Skip to content

Performance bottleneck in FluxCsvParser when parsing large CSV payloads (10MB+) #691

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
vessaldaneshvar opened this issue Apr 13, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@vessaldaneshvar
Copy link

Specifications

  • Client Version: 1.48.0
  • InfluxDB Version: 2.7
  • Platform: macos

Code sample to reproduce problem

import influxdb_client
client = influxdb_client.InfluxDBClient(
url="http://localhost:8086",
token="TOKEN",
org="organization",
)
query_api = client.query_api()
query = 'from(bucket: "sensors") |> range(start: 2025-04-13T14:18:11.036Z, stop: 2025-04-13T14:33:11.036Z)'
result = query_api.query(org="matna", query=query)

Expected behavior

runtime of this query must be same as ui influx

Actual behavior

runtime this code is not same order

Additional info

Hi InfluxDB team,

I've encountered a significant performance bottleneck in the FluxCsvParser class within the InfluxDB Python client when working with larger datasets.

🐞 Issue Description
When querying data (~10MB in size), the network call returns results in under 20 ms, which is excellent. However, the CSV parsing step takes over 5 seconds to complete. This introduces an unacceptable latency for high-throughput or low-latency use cases.

In contrast, using the Go client for the same query and dataset, the full query—including parsing—is completed in under 200 ms. This makes the Python client around 25x slower just in the parsing stage.

📈 Performance Benchmark
Data size: ~10MB (Flux CSV)

Query response time (network): < 20 ms

Parsing time (Python client): > 5000 ms

Parsing time (Go client): < 200 ms

🔍 Root Cause
Profiling indicates that the performance degradation is centered in the FluxCsvParser implementation. The current parsing logic in Python seems to be inefficient for large responses due to overhead in string parsing, tokenization, and possibly memory management.

💡 Suggested Improvement
To address this, I suggest reviewing the implementation of FluxCsvParser—specifically around how it handles tokenization, buffering, and line-by-line parsing. Additionally, performance could be dramatically improved by offloading the CSV parsing to a C extension (e.g., using cffi, cython, or ctypes) or integrating an existing optimized parser like libcsv or simdjson.

This would help close the gap with the Go client's performance while maintaining compatibility with the current interface.
✅ Request
Could the maintainers review the FluxCsvParser code path, especially in generate function?

Is there openness to rewriting this part as a performance-critical native extension, or at least modularizing it for optional native acceleration?

@vessaldaneshvar vessaldaneshvar added the bug Something isn't working label Apr 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant