Skip to content

Data loading is very slow when retrieving OHLCV for 600 instruments #1910

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
trungtv opened this issue Apr 13, 2025 · 0 comments
Open

Data loading is very slow when retrieving OHLCV for 600 instruments #1910

trungtv opened this issue Apr 13, 2025 · 0 comments
Labels
question Further information is requested

Comments

@trungtv
Copy link

trungtv commented Apr 13, 2025

❓ Questions and Help

Hi Qlib team,

Thank you for the great work on this project. I'm currently using Qlib to build a forecasting pipeline and have encountered a serious performance issue when loading data.

Specifically, when retrieving OHLCV data for ~600 instruments (using default Alpha158 features), the data loading process takes around 170 seconds, which is significantly longer than expected.

Here is a log snippet from my run:

[32077:MainThread](2025-04-13 09:41:09,724) INFO - qlib.timer - [log.py:127] - Time cost: 168.134s | Loading data Done
[32077:MainThread](2025-04-13 09:41:09,777) INFO - qlib.timer - [log.py:127] - Time cost: 0.038s | DropnaProcessor Done
[32077:MainThread](2025-04-13 09:41:10,816) INFO - qlib.timer - [log.py:127] - Time cost: 1.038s | FilterByInstrumentLengthProcessor Done
[32077:MainThread](2025-04-13 09:41:10,830) INFO - qlib.timer - [log.py:127] - Time cost: 0.009s | DropnaLabel Done
[32077:MainThread](2025-04-13 09:41:10,842) INFO - qlib.timer - [log.py:127] - Time cost: 0.011s | DropnaLabel Done
[32077:MainThread](2025-04-13 09:41:11,905) INFO - qlib.timer - [log.py:127] - Time cost: 1.063s | FilterByInstrumentLengthProcessor Done
[32077:MainThread](2025-04-13 09:41:11,907) INFO - qlib.timer - [log.py:127] - Time cost: 2.182s | fit & process data Done
[32077:MainThread](2025-04-13 09:41:11,907) INFO - qlib.timer - [log.py:127] - Time cost: 170.318s | Init data Done
This makes experimentation and model development inefficient. I've tried checking disk performance and system load, and everything seems normal.

Could you please help clarify:

Is this expected behavior with the current version of Qlib?

Are there any recommended configurations (e.g., cache setup, parallel loading, data format) to reduce the data loading time?

Would switching to a different storage format (e.g., parquet or Arrow) help here?

Are there any best practices when using a large number of instruments?

Qlib version: 0.9.6

Python version: 3.9

OS: MacOS

Data: custom dataset

@trungtv trungtv added the question Further information is requested label Apr 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

1 participant