You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for the great work on this project. I'm currently using Qlib to build a forecasting pipeline and have encountered a serious performance issue when loading data.
Specifically, when retrieving OHLCV data for ~600 instruments (using default Alpha158 features), the data loading process takes around 170 seconds, which is significantly longer than expected.
Here is a log snippet from my run:
[32077:MainThread](2025-04-13 09:41:09,724) INFO - qlib.timer - [log.py:127] - Time cost: 168.134s | Loading data Done
[32077:MainThread](2025-04-13 09:41:09,777) INFO - qlib.timer - [log.py:127] - Time cost: 0.038s | DropnaProcessor Done
[32077:MainThread](2025-04-13 09:41:10,816) INFO - qlib.timer - [log.py:127] - Time cost: 1.038s | FilterByInstrumentLengthProcessor Done
[32077:MainThread](2025-04-13 09:41:10,830) INFO - qlib.timer - [log.py:127] - Time cost: 0.009s | DropnaLabel Done
[32077:MainThread](2025-04-13 09:41:10,842) INFO - qlib.timer - [log.py:127] - Time cost: 0.011s | DropnaLabel Done
[32077:MainThread](2025-04-13 09:41:11,905) INFO - qlib.timer - [log.py:127] - Time cost: 1.063s | FilterByInstrumentLengthProcessor Done
[32077:MainThread](2025-04-13 09:41:11,907) INFO - qlib.timer - [log.py:127] - Time cost: 2.182s | fit & process data Done
[32077:MainThread](2025-04-13 09:41:11,907) INFO - qlib.timer - [log.py:127] - Time cost: 170.318s | Init data Done
This makes experimentation and model development inefficient. I've tried checking disk performance and system load, and everything seems normal.
Could you please help clarify:
Is this expected behavior with the current version of Qlib?
Are there any recommended configurations (e.g., cache setup, parallel loading, data format) to reduce the data loading time?
Would switching to a different storage format (e.g., parquet or Arrow) help here?
Are there any best practices when using a large number of instruments?
Qlib version: 0.9.6
Python version: 3.9
OS: MacOS
Data: custom dataset
The text was updated successfully, but these errors were encountered:
❓ Questions and Help
Hi Qlib team,
Thank you for the great work on this project. I'm currently using Qlib to build a forecasting pipeline and have encountered a serious performance issue when loading data.
Specifically, when retrieving OHLCV data for ~600 instruments (using default Alpha158 features), the data loading process takes around 170 seconds, which is significantly longer than expected.
Here is a log snippet from my run:
[32077:MainThread](2025-04-13 09:41:09,724) INFO - qlib.timer - [log.py:127] - Time cost: 168.134s | Loading data Done
[32077:MainThread](2025-04-13 09:41:09,777) INFO - qlib.timer - [log.py:127] - Time cost: 0.038s | DropnaProcessor Done
[32077:MainThread](2025-04-13 09:41:10,816) INFO - qlib.timer - [log.py:127] - Time cost: 1.038s | FilterByInstrumentLengthProcessor Done
[32077:MainThread](2025-04-13 09:41:10,830) INFO - qlib.timer - [log.py:127] - Time cost: 0.009s | DropnaLabel Done
[32077:MainThread](2025-04-13 09:41:10,842) INFO - qlib.timer - [log.py:127] - Time cost: 0.011s | DropnaLabel Done
[32077:MainThread](2025-04-13 09:41:11,905) INFO - qlib.timer - [log.py:127] - Time cost: 1.063s | FilterByInstrumentLengthProcessor Done
[32077:MainThread](2025-04-13 09:41:11,907) INFO - qlib.timer - [log.py:127] - Time cost: 2.182s | fit & process data Done
[32077:MainThread](2025-04-13 09:41:11,907) INFO - qlib.timer - [log.py:127] - Time cost: 170.318s | Init data Done
This makes experimentation and model development inefficient. I've tried checking disk performance and system load, and everything seems normal.
Could you please help clarify:
Is this expected behavior with the current version of Qlib?
Are there any recommended configurations (e.g., cache setup, parallel loading, data format) to reduce the data loading time?
Would switching to a different storage format (e.g., parquet or Arrow) help here?
Are there any best practices when using a large number of instruments?
Qlib version: 0.9.6
Python version: 3.9
OS: MacOS
Data: custom dataset
The text was updated successfully, but these errors were encountered: