Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: MongoDB pymongo.errors.OperationFailure (Size larger than 16MB) #17768

Open
xcvil opened this issue Feb 10, 2025 · 4 comments
Open

[Bug]: MongoDB pymongo.errors.OperationFailure (Size larger than 16MB) #17768

xcvil opened this issue Feb 10, 2025 · 4 comments
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized

Comments

@xcvil
Copy link

xcvil commented Feb 10, 2025

Bug Description

File "/llama_index/core/indices/keyword_table/base.py", line 90, in __init__
    super().__init__(
  File "/llama_index/core/indices/base.py", line 82, in __init__
    self._storage_context.index_store.add_index_struct(self._index_struct)
  File "/llama_index/core/storage/index_store/keyval_index_store.py", line 46, in add_index_struct
    self._kvstore.put(key, data, collection=self._collection)
  File "/llama_index/storage/kvstore/mongodb/base.py", line 136, in put
    self.put_all([(key, val)], collection=collection)
  File "/llama_index/storage/kvstore/mongodb/base.py", line 175, in put_all
    self._db[collection].bulk_write(new_docs)

pymongo.errors.OperationFailure: BSONObj size: 18616930 (0x11C1262) is invalid. Size must be between 0 and 16793600(16MB) First element: q: { _id: "ffb66f08-ca96-4ab2-ad53-82444ee9295e" }, full error: {'ok': 0.0, 'errmsg': 'BSONObj size: 18616930 (0x11C1262) is invalid. Size must be between 0 and 16793600(16MB) First element: q: { _id: "ffb66f08-ca96-4ab2-ad53-82444ee9295e" }', 'code': 10334, 'codeName': 'BSONObjectTooLarge'}

Version

0.11.10

Steps to Reproduce

The error is because I generate a large amount of nodes (~5000) and use a Keyword indices to extract keywords using LLM. Later inside the keyword_table index, it will write the nodes (chunks) and keywords into MongoDB but llama_index did not handle this well and the total size the document is exceeding 16 MB. I am wondering did you consider this before?

Relevant Logs/Tracbacks

@xcvil xcvil added bug Something isn't working triage Issue needs to be triaged/prioritized labels Feb 10, 2025
@xcvil
Copy link
Author

xcvil commented Feb 10, 2025

similar issue here #11958

Copy link

dosubot bot commented Feb 10, 2025

The error you're encountering is due to MongoDB's BSON size limit of 16MB. When using LlamaIndex, there are no built-in mechanisms specifically mentioned for splitting or compressing data to avoid exceeding this limit [1]. However, you can manage large datasets by configuring parameters such as batch_size when using MongoDB as a storage backend [1].

To handle this issue, consider breaking down your data into smaller chunks before storing it in MongoDB. You can also adjust the batch_size parameter in the put_all method to ensure that data is processed in smaller batches, which might help in managing the size of each document being stored [2].

I didn't find any similar threads related to this specific issue, but you might want to check issue #11958 that you mentioned for any related discussions or solutions.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@logan-markewich
Copy link
Collaborator

Hmm, I guess put_all() should probably be batching, i.e. in File "/llama_index/storage/kvstore/mongodb/base.py", line 175, in put_all -- not terribly hard to fix, I welcome a PR

@xcvil
Copy link
Author

xcvil commented Feb 11, 2025

Thanks Logan. I checked the put_all function and found there was already batch put implemented. But I am still wondering what would be the cause since I cannot imagine how can dictionary be larger than 16MB given that the full file I am using is smaller than 16MB...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Issue needs to be triaged/prioritized
Projects
None yet
Development

No branches or pull requests

2 participants