-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: use external chunker [cog-1354] #551
Conversation
WalkthroughThis pull request refactors the document chunking mechanism by introducing a new base class, Chunker, and its subclasses. The changes include adding a new LangchainChunker, updating TextChunker to inherit from Chunker, and modifying various document classes (Audio, Document, Image, Pdf, Text, Unstructured) and a task function to accept a chunker class instance (renamed to Changes
Possibly related PRs
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (28)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (6)
cognee/modules/chunking/LangchainChunker.py (3)
20-27
: Add type annotation forchunk_overlap
.
To enhance clarity and maintain consistency, include the explicit type annotation for thechunk_overlap
parameter, for examplechunk_overlap: int = 10
.- chunk_overlap=10, + chunk_overlap: int = 10,
32-34
: Unusedcheck_token_count
method.
check_token_count
is defined but never called inread()
. Consider removing or integrating it into the logic to avoid code duplication or confusion.
36-60
: Consider adding more logging for troubleshooting.
The coreread()
logic is clear. For robust debugging and monitoring, consider logging chunk sizes, indices, or token counts, especially when aValueError
is raised.cognee/modules/chunking/TextChunker.py (1)
11-11
: Good inheritance fromChunker
.
ExtendingChunker
alignsTextChunker
with the unified chunking pattern. Watch out for variable naming — the base class usesmax_chunk_size
while references here call itself.chunk_size
.cognee/modules/chunking/Chunker.py (1)
6-10
: Constructor naming clarity.
You passchunk_size
to the constructor, then store it asself.max_chunk_size
, which can be confusing for newcomers. Consider reusing the same name or clarifying the difference in a docstring.cognee/api/v1/cognify/cognify_v2.py (1)
127-130
: LGTM!The Task instantiation is correctly updated to include the TextChunker class as the chunker parameter, maintaining consistency with the new chunking mechanism.
Consider adding type hints for better clarity.
Consider adding type hints to the Task parameters for better code clarity:
Task( extract_chunks_from_documents, - max_chunk_tokens=get_max_chunk_tokens(), - chunker=TextChunker, + max_chunk_tokens: int = get_max_chunk_tokens(), + chunker: type[TextChunker] = TextChunker, ),
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
cognee/api/v1/cognify/cognify_v2.py
(2 hunks)cognee/modules/chunking/Chunker.py
(1 hunks)cognee/modules/chunking/LangchainChunker.py
(1 hunks)cognee/modules/chunking/TextChunker.py
(1 hunks)cognee/modules/data/processing/document_types/AudioDocument.py
(2 hunks)cognee/modules/data/processing/document_types/ChunkerMapping.py
(0 hunks)cognee/modules/data/processing/document_types/Document.py
(2 hunks)cognee/modules/data/processing/document_types/ImageDocument.py
(2 hunks)cognee/modules/data/processing/document_types/PdfDocument.py
(1 hunks)cognee/modules/data/processing/document_types/TextDocument.py
(2 hunks)cognee/modules/data/processing/document_types/UnstructuredDocument.py
(3 hunks)cognee/tasks/documents/extract_chunks_from_documents.py
(3 hunks)cognee/tests/integration/documents/AudioDocument_test.py
(2 hunks)cognee/tests/integration/documents/ImageDocument_test.py
(2 hunks)cognee/tests/integration/documents/PdfDocument_test.py
(2 hunks)cognee/tests/integration/documents/TextDocument_test.py
(2 hunks)cognee/tests/integration/documents/UnstructuredDocument_test.py
(5 hunks)
💤 Files with no reviewable changes (1)
- cognee/modules/data/processing/document_types/ChunkerMapping.py
⏰ Context from checks skipped due to timeout of 90000ms (2)
- GitHub Check: docker-compose-test
- GitHub Check: Build Cognee Backend Docker App Image
🔇 Additional comments (28)
cognee/tests/integration/documents/ImageDocument_test.py (2)
3-3
: LGTM! Import aligns with the new chunking approach.The explicit import of
TextChunker
class supports the transition from string-based chunker identification to typed class references.
27-27
:❓ Verification inconclusive
LGTM! Verify consistency across other document types.
The update to use
TextChunker
class instead of a string identifier is a good improvement. The parameter namechunker_cls
better reflects that it expects a class reference.Let's verify that this change is consistent across other document types:
🏁 Script executed:
#!/bin/bash # Description: Check if all document test files have been updated to use chunker_cls # and TextChunker class instead of string-based chunker identification. # Search for any remaining string-based chunker usage in document tests echo "Checking for any remaining string-based chunker usage..." rg -l "chunker=\"text_chunker\"" "cognee/tests/integration/documents/*Document_test.py" # Search for the new pattern to confirm consistent adoption echo "Checking for new chunker_cls pattern..." rg -l "chunker_cls=TextChunker" "cognee/tests/integration/documents/*Document_test.py"Length of output: 572
LGTM – Code change looks good; please verify file paths for consistency manually.
The update to use the
TextChunker
class via thechunker_cls
parameter is a clear improvement. However, the automated check for consistency across document test files did not yield results due to file path issues (the glob pattern used was not expanded because it was quoted). Please manually verify that all document tests (e.g.,ImageDocument_test.py
and others undercognee/tests/integration/documents/
) consistently usechunker_cls=TextChunker
and that no old string-based usage (chunker="text_chunker"
) remains.cognee/modules/data/processing/document_types/PdfDocument.py (3)
2-2
: LGTM! Import changes align with the refactoring.The import of the
Chunker
class directly supports the transition to using explicit chunker class references.
10-10
: LGTM! Method signature changes improve type safety.The parameter rename and type change provide better type safety and make the API more explicit about the expected input.
18-20
:❓ Verification inconclusive
Verify that all callers have been updated.
The direct chunker instantiation looks good and is more type-safe. However, we should ensure all callers have been updated to pass Chunker classes instead of strings.
🏁 Script executed:
#!/bin/bash # Search for calls to PdfDocument.read to verify they pass Chunker classes rg -A 2 "PdfDocument.*\.read\("Length of output: 33
Chunker instantiation verified – please confirm caller updates manually.
The direct instantiation using
chunker_cls
is cleaner and more type-safe. However, our automated search for calls toPdfDocument.read
did not return any results. Since this absence of output doesn’t conclusively prove that all callers are updated to pass Chunker classes instead of strings, please perform a manual review of the callers to ensure they meet this new expectation.cognee/modules/chunking/LangchainChunker.py (2)
1-2
: No major concerns with these imports.
They are standard Python imports, and usage appears straightforward and appropriate.
12-18
: Well-documented class docstring.
The explanatory docstring provides good clarity on howLangchainChunker
uses Langchain’sRecursiveCharacterTextSplitter
.cognee/modules/chunking/TextChunker.py (1)
5-5
: Appropriate usage of the new base class.
ImportingChunker
and referencing it maintains consistency across your new chunking architecture.cognee/modules/chunking/Chunker.py (1)
12-14
: Abstract method is properly highlighted.
read()
raisesNotImplementedError
, making the contract clear that subclasses must override it.cognee/modules/data/processing/document_types/Document.py (1)
3-3
: LGTM! Type safety improvements.The changes improve type safety by:
- Using explicit
Chunker
class instead of string identifiers- Renaming parameter to
chunker_cls
to better reflect its purposeAlso applies to: 13-15
cognee/modules/data/processing/document_types/TextDocument.py (1)
2-2
: LGTM! Streamlined chunker instantiation.The changes improve the code by:
- Directly using the chunker class instead of string-based configuration
- Maintaining the existing generator-based text reading approach
Also applies to: 8-8, 19-21
cognee/modules/data/processing/document_types/AudioDocument.py (1)
2-2
: LGTM! Consistent with other document types.The changes maintain consistency with other document types while preserving the audio transcription functionality.
Also applies to: 14-14, 19-21
cognee/modules/data/processing/document_types/ImageDocument.py (1)
2-2
: LGTM! Consistent with other document types.The changes maintain consistency with other document types while preserving the image transcription functionality.
Also applies to: 14-14, 18-20
cognee/modules/data/processing/document_types/UnstructuredDocument.py (3)
3-3
: LGTM! Import statement updated for type safety.The import statement now correctly imports the base
Chunker
class, which aligns with the new type-safe approach.
12-12
: LGTM! Method signature enhanced with type safety.The parameter type change from
str
toChunker
improves type safety and makes the API more explicit.
31-33
: LGTM! Chunker instantiation simplified.The chunker instantiation is now more direct and type-safe, using the class reference instead of string-based instantiation.
cognee/tests/integration/documents/PdfDocument_test.py (2)
3-3
: LGTM! Import statement added for TextChunker.The import statement correctly adds the TextChunker class needed for the updated test.
28-28
: LGTM! Test updated to use class-based chunker.The test now correctly uses the TextChunker class instead of a string identifier, maintaining test coverage while improving type safety.
cognee/tests/integration/documents/TextDocument_test.py (2)
5-5
: LGTM! Import statement added for TextChunker.The import statement correctly adds the TextChunker class needed for the updated test.
41-41
: LGTM! Test updated to use class-based chunker.The test now correctly uses the TextChunker class instead of a string identifier, maintaining test coverage while improving type safety.
cognee/tasks/documents/extract_chunks_from_documents.py (3)
8-9
: LGTM! Import statements added for chunker classes.The import statements correctly add both the TextChunker implementation and the base Chunker class.
31-31
: LGTM! Function signature enhanced with type safety.The parameter type change from string to
Chunker
with a default value ofTextChunker
improves type safety while maintaining backward compatibility.
43-43
: LGTM! Document read call updated.The document.read call now correctly uses the chunker_cls parameter name, maintaining consistency with the new API.
cognee/tests/integration/documents/AudioDocument_test.py (2)
3-3
: LGTM!The import statement correctly imports the TextChunker class that replaces the string-based chunker identifier.
38-38
: LGTM!The change from string-based chunker to class reference improves type safety and makes the code more maintainable.
cognee/tests/integration/documents/UnstructuredDocument_test.py (2)
3-3
: LGTM!The import statement correctly imports the TextChunker class that replaces the string-based chunker identifier.
72-72
: LGTM!The changes consistently apply the new chunker class reference pattern across all document types (PPTX, DOCX, CSV, XLSX), improving type safety and maintainability.
Also applies to: 82-82, 92-92, 104-104
cognee/api/v1/cognify/cognify_v2.py (1)
28-28
: LGTM!The import statement correctly imports the TextChunker class that replaces the string-based chunker identifier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
cognee/tests/integration/documents/TextDocument_test.py (1)
20-23
: Consider adding more test cases.The current test cases cover basic functionality with two different chunk sizes. Consider adding test cases for:
- Edge cases with very small or large chunk sizes
- Different types of text content (e.g., with varying paragraph structures)
- Boundary conditions for max_chunk_tokens
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
cognee/modules/chunking/LangchainChunker.py
(1 hunks)cognee/modules/data/processing/document_types/TextDocument.py
(1 hunks)cognee/tests/integration/documents/TextDocument_test.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- cognee/modules/chunking/LangchainChunker.py
- cognee/modules/data/processing/document_types/TextDocument.py
🔇 Additional comments (2)
cognee/tests/integration/documents/TextDocument_test.py (2)
41-41
: LGTM! Test correctly uses the new chunking mechanism.The test has been properly updated to use the TextChunker class, aligning with the architectural changes mentioned in the PR summary.
10-11
:✅ Verification successful
Verify updated ground truth values.
The ground truth values for "code.txt" have been significantly modified, with changes in both the word counts and text lengths. Additionally, the cut type has changed from "sentence_cut" to "paragraph_end", suggesting a fundamental change in the chunking behavior.
Let's verify these values by examining the test data file:
🏁 Script executed:
#!/bin/bash # Description: Verify the content and chunking of the test data file # Find and display the content of code.txt fd -t f "code.txt" -x cat {}Length of output: 2728
Ground truth values verified and aligned with the updated chunking approach.
After reviewing the content of code.txt and correlating it with the changes in the integration test (lines 10–11 of
cognee/tests/integration/documents/TextDocument_test.py
), we can confirm that:
- The updated word counts (252 and 56) and text lengths (1376 and 481) match what is expected from the new paragraph-based splitting.
- The change in
cut_type
from"sentence_cut"
to"paragraph_end"
is consistent with the revised chunking mechanism.- The integration test now accurately reflects the intended behavior of the
TextDocument
class.No further changes are required.
super().__init__(document, get_text, max_chunk_tokens, chunk_size) | ||
|
||
self.splitter = RecursiveCharacterTextSplitter( | ||
chunk_size=chunk_size, chunk_overlap=chunk_overlap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chunk_size
in RecursiveCharacterTextSplitter
refers to the number of characters in the chunk, but we refer to it as number of words. We need to unify this somehow, and moving to tokens makes sense. The only problem there is that counting tokens can be slow. And in this case it will be hard to assume the number of characters needed for 1024 tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
now I'm passing word count as length function in RecursiveCharacterTextSplitter
Description
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin
Summary by CodeRabbit
New Features
LangchainChunker
and improvedTextChunker
.Refactor
Bug Fixes