Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: use external chunker [cog-1354] #551

Merged
merged 18 commits into from
Feb 21, 2025

Conversation

alekszievr
Copy link
Contributor

@alekszievr alekszievr commented Feb 18, 2025

Description

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin

Summary by CodeRabbit

  • New Features

    • Introduced a modular content chunking interface that offers flexible text segmentation with configurable chunk size and overlap.
    • Added new chunkers for enhanced text processing, including LangchainChunker and improved TextChunker.
  • Refactor

    • Unified the chunk extraction mechanism across various document types for improved consistency and type safety.
    • Updated method signatures to enhance clarity and type safety regarding chunker usage.
    • Enhanced error handling and logging during text segmentation to guide adjustments when content exceeds limits.
  • Bug Fixes

    • Adjusted expected output in tests to reflect changes in chunking logic and configurations.

Copy link
Contributor

coderabbitai bot commented Feb 18, 2025

Walkthrough

This pull request refactors the document chunking mechanism by introducing a new base class, Chunker, and its subclasses. The changes include adding a new LangchainChunker, updating TextChunker to inherit from Chunker, and modifying various document classes (Audio, Document, Image, Pdf, Text, Unstructured) and a task function to accept a chunker class instance (renamed to chunker_cls) instead of a string. These updates enforce type safety and standardize chunker instantiation across modules.

Changes

File(s) Change Summary
cognee/modules/chunking/Chunker.py New base class Chunker added with class-level attributes and an abstract read method.
cognee/modules/chunking/LangchainChunker.py New subclass LangchainChunker added that implements the read method using RecursiveCharacterTextSplitter, logging, and token validation.
cognee/modules/chunking/TextChunker.py Modified to inherit from Chunker; removed its own constructor and redundant attributes.
cognee/modules/data/processing/document_types/{Audio, Document, Image, Pdf, Text, Unstructured}Document.py Updated read method signatures: renamed chunker to chunker_cls and changed its type from str (or previous types) to Chunker. Adjusted import statements accordingly.
cognee/modules/tasks/documents/extract_chunks_from_documents.py Changed the chunker parameter type from a string ("text_chunker") to a Chunker instance default (TextChunker), and updated the call to read with chunker_cls.

Possibly related PRs

Suggested reviewers

  • borisarzentar

Poem

Oh, hop along in code so neat,
I’m a rabbit with a joyful beat.
New chunkers spring with typed embrace,
Document chunks find their rightful place.
Carrots and code in perfect sync – what a treat!
🐰🥕 Happy hopping with refactored feat!


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0448189 and 9b1e120.

📒 Files selected for processing (1)
  • cognee/modules/chunking/LangchainChunker.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • cognee/modules/chunking/LangchainChunker.py
⏰ Context from checks skipped due to timeout of 90000ms (28)
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_multimedia_example_test / test
  • GitHub Check: run_simple_example_test / test
  • GitHub Check: run_eval_framework_test / test
  • GitHub Check: run_networkx_metrics_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: Test on macos-15
  • GitHub Check: Test on macos-15
  • GitHub Check: Test on macos-13
  • GitHub Check: Test on macos-13
  • GitHub Check: windows-latest
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: test
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: Test on macos-15
  • GitHub Check: run_dynamic_steps_example_test / test
  • GitHub Check: test
  • GitHub Check: Test on macos-13
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: docker-compose-test
  • GitHub Check: Build Cognee Backend Docker App Image

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@alekszievr alekszievr changed the base branch from dev to feat/cog-1354-externalize-chunker February 18, 2025 12:14
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (6)
cognee/modules/chunking/LangchainChunker.py (3)

20-27: Add type annotation for chunk_overlap.
To enhance clarity and maintain consistency, include the explicit type annotation for the chunk_overlap parameter, for example chunk_overlap: int = 10.

-        chunk_overlap=10,
+        chunk_overlap: int = 10,

32-34: Unused check_token_count method.
check_token_count is defined but never called in read(). Consider removing or integrating it into the logic to avoid code duplication or confusion.


36-60: Consider adding more logging for troubleshooting.
The core read() logic is clear. For robust debugging and monitoring, consider logging chunk sizes, indices, or token counts, especially when a ValueError is raised.

cognee/modules/chunking/TextChunker.py (1)

11-11: Good inheritance from Chunker.
Extending Chunker aligns TextChunker with the unified chunking pattern. Watch out for variable naming — the base class uses max_chunk_size while references here call it self.chunk_size.

cognee/modules/chunking/Chunker.py (1)

6-10: Constructor naming clarity.
You pass chunk_size to the constructor, then store it as self.max_chunk_size, which can be confusing for newcomers. Consider reusing the same name or clarifying the difference in a docstring.

cognee/api/v1/cognify/cognify_v2.py (1)

127-130: LGTM!

The Task instantiation is correctly updated to include the TextChunker class as the chunker parameter, maintaining consistency with the new chunking mechanism.

Consider adding type hints for better clarity.

Consider adding type hints to the Task parameters for better code clarity:

 Task(
     extract_chunks_from_documents,
-    max_chunk_tokens=get_max_chunk_tokens(),
-    chunker=TextChunker,
+    max_chunk_tokens: int = get_max_chunk_tokens(),
+    chunker: type[TextChunker] = TextChunker,
 ),
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fcb9298 and e364828.

📒 Files selected for processing (17)
  • cognee/api/v1/cognify/cognify_v2.py (2 hunks)
  • cognee/modules/chunking/Chunker.py (1 hunks)
  • cognee/modules/chunking/LangchainChunker.py (1 hunks)
  • cognee/modules/chunking/TextChunker.py (1 hunks)
  • cognee/modules/data/processing/document_types/AudioDocument.py (2 hunks)
  • cognee/modules/data/processing/document_types/ChunkerMapping.py (0 hunks)
  • cognee/modules/data/processing/document_types/Document.py (2 hunks)
  • cognee/modules/data/processing/document_types/ImageDocument.py (2 hunks)
  • cognee/modules/data/processing/document_types/PdfDocument.py (1 hunks)
  • cognee/modules/data/processing/document_types/TextDocument.py (2 hunks)
  • cognee/modules/data/processing/document_types/UnstructuredDocument.py (3 hunks)
  • cognee/tasks/documents/extract_chunks_from_documents.py (3 hunks)
  • cognee/tests/integration/documents/AudioDocument_test.py (2 hunks)
  • cognee/tests/integration/documents/ImageDocument_test.py (2 hunks)
  • cognee/tests/integration/documents/PdfDocument_test.py (2 hunks)
  • cognee/tests/integration/documents/TextDocument_test.py (2 hunks)
  • cognee/tests/integration/documents/UnstructuredDocument_test.py (5 hunks)
💤 Files with no reviewable changes (1)
  • cognee/modules/data/processing/document_types/ChunkerMapping.py
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: docker-compose-test
  • GitHub Check: Build Cognee Backend Docker App Image
🔇 Additional comments (28)
cognee/tests/integration/documents/ImageDocument_test.py (2)

3-3: LGTM! Import aligns with the new chunking approach.

The explicit import of TextChunker class supports the transition from string-based chunker identification to typed class references.


27-27:

❓ Verification inconclusive

LGTM! Verify consistency across other document types.

The update to use TextChunker class instead of a string identifier is a good improvement. The parameter name chunker_cls better reflects that it expects a class reference.

Let's verify that this change is consistent across other document types:


🏁 Script executed:

#!/bin/bash
# Description: Check if all document test files have been updated to use chunker_cls
# and TextChunker class instead of string-based chunker identification.

# Search for any remaining string-based chunker usage in document tests
echo "Checking for any remaining string-based chunker usage..."
rg -l "chunker=\"text_chunker\"" "cognee/tests/integration/documents/*Document_test.py"

# Search for the new pattern to confirm consistent adoption
echo "Checking for new chunker_cls pattern..."
rg -l "chunker_cls=TextChunker" "cognee/tests/integration/documents/*Document_test.py"

Length of output: 572


LGTM – Code change looks good; please verify file paths for consistency manually.

The update to use the TextChunker class via the chunker_cls parameter is a clear improvement. However, the automated check for consistency across document test files did not yield results due to file path issues (the glob pattern used was not expanded because it was quoted). Please manually verify that all document tests (e.g., ImageDocument_test.py and others under cognee/tests/integration/documents/) consistently use chunker_cls=TextChunker and that no old string-based usage (chunker="text_chunker") remains.

cognee/modules/data/processing/document_types/PdfDocument.py (3)

2-2: LGTM! Import changes align with the refactoring.

The import of the Chunker class directly supports the transition to using explicit chunker class references.


10-10: LGTM! Method signature changes improve type safety.

The parameter rename and type change provide better type safety and make the API more explicit about the expected input.


18-20:

❓ Verification inconclusive

Verify that all callers have been updated.

The direct chunker instantiation looks good and is more type-safe. However, we should ensure all callers have been updated to pass Chunker classes instead of strings.


🏁 Script executed:

#!/bin/bash
# Search for calls to PdfDocument.read to verify they pass Chunker classes
rg -A 2 "PdfDocument.*\.read\(" 

Length of output: 33


Chunker instantiation verified – please confirm caller updates manually.

The direct instantiation using chunker_cls is cleaner and more type-safe. However, our automated search for calls to PdfDocument.read did not return any results. Since this absence of output doesn’t conclusively prove that all callers are updated to pass Chunker classes instead of strings, please perform a manual review of the callers to ensure they meet this new expectation.

cognee/modules/chunking/LangchainChunker.py (2)

1-2: No major concerns with these imports.
They are standard Python imports, and usage appears straightforward and appropriate.


12-18: Well-documented class docstring.
The explanatory docstring provides good clarity on how LangchainChunker uses Langchain’s RecursiveCharacterTextSplitter.

cognee/modules/chunking/TextChunker.py (1)

5-5: Appropriate usage of the new base class.
Importing Chunker and referencing it maintains consistency across your new chunking architecture.

cognee/modules/chunking/Chunker.py (1)

12-14: Abstract method is properly highlighted.
read() raises NotImplementedError, making the contract clear that subclasses must override it.

cognee/modules/data/processing/document_types/Document.py (1)

3-3: LGTM! Type safety improvements.

The changes improve type safety by:

  1. Using explicit Chunker class instead of string identifiers
  2. Renaming parameter to chunker_cls to better reflect its purpose

Also applies to: 13-15

cognee/modules/data/processing/document_types/TextDocument.py (1)

2-2: LGTM! Streamlined chunker instantiation.

The changes improve the code by:

  1. Directly using the chunker class instead of string-based configuration
  2. Maintaining the existing generator-based text reading approach

Also applies to: 8-8, 19-21

cognee/modules/data/processing/document_types/AudioDocument.py (1)

2-2: LGTM! Consistent with other document types.

The changes maintain consistency with other document types while preserving the audio transcription functionality.

Also applies to: 14-14, 19-21

cognee/modules/data/processing/document_types/ImageDocument.py (1)

2-2: LGTM! Consistent with other document types.

The changes maintain consistency with other document types while preserving the image transcription functionality.

Also applies to: 14-14, 18-20

cognee/modules/data/processing/document_types/UnstructuredDocument.py (3)

3-3: LGTM! Import statement updated for type safety.

The import statement now correctly imports the base Chunker class, which aligns with the new type-safe approach.


12-12: LGTM! Method signature enhanced with type safety.

The parameter type change from str to Chunker improves type safety and makes the API more explicit.


31-33: LGTM! Chunker instantiation simplified.

The chunker instantiation is now more direct and type-safe, using the class reference instead of string-based instantiation.

cognee/tests/integration/documents/PdfDocument_test.py (2)

3-3: LGTM! Import statement added for TextChunker.

The import statement correctly adds the TextChunker class needed for the updated test.


28-28: LGTM! Test updated to use class-based chunker.

The test now correctly uses the TextChunker class instead of a string identifier, maintaining test coverage while improving type safety.

cognee/tests/integration/documents/TextDocument_test.py (2)

5-5: LGTM! Import statement added for TextChunker.

The import statement correctly adds the TextChunker class needed for the updated test.


41-41: LGTM! Test updated to use class-based chunker.

The test now correctly uses the TextChunker class instead of a string identifier, maintaining test coverage while improving type safety.

cognee/tasks/documents/extract_chunks_from_documents.py (3)

8-9: LGTM! Import statements added for chunker classes.

The import statements correctly add both the TextChunker implementation and the base Chunker class.


31-31: LGTM! Function signature enhanced with type safety.

The parameter type change from string to Chunker with a default value of TextChunker improves type safety while maintaining backward compatibility.


43-43: LGTM! Document read call updated.

The document.read call now correctly uses the chunker_cls parameter name, maintaining consistency with the new API.

cognee/tests/integration/documents/AudioDocument_test.py (2)

3-3: LGTM!

The import statement correctly imports the TextChunker class that replaces the string-based chunker identifier.


38-38: LGTM!

The change from string-based chunker to class reference improves type safety and makes the code more maintainable.

cognee/tests/integration/documents/UnstructuredDocument_test.py (2)

3-3: LGTM!

The import statement correctly imports the TextChunker class that replaces the string-based chunker identifier.


72-72: LGTM!

The changes consistently apply the new chunker class reference pattern across all document types (PPTX, DOCX, CSV, XLSX), improving type safety and maintainability.

Also applies to: 82-82, 92-92, 104-104

cognee/api/v1/cognify/cognify_v2.py (1)

28-28: LGTM!

The import statement correctly imports the TextChunker class that replaces the string-based chunker identifier.

@alekszievr alekszievr self-assigned this Feb 18, 2025
@borisarzentar borisarzentar changed the title Feat: use external chunker [cog-1354] feat: use external chunker [cog-1354] Feb 19, 2025
Base automatically changed from feat/cog-1354-externalize-chunker to dev February 19, 2025 12:26
@alekszievr alekszievr marked this pull request as draft February 19, 2025 15:54
@alekszievr alekszievr marked this pull request as ready for review February 19, 2025 20:18
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
cognee/tests/integration/documents/TextDocument_test.py (1)

20-23: Consider adding more test cases.

The current test cases cover basic functionality with two different chunk sizes. Consider adding test cases for:

  • Edge cases with very small or large chunk sizes
  • Different types of text content (e.g., with varying paragraph structures)
  • Boundary conditions for max_chunk_tokens
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 3d16334 and 0448189.

📒 Files selected for processing (3)
  • cognee/modules/chunking/LangchainChunker.py (1 hunks)
  • cognee/modules/data/processing/document_types/TextDocument.py (1 hunks)
  • cognee/tests/integration/documents/TextDocument_test.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • cognee/modules/chunking/LangchainChunker.py
  • cognee/modules/data/processing/document_types/TextDocument.py
🔇 Additional comments (2)
cognee/tests/integration/documents/TextDocument_test.py (2)

41-41: LGTM! Test correctly uses the new chunking mechanism.

The test has been properly updated to use the TextChunker class, aligning with the architectural changes mentioned in the PR summary.


10-11:

✅ Verification successful

Verify updated ground truth values.

The ground truth values for "code.txt" have been significantly modified, with changes in both the word counts and text lengths. Additionally, the cut type has changed from "sentence_cut" to "paragraph_end", suggesting a fundamental change in the chunking behavior.

Let's verify these values by examining the test data file:


🏁 Script executed:

#!/bin/bash
# Description: Verify the content and chunking of the test data file

# Find and display the content of code.txt
fd -t f "code.txt" -x cat {}

Length of output: 2728


Ground truth values verified and aligned with the updated chunking approach.

After reviewing the content of code.txt and correlating it with the changes in the integration test (lines 10–11 of cognee/tests/integration/documents/TextDocument_test.py), we can confirm that:

  • The updated word counts (252 and 56) and text lengths (1376 and 481) match what is expected from the new paragraph-based splitting.
  • The change in cut_type from "sentence_cut" to "paragraph_end" is consistent with the revised chunking mechanism.
  • The integration test now accurately reflects the intended behavior of the TextDocument class.

No further changes are required.

super().__init__(document, get_text, max_chunk_tokens, chunk_size)

self.splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size, chunk_overlap=chunk_overlap
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

chunk_size in RecursiveCharacterTextSplitter refers to the number of characters in the chunk, but we refer to it as number of words. We need to unify this somehow, and moving to tokens makes sense. The only problem there is that counting tokens can be slow. And in this case it will be hard to assume the number of characters needed for 1024 tokens.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now I'm passing word count as length function in RecursiveCharacterTextSplitter

@alekszievr alekszievr merged commit a61df96 into dev Feb 21, 2025
34 of 35 checks passed
@alekszievr alekszievr deleted the feat/cog-1354-use-external-chunker branch February 21, 2025 13:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants