Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/cog 1364 golden contexts #579

Merged
merged 8 commits into from
Feb 27, 2025
Merged

Conversation

lxobr
Copy link
Collaborator

@lxobr lxobr commented Feb 25, 2025

Description

• Added load_golden_context parameter to BaseBenchmarkAdapter's abstract load_corpus method, establishing a common interface for retrieving supporting evidence
• Refactored HotpotQAAdapter with a modular design: introduced _get_metadata_field_name method to handle dataset-specific fields (making it extensible for child classes), implemented get golden context functionality.
• Refactored TwoWikiMultihopAdapter to inherit from HotpotQAAdapter, overriding only the necessary methods while reusing parent's functionality
• Added golden context support to MusiqueQAAdapter with their decomposition-based format

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin

Summary by CodeRabbit

  • New Features
    • Introduced an option to include additional context during corpus loading, enhancing the quality and flexibility of generated QA pairs.
  • Refactor
    • Streamlined and modularized the processing workflow across different adapters for improved consistency and maintainability.
    • Updated metadata extraction to refine the display of contextual information.
    • Shifted focus in the TwoWikiMultihopAdapter from corpus loading to context extraction.

@lxobr lxobr self-assigned this Feb 25, 2025
Copy link
Contributor

coderabbitai bot commented Feb 25, 2025

Walkthrough

This update enhances the corpus-loading functionality across several benchmark adapters. The load_corpus method in multiple adapters now accepts an additional boolean parameter, load_golden_context, allowing conditional inclusion of a golden context. New helper methods have been introduced to modularize context extraction and item processing logic. In one adapter, inheritance has been updated for structural consistency, and minor dataset metadata changes are applied.

Changes

File(s) Change Summary
evals/.../base_benchmark_adapter.py Updated load_corpus signature to add the load_golden_context: bool = False parameter.
evals/.../hotpot_qa_adapter.py Added _is_valid_supporting_fact, _get_golden_context, and _process_item methods; updated load_corpus signature and return type to include load_golden_context.
evals/.../musique_adapter.py Introduced _get_golden_context and _process_item methods; modified load_corpus to support load_golden_context while retaining auto_download.
evals/.../twowikimultihop_adapter.py Changed inheritance from BaseBenchmarkAdapter to HotpotQAAdapter; removed load_corpus; updated dataset_info key and added _get_golden_context method.

Possibly Related PRs

Suggested Reviewers

  • hajdul88

Poem

Hoppin’ through the code I go,
Adjusting methods, nice and slow.
Golden context now in play,
Guiding data on its way.
In this realm of logic, I cheer—
A bunny's joy, crystal clear!
🐇✨ Happy coding all around!

✨ Finishing Touches
  • 📝 Generate Docstrings

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (7)
evals/eval_framework/benchmark_adapters/base_benchmark_adapter.py (1)

7-9: Add docstring for the new parameter.

The newly introduced parameter load_golden_context lacks documentation in this abstract method. Consider adding a brief docstring explaining how and why it should be used, so that all subclasses override it consistently.

evals/eval_framework/benchmark_adapters/twowikimultihop_adapter.py (2)

15-16: Consider a more descriptive metadata field name.

Currently returning "type" may be confused with Python’s built-in type. If feasible, a more domain-specific name (e.g., "question_type") can enhance code clarity.


19-28: Ensure readability and handle missing evidences gracefully.

  1. Adding evidence lines directly after the super call might jumble the content. Consider adding a second blank line or a delimiter between the inherited context and the evidence list for better readability.
  2. The current check if "evidences" in item: is straightforward; consider also validating that item["evidences"] is a list if there's any risk of malformed data.
evals/eval_framework/benchmark_adapters/musique_adapter.py (3)

21-39: Apply consistent formatting for answers.

Here, step answers are returned as-is, while the final QA pair answer is lowercased. If you intend to standardize all answers (including decomposed steps) to lowercase, consider applying .lower() conditionally to step answers for consistency.


41-67: Validate paragraph content before appending.

The refactored logic is clear. However, if there's any chance of incomplete data, you might want to confirm that paragraph["paragraph_text"] exists or handle potential KeyError to prevent runtime exceptions.


72-75: Document the new parameter.

You introduced load_golden_context in load_corpus, but the docstring does not describe it. Provide a short explanation to clarify its purpose for maintainers and external contributors.

evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py (1)

21-35: Well-structured golden context extraction.

The implementation cleanly maps context titles to sentences, filters valid supporting facts, and formats them consistently. Consider adding logging for invalid supporting facts to aid in debugging.

def _get_golden_context(self, item: dict[str, Any]) -> str:
    """Extracts and formats the golden context from supporting facts."""
    # Create a mapping of title to sentences for easy lookup
    context_dict = {title: sentences for (title, sentences) in item["context"]}

    # Get all supporting facts in order
    golden_contexts = []
    for title, sent_idx in item["supporting_facts"]:
        sentences = context_dict.get(title, [])
        if not self._is_valid_supporting_fact(sentences, sent_idx):
+            # Consider logging this issue for debugging
+            # import logging
+            # logging.warning(f"Invalid supporting fact: {title}, index {sent_idx}")
            continue
        golden_contexts.append(f"{title}: {sentences[sent_idx]}")

    return "\n".join(golden_contexts)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1cb8331 and 32d5829.

📒 Files selected for processing (4)
  • evals/eval_framework/benchmark_adapters/base_benchmark_adapter.py (1 hunks)
  • evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py (3 hunks)
  • evals/eval_framework/benchmark_adapters/musique_adapter.py (3 hunks)
  • evals/eval_framework/benchmark_adapters/twowikimultihop_adapter.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (30)
  • GitHub Check: Test on macos-15
  • GitHub Check: run_networkx_metrics_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_dynamic_steps_example_test / test
  • GitHub Check: Test on macos-15
  • GitHub Check: Test on macos-13
  • GitHub Check: Test on macos-13
  • GitHub Check: lint (ubuntu-latest, 3.11.x)
  • GitHub Check: lint (ubuntu-latest, 3.10.x)
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: test
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: Test on macos-15
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_eval_framework_test / test
  • GitHub Check: run_multimedia_example_test / test
  • GitHub Check: Test on macos-13
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: test
  • GitHub Check: run_simple_example_test / test
  • GitHub Check: windows-latest
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: Build Cognee Backend Docker App Image
  • GitHub Check: docker-compose-test
🔇 Additional comments (12)
evals/eval_framework/benchmark_adapters/twowikimultihop_adapter.py (3)

5-6: Looks good.

The type import and usage look correct here.


9-9: Verify override behavior.

By inheriting from HotpotQAAdapter instead of BaseBenchmarkAdapter, this class now relies on the parent’s logic for corpus loading. Ensure all new or overridden methods (e.g., _get_metadata_field_name, _get_golden_context) are adequately tested to confirm they're invoked correctly.


12-12: Consistent key naming is good.

Changing the key name to lowercase "url" aligns with typical JSON styling and is more conventional.

evals/eval_framework/benchmark_adapters/musique_adapter.py (2)

4-4: No issue with updated imports.


103-103: Ensure environment readiness.

This code depends on gdown being installed and on a functioning internet connection. If these conditions are unfulfilled, it will fail. Consider adding a more explicit error message or fallback mechanism for offline users.

evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py (7)

5-5: Improved type annotations for better code quality.

The expanded import of typing classes enhances code clarity and type safety throughout the adapter.


17-19: Great validation helper method.

This utility method properly validates supporting fact indices against provided sentences, with good checks for non-empty sentences, integer type, and bounds checking.


36-39: Good design for extensibility.

Extracting the metadata field name into a separate method allows child classes to easily override this behavior without duplicating other code.


40-61: Excellent modularization of item processing.

This method encapsulates the logic for processing corpus items, making the code more maintainable and easier to understand. The conditional addition of golden context is well-implemented.


62-64: Appropriate method signature update.

The addition of the load_golden_context parameter with a sensible default value maintains backward compatibility while enabling new functionality.


65-65: Clear and concise documentation.

The updated docstring clearly explains the purpose of the method, including the new golden context functionality.


87-87: Clean implementation of process delegation.

Replacing inline processing logic with a call to the dedicated method improves code readability and maintainability.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py (1)

25-38: Well-implemented golden context extraction

The extraction of golden contexts from supporting facts is well-structured with clear error handling for invalid indices. The use of a mapping for title lookup is efficient.

Consider adding a fallback message when no valid supporting facts are found to make the return value more explicit.

 def _get_golden_context(self, item: dict[str, Any]) -> str:
     """Extracts and formats the golden context from supporting facts."""
     # Create a mapping of title to sentences for easy lookup
     context_dict = {title: sentences for (title, sentences) in item["context"]}

     # Get all supporting facts in order
     golden_contexts = []
     for title, sentence_idx in item["supporting_facts"]:
         sentences = context_dict.get(title, [])
         if not self._is_valid_supporting_fact(sentences, sentence_idx):
             continue
         golden_contexts.append(f"{title}: {sentences[sentence_idx]}")

-    return "\n".join(golden_contexts)
+    if golden_contexts:
+        return "\n".join(golden_contexts)
+    return "No valid supporting facts found."
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32d5829 and 0ece58a.

📒 Files selected for processing (2)
  • evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py (3 hunks)
  • evals/eval_framework/benchmark_adapters/twowikimultihop_adapter.py (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (29)
  • GitHub Check: Test on macos-15
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_simple_example_test / test
  • GitHub Check: run_notebook_test / test
  • GitHub Check: Test on macos-13
  • GitHub Check: run_notebook_test / test
  • GitHub Check: run_eval_framework_test / test
  • GitHub Check: Test on macos-15
  • GitHub Check: Test on macos-15
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: Test on macos-13
  • GitHub Check: run_dynamic_steps_example_test / test
  • GitHub Check: Test on macos-13
  • GitHub Check: run_multimedia_example_test / test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: windows-latest
  • GitHub Check: run_networkx_metrics_test / test
  • GitHub Check: Test on ubuntu-22.04
  • GitHub Check: test
  • GitHub Check: lint (ubuntu-latest, 3.10.x)
  • GitHub Check: test
  • GitHub Check: test
  • GitHub Check: Build Cognee Backend Docker App Image
  • GitHub Check: docker-compose-test
🔇 Additional comments (35)
evals/eval_framework/benchmark_adapters/twowikimultihop_adapter.py (15)

5-6: Good enhancement to type hints

The updated imports with proper type annotations enhance code readability and enable better static analysis. The change to inherit from HotpotQAAdapter is a good design decision that improves code reuse.


9-9: Improved inheritance structure

Changing from BaseBenchmarkAdapter to HotpotQAAdapter properly leverages the common functionality already implemented in the parent class. This modification aligns with the single responsibility principle and DRY (Don't Repeat Yourself).


12-12: Updated field name for consistency

Changing from "URL" to "url" improves naming consistency throughout the codebase. Consistent field naming makes code maintenance easier.


15-17: Good initialization with metadata field override

The new constructor properly calls the parent constructor and overrides the metadata_field_name to match the field structure in the TwoWikiMultihop dataset. This approach ensures proper dataset-specific behavior.


19-28: Well-implemented context extension method

The override of _get_golden_context effectively extends the parent implementation by adding evidence fact triplets while preserving the base functionality. The implementation is clean and follows good inheritance practices.


5-6: Import and inheritance changes look good

The updated imports for typing and HotpotQAAdapter support the new inheritance structure and type annotations.


9-9: Base class change aligns with modular design

Changing the base class from BaseBenchmarkAdapter to HotpotQAAdapter looks good and matches the PR objective of improving modularity by letting this adapter inherit functionality from HotpotQAAdapter.


12-12: Fixed URL key to maintain consistency

Good fix changing "URL" to "url" for consistency with other adapters.


15-17: Constructor implementation looks good

The constructor properly calls the parent class initializer and sets a custom metadata field name which will be used in the inherited methods.


19-28: Golden context extraction looks good

The implementation extends the parent's golden context extraction by appending evidence fact triplets when available. This maintains a consistent interface while adding adapter-specific functionality.


5-6: Updated imports to match inheritance changes

The import statement now correctly includes HotpotQAAdapter to support the inheritance change, which is aligned with the modular design approach.


9-9: Improved inheritance hierarchy

Changing the parent class from BaseBenchmarkAdapter to HotpotQAAdapter leverages the common implementation, reducing code duplication and improving maintainability.


12-12: Updated URL key for consistency

Changed "URL" to "url" for consistency with other adapters. This is a good catch to ensure uniform casing in dictionary keys.


15-17: Well-implemented constructor with parent class initialization

The constructor properly calls the parent class initializer and overrides the metadata_field_name attribute to match the dataset-specific field name.


19-28: Effective extension of the golden context extraction

The override of _get_golden_context demonstrates good OOP practices:

  1. It reuses the parent implementation with super()
  2. It extends functionality by adding evidence fact triplets
  3. It maintains a consistent format with the parent class

This approach shows proper understanding of inheritance and method overriding.

evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py (20)

5-5: Good addition of type annotations

The addition of proper typing improves code clarity and enables better static analysis.


17-19: Good class initialization

Adding an __init__ method to initialize the metadata_field_name as a class property is a good design decision that makes the code more configurable for child classes.


21-23: Well-implemented validation method

This helper method properly validates supporting fact indices with appropriate type checking and range validation.


25-38: Great extraction method for golden context

The _get_golden_context method is well structured and handles potential edge cases by:

  1. Creating a mapping for easy lookup
  2. Validating indices before accessing data
  3. Properly formatting the extracted contexts

This implementation is robust and prevents potential index errors.


40-61: Good modularization of item processing

Encapsulating the logic for processing individual items into a separate method improves code organization and maintainability. The conditional addition of golden context based on the parameter is clean and follows good separation of concerns.


62-64: Good method signature update

The addition of the load_golden_context parameter with a default value of False maintains backward compatibility while adding new functionality. The return type annotation with Tuple[List[str], List[dict[str, Any]]] improves type clarity.


87-87: Clean implementation of process delegation

Replacing the inline item processing with a call to the modular _process_item method simplifies the load_corpus method and improves readability.


5-5: Updated imports for better type annotations

Good addition of proper typing imports that match the enhanced type annotations in the method signatures.


17-19: Good initialization of metadata field name

The initialization of the metadata_field_name as a class property matches the feedback from previous comments and provides a clean extension point for child classes.


21-23: Robust supporting fact validation

The validation method provides a clear safeguard against invalid supporting fact indices, protecting against potential runtime errors.


40-61: Good encapsulation of item processing logic

The _process_item method effectively modularizes the logic for processing items, making the code more maintainable and easier to understand. The conditional inclusion of golden_context based on the load_golden_context parameter is well-implemented.


62-64: Updated method signature with proper type annotations

The addition of the load_golden_context parameter and improved type annotations enhances the method's interface and documentation.


86-87: Simplified corpus loading with modular approach

Good refactoring to use the _process_item method, which makes the load_corpus method cleaner and more maintainable.


5-5: Good import organization

The updated import statement improves type hinting by including more specific typing classes (Optional, Any, List, Tuple), which enhances code readability and IDE support.


17-20: Proper initialization of metadata field name

Good implementation of the constructor that initializes the metadata_field_name attribute. This makes the code more maintainable by centralizing the field name configuration and allows child classes to easily override it.


21-24: Well-structured validation helper method

The _is_valid_supporting_fact method properly validates supporting fact indices with clear error checking. Using this helper method improves code modularity and readability.


25-39: Good implementation of golden context extraction

The _get_golden_context method is well-structured with:

  1. Clear context dictionary creation
  2. Proper validation of supporting facts
  3. Formatted output generation

This implementation provides a clean way for child classes to extend the functionality while maintaining consistency.


40-61: Excellent modularization of item processing logic

Extracting the item processing logic into a separate method improves code organization and maintainability. The method handles both corpus and QA pair processing with conditional inclusion of golden context based on the parameter.


62-64: Updated method signature with golden context support

The method signature now includes the load_golden_context parameter with a default value of False, maintaining backward compatibility while adding new functionality.


86-87: Simplified processing loop with method delegation

Good refactoring that replaces inline processing code with a method call, making the code more readable and maintainable.

@lxobr lxobr requested a review from borisarzentar February 27, 2025 10:14
@borisarzentar borisarzentar merged commit 4b7c21d into dev Feb 27, 2025
36 checks passed
@borisarzentar borisarzentar deleted the feat/COG-1364-golden-contexts branch February 27, 2025 12:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants