Feat/cog 1364 golden contexts #579

lxobr · 2025-02-25T17:10:51Z

Description

• Added load_golden_context parameter to BaseBenchmarkAdapter's abstract load_corpus method, establishing a common interface for retrieving supporting evidence
• Refactored HotpotQAAdapter with a modular design: introduced _get_metadata_field_name method to handle dataset-specific fields (making it extensible for child classes), implemented get golden context functionality.
• Refactored TwoWikiMultihopAdapter to inherit from HotpotQAAdapter, overriding only the necessary methods while reusing parent's functionality
• Added golden context support to MusiqueQAAdapter with their decomposition-based format

DCO Affirmation

I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin

Summary by CodeRabbit

New Features
- Introduced an option to include additional context during corpus loading, enhancing the quality and flexibility of generated QA pairs.
Refactor
- Streamlined and modularized the processing workflow across different adapters for improved consistency and maintainability.
- Updated metadata extraction to refine the display of contextual information.
- Shifted focus in the TwoWikiMultihopAdapter from corpus loading to context extraction.

coderabbitai · 2025-02-25T17:10:57Z

Walkthrough

This update enhances the corpus-loading functionality across several benchmark adapters. The load_corpus method in multiple adapters now accepts an additional boolean parameter, load_golden_context, allowing conditional inclusion of a golden context. New helper methods have been introduced to modularize context extraction and item processing logic. In one adapter, inheritance has been updated for structural consistency, and minor dataset metadata changes are applied.

Changes

File(s)	Change Summary
`evals/.../base_benchmark_adapter.py`	Updated `load_corpus` signature to add the `load_golden_context: bool = False` parameter.
`evals/.../hotpot_qa_adapter.py`	Added `_is_valid_supporting_fact`, `_get_golden_context`, and `_process_item` methods; updated `load_corpus` signature and return type to include `load_golden_context`.
`evals/.../musique_adapter.py`	Introduced `_get_golden_context` and `_process_item` methods; modified `load_corpus` to support `load_golden_context` while retaining `auto_download`.
`evals/.../twowikimultihop_adapter.py`	Changed inheritance from `BaseBenchmarkAdapter` to `HotpotQAAdapter`; removed `load_corpus`; updated `dataset_info` key and added `_get_golden_context` method.

Possibly Related PRs

Test: Parse context pieces separately in MusiqueQAAdapter and adjust tests [cog-1234] #561: Contains similar modifications to the load_corpus method in the MusiqueQAAdapter, specifically introducing the load_golden_context parameter to enhance corpus processing.

Suggested Reviewers

hajdul88

Poem

Hoppin’ through the code I go,
Adjusting methods, nice and slow.
Golden context now in play,
Guiding data on its way.
In this realm of logic, I cheer—
A bunny's joy, crystal clear!
🐇✨ Happy coding all around!

✨ Finishing Touches

📝 Generate Docstrings

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (7)

evals/eval_framework/benchmark_adapters/base_benchmark_adapter.py (1)

7-9: Add docstring for the new parameter.

The newly introduced parameter load_golden_context lacks documentation in this abstract method. Consider adding a brief docstring explaining how and why it should be used, so that all subclasses override it consistently.

evals/eval_framework/benchmark_adapters/twowikimultihop_adapter.py (2)

15-16: Consider a more descriptive metadata field name.

Currently returning "type" may be confused with Python’s built-in type. If feasible, a more domain-specific name (e.g., "question_type") can enhance code clarity.

19-28: Ensure readability and handle missing evidences gracefully.

Adding evidence lines directly after the super call might jumble the content. Consider adding a second blank line or a delimiter between the inherited context and the evidence list for better readability.

The current check if "evidences" in item: is straightforward; consider also validating that item["evidences"] is a list if there's any risk of malformed data.

evals/eval_framework/benchmark_adapters/musique_adapter.py (3)

21-39: Apply consistent formatting for answers.

Here, step answers are returned as-is, while the final QA pair answer is lowercased. If you intend to standardize all answers (including decomposed steps) to lowercase, consider applying .lower() conditionally to step answers for consistency.

41-67: Validate paragraph content before appending.

The refactored logic is clear. However, if there's any chance of incomplete data, you might want to confirm that paragraph["paragraph_text"] exists or handle potential KeyError to prevent runtime exceptions.

72-75: Document the new parameter.

You introduced load_golden_context in load_corpus, but the docstring does not describe it. Provide a short explanation to clarify its purpose for maintainers and external contributors.
evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py (1)
21-35: Well-structured golden context extraction.

The implementation cleanly maps context titles to sentences, filters valid supporting facts, and formats them consistently. Consider adding logging for invalid supporting facts to aid in debugging.
def _get_golden_context(self, item: dict[str, Any]) -> str:
    """Extracts and formats the golden context from supporting facts."""
    # Create a mapping of title to sentences for easy lookup
    context_dict = {title: sentences for (title, sentences) in item["context"]}

    # Get all supporting facts in order
    golden_contexts = []
    for title, sent_idx in item["supporting_facts"]:
        sentences = context_dict.get(title, [])
        if not self._is_valid_supporting_fact(sentences, sent_idx):
+            # Consider logging this issue for debugging
+            # import logging
+            # logging.warning(f"Invalid supporting fact: {title}, index {sent_idx}")
            continue
        golden_contexts.append(f"{title}: {sentences[sent_idx]}")

    return "\n".join(golden_contexts)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1cb8331 and 32d5829.

📒 Files selected for processing (4)

evals/eval_framework/benchmark_adapters/base_benchmark_adapter.py (1 hunks)
evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py (3 hunks)
evals/eval_framework/benchmark_adapters/musique_adapter.py (3 hunks)
evals/eval_framework/benchmark_adapters/twowikimultihop_adapter.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (30)

GitHub Check: Test on macos-15
GitHub Check: run_networkx_metrics_test / test
GitHub Check: run_notebook_test / test
GitHub Check: run_dynamic_steps_example_test / test
GitHub Check: Test on macos-15
GitHub Check: Test on macos-13
GitHub Check: Test on macos-13
GitHub Check: lint (ubuntu-latest, 3.11.x)
GitHub Check: lint (ubuntu-latest, 3.10.x)
GitHub Check: Test on ubuntu-22.04
GitHub Check: test
GitHub Check: Test on ubuntu-22.04
GitHub Check: run_notebook_test / test
GitHub Check: run_notebook_test / test
GitHub Check: Test on macos-15
GitHub Check: run_notebook_test / test
GitHub Check: run_eval_framework_test / test
GitHub Check: run_multimedia_example_test / test
GitHub Check: Test on macos-13
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: Test on ubuntu-22.04
GitHub Check: test
GitHub Check: run_simple_example_test / test
GitHub Check: windows-latest
GitHub Check: test
GitHub Check: test
GitHub Check: Build Cognee Backend Docker App Image
GitHub Check: docker-compose-test

🔇 Additional comments (12)

evals/eval_framework/benchmark_adapters/twowikimultihop_adapter.py (3)

5-6: Looks good.

The type import and usage look correct here.

9-9: Verify override behavior.

By inheriting from HotpotQAAdapter instead of BaseBenchmarkAdapter, this class now relies on the parent’s logic for corpus loading. Ensure all new or overridden methods (e.g., _get_metadata_field_name, _get_golden_context) are adequately tested to confirm they're invoked correctly.

12-12: Consistent key naming is good.

Changing the key name to lowercase "url" aligns with typical JSON styling and is more conventional.

evals/eval_framework/benchmark_adapters/musique_adapter.py (2)

4-4: No issue with updated imports.

103-103: Ensure environment readiness.

This code depends on gdown being installed and on a functioning internet connection. If these conditions are unfulfilled, it will fail. Consider adding a more explicit error message or fallback mechanism for offline users.

evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py (7)

5-5: Improved type annotations for better code quality.

The expanded import of typing classes enhances code clarity and type safety throughout the adapter.

17-19: Great validation helper method.

This utility method properly validates supporting fact indices against provided sentences, with good checks for non-empty sentences, integer type, and bounds checking.

36-39: Good design for extensibility.

Extracting the metadata field name into a separate method allows child classes to easily override this behavior without duplicating other code.

40-61: Excellent modularization of item processing.

This method encapsulates the logic for processing corpus items, making the code more maintainable and easier to understand. The conditional addition of golden context is well-implemented.

62-64: Appropriate method signature update.

The addition of the load_golden_context parameter with a sensible default value maintains backward compatibility while enabling new functionality.

65-65: Clear and concise documentation.

The updated docstring clearly explains the purpose of the method, including the new golden context functionality.

87-87: Clean implementation of process delegation.

Replacing inline processing logic with a call to the dedicated method improves code readability and maintainability.

evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py (1)
25-38: Well-implemented golden context extraction

The extraction of golden contexts from supporting facts is well-structured with clear error handling for invalid indices. The use of a mapping for title lookup is efficient.

Consider adding a fallback message when no valid supporting facts are found to make the return value more explicit.
 def _get_golden_context(self, item: dict[str, Any]) -> str:
     """Extracts and formats the golden context from supporting facts."""
     # Create a mapping of title to sentences for easy lookup
     context_dict = {title: sentences for (title, sentences) in item["context"]}

     # Get all supporting facts in order
     golden_contexts = []
     for title, sentence_idx in item["supporting_facts"]:
         sentences = context_dict.get(title, [])
         if not self._is_valid_supporting_fact(sentences, sentence_idx):
             continue
         golden_contexts.append(f"{title}: {sentences[sentence_idx]}")

-    return "\n".join(golden_contexts)
+    if golden_contexts:
+        return "\n".join(golden_contexts)
+    return "No valid supporting facts found."

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 32d5829 and 0ece58a.

📒 Files selected for processing (2)

evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py (3 hunks)
evals/eval_framework/benchmark_adapters/twowikimultihop_adapter.py (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms (29)

GitHub Check: Test on macos-15
GitHub Check: run_notebook_test / test
GitHub Check: run_notebook_test / test
GitHub Check: run_simple_example_test / test
GitHub Check: run_notebook_test / test
GitHub Check: Test on macos-13
GitHub Check: run_notebook_test / test
GitHub Check: run_eval_framework_test / test
GitHub Check: Test on macos-15
GitHub Check: Test on macos-15
GitHub Check: Test on ubuntu-22.04
GitHub Check: Test on macos-13
GitHub Check: run_dynamic_steps_example_test / test
GitHub Check: Test on macos-13
GitHub Check: run_multimedia_example_test / test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: test
GitHub Check: Test on ubuntu-22.04
GitHub Check: windows-latest
GitHub Check: run_networkx_metrics_test / test
GitHub Check: Test on ubuntu-22.04
GitHub Check: test
GitHub Check: lint (ubuntu-latest, 3.10.x)
GitHub Check: test
GitHub Check: test
GitHub Check: Build Cognee Backend Docker App Image
GitHub Check: docker-compose-test

🔇 Additional comments (35)

evals/eval_framework/benchmark_adapters/twowikimultihop_adapter.py (15)

5-6: Good enhancement to type hints

The updated imports with proper type annotations enhance code readability and enable better static analysis. The change to inherit from HotpotQAAdapter is a good design decision that improves code reuse.

9-9: Improved inheritance structure

Changing from BaseBenchmarkAdapter to HotpotQAAdapter properly leverages the common functionality already implemented in the parent class. This modification aligns with the single responsibility principle and DRY (Don't Repeat Yourself).

12-12: Updated field name for consistency

Changing from "URL" to "url" improves naming consistency throughout the codebase. Consistent field naming makes code maintenance easier.

15-17: Good initialization with metadata field override

The new constructor properly calls the parent constructor and overrides the metadata_field_name to match the field structure in the TwoWikiMultihop dataset. This approach ensures proper dataset-specific behavior.

19-28: Well-implemented context extension method

The override of _get_golden_context effectively extends the parent implementation by adding evidence fact triplets while preserving the base functionality. The implementation is clean and follows good inheritance practices.

5-6: Import and inheritance changes look good

The updated imports for typing and HotpotQAAdapter support the new inheritance structure and type annotations.

9-9: Base class change aligns with modular design

Changing the base class from BaseBenchmarkAdapter to HotpotQAAdapter looks good and matches the PR objective of improving modularity by letting this adapter inherit functionality from HotpotQAAdapter.

12-12: Fixed URL key to maintain consistency

Good fix changing "URL" to "url" for consistency with other adapters.

15-17: Constructor implementation looks good

The constructor properly calls the parent class initializer and sets a custom metadata field name which will be used in the inherited methods.

19-28: Golden context extraction looks good

The implementation extends the parent's golden context extraction by appending evidence fact triplets when available. This maintains a consistent interface while adding adapter-specific functionality.

5-6: Updated imports to match inheritance changes

The import statement now correctly includes HotpotQAAdapter to support the inheritance change, which is aligned with the modular design approach.

9-9: Improved inheritance hierarchy

Changing the parent class from BaseBenchmarkAdapter to HotpotQAAdapter leverages the common implementation, reducing code duplication and improving maintainability.

12-12: Updated URL key for consistency

Changed "URL" to "url" for consistency with other adapters. This is a good catch to ensure uniform casing in dictionary keys.

15-17: Well-implemented constructor with parent class initialization

The constructor properly calls the parent class initializer and overrides the metadata_field_name attribute to match the dataset-specific field name.

19-28: Effective extension of the golden context extraction

The override of _get_golden_context demonstrates good OOP practices:

It reuses the parent implementation with super()

It extends functionality by adding evidence fact triplets

It maintains a consistent format with the parent class

This approach shows proper understanding of inheritance and method overriding.

evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py (20)

5-5: Good addition of type annotations

The addition of proper typing improves code clarity and enables better static analysis.

17-19: Good class initialization

Adding an __init__ method to initialize the metadata_field_name as a class property is a good design decision that makes the code more configurable for child classes.

21-23: Well-implemented validation method

This helper method properly validates supporting fact indices with appropriate type checking and range validation.

25-38: Great extraction method for golden context

The _get_golden_context method is well structured and handles potential edge cases by:

Creating a mapping for easy lookup

Validating indices before accessing data

Properly formatting the extracted contexts

This implementation is robust and prevents potential index errors.

40-61: Good modularization of item processing

Encapsulating the logic for processing individual items into a separate method improves code organization and maintainability. The conditional addition of golden context based on the parameter is clean and follows good separation of concerns.

62-64: Good method signature update

The addition of the load_golden_context parameter with a default value of False maintains backward compatibility while adding new functionality. The return type annotation with Tuple[List[str], List[dict[str, Any]]] improves type clarity.

87-87: Clean implementation of process delegation

Replacing the inline item processing with a call to the modular _process_item method simplifies the load_corpus method and improves readability.

5-5: Updated imports for better type annotations

Good addition of proper typing imports that match the enhanced type annotations in the method signatures.

17-19: Good initialization of metadata field name

The initialization of the metadata_field_name as a class property matches the feedback from previous comments and provides a clean extension point for child classes.

21-23: Robust supporting fact validation

The validation method provides a clear safeguard against invalid supporting fact indices, protecting against potential runtime errors.

40-61: Good encapsulation of item processing logic

The _process_item method effectively modularizes the logic for processing items, making the code more maintainable and easier to understand. The conditional inclusion of golden_context based on the load_golden_context parameter is well-implemented.

62-64: Updated method signature with proper type annotations

The addition of the load_golden_context parameter and improved type annotations enhances the method's interface and documentation.

86-87: Simplified corpus loading with modular approach

Good refactoring to use the _process_item method, which makes the load_corpus method cleaner and more maintainable.

5-5: Good import organization

The updated import statement improves type hinting by including more specific typing classes (Optional, Any, List, Tuple), which enhances code readability and IDE support.

17-20: Proper initialization of metadata field name

Good implementation of the constructor that initializes the metadata_field_name attribute. This makes the code more maintainable by centralizing the field name configuration and allows child classes to easily override it.

21-24: Well-structured validation helper method

The _is_valid_supporting_fact method properly validates supporting fact indices with clear error checking. Using this helper method improves code modularity and readability.

25-39: Good implementation of golden context extraction

The _get_golden_context method is well-structured with:

Clear context dictionary creation

Proper validation of supporting facts

Formatted output generation

This implementation provides a clean way for child classes to extend the functionality while maintaining consistency.

40-61: Excellent modularization of item processing logic

Extracting the item processing logic into a separate method improves code organization and maintainability. The method handles both corpus and QA pair processing with conditional inclusion of golden context based on the parameter.

62-64: Updated method signature with golden context support

The method signature now includes the load_golden_context parameter with a default value of False, maintaining backward compatibility while adding new functionality.

86-87: Simplified processing loop with method delegation

Good refactoring that replaces inline processing code with a method call, making the code more readable and maintainable.

lxobr added 5 commits February 25, 2025 14:39

feat: add get_golden_context flag

d9fcb12

feat: implement get_golden_context for hotpot_qa

36dbdf7

Merge branch 'dev' into feat/COG-1364-golden-contexts

65784e1

feat: simplify twowikimultihop, get golden context

bdaea29

feat: add golden context to musique_adapter.py

32d5829

lxobr added the run-checks label Feb 25, 2025

lxobr self-assigned this Feb 25, 2025

coderabbitai bot reviewed Feb 25, 2025

View reviewed changes

borisarzentar reviewed Feb 26, 2025

View reviewed changes

evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py Outdated Show resolved Hide resolved

evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py Outdated Show resolved Hide resolved

evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py Outdated Show resolved Hide resolved

lxobr added 3 commits February 27, 2025 10:58

refactor: rename variables

e98c12e

Merge branch 'dev' into feat/COG-1364-golden-contexts

af5d7c6

refactor: add metadata_field_name property

0ece58a

coderabbitai bot reviewed Feb 27, 2025

View reviewed changes

lxobr requested a review from borisarzentar February 27, 2025 10:14

borisarzentar approved these changes Feb 27, 2025

View reviewed changes

borisarzentar merged commit 4b7c21d into dev Feb 27, 2025
36 checks passed

borisarzentar deleted the feat/COG-1364-golden-contexts branch February 27, 2025 12:25

coderabbitai bot mentioned this pull request Mar 10, 2025

feat: COG-1526 instance filter in eval #627

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/cog 1364 golden contexts #579

Feat/cog 1364 golden contexts #579

lxobr commented Feb 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 25, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

Feat/cog 1364 golden contexts #579

Feat/cog 1364 golden contexts #579

Conversation

lxobr commented Feb 25, 2025 • edited by coderabbitai bot Loading

Description

DCO Affirmation

Summary by CodeRabbit

coderabbitai bot commented Feb 25, 2025 • edited Loading

Walkthrough

Changes

Possibly Related PRs

Suggested Reviewers

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

lxobr commented Feb 25, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 25, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)