-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/cog 1364 golden contexts #579
Conversation
WalkthroughThis update enhances the corpus-loading functionality across several benchmark adapters. The Changes
Possibly Related PRs
Suggested Reviewers
Poem
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (7)
evals/eval_framework/benchmark_adapters/base_benchmark_adapter.py (1)
7-9
: Add docstring for the new parameter.The newly introduced parameter
load_golden_context
lacks documentation in this abstract method. Consider adding a brief docstring explaining how and why it should be used, so that all subclasses override it consistently.evals/eval_framework/benchmark_adapters/twowikimultihop_adapter.py (2)
15-16
: Consider a more descriptive metadata field name.Currently returning
"type"
may be confused with Python’s built-intype
. If feasible, a more domain-specific name (e.g.,"question_type"
) can enhance code clarity.
19-28
: Ensure readability and handle missing evidences gracefully.
- Adding evidence lines directly after the super call might jumble the content. Consider adding a second blank line or a delimiter between the inherited context and the evidence list for better readability.
- The current check
if "evidences" in item:
is straightforward; consider also validating thatitem["evidences"]
is a list if there's any risk of malformed data.evals/eval_framework/benchmark_adapters/musique_adapter.py (3)
21-39
: Apply consistent formatting for answers.Here, step answers are returned as-is, while the final QA pair answer is lowercased. If you intend to standardize all answers (including decomposed steps) to lowercase, consider applying
.lower()
conditionally to step answers for consistency.
41-67
: Validate paragraph content before appending.The refactored logic is clear. However, if there's any chance of incomplete data, you might want to confirm that
paragraph["paragraph_text"]
exists or handle potential KeyError to prevent runtime exceptions.
72-75
: Document the new parameter.You introduced
load_golden_context
inload_corpus
, but the docstring does not describe it. Provide a short explanation to clarify its purpose for maintainers and external contributors.evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py (1)
21-35
: Well-structured golden context extraction.The implementation cleanly maps context titles to sentences, filters valid supporting facts, and formats them consistently. Consider adding logging for invalid supporting facts to aid in debugging.
def _get_golden_context(self, item: dict[str, Any]) -> str: """Extracts and formats the golden context from supporting facts.""" # Create a mapping of title to sentences for easy lookup context_dict = {title: sentences for (title, sentences) in item["context"]} # Get all supporting facts in order golden_contexts = [] for title, sent_idx in item["supporting_facts"]: sentences = context_dict.get(title, []) if not self._is_valid_supporting_fact(sentences, sent_idx): + # Consider logging this issue for debugging + # import logging + # logging.warning(f"Invalid supporting fact: {title}, index {sent_idx}") continue golden_contexts.append(f"{title}: {sentences[sent_idx]}") return "\n".join(golden_contexts)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
evals/eval_framework/benchmark_adapters/base_benchmark_adapter.py
(1 hunks)evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py
(3 hunks)evals/eval_framework/benchmark_adapters/musique_adapter.py
(3 hunks)evals/eval_framework/benchmark_adapters/twowikimultihop_adapter.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (30)
- GitHub Check: Test on macos-15
- GitHub Check: run_networkx_metrics_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-13
- GitHub Check: Test on macos-13
- GitHub Check: lint (ubuntu-latest, 3.11.x)
- GitHub Check: lint (ubuntu-latest, 3.10.x)
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: Test on macos-15
- GitHub Check: run_notebook_test / test
- GitHub Check: run_eval_framework_test / test
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: Test on macos-13
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: run_simple_example_test / test
- GitHub Check: windows-latest
- GitHub Check: test
- GitHub Check: test
- GitHub Check: Build Cognee Backend Docker App Image
- GitHub Check: docker-compose-test
🔇 Additional comments (12)
evals/eval_framework/benchmark_adapters/twowikimultihop_adapter.py (3)
5-6
: Looks good.The type import and usage look correct here.
9-9
: Verify override behavior.By inheriting from
HotpotQAAdapter
instead ofBaseBenchmarkAdapter
, this class now relies on the parent’s logic for corpus loading. Ensure all new or overridden methods (e.g.,_get_metadata_field_name
,_get_golden_context
) are adequately tested to confirm they're invoked correctly.
12-12
: Consistent key naming is good.Changing the key name to lowercase
"url"
aligns with typical JSON styling and is more conventional.evals/eval_framework/benchmark_adapters/musique_adapter.py (2)
4-4
: No issue with updated imports.
103-103
: Ensure environment readiness.This code depends on
gdown
being installed and on a functioning internet connection. If these conditions are unfulfilled, it will fail. Consider adding a more explicit error message or fallback mechanism for offline users.evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py (7)
5-5
: Improved type annotations for better code quality.The expanded import of typing classes enhances code clarity and type safety throughout the adapter.
17-19
: Great validation helper method.This utility method properly validates supporting fact indices against provided sentences, with good checks for non-empty sentences, integer type, and bounds checking.
36-39
: Good design for extensibility.Extracting the metadata field name into a separate method allows child classes to easily override this behavior without duplicating other code.
40-61
: Excellent modularization of item processing.This method encapsulates the logic for processing corpus items, making the code more maintainable and easier to understand. The conditional addition of golden context is well-implemented.
62-64
: Appropriate method signature update.The addition of the
load_golden_context
parameter with a sensible default value maintains backward compatibility while enabling new functionality.
65-65
: Clear and concise documentation.The updated docstring clearly explains the purpose of the method, including the new golden context functionality.
87-87
: Clean implementation of process delegation.Replacing inline processing logic with a call to the dedicated method improves code readability and maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py (1)
25-38
: Well-implemented golden context extractionThe extraction of golden contexts from supporting facts is well-structured with clear error handling for invalid indices. The use of a mapping for title lookup is efficient.
Consider adding a fallback message when no valid supporting facts are found to make the return value more explicit.
def _get_golden_context(self, item: dict[str, Any]) -> str: """Extracts and formats the golden context from supporting facts.""" # Create a mapping of title to sentences for easy lookup context_dict = {title: sentences for (title, sentences) in item["context"]} # Get all supporting facts in order golden_contexts = [] for title, sentence_idx in item["supporting_facts"]: sentences = context_dict.get(title, []) if not self._is_valid_supporting_fact(sentences, sentence_idx): continue golden_contexts.append(f"{title}: {sentences[sentence_idx]}") - return "\n".join(golden_contexts) + if golden_contexts: + return "\n".join(golden_contexts) + return "No valid supporting facts found."
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py
(3 hunks)evals/eval_framework/benchmark_adapters/twowikimultihop_adapter.py
(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (29)
- GitHub Check: Test on macos-15
- GitHub Check: run_notebook_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: run_simple_example_test / test
- GitHub Check: run_notebook_test / test
- GitHub Check: Test on macos-13
- GitHub Check: run_notebook_test / test
- GitHub Check: run_eval_framework_test / test
- GitHub Check: Test on macos-15
- GitHub Check: Test on macos-15
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: Test on macos-13
- GitHub Check: run_dynamic_steps_example_test / test
- GitHub Check: Test on macos-13
- GitHub Check: run_multimedia_example_test / test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: windows-latest
- GitHub Check: run_networkx_metrics_test / test
- GitHub Check: Test on ubuntu-22.04
- GitHub Check: test
- GitHub Check: lint (ubuntu-latest, 3.10.x)
- GitHub Check: test
- GitHub Check: test
- GitHub Check: Build Cognee Backend Docker App Image
- GitHub Check: docker-compose-test
🔇 Additional comments (35)
evals/eval_framework/benchmark_adapters/twowikimultihop_adapter.py (15)
5-6
: Good enhancement to type hintsThe updated imports with proper type annotations enhance code readability and enable better static analysis. The change to inherit from
HotpotQAAdapter
is a good design decision that improves code reuse.
9-9
: Improved inheritance structureChanging from
BaseBenchmarkAdapter
toHotpotQAAdapter
properly leverages the common functionality already implemented in the parent class. This modification aligns with the single responsibility principle and DRY (Don't Repeat Yourself).
12-12
: Updated field name for consistencyChanging from "URL" to "url" improves naming consistency throughout the codebase. Consistent field naming makes code maintenance easier.
15-17
: Good initialization with metadata field overrideThe new constructor properly calls the parent constructor and overrides the
metadata_field_name
to match the field structure in the TwoWikiMultihop dataset. This approach ensures proper dataset-specific behavior.
19-28
: Well-implemented context extension methodThe override of
_get_golden_context
effectively extends the parent implementation by adding evidence fact triplets while preserving the base functionality. The implementation is clean and follows good inheritance practices.
5-6
: Import and inheritance changes look goodThe updated imports for typing and HotpotQAAdapter support the new inheritance structure and type annotations.
9-9
: Base class change aligns with modular designChanging the base class from BaseBenchmarkAdapter to HotpotQAAdapter looks good and matches the PR objective of improving modularity by letting this adapter inherit functionality from HotpotQAAdapter.
12-12
: Fixed URL key to maintain consistencyGood fix changing "URL" to "url" for consistency with other adapters.
15-17
: Constructor implementation looks goodThe constructor properly calls the parent class initializer and sets a custom metadata field name which will be used in the inherited methods.
19-28
: Golden context extraction looks goodThe implementation extends the parent's golden context extraction by appending evidence fact triplets when available. This maintains a consistent interface while adding adapter-specific functionality.
5-6
: Updated imports to match inheritance changesThe import statement now correctly includes
HotpotQAAdapter
to support the inheritance change, which is aligned with the modular design approach.
9-9
: Improved inheritance hierarchyChanging the parent class from
BaseBenchmarkAdapter
toHotpotQAAdapter
leverages the common implementation, reducing code duplication and improving maintainability.
12-12
: Updated URL key for consistencyChanged "URL" to "url" for consistency with other adapters. This is a good catch to ensure uniform casing in dictionary keys.
15-17
: Well-implemented constructor with parent class initializationThe constructor properly calls the parent class initializer and overrides the
metadata_field_name
attribute to match the dataset-specific field name.
19-28
: Effective extension of the golden context extractionThe override of
_get_golden_context
demonstrates good OOP practices:
- It reuses the parent implementation with
super()
- It extends functionality by adding evidence fact triplets
- It maintains a consistent format with the parent class
This approach shows proper understanding of inheritance and method overriding.
evals/eval_framework/benchmark_adapters/hotpot_qa_adapter.py (20)
5-5
: Good addition of type annotationsThe addition of proper typing improves code clarity and enables better static analysis.
17-19
: Good class initializationAdding an
__init__
method to initialize themetadata_field_name
as a class property is a good design decision that makes the code more configurable for child classes.
21-23
: Well-implemented validation methodThis helper method properly validates supporting fact indices with appropriate type checking and range validation.
25-38
: Great extraction method for golden contextThe
_get_golden_context
method is well structured and handles potential edge cases by:
- Creating a mapping for easy lookup
- Validating indices before accessing data
- Properly formatting the extracted contexts
This implementation is robust and prevents potential index errors.
40-61
: Good modularization of item processingEncapsulating the logic for processing individual items into a separate method improves code organization and maintainability. The conditional addition of golden context based on the parameter is clean and follows good separation of concerns.
62-64
: Good method signature updateThe addition of the
load_golden_context
parameter with a default value ofFalse
maintains backward compatibility while adding new functionality. The return type annotation withTuple[List[str], List[dict[str, Any]]]
improves type clarity.
87-87
: Clean implementation of process delegationReplacing the inline item processing with a call to the modular
_process_item
method simplifies theload_corpus
method and improves readability.
5-5
: Updated imports for better type annotationsGood addition of proper typing imports that match the enhanced type annotations in the method signatures.
17-19
: Good initialization of metadata field nameThe initialization of the metadata_field_name as a class property matches the feedback from previous comments and provides a clean extension point for child classes.
21-23
: Robust supporting fact validationThe validation method provides a clear safeguard against invalid supporting fact indices, protecting against potential runtime errors.
40-61
: Good encapsulation of item processing logicThe _process_item method effectively modularizes the logic for processing items, making the code more maintainable and easier to understand. The conditional inclusion of golden_context based on the load_golden_context parameter is well-implemented.
62-64
: Updated method signature with proper type annotationsThe addition of the load_golden_context parameter and improved type annotations enhances the method's interface and documentation.
86-87
: Simplified corpus loading with modular approachGood refactoring to use the _process_item method, which makes the load_corpus method cleaner and more maintainable.
5-5
: Good import organizationThe updated import statement improves type hinting by including more specific typing classes (Optional, Any, List, Tuple), which enhances code readability and IDE support.
17-20
: Proper initialization of metadata field nameGood implementation of the constructor that initializes the
metadata_field_name
attribute. This makes the code more maintainable by centralizing the field name configuration and allows child classes to easily override it.
21-24
: Well-structured validation helper methodThe
_is_valid_supporting_fact
method properly validates supporting fact indices with clear error checking. Using this helper method improves code modularity and readability.
25-39
: Good implementation of golden context extractionThe
_get_golden_context
method is well-structured with:
- Clear context dictionary creation
- Proper validation of supporting facts
- Formatted output generation
This implementation provides a clean way for child classes to extend the functionality while maintaining consistency.
40-61
: Excellent modularization of item processing logicExtracting the item processing logic into a separate method improves code organization and maintainability. The method handles both corpus and QA pair processing with conditional inclusion of golden context based on the parameter.
62-64
: Updated method signature with golden context supportThe method signature now includes the
load_golden_context
parameter with a default value ofFalse
, maintaining backward compatibility while adding new functionality.
86-87
: Simplified processing loop with method delegationGood refactoring that replaces inline processing code with a method call, making the code more readable and maintainable.
Description
• Added load_golden_context parameter to BaseBenchmarkAdapter's abstract load_corpus method, establishing a common interface for retrieving supporting evidence
• Refactored HotpotQAAdapter with a modular design: introduced _get_metadata_field_name method to handle dataset-specific fields (making it extensible for child classes), implemented get golden context functionality.
• Refactored TwoWikiMultihopAdapter to inherit from HotpotQAAdapter, overriding only the necessary methods while reusing parent's functionality
• Added golden context support to MusiqueQAAdapter with their decomposition-based format
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin
Summary by CodeRabbit
TwoWikiMultihopAdapter
from corpus loading to context extraction.