-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Test: eval_framework/evaluation [cog-1234] #575
Conversation
…est-answer-generation
…rated answer correctly
WalkthroughThis pull request introduces new unit tests for the evaluation framework. A test suite for the Changes
Suggested reviewers
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms (28)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments. If you are seeing this consistently, please check "Code review limits" under "Moderation" settings.
Actionable comments posted: 2
🧹 Nitpick comments (4)
cognee/tests/unit/eval_framework/metrics_test.py (2)
21-39
: LGTM! Comprehensive test cases covering various string matching scenarios.The parametrized test cases effectively cover exact matches, case sensitivity, whitespace handling, empty strings, and special characters.
Consider adding these edge cases:
@pytest.mark.parametrize( "actual, expected, expected_exact_score, expected_f1_range", [ + ("Hello\nWorld", "Hello World", 0.0, (0.0, 1.0)), # Newline handling + ("Hello\tWorld", "Hello World", 0.0, (0.0, 1.0)), # Tab handling + ("Hello 世界", "Hello 世界", 1.0, (1.0, 1.0)), # Unicode support + (None, "Hello World", 0.0, (0.0, 0.0)), # None handling ], )
40-52
: Add docstring to test function.The test function would benefit from a docstring explaining the purpose of each parameter and expected behavior.
def test_metrics(metrics, actual, expected, expected_exact_score, expected_f1_range): + """Test exact match and F1 score metrics with various input combinations. + + Args: + metrics: Dictionary containing ExactMatchMetric and F1ScoreMetric instances + actual: The actual output string to test + expected: The expected output string to compare against + expected_exact_score: Expected score from exact match metric (0.0 or 1.0) + expected_f1_range: Tuple of (min, max) expected F1 scores + """ test_case = MockTestCase(actual, expected)cognee/tests/unit/eval_framework/deepeval_adapter_test.py (1)
1-113
: LGTM! Excellent test coverage with comprehensive edge cases.The test suite effectively covers correctness, error handling, edge cases, and mocking. Good job on testing various scenarios including empty answers, missing fields, and partial matches.
Consider adding these test cases:
@pytest.mark.asyncio async def test_concurrent_evaluation(adapter): """Test evaluating multiple answers concurrently.""" answers = [ {"question": f"Q{i}", "answer": f"A{i}", "golden_answer": f"A{i}"} for i in range(5) ] evaluator_metrics = ["EM", "f1"] results = await asyncio.gather(*[ adapter.evaluate_answers([answer], evaluator_metrics) for answer in answers ]) assert len(results) == len(answers) for result in results: assert result[0]["metrics"]["EM"]["score"] == 1.0 @pytest.mark.asyncio async def test_metric_calculation_logic(adapter): """Test the actual calculation logic of metrics.""" answers = [ { "question": "What is 2 + 2?", "answer": "The sum of 2 and 2 is 4", "golden_answer": "4", } ] evaluator_metrics = ["EM", "f1"] results = await adapter.evaluate_answers(answers, evaluator_metrics) # Verify that partial matches affect scores differently assert results[0]["metrics"]["EM"]["score"] == 0.0 # Exact match should fail assert 0.0 < results[0]["metrics"]["f1"]["score"] < 1.0 # F1 should be partialcognee/tests/unit/eval_framework/dashboard_test.py (1)
51-56
: Enhance HTML structure verification.The current HTML content verification is basic. Consider adding more thorough checks of the generated HTML structure.
def test_generate_metrics_dashboard_html_structure(self): """Test the structure of generated HTML dashboard.""" generate_metrics_dashboard(self.temp_json.name, self.output_file) with open(self.output_file, "r", encoding="utf-8") as f: html_content = f.read() # Check basic HTML structure self.assertIn("<!DOCTYPE html>", html_content) self.assertIn("<html>", html_content) self.assertIn("</html>", html_content) # Check required sections self.assertIn("<div class='metrics-summary'>", html_content) self.assertIn("<div class='plotly-chart'>", html_content) # Check metric-specific elements for metric in ["accuracy", "relevance"]: self.assertIn(f"<h3>{metric.title()}</h3>", html_content) self.assertIn(f"<div class='{metric}-chart'>", html_content)
🛑 Comments failed to post (2)
cognee/tests/unit/eval_framework/answer_generation_test.py (1)
9-37: 🛠️ Refactor suggestion
Add error handling test cases.
While the happy path is well tested, consider adding test cases for error scenarios and multiple QA pairs.
Add these test cases:
@pytest.mark.asyncio async def test_answer_generation_error_handling(): qa_pairs = [{"question": "Q1"}, {"question": "Q2"}] # Missing 'answer' field mock_answer_resolver = AsyncMock() answer_generator = AnswerGeneratorExecutor() with pytest.raises(KeyError): await answer_generator.question_answering_non_parallel( questions=qa_pairs, answer_resolver=mock_answer_resolver, ) @pytest.mark.asyncio async def test_answer_generation_multiple_qa_pairs(): limit = 3 # Test with multiple QA pairs corpus_list, qa_pairs = DummyAdapter().load_corpus(limit=limit) mock_answer_resolver = AsyncMock() mock_answer_resolver.side_effect = lambda query: ["mock_answer"] answer_generator = AnswerGeneratorExecutor() answers = await answer_generator.question_answering_non_parallel( questions=qa_pairs, answer_resolver=mock_answer_resolver, ) assert len(answers) == limit assert mock_answer_resolver.call_count == limitcognee/tests/unit/eval_framework/dashboard_test.py (1)
42-56: 🛠️ Refactor suggestion
Add error handling test cases for invalid inputs.
While the happy path is well tested, consider adding test cases for invalid JSON and file permission scenarios.
Add these test cases:
def test_generate_metrics_dashboard_invalid_json(self): """Test handling of invalid JSON input.""" with open(self.temp_json.name, "w") as f: f.write("invalid json content") with self.assertRaises(json.JSONDecodeError): generate_metrics_dashboard(self.temp_json.name, self.output_file) def test_generate_metrics_dashboard_permission_error(self): """Test handling of file permission errors.""" # Make output directory read-only os.chmod(os.path.dirname(self.output_file), 0o444) with self.assertRaises(PermissionError): generate_metrics_dashboard(self.temp_json.name, self.output_file) # Restore permissions for cleanup os.chmod(os.path.dirname(self.output_file), 0o777)
Description
DCO Affirmation
I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin
Summary by CodeRabbit
DeepEvalAdapter
, covering correctness, unsupported metrics, and error handling.ExactMatchMetric
andF1ScoreMetric
, parameterized for various test cases.