Skip to content

Commit 1cb8331

Browse files
lxobrVasilije1990hajdul88
authored
feat: add experimental cognify pipeline [COG-1293] (#541)
<!-- .github/pull_request_template.md --> ## Description - Integrate experimental tasks into the evaluation framework ## DCO Affirmation I affirm that all code in every commit of this pull request conforms to the terms of the Topoteretes Developer Certificate of Origin <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **New Features** - Introduced interactive prompt templates for extracting graph nodes, edge triplets, and relationship names, resulting in more comprehensive and accurate knowledge graphs. - Added asynchronous processes to efficiently handle document data and integrate graph components. - Launched cascade graph task options to offer enhanced flexibility in task management workflows. - Added new functionality for extracting content nodes and relationship names from text. - **Refactor** - Streamlined configurations for prompt processing and task initialization, improving overall modularity and system stability. - Updated task getter mechanisms to utilize function-based approaches for improved flexibility. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Co-authored-by: Vasilije <8619304+Vasilije1990@users.noreply.github.com> Co-authored-by: hajdul88 <52442977+hajdul88@users.noreply.github.com>
1 parent 55411ff commit 1cb8331

24 files changed

+380
-70
lines changed

cognee/infrastructure/llm/prompts/read_query_prompt.py

+5-2
Original file line numberDiff line numberDiff line change
@@ -3,10 +3,13 @@
33
from cognee.root_dir import get_absolute_path
44

55

6-
def read_query_prompt(prompt_file_name: str):
6+
def read_query_prompt(prompt_file_name: str, base_directory: str = None):
77
"""Read a query prompt from a file."""
88
try:
9-
file_path = path.join(get_absolute_path("./infrastructure/llm/prompts"), prompt_file_name)
9+
if base_directory is None:
10+
base_directory = get_absolute_path("./infrastructure/llm/prompts")
11+
12+
file_path = path.join(base_directory, prompt_file_name)
1013

1114
with open(file_path, "r", encoding="utf-8") as file:
1215
return file.read()

cognee/infrastructure/llm/prompts/render_prompt.py

+3-2
Original file line numberDiff line numberDiff line change
@@ -2,14 +2,15 @@
22
from cognee.root_dir import get_absolute_path
33

44

5-
def render_prompt(filename: str, context: dict) -> str:
5+
def render_prompt(filename: str, context: dict, base_directory: str = None) -> str:
66
"""Render a Jinja2 template asynchronously.
77
:param filename: The name of the template file to render.
88
:param context: The context to render the template with.
99
:return: The rendered template as a string."""
1010

1111
# Set the base directory relative to the cognee root directory
12-
base_directory = get_absolute_path("./infrastructure/llm/prompts")
12+
if base_directory is None:
13+
base_directory = get_absolute_path("./infrastructure/llm/prompts")
1314

1415
# Initialize the Jinja2 environment to load templates from the filesystem
1516
env = Environment(

cognee/tasks/experimental/__init__.py

Whitespace-only changes.

cognee/tasks/graph/cascade_extract/__init__.py

Whitespace-only changes.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,18 @@
1+
Using provided potential nodes and relationships, extract concrete edges from the following text. Build upon previously extracted nodes and edges (if any), as this is round {{ round_number }} of {{ total_rounds }}.
2+
3+
**Text:**
4+
{{ text }}
5+
6+
**Potential Nodes to Use:**
7+
{{ potential_nodes }}
8+
9+
**Potential Relationships to Use:**
10+
{{ potential_relationship_names }}
11+
12+
**Previously Extracted Nodes:**
13+
{{ previous_nodes }}
14+
15+
**Previously Extracted Edge Triplets:**
16+
{{ previous_edge_triplets }}
17+
18+
Create specific edge triplets between nodes, ensuring each connection is clearly supported by the text content. Use the potential nodes and relationships as your primary building blocks, while considering previously extracted nodes and edges for consistency and completeness.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
You are an expert in knowledge graph building focusing on the extraction of graph triplets.
2+
Your task is to extract structured knowledge graph triplets from text, using as a reference provided list of potential nodes and relationship names.
3+
• Form triplets in the format (start_node, relationship_name, end_node), selecting the most precise and relevant relationship.
4+
• Identify explicit and implied relationships by leveraging the given nodes and relationship names, as well as logical inference.
5+
• Ensure completeness by cross-checking all nodes and relationships across multiple rounds.
6+
• Exclude trivial, redundant, or nonsensical triplets, keeping only meaningful and well-structured connections.
7+
• Add relevant edge triplets beyond the available potential nodes and relationship names.
8+
• Return a list of extracted triplets, ensuring clarity and accuracy for knowledge graph integration.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
Extract distinct entities and concepts from the following text to expand the knowledge graph. Build upon previously extracted entities, ensuring completeness and consistency. This is round {{ round_number }} of {{ total_rounds }}.
2+
3+
**Text:**
4+
{{ text }}
5+
6+
**Previously Extracted Entities:**
7+
{{ previous_entities }}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,8 @@
1+
You are an expert in entity extraction and knowledge graph building focusing on the node identification.
2+
Your task is to perform a detailed entity and concept extraction from text to generate a list of potential nodes for a knowledge graph.
3+
• Extract clear, distinct entities and concepts as individual strings.
4+
• Be exhaustive, ensure completeness by capturing all the entities, names, nouns, noun-parts, and implied or implicit mentions.
5+
• Also extract potential entity type nodes, directly mentioned or implied.
6+
• Avoid duplicates and overly generic terms.
7+
• Consider different perspectives and indirect references.
8+
• Return only a list of unique node strings with all the entities.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
Analyze the following text to identify relationships between entities in the knowledge graph. This is round {{ round_number }} of {{ total_rounds }}.
2+
3+
**Text:**
4+
{{ text }}
5+
6+
**Previously Extracted Potential Nodes:**
7+
{{ potential_nodes }}
8+
9+
**Nodes Identified in Previous Rounds:**
10+
{{ previous_nodes }}
11+
12+
**Relationships Identified in Previous Rounds:**
13+
{{ previous_relationship_names }}
14+
15+
Extract both explicit and implicit relationships between the nodes, building upon previous findings while ensuring completeness and consistency.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
You are an expert in relationship identification and knowledge graph building focusing on relationships. Your task is to perform a detailed extraction of relationship names from the text.
2+
• Extract all relationship names from explicit phrases, verbs, and implied context that could help form edge triplets.
3+
• Use the potential nodes and reassign them to relationship names if they correspond to a relation, verb, action or similar.
4+
• Ensure completeness by working in multiple rounds, capturing overlooked connections and refining the nodes list.
5+
• Focus on meaningful entities and relationship, directly stated or implied and implicit.
6+
• Return two lists: refined nodes and potential relationship names (for forming edges).

cognee/tasks/graph/cascade_extract/utils/__init__.py

Whitespace-only changes.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
from typing import List, Tuple
2+
from pydantic import BaseModel
3+
4+
from cognee.infrastructure.llm.get_llm_client import get_llm_client
5+
from cognee.infrastructure.llm.prompts import render_prompt, read_query_prompt
6+
from cognee.root_dir import get_absolute_path
7+
8+
9+
class PotentialNodesAndRelationshipNames(BaseModel):
10+
"""Response model containing lists of potential node names and relationship names."""
11+
12+
nodes: List[str]
13+
relationship_names: List[str]
14+
15+
16+
async def extract_content_nodes_and_relationship_names(
17+
content: str, existing_nodes: List[str], n_rounds: int = 2
18+
) -> Tuple[List[str], List[str]]:
19+
"""Extracts node names and relationship_names from content through multiple rounds of analysis."""
20+
llm_client = get_llm_client()
21+
all_nodes: List[str] = existing_nodes.copy()
22+
all_relationship_names: List[str] = []
23+
existing_node_set = {node.lower() for node in all_nodes}
24+
existing_relationship_names = set()
25+
26+
for round_num in range(n_rounds):
27+
context = {
28+
"text": content,
29+
"potential_nodes": existing_nodes,
30+
"previous_nodes": all_nodes,
31+
"previous_relationship_names": all_relationship_names,
32+
"round_number": round_num + 1,
33+
"total_rounds": n_rounds,
34+
}
35+
36+
base_directory = get_absolute_path("./tasks/graph/cascade_extract/prompts")
37+
text_input = render_prompt(
38+
"extract_graph_relationship_names_prompt_input.txt",
39+
context,
40+
base_directory=base_directory,
41+
)
42+
system_prompt = read_query_prompt(
43+
"extract_graph_relationship_names_prompt_system.txt", base_directory=base_directory
44+
)
45+
response = await llm_client.acreate_structured_output(
46+
text_input=text_input,
47+
system_prompt=system_prompt,
48+
response_model=PotentialNodesAndRelationshipNames,
49+
)
50+
51+
for node in response.nodes:
52+
if node.lower() not in existing_node_set:
53+
all_nodes.append(node)
54+
existing_node_set.add(node.lower())
55+
56+
for relationship_name in response.relationship_names:
57+
if relationship_name.lower() not in existing_relationship_names:
58+
all_relationship_names.append(relationship_name)
59+
existing_relationship_names.add(relationship_name.lower())
60+
61+
return all_nodes, all_relationship_names
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,60 @@
1+
from typing import List, Tuple
2+
from cognee.infrastructure.llm.get_llm_client import get_llm_client
3+
from cognee.infrastructure.llm.prompts import render_prompt, read_query_prompt
4+
from cognee.shared.data_models import KnowledgeGraph
5+
from cognee.root_dir import get_absolute_path
6+
7+
8+
async def extract_edge_triplets(
9+
content: str, nodes: List[str], relationship_names: List[str], n_rounds: int = 2
10+
) -> KnowledgeGraph:
11+
"""Creates a knowledge graph by identifying relationships between the provided nodes."""
12+
llm_client = get_llm_client()
13+
final_graph = KnowledgeGraph(nodes=[], edges=[])
14+
existing_nodes = set()
15+
existing_node_ids = set()
16+
existing_edge_triplets = set()
17+
18+
for round_num in range(n_rounds):
19+
context = {
20+
"text": content,
21+
"potential_nodes": nodes,
22+
"potential_relationship_names": relationship_names,
23+
"previous_nodes": existing_nodes,
24+
"previous_edge_triplets": existing_edge_triplets,
25+
"round_number": round_num + 1,
26+
"total_rounds": n_rounds,
27+
}
28+
29+
base_directory = get_absolute_path("./tasks/graph/cascade_extract/prompts")
30+
text_input = render_prompt(
31+
"extract_graph_edge_triplets_prompt_input.txt", context, base_directory=base_directory
32+
)
33+
system_prompt = read_query_prompt(
34+
"extract_graph_edge_triplets_prompt_system.txt", base_directory=base_directory
35+
)
36+
extracted_graph = await llm_client.acreate_structured_output(
37+
text_input=text_input, system_prompt=system_prompt, response_model=KnowledgeGraph
38+
)
39+
40+
for node in extracted_graph.nodes:
41+
if node.name not in existing_nodes:
42+
final_graph.nodes.append(node)
43+
existing_nodes.add(node.name)
44+
existing_node_ids.add(node.id)
45+
46+
for edge in extracted_graph.edges:
47+
edge_key = (edge.source_node_id, edge.target_node_id, edge.relationship_name)
48+
if edge_key in existing_edge_triplets:
49+
continue
50+
51+
if not (
52+
edge.source_node_id in existing_node_ids
53+
and edge.target_node_id in existing_node_ids
54+
):
55+
continue
56+
57+
final_graph.edges.append(edge)
58+
existing_edge_triplets.add(edge_key)
59+
60+
return final_graph
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
from typing import List
2+
from pydantic import BaseModel
3+
4+
from cognee.modules.chunking.models.DocumentChunk import DocumentChunk
5+
from cognee.infrastructure.llm.get_llm_client import get_llm_client
6+
from cognee.infrastructure.llm.prompts import render_prompt, read_query_prompt
7+
from cognee.root_dir import get_absolute_path
8+
9+
10+
class PotentialNodes(BaseModel):
11+
"""Response model containing a list of potential node names."""
12+
13+
nodes: List[str]
14+
15+
16+
async def extract_nodes(text: str, n_rounds: int = 2) -> List[str]:
17+
"""Extracts node names from content through multiple rounds of analysis."""
18+
llm_client = get_llm_client()
19+
all_nodes: List[str] = []
20+
existing_nodes = set()
21+
22+
for round_num in range(n_rounds):
23+
context = {
24+
"previous_nodes": all_nodes,
25+
"round_number": round_num + 1,
26+
"total_rounds": n_rounds,
27+
"text": text,
28+
}
29+
base_directory = get_absolute_path("./tasks/graph/cascade_extract/prompts")
30+
text_input = render_prompt(
31+
"extract_graph_nodes_prompt_input.txt", context, base_directory=base_directory
32+
)
33+
system_prompt = read_query_prompt(
34+
"extract_graph_nodes_prompt_system.txt", base_directory=base_directory
35+
)
36+
response = await llm_client.acreate_structured_output(
37+
text_input=text_input, system_prompt=system_prompt, response_model=PotentialNodes
38+
)
39+
40+
for node in response.nodes:
41+
if node.lower() not in existing_nodes:
42+
all_nodes.append(node)
43+
existing_nodes.add(node.lower())
44+
45+
return all_nodes

cognee/tasks/graph/extract_graph_from_data.py

+13-9
Original file line numberDiff line numberDiff line change
@@ -14,16 +14,10 @@
1414
from cognee.tasks.storage import add_data_points
1515

1616

17-
async def extract_graph_from_data(
18-
data_chunks: list[DocumentChunk], graph_model: Type[BaseModel]
17+
async def integrate_chunk_graphs(
18+
data_chunks: list[DocumentChunk], chunk_graphs: list, graph_model: Type[BaseModel]
1919
) -> List[DocumentChunk]:
20-
"""
21-
Extracts and integrates a knowledge graph from the text content of document chunks using a specified graph model.
22-
"""
23-
24-
chunk_graphs = await asyncio.gather(
25-
*[extract_content_graph(chunk.text, graph_model) for chunk in data_chunks]
26-
)
20+
"""Updates DocumentChunk objects, integrates data points and edges into databases."""
2721
graph_engine = await get_graph_engine()
2822

2923
if graph_model is not KnowledgeGraph:
@@ -52,3 +46,13 @@ async def extract_graph_from_data(
5246
await graph_engine.add_edges(graph_edges)
5347

5448
return data_chunks
49+
50+
51+
async def extract_graph_from_data(
52+
data_chunks: list[DocumentChunk], graph_model: Type[BaseModel]
53+
) -> List[DocumentChunk]:
54+
"""Extracts and integrates a knowledge graph from the text content of document chunks using a specified graph model."""
55+
chunk_graphs = await asyncio.gather(
56+
*[extract_content_graph(chunk.text, graph_model) for chunk in data_chunks]
57+
)
58+
return await integrate_chunk_graphs(data_chunks, chunk_graphs, graph_model)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
import asyncio
2+
from typing import List
3+
4+
from cognee.modules.chunking.models.DocumentChunk import DocumentChunk
5+
from cognee.shared.data_models import KnowledgeGraph
6+
from cognee.tasks.graph.cascade_extract.utils.extract_nodes import extract_nodes
7+
from cognee.tasks.graph.cascade_extract.utils.extract_content_nodes_and_relationship_names import (
8+
extract_content_nodes_and_relationship_names,
9+
)
10+
from cognee.tasks.graph.cascade_extract.utils.extract_edge_triplets import (
11+
extract_edge_triplets,
12+
)
13+
from cognee.tasks.graph.extract_graph_from_data import integrate_chunk_graphs
14+
15+
16+
async def extract_graph_from_data(
17+
data_chunks: List[DocumentChunk], n_rounds: int = 2
18+
) -> List[DocumentChunk]:
19+
"""Extract and update graph data from document chunks in multiple steps."""
20+
chunk_nodes = await asyncio.gather(
21+
*[extract_nodes(chunk.text, n_rounds) for chunk in data_chunks]
22+
)
23+
24+
chunk_results = await asyncio.gather(
25+
*[
26+
extract_content_nodes_and_relationship_names(chunk.text, nodes, n_rounds)
27+
for chunk, nodes in zip(data_chunks, chunk_nodes)
28+
]
29+
)
30+
31+
updated_nodes, relationships = zip(*chunk_results)
32+
33+
chunk_graphs = await asyncio.gather(
34+
*[
35+
extract_edge_triplets(chunk.text, nodes, rels, n_rounds)
36+
for chunk, nodes, rels in zip(data_chunks, updated_nodes, relationships)
37+
]
38+
)
39+
40+
return await integrate_chunk_graphs(data_chunks, chunk_graphs, KnowledgeGraph)

evals/eval_framework/corpus_builder/corpus_builder_executor.py

+8-12
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,18 @@
11
import cognee
22
import logging
3-
from typing import Optional, Tuple, List, Dict, Union, Any
3+
from typing import Optional, Tuple, List, Dict, Union, Any, Callable, Awaitable
44

55
from evals.eval_framework.benchmark_adapters.benchmark_adapters import BenchmarkAdapter
6-
from evals.eval_framework.corpus_builder.task_getters.task_getters import TaskGetters
7-
from evals.eval_framework.corpus_builder.task_getters.base_task_getter import BaseTaskGetter
6+
from evals.eval_framework.corpus_builder.task_getters.TaskGetters import TaskGetters
7+
from cognee.modules.pipelines.tasks.Task import Task
88
from cognee.shared.utils import setup_logging
99

1010

1111
class CorpusBuilderExecutor:
1212
def __init__(
13-
self, benchmark: Union[str, Any] = "Dummy", task_getter_type: str = "DEFAULT"
13+
self,
14+
benchmark: Union[str, Any] = "Dummy",
15+
task_getter: Callable[..., Awaitable[List[Task]]] = None,
1416
) -> None:
1517
if isinstance(benchmark, str):
1618
try:
@@ -23,13 +25,7 @@ def __init__(
2325

2426
self.raw_corpus = None
2527
self.questions = None
26-
27-
try:
28-
task_enum = TaskGetters(task_getter_type)
29-
except KeyError:
30-
raise ValueError(f"Invalid task getter type: {task_getter_type}")
31-
32-
self.task_getter: BaseTaskGetter = task_enum.getter_class()
28+
self.task_getter = task_getter
3329

3430
def load_corpus(self, limit: Optional[int] = None) -> Tuple[List[Dict], List[str]]:
3531
self.raw_corpus, self.questions = self.adapter.load_corpus(limit=limit)
@@ -48,5 +44,5 @@ async def run_cognee(self) -> None:
4844

4945
await cognee.add(self.raw_corpus)
5046

51-
tasks = await self.task_getter.get_tasks()
47+
tasks = await self.task_getter()
5248
await cognee.cognify(tasks=tasks)

0 commit comments

Comments
 (0)