Graph consolidation Module¶
In order to overcome the LLM's output token limitation, that have a direct impact on the order (number of entities) of the generated graphs. We developped this module to merge several graphs coming from the knowledge graph generation module in an only one, and so increase the order of the resulting merged graph.
The graph consolidation module processes sets of synthetic knowledge graphs in turtle format based on the merging by exact matching method on name entities. It identifies overlapping entities (homonymes), and creates different merged graphs with different density based on a sequential homonymes entities renaming process, from a merged graph where no homonymous are renamed to the renaming of all homonymous corresponding to the juxtaposition of graphs generated by the LLM.
Main Components:¶
- merge_ttl.py : Main script to launch the consolidation process
-
utils_merge/utils.py : Supporting utilities for homonyme entities detection, managing files/folders, sequential renaming of homonymous to build different sets of graph with different number of nodes, merging sets of graphs, managing prefix in the merged graphs.
- List of functions :
- build_merged_folder_paths_and_files(path_files):
- manage_prefix(path_merged):
- find_homonymes_nodes(path,logger_homonymes,ontology):
- rename_and_merge(path_homonyme_treated,path_merged,homonymes_nodes_and_occurence,\ nbr_homonyme_max,logger_merge):
- List of functions :
Features¶
- Duplicate Node Detection: Automatically identifies homonymous nodes across multiple TTL files
- Intelligent Merging: Combines graphs while preserving semantic relationships
- Prefix Management: Handles RDF namespace prefixes during the merge process
- Validation: Verifies TTL syntax validity of merged outputs
- Flexible Node Density: Supports different graph densities based on homonym occurrence thresholds
- Comprehensive Logging: Detailed logging for merge operations, homonym detection, and validation
Merged files are automatically checked by Turtle Validator for syntax validation. Each validated file is stored in a "merged" folder beside the LLM generated graphs.
Process Workflow¶
- Path Setup: Creates necessary output directories for merged files and logs
- Homonym Detection: Scans all TTL files to identify duplicate node names
- Occurrence Counting: Counts how many files contain each homonymous node
- Merge Strategy: Applies renaming strategy based on occurrence thresholds
- File Merging: Combines processed TTL files into unified graphs
- Prefix Management: Cleans up RDF namespace prefixes
- Validation: Validates merged TTL syntax and moves invalid files
Error Handling¶
- Syntax Errors: Invalid TTL files are moved to
Invalid_Turtle_Syntax_for_merged_graphs/ - Missing Files: Graceful handling of missing input files
Logging¶
Three separate log files track different aspects:
- Merge Log: Overall merge process and statistics
- Homonyms Log: Duplicate node detection details
- Validation Log: TTL syntax checking results
Knowledge graph consolidation module¶
graph TD
A@{ shape: docs, label: "Initial set of valid graphs (Turtle files) produced by the knowledge graph generation module"} --> B:::highlight
B["Build the list of key:value pairs [entity:occurrence]. Where occurence is the number of files in which an entity appears at least once"] --> C["Retreive the 'Max occurrence' in the list of key:value pairs"]:::highlight
C --> Cbis["Set to 1 the value of a local variable (max_occ_for_file_to_merge) that represents the maximum numbers of files in which an entity appears before merging"]:::highlight
Cbis --> D{{"while max_occ_for_file_to_merge is different of 'Max occurence' +1 "}}:::highlight
D --> |max_occ_for_file_to_merge='Max occurence' +1|Dbis["end of the consoliodation process"]
D --> E["Based on the initial set of valid graphs, generate a new set of graphs by decreasing entities occurrence higher than max_occ_for_file_to_merge, to the value of max_occ_for_file_to_merge by renaming them in an adequate number of files"]:::highlight
E --> F@{ shape: docs, label: "New sets of graphs (Turtle files) based on the value of max_occ_for_file_to_merge"}
E --> G["increase by one the value of max_occ_for_file_to_merge"]:::highlight
G --> D
F --> H["Merge each new set of graphs (files concatenation)"]:::highlight
H --> I["Move all the prefix at the top of the merged file and remove duplicate prefix generated by the concatenation"]:::highlight
I --> J{{Check Turtle syntax}}:::highlight
J -->|Turtle syntax ok| K@{ shape: lin-doc, label: "Merged graph with specific number of entities and density derived from the merge process" }
J -->|Turtle syntax ko|L[Discard the graph]
classDef largeBox min-width:300px,min-height:1000px
class B largeBox
classDef highlight fill:#f9d423,stroke:#333,stroke-width:4px;