KG Generation#
KG generation creates synthetic Knowledge Graph instances from extracted ontology metadata. Entities are created, assigned types, and connected with triples that respect ontology constraints.
Why This Page is Detailed
The KG generator is the core of PyGraft-gen's architecture — it's what enables generation at scales previously impractical (1M+ entities, 10M+ triples). This page is intentionally exhaustive because understanding these internals is essential for troubleshooting, performance tuning, and future extensions.
On this page:
- Overview — What KG generation does
- How Constraints Are Enforced — The rules PyGraft-gen respects
- The Generation Pipeline — Four phases at a glance
- Algorithm and Complexity — Deep dive into how it works
- Performance Characteristics — Runtime and memory expectations
- Configuration Parameters — Tuning generation behavior
- Fast Generation Mode — Speed optimization for large KGs
- FAQ — Common questions and troubleshooting
- Limitations — Known constraints
Overview#
KG generation takes the three metadata files from ontology extraction and produces a Knowledge Graph:
flowchart LR
CI[class_info.json]
RI[relation_info.json]
NI[namespaces_info.json]
subgraph G[KG Generator]
direction TB
E[Create<br/>Entities]
T[Assign<br/>Types]
TR[Generate<br/>Triples]
end
KG[Knowledge Graph]
INFO[kg_info.json]
CI --> E
RI --> TR
NI --> KG
E --> T
T --> TR
TR --> KG
TR --> INFO
style CI fill:#eee,stroke:#666,stroke-width:2px
style RI fill:#eee,stroke:#666,stroke-width:2px
style NI fill:#eee,stroke:#666,stroke-width:2px
style E fill:#f8c9c9,stroke:#c55,stroke-width:2px
style T fill:#f8c9c9,stroke:#c55,stroke-width:2px
style TR fill:#f8c9c9,stroke:#c55,stroke-width:2px
style KG fill:#eee,stroke:#666,stroke-width:2px
style INFO fill:#eee,stroke:#666,stroke-width:2px
Output files:
-
kg.{ttl|rdf|nt}— The complete Knowledge Graph in your chosen RDF format -
kg_info.json— Statistics and parameters documenting what was generated
Extraction Scope
The generator can only enforce constraints from the extracted metadata but cannot enforce constructs that weren't captured during extraction.
See What's Supported
How Constraints Are Enforced#
The generator enforces constraints extracted from the ontology during generation. Every triple generated satisfies the schema's explicit constraints.
Detailed Explanations
For detailed explanations with examples, see OWL Constraints.
Enforced Constraints#
Class Constraints:
| Constraint | Property | Enforcement |
|---|---|---|
| Hierarchy | rdfs:subClassOf |
Entities inherit all superclasses |
| Disjointness | owl:disjointWith |
Entities cannot have disjoint types |
Property Characteristics:
| Constraint | Property | Enforcement |
|---|---|---|
| Functional | owl:FunctionalProperty |
At most one outgoing edge per subject |
| Inverse functional | owl:InverseFunctionalProperty |
At most one incoming edge per object |
| Symmetric | owl:SymmetricProperty |
Stores only one direction per entity pair |
| Asymmetric | owl:AsymmetricProperty |
Rejects if reverse edge exists |
| Transitive | owl:TransitiveProperty |
Prevents cycles with irreflexive closure |
| Irreflexive | owl:IrreflexiveProperty |
No self-loops |
| Reflexive | owl:ReflexiveProperty |
Not materialized (reasoners infer) |
Relational Constraints:
| Constraint | Property | Enforcement |
|---|---|---|
| Domain/range | rdfs:domain / rdfs:range |
Triples respect class restrictions |
| Property disjointness | owl:propertyDisjointWith |
Entity pairs cannot use disjoint relations |
| Inverse relationships | owl:inverseOf |
Validates inverse triples would be valid |
| Subproperty inheritance | rdfs:subPropertyOf |
Constraints inherited from superproperties |
Forbidden Characteristic Combinations#
Some characteristic combinations are logically inconsistent and prohibited by OWL 2:
Direct Contradictions (Generation Stops):
| Combination | Why Forbidden |
|---|---|
| Reflexive + Irreflexive | Reflexive requires self-loops; Irreflexive forbids them |
| Symmetric + Asymmetric | Symmetric requires bidirectional; Asymmetric forbids it |
Problematic Combinations (Warnings, Relation Excluded):
| Combination | Issue |
|---|---|
| Asymmetric + Functional | Creates inconsistency in OWL 2 reasoning |
| Asymmetric + InverseFunctional | Creates inconsistency in OWL 2 reasoning |
| Transitive + Functional | Can lead to unintended inference chains and explosions |
When detected during schema loading, the generator either stops (SEVERE errors) or excludes the problematic relation (WARNING errors).
The Generation Pipeline#
Generation happens in four sequential phases, each building on the previous one.
Schema Loading#
The first phase prepares everything needed for efficient generation by converting metadata into optimized internal structures.
ID mappings:
Convert class/relation names to integer IDs and create bidirectional lookups (string ↔ ID). This enables fast array-based operations throughout generation.
Constraint caches (pre-computed):
All constraint data is computed once upfront to avoid repeated lookups during generation:
- Domain/range entity pools for each relation
- Disjoint envelopes (classes disjoint with each relation's domain/range)
- Transitive superclass closures for all classes
- Property characteristics sets (functional, symmetric, transitive, etc.)
- Inverse mappings and subproperty chains
Schema validation:
Checks for logical contradictions before generation begins:
- SEVERE errors stop generation immediately (e.g., reflexive + irreflexive, symmetric + asymmetric)
- WARNING errors exclude problematic relations (e.g., transitive + functional, asymmetric + functional)
Entity Creation and Typing#
With the schema loaded, entities are created and assigned types according to the class hierarchy.
Entity creation:
- Allocates entity ID space (E1, E2, ..., En)
- Splits entities into typed vs untyped based on
prop_untyped_entities
Type assignment:
Typed entities receive classes sampled from the hierarchy:
- Uses power-law distribution (mimics real-world data where a few classes are very common and most are rare)
- Target depth controlled by
avg_specific_class_depth - Transitive superclasses added automatically
- Optional multityping adds additional most-specific classes
Disjointness enforcement:
If an entity receives disjoint classes, one is removed deterministically to maintain logical consistency.
Reverse indices:
Builds lookup tables for the next phase:
class2entities[Person] = [E1, E5, E12, ...]— All entities of each classclass2unseen[Person] = [E5, E12, ...]— Entities not yet used in triples
These enable fast candidate pool construction during triple generation.
Triple Generation#
The core phase where relational triples are created using batch sampling with constraint filtering.
Setup Phase (One-Time)#
Before generating any triples, the system prepares relation-specific data:
Build candidate entity pools per relation:
- Domain pool: Entities satisfying ALL domain classes (intersection)
- Range pool: Entities satisfying ALL range classes (intersection)
- Relations with empty pools are excluded from generation
Distribute triple budget across relations:
- Controlled by
relation_usage_uniformity - 0.0 = power-law distribution (few relations dominate)
- 1.0 = uniform distribution (all relations equal)
Initialize tracking structures:
- Duplicate detection:
seen_pairs[relation] = {(h1,t1), (h2,t2), ...} - Functional constraints:
functional_heads[relation] = {h1, h2, ...} - Inverse-functional:
invfunctional_tails[relation] = {t1, t2, ...}
Generation Loop#
Sample batches of candidate triples and filter them through two phases:
Fast Filtering (Batch)
Applied to all candidates using vectorized NumPy operations. Checks that don't need current graph state:
- Irreflexive:
head != tail - Duplicate:
(head, tail)not already generated - Functional:
headnot already used for this relation - Inverse-functional:
tailnot already used for this relation - Asymmetric: reverse edge doesn't exist
- Symmetric duplicates: reverse direction not already generated
Deep Validation (Per-Triple)
Applied during generation as each triple is added. These checks require type information or graph traversal:
- Domain/range typing: Entity types satisfy all constraints
- Disjoint envelopes: Entities not instances of classes disjoint with domain/range
- Inverse validation: Inverse triple would be valid
- Property disjointness: Entity pair doesn't exist for disjoint property
- Transitive cycles: No reflexive cycles created
- Subproperty inheritance: Constraints from superproperties satisfied
Adaptive Mechanisms#
The generator adjusts dynamically to maintain efficiency:
- Oversample multiplier: Constrained relations (functional, inverse-functional) oversample 4x to compensate for higher rejection rates
- Batch sizing: Increases dynamically as valid candidates become scarce
- Stall detection: Relations producing 20 consecutive empty batches are dropped (candidate pool exhausted)
- Weight recomputation: Redistributes budget from dropped relations to active ones
Target vs Actual
num_triples is a target. Actual count may be lower due to constraint exhaustion and is reported in kg_info.json.
Serialization#
The final phase writes the generated Knowledge Graph to disk with proper formatting.
RDF graph (kg.ttl, .rdf, or .nt):
- Instance triples (head, relation, tail)
- Type assertions for most-specific classes only (reasoners infer superclasses)
- Namespace bindings from
namespaces_info.json
Statistics (kg_info.json):
- User parameters (requested counts, configuration)
- Actual statistics (generated counts, averages, proportions)
Algorithm and Complexity#
This section provides a detailed look at how the generation algorithm works and its computational complexity. Understanding this helps you predict performance and troubleshoot slow generation.
Notation
Throughout this section, we use shorthand notation for readability. Here is what each symbol means:
| Symbol | Full name | Source |
|---|---|---|
| $n_\text{entities}$ | Number of entities | num_entities config parameter |
| $n_\text{triples}$ | Number of triples | num_triples config parameter |
| $n_\text{classes}$ | Number of classes | From the schema (class_info.json) |
| $n_\text{relations}$ | Number of relations | From the schema (relation_info.json) |
| $\text{avg_depth}$ | Average class hierarchy depth | Depends on schema structure |
| $\text{avg_types}$ | Average types per entity | When multityping is enabled |
| $\text{batch}$ | Batch size | 1K to 100K depending on KG scale |
We express complexity using Big-O notation. For example, $O(n_\text{entities})$ means the operation's runtime grows linearly with the number of entities.
Design Philosophy#
PyGraft-gen uses an integer-ID model with batch sampling and two-phase filtering to achieve scalability:
-
Integer-based identifiers: Entities, classes, and relations use integer IDs internally. Strings only appear during serialization. This eliminates string overhead and enables efficient NumPy array operations.
-
Pre-computed constraint caches: Domain/range pools, disjointness sets, and property characteristics computed once before generation. No repeated set intersections or dictionary lookups during sampling.
-
Batch sampling with vectorized filtering: Sample large batches of candidate triples, then apply constraints in two phases — fast filtering eliminates most invalid candidates before expensive deep validation.
-
Incremental constraint tracking: Functional and inverse-functional properties maintain sets of used heads/tails. Constraint checks become constant-time set membership tests, i.e., $O(1)$, instead of scanning all existing triples which would cost $O(n_\text{triples})$.
Performance Impact
- Naive approach: 10M triples = 10M sampling operations, each with $O(n_\text{triples})$ validation cost
- PyGraft-gen: 10M triples ≈ 1,000 batch operations with $O(1)$ constraint checks
- Result: Hours reduced to minutes for large-scale generation
Schema Loading#
Schema loading converts JSON metadata into optimized internal structures.
flowchart LR
A[Load JSON files] --> B[Build ID mappings]
B --> C[Initialize caches]
C --> D[Validate]
D --> E[Compute envelopes]
Complexity breakdown:
| Step | Operation | Complexity |
|---|---|---|
| Load JSON | Parse class_info, relation_info | Linear in schema size: $O(n_\text{classes} + n_\text{relations})$ |
| ID mappings | Build bidirectional dictionaries | Linear in schema size: $O(n_\text{classes} + n_\text{relations})$ |
| Class caches | Layer mappings, transitive closures | Linear in classes times hierarchy depth: $O(n_\text{classes} \times \text{avg_depth})$ |
| Relation caches | Domain/range sets, property characteristics | Linear in relations: $O(n_\text{relations})$ |
| Validation | Check forbidden combinations | Linear in relations: $O(n_\text{relations})$ |
| Disjoint envelopes | Union disjoint classes per relation | Worst case relations times classes: $O(n_\text{relations} \times n_\text{classes})$ |
Total Complexity
$O(n_\text{classes} \times \text{avg_depth} + n_\text{relations} \times n_\text{classes})$ in the worst case, but typically much faster because disjointness declarations are sparse in real ontologies.
Entity Typing#
Entity typing assigns classes to entities while respecting hierarchy and disjointness constraints.
flowchart LR
A[Initialize space] --> B[Assign most-specific]
B --> C[Add multitypes]
C --> D[Compute transitive]
D --> E[Resolve conflicts]
Complexity breakdown:
| Step | Operation | Complexity |
|---|---|---|
| Initialize | Allocate arrays, select typed subset | Linear in entities: $O(n_\text{entities})$ |
| Most-specific assignment | Power-law sampling per entity | Linear in entities: $O(n_\text{entities})$ |
| Multityping | Add additional classes per entity | Linear in entities times avg types: $O(n_\text{entities} \times \text{avg_types})$ |
| Transitive closure | Add superclasses for each specific class | Linear in entities times hierarchy depth: $O(n_\text{entities} \times \text{avg_depth})$ |
| Conflict resolution | Check and repair disjoint violations | Linear in entities times avg types squared: $O(n_\text{entities} \times \text{avg_types}^2)$ |
| Profile replication (fast mode) | Copy profiles round-robin | Linear in entities: $O(n_\text{entities})$ |
Total Complexity
$O(n_\text{entities} \times (\text{avg_depth} + \text{avg_types}^2))$, dominated by transitive closure computation and disjoint conflict resolution.
Triple Generation#
Triple generation is the performance-critical phase. It uses a batch sampling pipeline with two-phase filtering.
Setup (One-Time)#
Before the main loop, the generator prepares per-relation data structures:
| Step | Operation | Complexity |
|---|---|---|
| Class-entity index | Build reverse mapping | Linear in entities times avg types: $O(n_\text{entities} \times \text{avg_types})$ |
| Candidate pools | Intersect entities per relation | Worst case relations times entities: $O(n_\text{relations} \times n_\text{entities})$ |
| Budget distribution | Compute weights and quotas | Linear in relations: $O(n_\text{relations})$ |
| Tracking init | Create empty sets | Linear in relations: $O(n_\text{relations})$ |
Setup Total Complexity
$O(n_\text{relations} \times n_\text{entities})$ in the worst case, though typically faster when domain/range constraints are selective and filter out most entities.
Main Loop#
The generation loop runs until the target triple count is reached:
flowchart LR
S[Sample relation] --> B[Sample batch]
B --> F[Fast filter]
F --> D[Deep validate]
D --> A[Accept valid]
A --> U[Update state]
U --> S
Each iteration processes a batch of candidates (typically 1K-100K depending on KG size):
| Step | Operation | Complexity per batch |
|---|---|---|
| Sample relation | Weighted random choice | Constant time: $O(1)$ |
| Sample candidates | Random from pools with freshness bias | Linear in batch size: $O(\text{batch})$ |
| Fast filtering | Vectorized constraint masks | Linear in batch size: $O(\text{batch})$ |
| Deep validation | Per-survivor constraint checks | Survivors times validation cost: $O(\text{survivors} \times V)$ |
| Accept and record | Update tracking structures | Linear in accepted count: $O(\text{accepted})$ |
Number of iterations: In the best case, approximately $\left\lceil \frac{n_{\text{triples}}}{\text{batch}} \right\rceil$ iterations. More iterations are needed when rejection rates are high due to tight constraints.
Loop Total Complexity
$O(n_\text{triples} \times V)$ where $V$ is the average validation cost per triple. For most constraints, $V$ is constant, i.e., $O(1)$. The exception is transitive cycle detection, which requires graph traversal.
Two-Phase Filtering Details#
The two-phase approach separates cheap batch operations from expensive per-triple checks:
Applied to the entire batch using vectorized NumPy operations. Each check runs in linear time over the batch with constant-time lookups:
| Constraint | Check | Cost per candidate |
|---|---|---|
| Irreflexive | head != tail |
Constant time via vectorized comparison: $O(1)$ |
| Duplicate | (h,t) in seen_pairs |
Constant time via hash set lookup: $O(1)$ |
| Functional | h in functional_heads |
Constant time via hash set lookup: $O(1)$ |
| Inverse-functional | t in invfunctional_tails |
Constant time via hash set lookup: $O(1)$ |
| Asymmetric | (t,h) in kg_pairs |
Constant time via hash set lookup: $O(1)$ |
| Symmetric duplicate | (t,h) in seen_pairs |
Constant time via hash set lookup: $O(1)$ |
Fast filtering catches most constraint violations before deep validation begins, ensuring expensive checks only run on candidates that pass the cheap tests.
Applied to each survivor individually. Most checks are constant-time thanks to pre-computed caches:
| Constraint | Check | Cost per triple |
|---|---|---|
| Domain/range typing | Entity has required classes | Constant time via set membership: $O(1)$ |
| Disjoint envelope | Entity not in forbidden classes | Constant time via set intersection: $O(1)$ |
| Inverse validation | Inverse triple would be valid | Constant time via recursive check: $O(1)$ |
| Property disjointness | No conflict with disjoint relations | Linear in disjoint set size: $O(\vert\text{disjoints}\vert)$ |
| Subproperty inheritance | Super-property constraints satisfied | Linear in super-property count: $O(\vert\text{supers}\vert)$ |
| Transitive cycle | No path from tail to head | BFS traversal of relation subgraph: $O(V + E)$ |
Most constraints are constant-time, i.e., $O(1)$, due to pre-computed caches. The exception is transitive cycle detection, which requires graph traversal and can be expensive for dense transitive relations.
Transitive Cycle Detection#
For relations that are both transitive and (irreflexive or asymmetric), adding a triple $(h, r, t)$ could create a cycle in the transitive closure. The generator uses breadth-first search (BFS) to detect this:
flowchart TD
A[Start BFS from t] --> B[Follow existing edges for relation r]
B --> C{Reached h?}
C -->|Yes| D[Reject triple - cycle detected]
C -->|No more nodes| E[Accept triple]
Cycle Detection Complexity
The cost is proportional to the number of vertices and edges reachable from $t$ in the relation's subgraph, i.e., $O(V + E)$ where $V$ is vertices visited and $E$ is edges traversed. This is typically much smaller than the full graph because only edges of one specific relation are considered, the search stops as soon as $h$ is found, and most transitive relations have sparse connectivity.
Incremental Tracking#
Instead of scanning all existing triples to check constraints (which would cost linear time in the number of triples, i.e., $O(n_\text{triples})$), the generator maintains incremental tracking structures:
| Structure | Purpose | Update cost | Query cost |
|---|---|---|---|
seen_pairs[r] |
Duplicate detection | Constant time to add: $O(1)$ | Constant time lookup: $O(1)$ |
functional_heads[r] |
Functional property | Constant time to add: $O(1)$ | Constant time lookup: $O(1)$ |
invfunctional_tails[r] |
Inverse-functional | Constant time to add: $O(1)$ | Constant time lookup: $O(1)$ |
transitive_adjacency[r] |
Cycle detection | Constant time to add edge: $O(1)$ | BFS traversal: $O(V + E)$ |
Key Optimization
This is the key optimization that makes PyGraft-gen scale. In naive implementations, checking functional constraints requires scanning all existing triples for that relation, costing $O(n_\text{triples})$ per check. With incremental tracking, the same check costs constant time, i.e., $O(1)$.
Serialization#
Serialization writes the generated KG to disk:
| Step | Operation | Complexity |
|---|---|---|
| Build RDF graph | Add triples to RDFLib | Linear in triples: $O(n_\text{triples})$ |
| Add type assertions | One per entity-class pair | Linear in entities times types: $O(n_\text{entities} \times \text{avg_types})$ |
| Serialize | Write to file | Linear in output size: $O(n_\text{triples} + n_\text{entities})$ |
| Statistics | Compute and write kg_info | Linear in triples: $O(n_\text{triples})$ |
Total Complexity
$O(n_\text{triples} + n_\text{entities} \times \text{avg_types})$, dominated by type assertion generation when multityping is enabled. Linear in output size.
Overall Complexity Summary#
| Phase | Complexity | What dominates |
|---|---|---|
| Schema loading | $O(n_\text{classes} \times \text{avg_depth} + n_\text{relations} \times n_\text{classes})$ | Disjoint envelope computation |
| Entity typing | $O(n_\text{entities} \times (\text{avg_depth} + \text{avg_types}^2))$ | Transitive closure, conflict resolution |
| Triple generation setup | $O(n_\text{relations} \times n_\text{entities})$ | Candidate pool construction |
| Triple generation loop | $O(n_\text{triples} \times V)$ | Validation cost $V$ per triple |
| Serialization | $O(n_\text{triples} + n_\text{entities} \times \text{avg_types})$ | Linear scan of outputs |
In practice, the triple generation loop dominates runtime for large KGs. The validation cost $V$ is constant, i.e., $O(1)$, for most constraints, but becomes a graph traversal cost, i.e., $O(V + E)$, for transitive relations with cycle detection.
Performance Characteristics#
Understanding performance helps you plan generation runs and troubleshoot issues.
Constraint Impact on Performance#
Some constraints are more expensive to validate than others:
| Constraint Type | Cost | Complexity | Notes |
|---|---|---|---|
| Irreflexive | Low | Constant: $O(1)$ | Simple equality check |
| Functional | Low | Constant: $O(1)$ | Set membership via incremental tracking |
| Inverse-functional | Low | Constant: $O(1)$ | Set membership via incremental tracking |
| Domain/Range | Low | Constant: $O(1)$ | Pre-filtered pools |
| Symmetric | Low | Constant: $O(1)$ | Duplicate check in seen_pairs |
| Asymmetric | Medium | Constant: $O(1)$ | Reverse edge lookup |
| Disjointness | Medium | Linear in disjoint set: $O(\vert\text{disjoints}\vert)$ | Intersection with disjoint set |
| Inverse validation | Medium | Constant: $O(1)$ | Recursive validation of inverse |
| Subproperty | Medium | Linear in supers: $O(\vert\text{supers}\vert)$ | Check inherited constraints |
| Transitive + Irreflexive | High | BFS traversal: $O(V + E)$ | Cycle detection in relation subgraph |
Scale Expectations#
Relative Scale Expectations
- Small (1K-10K entities): Seconds
- Medium (10K-100K entities): Minutes
- Large (100K-1M entities): Tens of minutes
- Very large (1M+ entities): Hours
Performance Factors
What affects speed:
- Hardware (CPU, RAM)
- Schema complexity (number of constraints, hierarchy depth)
- Configuration (
relation_usage_uniformity,enable_fast_generation) - Constraint density (how many relations have expensive properties like transitivity)
What Slows Generation Down
- Transitive cycle detection: Requires BFS graph traversal costing $O(V + E)$ for each candidate triple
- Deep validation: Each surviving candidate requires type checking against current state
- Inverse relation validation: Must validate both the triple and its inverse
- Small candidate pools: Tight constraints mean more sampling attempts per accepted triple
Memory Usage#
Memory requirements scale with graph size and schema complexity.
Memory components:
- Entity structures: Linear in entities, i.e., $O(n_\text{entities})$
- Triple storage: Linear in triples, i.e., $O(n_\text{triples})$
- Constraint caches: Linear in schema size, i.e., $O(n_\text{classes} + n_\text{relations})$
- Candidate pools: Linear in relations times average pool size, i.e., $O(n_\text{relations} \times \text{avg_pool_size})$
What Affects Memory
- Number of entities and triples (primary drivers)
- Schema size (classes, relations, constraints)
- Number of active relations being tracked
- Whether consistency checking is enabled (reasoner memory overhead)
Memory-Saving Strategies
During generation:
- Use fast generation mode for large KGs
- Reduce
relation_usage_uniformity(smaller pool tracking)
Post-generation:
Consistency checking runs as a separate step after the KG is serialized. It uses the HermiT reasoner via Owlready2 which:
- Runs in its own JVM subprocess with independent memory
- Can be extremely memory-intensive on large KGs
- May exhaust Java heap space on very large graphs
For large KGs (1M+ entities), disable check_kg_consistency to avoid memory issues and long validation times.
See Consistency Checking for details.
Configuration Parameters#
Now that you understand how the generation pipeline works, you can fine-tune it using the configuration parameters in the kg section of your config file.
| Parameter | Controls | |
|---|---|---|
num_entities |
Total entity count | |
num_triples |
Target triple count | |
prop_untyped_entities |
Proportion without class assignment (0.0-1.0) | |
avg_specific_class_depth |
Target hierarchy depth for assigned classes | |
multityping |
Allow multiple most-specific classes per entity | |
avg_types_per_entity |
Target average class count when multityping enabled | |
relation_usage_uniformity |
Triple distribution evenness across relations (0.0-1.0) | |
enable_fast_generation |
Generate small prototype then scale up | |
check_kg_consistency |
Run reasoner validation after generation |
Complete Reference
See Configuration Reference for detailed parameter descriptions.
Key Parameters Explained#
relation_usage_uniformity (0.0-1.0)
Controls triple distribution across relations. Example with 10 relations, 1000 triples:
| uniformity | R1 | R2 | R3 | R4 | R5 | R6 | R7 | R8 | R9 | R10 |
|---|---|---|---|---|---|---|---|---|---|---|
| 0.0 | 400 | 200 | 150 | 100 | 50 | 40 | 30 | 20 | 5 | 5 |
| 0.5 | 180 | 150 | 130 | 110 | 95 | 85 | 75 | 65 | 60 | 50 |
| 1.0 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 | 100 |
avg_specific_class_depth
Controls type specificity. Example hierarchy:
owl:Thing (depth 0)
├── Person (depth 1)
│ ├── Student (depth 2)
│ │ └── GraduateStudent (depth 3)
│ └── Professor (depth 2)
└── Organization (depth 1)
- depth 1.0 → Person, Organization
- depth 2.0 → Student, Professor
- depth 3.0 → GraduateStudent
Fast Generation Mode#
For large-scale generation, fast mode offers a speed-optimized alternative to the standard pipeline.
When enable_fast_generation: true, the generator creates a small prototype KG then replicates entity profiles to reach target size.
How It Works
Fast generation trades entity typing diversity for speed by reusing type profiles:
- Generate seed batch (10-20% of target size) with full constraint pipeline
- Capture entity type profiles from seed batch
- Replicate profiles round-robin to create remaining entities
- Each new entity gets a copy of a seed entity's type assignments
- Profiles are shuffled to maintain distribution
- Generate triples for all entities (seed + replicated)
Benefits
- Significantly faster for large KGs (100K+ entities)
- Preserves type distribution and hierarchy characteristics
- Avoids recomputing hierarchy sampling and disjointness resolution
Trade-offs
- Less diverse entity typing patterns (profiles are copied, not unique)
- Type distribution matches seed batch exactly
When to Use
- Large KGs (1M+ entities) where full generation is slow
- Heavily constrained schemas
- Testing configurations before full-scale runs
FAQ#
Here are answers to common questions and troubleshooting tips based on generation behavior.
Why did I get fewer triples than I requested?
Constraints are too restrictive (small domain/range pools, many functional properties) or relations exhausted their candidate space early and were dropped.
Why is generation taking hours?
Expensive constraints (transitive properties, deep subproperty hierarchies, complex inverse validations) or small candidate pools slow down validation.
Why are many relations excluded or dropped?
Empty candidate pools (domain/range constraints have no satisfying entities) or forbidden characteristic combinations (Asymmetric + Functional, Transitive + Functional). Check logs for specific exclusion reasons.
Limitations#
The generator enforces only constraints captured during Ontology extraction. This creates some important limitations to be aware of like some unsupported OWL constructs not being enforced during generation but being validated during consistency checking.
This creates a generation vs validation gap: A KG can be generated "correctly" according to extracted metadata but still fail consistency checking against the full ontology.
Why this matters:
- Generated KG respects extracted constraints
- Consistency checking validates against full ontology (including unsupported constructs)
- Result: KG may be marked inconsistent due to constraints not enforced during generation
Learn More
See Consistency Checking for details on this gap and What's Supported for unsupported constructs.
What's Next#
- OWL Constraints — Detailed constraint explanations
- Consistency Checking — Validating generated KGs
- Configuration Reference — All generation parameters
- KG Info Reference — Output statistics format