Skip to content

KG Generation#

KG generation creates synthetic Knowledge Graph instances from extracted ontology metadata. Entities are created, assigned types, and connected with triples that respect ontology constraints.

Why This Page is Detailed

The KG generator is the core of PyGraft-gen's architecture — it's what enables generation at scales previously impractical (1M+ entities, 10M+ triples). This page is intentionally exhaustive because understanding these internals is essential for troubleshooting, performance tuning, and future extensions.

On this page:


Overview#

KG generation takes the three metadata files from ontology extraction and produces a Knowledge Graph:

flowchart LR
  CI[class_info.json]
  RI[relation_info.json]
  NI[namespaces_info.json]

  subgraph G[KG Generator]
    direction TB
    E[Create<br/>Entities]
    T[Assign<br/>Types]
    TR[Generate<br/>Triples]
  end

  KG[Knowledge Graph]
  INFO[kg_info.json]

  CI --> E
  RI --> TR
  NI --> KG

  E --> T
  T --> TR
  TR --> KG
  TR --> INFO

  style CI fill:#eee,stroke:#666,stroke-width:2px
  style RI fill:#eee,stroke:#666,stroke-width:2px
  style NI fill:#eee,stroke:#666,stroke-width:2px
  style E fill:#f8c9c9,stroke:#c55,stroke-width:2px
  style T fill:#f8c9c9,stroke:#c55,stroke-width:2px
  style TR fill:#f8c9c9,stroke:#c55,stroke-width:2px
  style KG fill:#eee,stroke:#666,stroke-width:2px
  style INFO fill:#eee,stroke:#666,stroke-width:2px

Output files:

  • kg.{ttl|rdf|nt} — The complete Knowledge Graph in your chosen RDF format
  • kg_info.json — Statistics and parameters documenting what was generated

Extraction Scope

The generator can only enforce constraints from the extracted metadata but cannot enforce constructs that weren't captured during extraction.

See What's Supported


How Constraints Are Enforced#

The generator enforces constraints extracted from the ontology during generation. Every triple generated satisfies the schema's explicit constraints.

Detailed Explanations

For detailed explanations with examples, see OWL Constraints.

Enforced Constraints#

Class Constraints:

Constraint Property Enforcement
Hierarchy rdfs:subClassOf Entities inherit all superclasses
Disjointness owl:disjointWith Entities cannot have disjoint types

Property Characteristics:

Constraint Property Enforcement
Functional owl:FunctionalProperty At most one outgoing edge per subject
Inverse functional owl:InverseFunctionalProperty At most one incoming edge per object
Symmetric owl:SymmetricProperty Stores only one direction per entity pair
Asymmetric owl:AsymmetricProperty Rejects if reverse edge exists
Transitive owl:TransitiveProperty Prevents cycles with irreflexive closure
Irreflexive owl:IrreflexiveProperty No self-loops
Reflexive owl:ReflexiveProperty Not materialized (reasoners infer)

Relational Constraints:

Constraint Property Enforcement
Domain/range rdfs:domain / rdfs:range Triples respect class restrictions
Property disjointness owl:propertyDisjointWith Entity pairs cannot use disjoint relations
Inverse relationships owl:inverseOf Validates inverse triples would be valid
Subproperty inheritance rdfs:subPropertyOf Constraints inherited from superproperties

Forbidden Characteristic Combinations#

Some characteristic combinations are logically inconsistent and prohibited by OWL 2:

Direct Contradictions (Generation Stops):

Combination Why Forbidden
Reflexive + Irreflexive Reflexive requires self-loops; Irreflexive forbids them
Symmetric + Asymmetric Symmetric requires bidirectional; Asymmetric forbids it

Problematic Combinations (Warnings, Relation Excluded):

Combination Issue
Asymmetric + Functional Creates inconsistency in OWL 2 reasoning
Asymmetric + InverseFunctional Creates inconsistency in OWL 2 reasoning
Transitive + Functional Can lead to unintended inference chains and explosions

When detected during schema loading, the generator either stops (SEVERE errors) or excludes the problematic relation (WARNING errors).


The Generation Pipeline#

Generation happens in four sequential phases, each building on the previous one.

Schema Loading#

The first phase prepares everything needed for efficient generation by converting metadata into optimized internal structures.

ID mappings:

Convert class/relation names to integer IDs and create bidirectional lookups (string ↔ ID). This enables fast array-based operations throughout generation.

Constraint caches (pre-computed):

All constraint data is computed once upfront to avoid repeated lookups during generation:

  • Domain/range entity pools for each relation
  • Disjoint envelopes (classes disjoint with each relation's domain/range)
  • Transitive superclass closures for all classes
  • Property characteristics sets (functional, symmetric, transitive, etc.)
  • Inverse mappings and subproperty chains

Schema validation:

Checks for logical contradictions before generation begins:

  • SEVERE errors stop generation immediately (e.g., reflexive + irreflexive, symmetric + asymmetric)
  • WARNING errors exclude problematic relations (e.g., transitive + functional, asymmetric + functional)

Entity Creation and Typing#

With the schema loaded, entities are created and assigned types according to the class hierarchy.

Entity creation:

  • Allocates entity ID space (E1, E2, ..., En)
  • Splits entities into typed vs untyped based on prop_untyped_entities

Type assignment:

Typed entities receive classes sampled from the hierarchy:

  • Uses power-law distribution (mimics real-world data where a few classes are very common and most are rare)
  • Target depth controlled by avg_specific_class_depth
  • Transitive superclasses added automatically
  • Optional multityping adds additional most-specific classes

Disjointness enforcement:

If an entity receives disjoint classes, one is removed deterministically to maintain logical consistency.

Reverse indices:

Builds lookup tables for the next phase:

  • class2entities[Person] = [E1, E5, E12, ...] — All entities of each class
  • class2unseen[Person] = [E5, E12, ...] — Entities not yet used in triples

These enable fast candidate pool construction during triple generation.

Triple Generation#

The core phase where relational triples are created using batch sampling with constraint filtering.

Setup Phase (One-Time)#

Before generating any triples, the system prepares relation-specific data:

Build candidate entity pools per relation:

  • Domain pool: Entities satisfying ALL domain classes (intersection)
  • Range pool: Entities satisfying ALL range classes (intersection)
  • Relations with empty pools are excluded from generation

Distribute triple budget across relations:

  • Controlled by relation_usage_uniformity
  • 0.0 = power-law distribution (few relations dominate)
  • 1.0 = uniform distribution (all relations equal)

Initialize tracking structures:

  • Duplicate detection: seen_pairs[relation] = {(h1,t1), (h2,t2), ...}
  • Functional constraints: functional_heads[relation] = {h1, h2, ...}
  • Inverse-functional: invfunctional_tails[relation] = {t1, t2, ...}

Generation Loop#

Sample batches of candidate triples and filter them through two phases:

Fast Filtering (Batch)

Applied to all candidates using vectorized NumPy operations. Checks that don't need current graph state:

  • Irreflexive: head != tail
  • Duplicate: (head, tail) not already generated
  • Functional: head not already used for this relation
  • Inverse-functional: tail not already used for this relation
  • Asymmetric: reverse edge doesn't exist
  • Symmetric duplicates: reverse direction not already generated
Deep Validation (Per-Triple)

Applied during generation as each triple is added. These checks require type information or graph traversal:

  • Domain/range typing: Entity types satisfy all constraints
  • Disjoint envelopes: Entities not instances of classes disjoint with domain/range
  • Inverse validation: Inverse triple would be valid
  • Property disjointness: Entity pair doesn't exist for disjoint property
  • Transitive cycles: No reflexive cycles created
  • Subproperty inheritance: Constraints from superproperties satisfied

Adaptive Mechanisms#

The generator adjusts dynamically to maintain efficiency:

  • Oversample multiplier: Constrained relations (functional, inverse-functional) oversample 4x to compensate for higher rejection rates
  • Batch sizing: Increases dynamically as valid candidates become scarce
  • Stall detection: Relations producing 20 consecutive empty batches are dropped (candidate pool exhausted)
  • Weight recomputation: Redistributes budget from dropped relations to active ones

Target vs Actual

num_triples is a target. Actual count may be lower due to constraint exhaustion and is reported in kg_info.json.

Serialization#

The final phase writes the generated Knowledge Graph to disk with proper formatting.

RDF graph (kg.ttl, .rdf, or .nt):

  • Instance triples (head, relation, tail)
  • Type assertions for most-specific classes only (reasoners infer superclasses)
  • Namespace bindings from namespaces_info.json

Statistics (kg_info.json):

  • User parameters (requested counts, configuration)
  • Actual statistics (generated counts, averages, proportions)

Algorithm and Complexity#

This section provides a detailed look at how the generation algorithm works and its computational complexity. Understanding this helps you predict performance and troubleshoot slow generation.

Notation

Throughout this section, we use shorthand notation for readability. Here is what each symbol means:

Symbol Full name Source
$n_\text{entities}$ Number of entities num_entities config parameter
$n_\text{triples}$ Number of triples num_triples config parameter
$n_\text{classes}$ Number of classes From the schema (class_info.json)
$n_\text{relations}$ Number of relations From the schema (relation_info.json)
$\text{avg_depth}$ Average class hierarchy depth Depends on schema structure
$\text{avg_types}$ Average types per entity When multityping is enabled
$\text{batch}$ Batch size 1K to 100K depending on KG scale

We express complexity using Big-O notation. For example, $O(n_\text{entities})$ means the operation's runtime grows linearly with the number of entities.

Design Philosophy#

PyGraft-gen uses an integer-ID model with batch sampling and two-phase filtering to achieve scalability:

  • Integer-based identifiers: Entities, classes, and relations use integer IDs internally. Strings only appear during serialization. This eliminates string overhead and enables efficient NumPy array operations.

  • Pre-computed constraint caches: Domain/range pools, disjointness sets, and property characteristics computed once before generation. No repeated set intersections or dictionary lookups during sampling.

  • Batch sampling with vectorized filtering: Sample large batches of candidate triples, then apply constraints in two phases — fast filtering eliminates most invalid candidates before expensive deep validation.

  • Incremental constraint tracking: Functional and inverse-functional properties maintain sets of used heads/tails. Constraint checks become constant-time set membership tests, i.e., $O(1)$, instead of scanning all existing triples which would cost $O(n_\text{triples})$.

Performance Impact

  • Naive approach: 10M triples = 10M sampling operations, each with $O(n_\text{triples})$ validation cost
  • PyGraft-gen: 10M triples ≈ 1,000 batch operations with $O(1)$ constraint checks
  • Result: Hours reduced to minutes for large-scale generation

Schema Loading#

Schema loading converts JSON metadata into optimized internal structures.

flowchart LR
  A[Load JSON files] --> B[Build ID mappings]
  B --> C[Initialize caches]
  C --> D[Validate]
  D --> E[Compute envelopes]

Complexity breakdown:

Step Operation Complexity
Load JSON Parse class_info, relation_info Linear in schema size: $O(n_\text{classes} + n_\text{relations})$
ID mappings Build bidirectional dictionaries Linear in schema size: $O(n_\text{classes} + n_\text{relations})$
Class caches Layer mappings, transitive closures Linear in classes times hierarchy depth: $O(n_\text{classes} \times \text{avg_depth})$
Relation caches Domain/range sets, property characteristics Linear in relations: $O(n_\text{relations})$
Validation Check forbidden combinations Linear in relations: $O(n_\text{relations})$
Disjoint envelopes Union disjoint classes per relation Worst case relations times classes: $O(n_\text{relations} \times n_\text{classes})$

Total Complexity

$O(n_\text{classes} \times \text{avg_depth} + n_\text{relations} \times n_\text{classes})$ in the worst case, but typically much faster because disjointness declarations are sparse in real ontologies.

Entity Typing#

Entity typing assigns classes to entities while respecting hierarchy and disjointness constraints.

flowchart LR
  A[Initialize space] --> B[Assign most-specific]
  B --> C[Add multitypes]
  C --> D[Compute transitive]
  D --> E[Resolve conflicts]

Complexity breakdown:

Step Operation Complexity
Initialize Allocate arrays, select typed subset Linear in entities: $O(n_\text{entities})$
Most-specific assignment Power-law sampling per entity Linear in entities: $O(n_\text{entities})$
Multityping Add additional classes per entity Linear in entities times avg types: $O(n_\text{entities} \times \text{avg_types})$
Transitive closure Add superclasses for each specific class Linear in entities times hierarchy depth: $O(n_\text{entities} \times \text{avg_depth})$
Conflict resolution Check and repair disjoint violations Linear in entities times avg types squared: $O(n_\text{entities} \times \text{avg_types}^2)$
Profile replication (fast mode) Copy profiles round-robin Linear in entities: $O(n_\text{entities})$

Total Complexity

$O(n_\text{entities} \times (\text{avg_depth} + \text{avg_types}^2))$, dominated by transitive closure computation and disjoint conflict resolution.

Triple Generation#

Triple generation is the performance-critical phase. It uses a batch sampling pipeline with two-phase filtering.

Setup (One-Time)#

Before the main loop, the generator prepares per-relation data structures:

Step Operation Complexity
Class-entity index Build reverse mapping Linear in entities times avg types: $O(n_\text{entities} \times \text{avg_types})$
Candidate pools Intersect entities per relation Worst case relations times entities: $O(n_\text{relations} \times n_\text{entities})$
Budget distribution Compute weights and quotas Linear in relations: $O(n_\text{relations})$
Tracking init Create empty sets Linear in relations: $O(n_\text{relations})$

Setup Total Complexity

$O(n_\text{relations} \times n_\text{entities})$ in the worst case, though typically faster when domain/range constraints are selective and filter out most entities.

Main Loop#

The generation loop runs until the target triple count is reached:

flowchart LR
  S[Sample relation] --> B[Sample batch]
  B --> F[Fast filter]
  F --> D[Deep validate]
  D --> A[Accept valid]
  A --> U[Update state]
  U --> S

Each iteration processes a batch of candidates (typically 1K-100K depending on KG size):

Step Operation Complexity per batch
Sample relation Weighted random choice Constant time: $O(1)$
Sample candidates Random from pools with freshness bias Linear in batch size: $O(\text{batch})$
Fast filtering Vectorized constraint masks Linear in batch size: $O(\text{batch})$
Deep validation Per-survivor constraint checks Survivors times validation cost: $O(\text{survivors} \times V)$
Accept and record Update tracking structures Linear in accepted count: $O(\text{accepted})$

Number of iterations: In the best case, approximately $\left\lceil \frac{n_{\text{triples}}}{\text{batch}} \right\rceil$ iterations. More iterations are needed when rejection rates are high due to tight constraints.

Loop Total Complexity

$O(n_\text{triples} \times V)$ where $V$ is the average validation cost per triple. For most constraints, $V$ is constant, i.e., $O(1)$. The exception is transitive cycle detection, which requires graph traversal.

Two-Phase Filtering Details#

The two-phase approach separates cheap batch operations from expensive per-triple checks:

Applied to the entire batch using vectorized NumPy operations. Each check runs in linear time over the batch with constant-time lookups:

Constraint Check Cost per candidate
Irreflexive head != tail Constant time via vectorized comparison: $O(1)$
Duplicate (h,t) in seen_pairs Constant time via hash set lookup: $O(1)$
Functional h in functional_heads Constant time via hash set lookup: $O(1)$
Inverse-functional t in invfunctional_tails Constant time via hash set lookup: $O(1)$
Asymmetric (t,h) in kg_pairs Constant time via hash set lookup: $O(1)$
Symmetric duplicate (t,h) in seen_pairs Constant time via hash set lookup: $O(1)$

Fast filtering catches most constraint violations before deep validation begins, ensuring expensive checks only run on candidates that pass the cheap tests.

Applied to each survivor individually. Most checks are constant-time thanks to pre-computed caches:

Constraint Check Cost per triple
Domain/range typing Entity has required classes Constant time via set membership: $O(1)$
Disjoint envelope Entity not in forbidden classes Constant time via set intersection: $O(1)$
Inverse validation Inverse triple would be valid Constant time via recursive check: $O(1)$
Property disjointness No conflict with disjoint relations Linear in disjoint set size: $O(\vert\text{disjoints}\vert)$
Subproperty inheritance Super-property constraints satisfied Linear in super-property count: $O(\vert\text{supers}\vert)$
Transitive cycle No path from tail to head BFS traversal of relation subgraph: $O(V + E)$

Most constraints are constant-time, i.e., $O(1)$, due to pre-computed caches. The exception is transitive cycle detection, which requires graph traversal and can be expensive for dense transitive relations.

Transitive Cycle Detection#

For relations that are both transitive and (irreflexive or asymmetric), adding a triple $(h, r, t)$ could create a cycle in the transitive closure. The generator uses breadth-first search (BFS) to detect this:

flowchart TD
  A[Start BFS from t] --> B[Follow existing edges for relation r]
  B --> C{Reached h?}
  C -->|Yes| D[Reject triple - cycle detected]
  C -->|No more nodes| E[Accept triple]

Cycle Detection Complexity

The cost is proportional to the number of vertices and edges reachable from $t$ in the relation's subgraph, i.e., $O(V + E)$ where $V$ is vertices visited and $E$ is edges traversed. This is typically much smaller than the full graph because only edges of one specific relation are considered, the search stops as soon as $h$ is found, and most transitive relations have sparse connectivity.

Incremental Tracking#

Instead of scanning all existing triples to check constraints (which would cost linear time in the number of triples, i.e., $O(n_\text{triples})$), the generator maintains incremental tracking structures:

Structure Purpose Update cost Query cost
seen_pairs[r] Duplicate detection Constant time to add: $O(1)$ Constant time lookup: $O(1)$
functional_heads[r] Functional property Constant time to add: $O(1)$ Constant time lookup: $O(1)$
invfunctional_tails[r] Inverse-functional Constant time to add: $O(1)$ Constant time lookup: $O(1)$
transitive_adjacency[r] Cycle detection Constant time to add edge: $O(1)$ BFS traversal: $O(V + E)$

Key Optimization

This is the key optimization that makes PyGraft-gen scale. In naive implementations, checking functional constraints requires scanning all existing triples for that relation, costing $O(n_\text{triples})$ per check. With incremental tracking, the same check costs constant time, i.e., $O(1)$.

Serialization#

Serialization writes the generated KG to disk:

Step Operation Complexity
Build RDF graph Add triples to RDFLib Linear in triples: $O(n_\text{triples})$
Add type assertions One per entity-class pair Linear in entities times types: $O(n_\text{entities} \times \text{avg_types})$
Serialize Write to file Linear in output size: $O(n_\text{triples} + n_\text{entities})$
Statistics Compute and write kg_info Linear in triples: $O(n_\text{triples})$

Total Complexity

$O(n_\text{triples} + n_\text{entities} \times \text{avg_types})$, dominated by type assertion generation when multityping is enabled. Linear in output size.

Overall Complexity Summary#

Phase Complexity What dominates
Schema loading $O(n_\text{classes} \times \text{avg_depth} + n_\text{relations} \times n_\text{classes})$ Disjoint envelope computation
Entity typing $O(n_\text{entities} \times (\text{avg_depth} + \text{avg_types}^2))$ Transitive closure, conflict resolution
Triple generation setup $O(n_\text{relations} \times n_\text{entities})$ Candidate pool construction
Triple generation loop $O(n_\text{triples} \times V)$ Validation cost $V$ per triple
Serialization $O(n_\text{triples} + n_\text{entities} \times \text{avg_types})$ Linear scan of outputs

In practice, the triple generation loop dominates runtime for large KGs. The validation cost $V$ is constant, i.e., $O(1)$, for most constraints, but becomes a graph traversal cost, i.e., $O(V + E)$, for transitive relations with cycle detection.


Performance Characteristics#

Understanding performance helps you plan generation runs and troubleshoot issues.

Constraint Impact on Performance#

Some constraints are more expensive to validate than others:

Constraint Type Cost Complexity Notes
Irreflexive Low Constant: $O(1)$ Simple equality check
Functional Low Constant: $O(1)$ Set membership via incremental tracking
Inverse-functional Low Constant: $O(1)$ Set membership via incremental tracking
Domain/Range Low Constant: $O(1)$ Pre-filtered pools
Symmetric Low Constant: $O(1)$ Duplicate check in seen_pairs
Asymmetric Medium Constant: $O(1)$ Reverse edge lookup
Disjointness Medium Linear in disjoint set: $O(\vert\text{disjoints}\vert)$ Intersection with disjoint set
Inverse validation Medium Constant: $O(1)$ Recursive validation of inverse
Subproperty Medium Linear in supers: $O(\vert\text{supers}\vert)$ Check inherited constraints
Transitive + Irreflexive High BFS traversal: $O(V + E)$ Cycle detection in relation subgraph

Scale Expectations#

Relative Scale Expectations
  • Small (1K-10K entities): Seconds
  • Medium (10K-100K entities): Minutes
  • Large (100K-1M entities): Tens of minutes
  • Very large (1M+ entities): Hours
Performance Factors

What affects speed:

  • Hardware (CPU, RAM)
  • Schema complexity (number of constraints, hierarchy depth)
  • Configuration (relation_usage_uniformity, enable_fast_generation)
  • Constraint density (how many relations have expensive properties like transitivity)
What Slows Generation Down
  • Transitive cycle detection: Requires BFS graph traversal costing $O(V + E)$ for each candidate triple
  • Deep validation: Each surviving candidate requires type checking against current state
  • Inverse relation validation: Must validate both the triple and its inverse
  • Small candidate pools: Tight constraints mean more sampling attempts per accepted triple

Memory Usage#

Memory requirements scale with graph size and schema complexity.

Memory components:

  • Entity structures: Linear in entities, i.e., $O(n_\text{entities})$
  • Triple storage: Linear in triples, i.e., $O(n_\text{triples})$
  • Constraint caches: Linear in schema size, i.e., $O(n_\text{classes} + n_\text{relations})$
  • Candidate pools: Linear in relations times average pool size, i.e., $O(n_\text{relations} \times \text{avg_pool_size})$
What Affects Memory
  • Number of entities and triples (primary drivers)
  • Schema size (classes, relations, constraints)
  • Number of active relations being tracked
  • Whether consistency checking is enabled (reasoner memory overhead)
Memory-Saving Strategies

During generation:

  • Use fast generation mode for large KGs
  • Reduce relation_usage_uniformity (smaller pool tracking)

Post-generation:

Consistency checking runs as a separate step after the KG is serialized. It uses the HermiT reasoner via Owlready2 which:

  • Runs in its own JVM subprocess with independent memory
  • Can be extremely memory-intensive on large KGs
  • May exhaust Java heap space on very large graphs

For large KGs (1M+ entities), disable check_kg_consistency to avoid memory issues and long validation times.

See Consistency Checking for details.


Configuration Parameters#

Now that you understand how the generation pipeline works, you can fine-tune it using the configuration parameters in the kg section of your config file.

Parameter Controls
num_entities Total entity count
num_triples Target triple count
prop_untyped_entities Proportion without class assignment (0.0-1.0)
avg_specific_class_depth Target hierarchy depth for assigned classes
multityping Allow multiple most-specific classes per entity
avg_types_per_entity Target average class count when multityping enabled
relation_usage_uniformity Triple distribution evenness across relations (0.0-1.0)
enable_fast_generation Generate small prototype then scale up
check_kg_consistency Run reasoner validation after generation

Complete Reference

See Configuration Reference for detailed parameter descriptions.

Key Parameters Explained#

relation_usage_uniformity (0.0-1.0)

Controls triple distribution across relations. Example with 10 relations, 1000 triples:

uniformity R1 R2 R3 R4 R5 R6 R7 R8 R9 R10
0.0 400 200 150 100 50 40 30 20 5 5
0.5 180 150 130 110 95 85 75 65 60 50
1.0 100 100 100 100 100 100 100 100 100 100
avg_specific_class_depth

Controls type specificity. Example hierarchy:

owl:Thing (depth 0)
├── Person (depth 1)
│   ├── Student (depth 2)
│   │   └── GraduateStudent (depth 3)
│   └── Professor (depth 2)
└── Organization (depth 1)

  • depth 1.0 → Person, Organization
  • depth 2.0 → Student, Professor
  • depth 3.0 → GraduateStudent

Fast Generation Mode#

For large-scale generation, fast mode offers a speed-optimized alternative to the standard pipeline.

When enable_fast_generation: true, the generator creates a small prototype KG then replicates entity profiles to reach target size.

How It Works

Fast generation trades entity typing diversity for speed by reusing type profiles:

  1. Generate seed batch (10-20% of target size) with full constraint pipeline
  2. Capture entity type profiles from seed batch
  3. Replicate profiles round-robin to create remaining entities
  4. Each new entity gets a copy of a seed entity's type assignments
  5. Profiles are shuffled to maintain distribution
  6. Generate triples for all entities (seed + replicated)

Benefits

  • Significantly faster for large KGs (100K+ entities)
  • Preserves type distribution and hierarchy characteristics
  • Avoids recomputing hierarchy sampling and disjointness resolution

Trade-offs

  • Less diverse entity typing patterns (profiles are copied, not unique)
  • Type distribution matches seed batch exactly

When to Use

  • Large KGs (1M+ entities) where full generation is slow
  • Heavily constrained schemas
  • Testing configurations before full-scale runs

FAQ#

Here are answers to common questions and troubleshooting tips based on generation behavior.

Why did I get fewer triples than I requested?

Constraints are too restrictive (small domain/range pools, many functional properties) or relations exhausted their candidate space early and were dropped.

Why is generation taking hours?

Expensive constraints (transitive properties, deep subproperty hierarchies, complex inverse validations) or small candidate pools slow down validation.

Why are many relations excluded or dropped?

Empty candidate pools (domain/range constraints have no satisfying entities) or forbidden characteristic combinations (Asymmetric + Functional, Transitive + Functional). Check logs for specific exclusion reasons.


Limitations#

The generator enforces only constraints captured during Ontology extraction. This creates some important limitations to be aware of like some unsupported OWL constructs not being enforced during generation but being validated during consistency checking.

This creates a generation vs validation gap: A KG can be generated "correctly" according to extracted metadata but still fail consistency checking against the full ontology.

Why this matters:

  • Generated KG respects extracted constraints
  • Consistency checking validates against full ontology (including unsupported constructs)
  • Result: KG may be marked inconsistent due to constraints not enforced during generation

Learn More

See Consistency Checking for details on this gap and What's Supported for unsupported constructs.


What's Next#