Ontology Extraction#

Ontology extraction analyzes existing RDFS/OWL ontologies and extracts structured metadata that PyGraft-gen uses to generate synthetic Knowledge Graphs.

On this page:

Overview - Understanding the extraction process
Design Philosophy - Core principles guiding extraction
What Gets Extracted - The three metadata files
Technical Details - Implementation and supported formats

Overview#

Extraction converts an ontology file into three JSON metadata files:

flowchart LR
  O[Ontology File]

  subgraph E[Extraction Pipeline]
    N[Namespace<br/>Extraction]
    C[Class<br/>Extraction]
    R[Relation<br/>Extraction]
  end

  NI[namespaces_info.json]
  CI[class_info.json]
  RI[relation_info.json]

  O --> N
  O --> C
  O --> R

  N --> NI
  C --> CI
  R --> RI

  style O fill:#f8c9c9,stroke:#c55,stroke-width:2px
  style N fill:#f8c9c9,stroke:#c55,stroke-width:2px
  style C fill:#f8c9c9,stroke:#c55,stroke-width:2px
  style R fill:#f8c9c9,stroke:#c55,stroke-width:2px
  style NI fill:#eee,stroke:#666,stroke-width:2px
  style CI fill:#eee,stroke:#666,stroke-width:2px
  style RI fill:#eee,stroke:#666,stroke-width:2px

These files capture the ontology structure in a format the KG generator can use. Extraction is deterministic, read-only, and operates purely on explicit axioms without reasoning.

Extraction Scope

Extraction captures only OWL constructs that PyGraft-gen can enforce during generation.

See What's Supported

Design Philosophy#

Extraction is built on three core principles that ensure predictable, debuggable results:

Explicit-First, No Inference

The extractor answers one question: "What does this ontology explicitly declare?"

It operates purely on axioms present in the ontology file. No OWL reasoning, RDFS entailment, or semantic expansion is performed.

What this means:

Declared relationships like foaf:Person rdfs:subClassOf foaf:Agent are captured
Implied relationships are only captured if explicitly stated
Transitive relationships (like rdfs:subClassOf*) are computed via SPARQL property paths over explicit axioms, not semantic inference
External ontologies (FOAF, Dublin Core, etc.) are not loaded or traversed

Read-Only and Schema-Focused

Extraction is read-only. The original ontology is never modified.

The focus is purely structural: class hierarchies, property characteristics, and explicit constraints that can be enforced during generation.

Deterministic and Reproducible

Running extraction on the same ontology always produces identical output. All derived structures are computed using fixed SPARQL queries that execute in the same order every time.

This matters for debugging extraction issues, reproducing generation runs, and understanding exactly what the generator sees.

What Gets Extracted#

Extraction produces three JSON files, each capturing a different aspect of your ontology's structure.

Namespaces → namespaces_info.json

Prefix-to-namespace mappings and ontology metadata. All IRIs are normalized to prefix:LocalName format.

Requirements: Your ontology must include an owl:Ontology declaration. (VANN annotations are recommended but optional)

Learn more → Reference - Namespaces Info

Classes → class_info.json

Class hierarchy and constraints including named classes, direct and transitive rdfs:subClassOf relationships, owl:disjointWith declarations, hierarchy layers, and statistics.

What counts as a class? Any named IRI appearing as an rdf:type target, rdfs:subClassOf participant, or property domain/range. Blank nodes are filtered out.

Learn more → Reference - Class Info

Relations → relation_info.json

Object property characteristics and constraints including OWL characteristics (symmetric, transitive, functional, reflexive, irreflexive, asymmetric, inverse functional), domain/range constraints, inverse relationships, subproperty hierarchies, and disjointness.

Note: Datatype properties are excluded.

Learn more → Reference - Relation Info

Technical Details#

Implementation & Supported Formats

SPARQL-based extraction:

Extraction uses SPARQL queries executed through Python's rdflib library. Queries are organized by domain and stored as .rq files.

No reasoning:

Extraction captures only explicit assertions. Transitive closures are computed via SPARQL property paths over explicit axioms, not OWL reasoning.

Supported formats:

Turtle (.ttl) - Recommended
RDF/XML (.rdf, .owl, .xml)

N-Triples and other RDF serializations are not supported.

What's Next#

KG Generation - How instances are created from extracted metadata
Consistency Checking - Validating generated KGs

Reference: