data-quality

A reference implementation of the data quality metrics introduced in my Master’s research project, documented in doc/*/paper.typ (Property Graph Quality Assessment). This project provides a systematic framework for evaluating completeness, validity, consistency, integrity, and uniqueness in labeled property graphs (e.g., Neo4j).

Architecture

flowchart TD
    subgraph SOURCES["Data Sources"]
        DB[(Relational DB)]
        CSV[/CSV/]
        JSON[/JSON/]
        TSV[/TSV/]
        OTHER[/Other formats.../]
    end

    subgraph GRAPH[ ]
        NEO4J[("Graph Database\n(Neo4j)")]
    end

    subgraph USAGES["Consumers"]
        AI[AI / ML]
        CRM[CRM]
        BI[BI]
    end

    subgraph FRAMEWORK[data-quality]
        DQ[Data Quality\nAssesment & Profiling]
    end

    DB -->|ETL / Ingestion| NEO4J
    CSV -->|ETL / Ingestion| NEO4J
    JSON -->|ETL / Ingestion| NEO4J
    TSV -->|ETL / Ingestion| NEO4J
    OTHER -->|ETL / Ingestion| NEO4J

    NEO4J -->|Analyzes & Improves| DQ
    DQ -->|Analyzes & Improves| NEO4J

    NEO4J -->|Supplies| AI
    NEO4J -->|Supplies| CRM
    NEO4J -->|Supplies| BI
    
    style DB fill:#2a7d56,stroke:#ffffff,color:#ffffff,stroke-width:1px
    style CSV fill:#2a7d56,stroke:#ffffff,color:#ffffff,stroke-width:1px
    style JSON fill:#2a7d56,stroke:#ffffff,color:#ffffff,stroke-width:1px
    style TSV fill:#2a7d56,stroke:#ffffff,color:#ffffff,stroke-width:1px
    style OTHER fill:#2a7d56,stroke:#ffffff,color:#ffffff,stroke-width:1px

    style AI fill:#da8f74,stroke:#ffffff,color:#ffffff,stroke-width:1px
    style BI fill:#da8f74,stroke:#ffffff,color:#ffffff,stroke-width:1px
    style CRM fill:#da8f74,stroke:#ffffff,color:#ffffff,stroke-width:1px

    style NEO4J fill:#014063,stroke:#ffffff,color:#ffffff,stroke-width:1px
    style DQ fill:#5db3f3,stroke:#ffffff,color:#ffffff,stroke-width:1px

Why measure data quality in a property graph ?

Property graphs are schema‑flexible and semantically rich, but this freedom makes them prone to :

Missing relationships or nodes → completeness issues
Invalid label sets or malformed property values → conformity violations
Inconsistent functional dependencies → coherence flaws
Structural anomalies (duplicate edges, missing mandatory properties) → integrity / uniqueness degradation

Automated quality profiling helps:

Validate graph‑based ETL pipelines
Enforce domain constraints without a rigid schema
Detect semantic drift in labels and relationships
Improve downstream analytics (e.g., graph ML, path queries)

Getting started

Requires Python 3.14 and uv.

# Clone the repository
git clone https://github.com/LugolBis/data-quality.git
cd data-quality

# Create virtual environment and install dependencies
uv venv .venv && source .venv/bin/activate && uv sync

Create a .env file and configure it :

echo '' > .env

and copy-paste in it

URI="neo4j://127.0.0.1:7687"
DB_USER="your_neo4j_user"
DB_PW="your_neo4j_password"
DB_NAME="your_database"

Usage

Launch the interactive profiler (Streamlit UI) :

streamlit run src/main.py

Then :

Connect to a Neo4j database (or upload a Cypher dump).
Define constraints based on your domain rules.
Run the assessment and easily export them as CSV.

Name		Name	Last commit message	Last commit date
Latest commit History 259 Commits
Data		Data
doc		doc
src		src
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

data-quality

Architecture

Why measure data quality in a property graph ?

Getting started

Usage

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

data-quality

Architecture

Why measure data quality in a property graph ?

Getting started

Usage

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages