IngestGuard: DataHub + MLflow + OpenSearch that accelerates Data Governance

TL;DR: IngestGuard helps Datahub with an Automated Governance System. It ingests MLflow metadata into DataHub, relies on OpenSearch to evaluate governance rules at scale, and writes violations back into the metadata graph (as tags). Instead of governance being checked manually during audits, policies are evaluated automatically and surfaced directly inside DataHub as first-class metadata signals.

DataHub streamlines metadata management, data discovery, and governance by making data assets and their relationships visible across systems. While this improves visibility, governance often doesn’t go beyond exploration.

After ingestion, DataHub does not actively evaluate whether governance policies are still being followed. Ownership fields may exist but go unchecked, production assets can ship without tags, and stale models can remain unnoticed. Governance becomes a manual, periodic process, usually triggered by audits or incidents.

IngestGuard addresses this gap. MLflow is used only as a metadata source; its role is to emit execution metadata. IngestGuard’s job is to ensure that once this metadata enters DataHub, it can be evaluated, enforced, and audited automatically.

Two phases of the working of IngestGuard:

Phase I: Ingestion

The first phase of IngestGuard focuses on ingestion and modeling. Metadata is extracted from MLflow, including experiments, runs, and agent information. This data is normalized into stable DataHub URNs and emitted into DataHub as first-class entities.

Entity creation and metadata attachment are handled using DataHub OpenAPI v3 REST endpoints, primarily datasetProperties or mlModelProperties with structured customProperties. Lineage between experiments, runs, and agents is explicitly created using the upstreamLineage API.

At the end of this phase, DataHub becomes the system of record for ML assets, including what exists, who owns it, and how everything is connected.

Phase II: Governance Scanning

Once metadata is ingested, DataHub automatically indexes it in OpenSearch. IngestGuard leverages this index to evaluate governance rules defined declaratively in YAML, such as models without owners, production models missing tags, or assets that haven’t been updated in a specified time window.

These rules are translated into OpenSearch Query DSL and executed directly against DataHub’s search indices. This avoids expensive graph traversal and enables governance checks to scale with metadata growth. When violations are detected, IngestGuard writes governance signals back into DataHub using two APIs:

The Global Tags API is used to mark violations (e.g., governance_violation_no_owner).
Entity-specific property APIs (datasetProperties, mlModelProperties) are used to store structured context such as rule ID, severity, timestamps, and remediation guidance.

To conclude, IngestGuard helps with automated governance enforcement and accumulates violations that are usually discovered late. Removes the pain of “Compliance relying on humans remembering to check”.