Skip to content

Cross-Dataset Validation

Validate foreign key relationships, reconcile datasets, and detect distribution drift between data sources.

Foreign Key Checks

from duckguard import connect

orders = connect("orders.csv")
customers = connect("customers.csv")

# All customer_ids must exist in customers table
result = orders.customer_id.exists_in(customers.customer_id)

# FK with null handling
result = orders.customer_id.references(customers.customer_id, allow_nulls=True)

Finding Orphans

orphans = orders.customer_id.find_orphans(customers.customer_id)
print(f"Invalid IDs: {orphans}")
# ['CUST-999', 'CUST-deleted']

Value Set Comparison

# Check if two columns share the same distinct values
result = orders.status.matches_values(lookup.code)

Row Count Comparison

# With tolerance (allows ±10 rows)
result = orders.row_count_matches(backup, tolerance=10)

Reconciliation

Full row-by-row comparison between two datasets:

source = connect("orders_source.parquet")
target = connect("orders_migrated.parquet")

recon = source.reconcile(
    target,
    key_columns=["order_id"],
    compare_columns=["amount", "status", "customer_id"],
)

print(recon.match_percentage)    # 95.5
print(recon.missing_in_target)   # 3
print(recon.extra_in_target)     # 1
print(recon.value_mismatches)    # {'amount': 5, 'status': 2}
print(recon.summary())

Use cases:

  • Validating data migrations
  • Comparing source vs. warehouse
  • Verifying ETL transformations

Distribution Drift Detection

Detect statistical drift between a baseline and current data using the Kolmogorov-Smirnov test:

baseline = connect("orders_jan.parquet")
current = connect("orders_feb.parquet")

drift = current.amount.detect_drift(baseline.amount)

print(drift.is_drifted)    # True/False
print(drift.p_value)       # 0.0023
print(drift.statistic)     # KS statistic
print(drift.message)       # "Significant drift detected (p=0.0023)"

Use cases:

  • Detecting data source changes
  • ML feature drift monitoring
  • Data pipeline regression testing