Why DuckGuard?¶

The Problem¶

Data quality tools are stuck in 2018. Here's what it looks like to validate a single column with the market leader:

# Great Expectations — 50+ lines before you validate anything
from great_expectations import get_context

context = get_context()
datasource = context.sources.add_pandas("my_ds")
asset = datasource.add_dataframe_asset(name="orders", dataframe=df)
batch_request = asset.build_batch_request()
expectation_suite = context.add_expectation_suite("orders_suite")
validator = context.get_validator(
    batch_request=batch_request,
    expectation_suite_name="orders_suite"
)
validator.expect_column_values_to_not_be_null("customer_id")
# ... and you're just getting started

This is insane. You need to understand contexts, datasources, assets, batch requests, expectation suites, and validators — just to check if a column has nulls.

The DuckGuard Way¶

from duckguard import connect

orders = connect("orders.csv")
assert orders.customer_id.is_not_null()

Done. No ceremony. No boilerplate. If you can write pytest, you can write DuckGuard.

Speed¶

DuckGuard is built on DuckDB, the fastest embedded analytics engine. This means:

Dataset Size	Great Expectations	DuckGuard
1 GB CSV	45 sec / 4 GB RAM	4 sec / 200 MB RAM
10 GB Parquet	8 min / 32 GB RAM	45 sec / 2 GB RAM

Why? DuckDB uses columnar storage, vectorized execution, and SIMD optimizations. DuckGuard reads files directly — no loading into pandas, no DataFrame conversion, no memory explosion.

Feature Comparison¶

What you need	DuckGuard	Great Expectations	Soda Core
Validate a column	1 line	50+ lines	10+ lines (YAML)
PII detection	Built-in	✗	✗
Anomaly detection	7 methods	✗	Partial
Row-level errors	Built-in	Yes	✗
Data contracts	Built-in	✗	Yes
Conditional checks	Built-in	✗	✗
Query-based checks	Built-in	✗	Yes
Drift detection	Built-in	✗	✗
Reconciliation	Built-in	✗	✗
Quality scoring (A-F)	Built-in	✗	✗
Learning curve	Minutes	Days	Hours

Who Is DuckGuard For?¶

Data engineers who want validation that doesn't slow down their pipelines.

Analytics engineers who want data quality checks as readable as their SQL.

ML engineers who need to detect data drift before it breaks their models.

Anyone who's tired of writing 50 lines of YAML to check if a column is not null.

Design Principles¶

Zero boilerplate — If it takes more than 3 lines to start, it's too many
Speed by default — DuckDB under the hood, not pandas
Batteries included — PII, anomalies, contracts, drift — all built in
Pytest-native — Use assert, not .expect_column_values_to_be_blah()
Progressive complexity — Simple things simple, complex things possible