Why DuckGuard?¶
The Problem¶
Data quality tools are stuck in 2018. Here's what it looks like to validate a single column with the market leader:
# Great Expectations — 50+ lines before you validate anything
from great_expectations import get_context
context = get_context()
datasource = context.sources.add_pandas("my_ds")
asset = datasource.add_dataframe_asset(name="orders", dataframe=df)
batch_request = asset.build_batch_request()
expectation_suite = context.add_expectation_suite("orders_suite")
validator = context.get_validator(
batch_request=batch_request,
expectation_suite_name="orders_suite"
)
validator.expect_column_values_to_not_be_null("customer_id")
# ... and you're just getting started
This is insane. You need to understand contexts, datasources, assets, batch requests, expectation suites, and validators — just to check if a column has nulls.
The DuckGuard Way¶
from duckguard import connect
orders = connect("orders.csv")
assert orders.customer_id.is_not_null()
Done. No ceremony. No boilerplate. If you can write pytest, you can write DuckGuard.
Speed¶
DuckGuard is built on DuckDB, the fastest embedded analytics engine. This means:
| Dataset Size | Great Expectations | DuckGuard |
|---|---|---|
| 1 GB CSV | 45 sec / 4 GB RAM | 4 sec / 200 MB RAM |
| 10 GB Parquet | 8 min / 32 GB RAM | 45 sec / 2 GB RAM |
Why? DuckDB uses columnar storage, vectorized execution, and SIMD optimizations. DuckGuard reads files directly — no loading into pandas, no DataFrame conversion, no memory explosion.
Feature Comparison¶
| What you need | DuckGuard | Great Expectations | Soda Core |
|---|---|---|---|
| Validate a column | 1 line | 50+ lines | 10+ lines (YAML) |
| PII detection | Built-in | ✗ | ✗ |
| Anomaly detection | 7 methods | ✗ | Partial |
| Row-level errors | Built-in | Yes | ✗ |
| Data contracts | Built-in | ✗ | Yes |
| Conditional checks | Built-in | ✗ | ✗ |
| Query-based checks | Built-in | ✗ | Yes |
| Drift detection | Built-in | ✗ | ✗ |
| Reconciliation | Built-in | ✗ | ✗ |
| Quality scoring (A-F) | Built-in | ✗ | ✗ |
| Learning curve | Minutes | Days | Hours |
Who Is DuckGuard For?¶
Data engineers who want validation that doesn't slow down their pipelines.
Analytics engineers who want data quality checks as readable as their SQL.
ML engineers who need to detect data drift before it breaks their models.
Anyone who's tired of writing 50 lines of YAML to check if a column is not null.
Design Principles¶
- Zero boilerplate — If it takes more than 3 lines to start, it's too many
- Speed by default — DuckDB under the hood, not pandas
- Batteries included — PII, anomalies, contracts, drift — all built in
- Pytest-native — Use
assert, not.expect_column_values_to_be_blah() - Progressive complexity — Simple things simple, complex things possible