Skip to content

Column Validation

DuckGuard's column validation API is designed to read like English. Every method returns a ValidationResult with .passed, .message, .summary(), and .failed_rows.

Null & Uniqueness

from duckguard import connect

orders = connect("orders.csv")

# No nulls allowed
assert orders.order_id.is_not_null()

# All values must be distinct
assert orders.order_id.is_unique()

# Alias for is_unique
assert orders.order_id.has_no_duplicates()

Column Properties

Access statistics directly on any column:

col = orders.total_amount

col.null_count       # Number of null values
col.null_percent     # Percentage of nulls
col.unique_count     # Number of distinct values
col.min              # Minimum value
col.max              # Maximum value
col.mean             # Mean (numeric columns)
col.median           # Median (numeric columns)
col.stddev           # Standard deviation

Range Checks

# Inclusive range
assert orders.total_amount.between(0, 10000)

# Exclusive bounds
assert orders.total_amount.greater_than(0)
assert orders.total_amount.less_than(100000)

Pattern Matching

# Regex validation
assert orders.email.matches(r'^[\w.+-]+@[\w-]+\.[\w.]+$')

# String length
assert orders.order_id.value_lengths_between(5, 10)

Enum Checks

assert orders.status.isin(["pending", "shipped", "delivered"])

Working with Results

Every validation returns a ValidationResult:

result = orders.quantity.between(1, 100)

# Check pass/fail
result.passed          # True or False

# Human-readable message
result.message         # "Column 'quantity' has 3 values outside [1, 100]"

# Detailed summary with row-level errors
print(result.summary())

# Access failed rows directly
for row in result.failed_rows:
    print(f"Row {row.row_number}: {row.value} ({row.reason})")

# Get just the values or indices
result.get_failed_values()        # [500, -2, 0]
result.get_failed_row_indices()   # [5, 23, 29]

Using with pytest

# test_data_quality.py
from duckguard import connect

def test_orders_quality():
    orders = connect("data/orders.csv")
    assert orders.row_count > 0
    assert orders.order_id.is_not_null()
    assert orders.order_id.is_unique()
    assert orders.quantity.between(0, 10000)
    assert orders.status.isin(["pending", "shipped", "delivered"])

Run with pytest:

pytest test_data_quality.py -v

Next