PII Detection¶
Automatically detect personally identifiable information (PII) in your data using column names and value patterns.
Quick Start¶
from duckguard import connect
from duckguard.semantic import SemanticAnalyzer
data = connect("customers.csv")
analyzer = SemanticAnalyzer()
# Find all PII columns
pii = analyzer.find_pii_columns(data)
for col_name, sem_type, warning in pii:
print(f"⚠️ {col_name}: {sem_type.value} — {warning}")
Semantic Type Detection¶
DuckGuard detects 40+ semantic types from column names and value patterns:
from duckguard import detect_type, detect_types_for_dataset
# Single column
sem_type = detect_type(data, "email")
print(sem_type) # SemanticType.EMAIL
# All columns at once
types = detect_types_for_dataset(data)
for col, stype in types.items():
print(f"{col}: {stype.value}")
Full Dataset Analysis¶
analysis = analyzer.analyze(data)
print(f"PII columns: {analysis.pii_columns}")
print(f"Has PII: {analysis.has_pii}")
for col in analysis.columns:
print(f"{col.name}: {col.semantic_type.value} "
f"(confidence: {col.confidence:.0%})")
if col.is_pii:
print(f" ⚠️ {col.pii_warning}")
if col.suggested_validations:
print(f" Suggested: {col.suggested_validations}")
PII Types Detected¶
| Type | Detection | Example |
|---|---|---|
| Name + regex pattern | user@example.com |
|
| Phone | Name + digit pattern | +1-555-0123 |
| SSN | Name + \d{3}-\d{2}-\d{4} |
123-45-6789 |
| Credit Card | Name + 16-digit pattern | 4111-1111-1111-1111 |
| Person Name | Column name matching | first_name, surname |
| Address | Column name matching | street_address |
Detection Methods¶
Detection uses a two-pass approach:
- Column name patterns — matches against 40+ name patterns (e.g.,
email,phone_number,ssn) - Value patterns — regex matching on sampled values (e.g., email format, UUID format)
Confidence scores combine both signals (0.0–1.0).
Generate Validation YAML¶
Output:
Quick Scan¶
For a fast type-only scan (no statistics):
types = analyzer.quick_scan(data)
# {'order_id': SemanticType.PRIMARY_KEY, 'email': SemanticType.EMAIL, ...}
Common Semantic Types¶
Identity: primary_key, foreign_key, uuid, id
Contact: email, phone, url, ip_address
PII: ssn, credit_card, person_name, address
Location: country, state, city, zipcode, latitude, longitude
Time: date, datetime, timestamp, year
Numeric: currency, percentage, quantity, age
Categorical: boolean, enum, status, category
CLI Integration¶
The discover and info commands include semantic analysis: