Data Quality Assurance Framework
The CLIF Consortium validates every dataset through a rigorous three-pillar framework — Conformance, Completeness, and Plausibility — catching data issues before they reach your analysis.
Conformance
Schema & Structure
Conformance checks verify structure and schema — do the right tables exist, are the required columns present, are data types valid, and do categorical values belong to the allowed vocabulary? These are the first line of defense: if the skeleton is wrong, nothing downstream matters.
| Column | Type |
|---|---|
| weight | float64 |
| height | float64 |
| age | int64 |
| Column | Type |
|---|---|
| weight | object ("abc") |
| height | float64 |
| age | int64 |
Completeness
Data Coverage
A dataset can be perfectly conformant yet still unusable if half the values are NULL. Completeness checks look beyond structure to ask: is the data filled in? They flag excessive missing values, verify that conditional fields are populated when they should be, and ensure that expected categorical coverage is met.
| device_category | fio2_set | peep |
|---|---|---|
| IMV | 0.6 | 10 |
| IMV | 0.5 | 8 |
| device_category | fio2_set | peep |
|---|---|---|
| IMV | NULL | 10 |
| IMV | 0.5 | NULL |
| labs.hosp_id | in hospitalization? |
|---|---|
| H001 | ✓ |
| H002 | ✓ |
| H003 | ✓ |
| labs.hosp_id | in hospitalization? |
|---|---|
| H001 | ✓ |
| H999 | ✗ (orphan) |
| H002 | ✓ |
Plausibility
Clinical Sense
Plausibility checks go beyond structure — they verify that data values are clinically reasonable. A heart rate of 450 is technically numeric but clinically impossible. These checks catch subtle ETL bugs that conformance alone would miss.
| Vital | Value | Range |
|---|---|---|
| heart_rate | 88 | 0–300 |
| sbp | 125 | 0–400 |
| spo2 | 96 | 0–100 |
| Vital | Value | Range |
|---|---|---|
| heart_rate | 450 | 0–300 ✗ |
| sbp | 125 | 0–400 |
| spo2 | 96 | 0–100 |
| recorded_dttm | device_category |
|---|---|
| 10:00 | NIPPV |
| 10:30 | NIPPV |
| 11:00 | IMV |
| recorded_dttm | device_category |
|---|---|
| 10:00 | IMV |
| 10:01 | Room Air |
| 10:02 | IMV |
| hosp_id | lab_cat | collect_dttm |
|---|---|---|
| H001 | potassium | Jan 15 08:00 |
| H001 | creatinine | Jan 15 08:00 |
| H001 | potassium | Jan 16 08:00 |
| hosp_id | lab_cat | collect_dttm |
|---|---|---|
| H001 | potassium | Jan 15 08:00 |
| H001 | potassium | Jan 15 08:00 |
| H001 | potassium | Jan 16 08:00 |
Run It Yourself
The clifpy package runs every check on this page against your own CLIF data and emits structured results, a PDF report, and a text summary.
# 1. install
pip install clifpy
# 2. orchestrate
from clifpy.orchestrator import ClifOrchestrator
co = ClifOrchestrator(config='clif_config.yaml')
co.initialize(['labs', 'vitals', 'medications'])
co.validate_all()
# 3. inspect
for err in co.labs.errors:
print(err.column, err.message) from clifpy.utils.validator import (
run_conformance_checks,
run_completeness_checks,
run_plausibility_checks,
run_full_dqa,
)
result = run_full_dqa(df, schema, 'labs')
print(result.passed, result.metrics)
# Serialize or report out
result.to_dict() # JSON-safe dict
generate_validation_pdf(result) # PDF report
generate_text_report(result) # plain-text summary New in clifpy 0.4.x
Nullable / allow_missing support, data profiling and monthly trends, ICD normalization (0.4.8+), atomic-check counts in reports, dual Polars / DuckDB backends for datasets in the hundreds of GB, and PDF + text report generators.