Skip to content
DataLow risk

Data Cleaning Loop

Clean messy datasets with repeatable validation and artifact outputs.

What this Loop Engineering template does

Produce a clean dataset and a reproducible cleaning script, documenting every transformation.

Never overwrite the raw data. Label any imputed values. Keep the script reproducible.

When to use it

Messy CSV or table cleanup
Schema normalization
Reproducible pipelines

When not to use it

Ambiguous labels needing human judgment
Data that must not be modified
Unverifiable sources

Validation checks

validation
Missing values are reported
Schema is documented
Cleaning script is reproducible
Final dataset passes validation checks

Boundaries & stop rule

!Do not silently delete rows
!Do not overwrite raw data
!Do not invent missing values without labeling imputation
Stop rule — Stop when the cleaned dataset passes validation and the script reproduces it. If labels are ambiguous, stop and ask for a human decision rather than guessing.

Copy the loop prompt

claude-goal.txt
/goal Produce a clean dataset and a reproducible cleaning script, documenting every transformation.
 
Work toward this goal until all validation checks pass or the stop rule is reached.
 
Loop cycle:
1. Discovery — Read the latest signal for this template before acting: CI output, issue detail, review comment, dataset report, or content brief.
2. Handoff — Hand the work to one agent in an isolated branch, worktree, or clearly scoped session. Keep final approval with a human.
3. Verification — Use an independent review pass to confirm the result, inspect the diff or artifact, and reject shortcut work.
4. Persistence — Save a short run note with the signal reviewed, actions taken, validation result, and next recommended step.
5. Scheduling — Run manually until the loop is reliable; only then consider a scheduled or event-triggered run.
 
Context:
Never overwrite the raw data. Label any imputed values. Keep the script reproducible.
 
Validation:
Missing values are reported
Schema is documented
Cleaning script is reproducible
Final dataset passes validation checks
 
Independent checker:
Use an independent review pass to confirm the result, inspect the diff or artifact, and reject shortcut work.
 
Boundaries:
Do not silently delete rows
Do not overwrite raw data
Do not invent missing values without labeling imputation
 
Stop rule:
Stop when the cleaned dataset passes validation and the script reproduces it.
Maximum iterations: 4
 
Budget:
Stop before exceeding the agreed per-run token budget.
 
Human approval:
Required before merge, deploy, delete, purchase, or external communication.
 
Fallback:
If labels are ambiguous, stop and ask for a human decision rather than guessing.
 
Do not delete tests, bypass checks, or modify unrelated files just to satisfy the validation condition. If blocked, stop and summarize the blocker, attempted fixes, and recommended next action.

Failure modes to watch

Silent row deletion
Overwritten raw data
Unlabeled imputation
Non-reproducible cleanup

Loop Engineering FAQ

If a cleaning step turns out wrong, you need the original to start over. Cleaning should always produce a new artifact, never overwrite the source.