A hands-on training repository for IPA Research Associates and Research Managers. This session teaches AI-assisted data cleaning in Stata using GitHub Copilot inside VS Code, built on top of the IPA Stata Template.
Warning
NEVER COMMIT DATA FILES TO GITHUB.
NEVER USE AI ASSISTANTS WITH PERSONALLY IDENTIFIABLE DATA.
YOU ARE REQUIRED TO REMOVE IDENTIFYING INFORMATION BEFORE CONNECTING AI ASSISTANTS OR STORING IN ANY UNENCRYPTED LOCATION.
This training uses synthetic data only. No real survey data should ever be committed to this repository.
- Git (download)
- VSCode (download)
- GitHub Copilot Extension for VSCode(download)
- Stata 17+
- Run the following command in your terminal to install just on windows:
- windows -
winget install Casey.Just - Linux -
brew install just
- windows -
- Restart your terminal and proceed with the steps
-
Clone the repository
git clone <repo-url> cd ai-assisted-data-cleaning
-
Configure your Stata path
Copy
.env-exampleto.envand set your Stata executable path:# Windows example STATA_CMD='C:\Program Files\Stata18\StataSE-64.exe' STATA_EDITION='se' # macOS example # STATA_CMD='/Applications/Stata/StataSE.app/Contents/MacOS/StataSE'
-
One-time setup (installs
setrootand all required packages)just stata-setup # or from Stata directly: # do setup.do
-
Generate the synthetic training dataset (run once)
In Stata, from the project root:
just create-synthetic-data # or from Stata directly: do setup/generate_synthetic_data.do
This creates
data/raw/household_survey_raw.dta— a synthetic household survey with 500 observations and intentional data quality issues. -
Run the training pipeline
# Full pipeline just stata-run # Or from Stata directly do do_files/00_run.do # Run a single module just stata-script 02_string_cleaning # or: do do_files/00_run.do "02_string_cleaning"
-
Check outputs
- CSV exports:
outputs/ - Codebook:
outputs/codebook.xlsx - Logs:
logs/ - Final clean dataset:
data/final/hh_clean_final.dta
- CSV exports:
The pipeline consists of five modules, each designed as a hands-on Copilot
exercise. Every module contains // TODO comments and * COPILOT PROMPT:
comments with ready-to-use natural language prompts.
| Module | File | Topic |
|---|---|---|
| 01 | do_files/01_data_cleaning.do |
Load data, inspect quality, check identifiers |
| 02 | do_files/02_string_cleaning.do |
Trim spaces, title case, standardise categories |
| 03 | do_files/03_deduplication.do |
Detect and resolve duplicate records |
| 04 | do_files/04_outliers_flags.do |
IQR outlier flagging, winsorisation, .o recoding |
| 05 | do_files/05_labeling_codebook.do |
Variable labels, value labels, codebook export |
Each module:
- Runs standalone — includes an initialisation block that sets up paths if
the module is run directly, without going through
00_run.do - Runs as part of the pipeline —
00_run.docalls each module in sequence - Contains TODOs — blank sections where participants write Copilot-assisted code
- Contains COPILOT PROMPT comments — copy the plain-English prompt into Copilot Chat or let inline Copilot autocomplete the code
Each module contains one or more exercises marked with // TODO and
* COPILOT PROMPT: comments. The prompts are written in plain English so you
can paste them directly into GitHub Copilot Chat or use them as inline
autocomplete triggers.
- Show missingness information (variable name, count missing, % missing) — key commands:
missings report
| # | Exercise | Key commands |
|---|---|---|
| 1 | Trim leading and trailing whitespace from every string variable | ds, has(type string), strtrim() |
| 2 | Standardise enumerator_name to title case |
proper() |
| 3 | Clean district_name — lowercase, trim spaces, collapse internal spaces |
lower(), strtrim(), itrim() |
| 4 | Recode occupation_raw to five canonical categories: Farmer, Teacher, Trader, Laborer, Other |
inlist(), strmatch() |
| # | Exercise | Key commands |
|---|---|---|
| 1 | Report how many records share the same hhid |
duplicates report |
| 2 | Create an is_duplicate flag (0 = unique, 1 = duplicate) |
duplicates tag |
| 3 | Export all duplicate records to outputs/hh_duplicates.xlsx for review |
export excel ... if is_duplicate == 1 |
| 4 | Keep only the most recent record per hhid using survey_date |
bysort hhid (survey_date): keep if _n == _N, isid |
| # | Exercise | Key commands |
|---|---|---|
| 1 | Flag outliers in hh_income_monthly using the IQR method; add an income_flag_reason string |
summarize, detail, r(p25), r(p75) |
| 2 | Winsorise hh_expenditure at the 1st and 99th percentiles |
winsor2 ..., cuts(1 99) replace |
| # | Exercise | Key commands |
|---|---|---|
| 1 | Apply descriptive variable labels to all 21 variables in the dataset | label variable |
| 2 | Define a yn_label (0 = No, 1 = Yes) and apply it to all _yn binary variables |
label define, label values, foreach var of varlist *_yn |
| 3 | Define an edu_label (0–3: No education → Tertiary) and apply it to edu_level |
label define, label values |
| 4 | Generate and export a codebook to outputs/codebook.xlsx |
ipacodebook (preferred), or codebook + putexcel |
├── README.md # This file
├── CLAUDE.md # AI assistant instructions and conventions
├── .here # Project root marker (used by setroot)
├── .env # Stata executable config (gitignored — copy from .env-example)
├── .env-example # Template for .env
├── config.do.template # Template for user-specific data paths
├── config.do # User-specific paths (gitignored — copy from template)
├── setup.do # One-time setup: installs setroot + packages
│
├── setup/
│ └── generate_synthetic_data.do # Generates synthetic training data (run once)
│
├── do_files/ # Stata do-files
│ ├── 00_run.do # Master runner (controls pipeline + single-module mode)
│ ├── 01_data_cleaning.do # MODULE 01: Load data, quality checks, identifier validation
│ ├── 02_string_cleaning.do # MODULE 02: String standardisation
│ ├── 03_deduplication.do # MODULE 03: Duplicate detection and resolution
│ ├── 04_outliers_flags.do # MODULE 04: Outlier detection and flagging
│ └── 05_labeling_codebook.do # MODULE 05: Variable labels and codebook
│
├── data/
│ ├── raw/ # Raw input data (read-only after generation)
│ │ └── household_survey_raw.dta # Synthetic training dataset (generated by setup/)
│ ├── intermediate/ # Intermediate files produced by modules 01–04
│ └── final/ # Final clean dataset (produced by module 05)
│
├── outputs/ # All exported outputs
│ ├── missing_summary.csv # From module 01
│ ├── dedup_log.csv # From module 03
│ ├── flag_summary.csv # From module 04
│ └── codebook.xlsx # From module 05
│
├── logs/
│ ├── setup.log # One-time setup log (root level)
│ └── 14_Apr_2026/ # Date-based subfolder (one per run)
│ ├── 00_run.log
│ ├── 01_data_cleaning.log
│ ├── 02_string_cleaning.log
│ ├── 03_deduplication.log
│ ├── 04_outliers_flags.log
│ └── 05_labeling_codebook.log
├── ado/ # Local Stata packages (installed by setup.do)
│
├── .config/
│ ├── stata/
│ │ ├── stata_requirements.txt # Stata package list (installed via require)
│ │ └── install_packages.do # Package installation script
│ └── quarto/ # Quarto formatting configuration
│
└── .github/
└── workflows/ # CI workflows (code review, pre-commit)
The project uses setroot to
automatically locate the project root from any directory by searching upward for
the .here marker file. This means:
- No hardcoded paths — scripts work regardless of where Stata is launched
- No
if c(user)blocks — paths resolve automatically for every team member - Reproducible adopath — only BASE + local
ado/directory
After 00_run.do runs, these globals are available in all modules:
| Global | Default | Purpose |
|---|---|---|
${project_path} |
(from setroot) | Project root directory |
${data} |
${project_path}/data |
Data root folder |
${logs} |
${project_path}/logs |
Log files |
${outputs} |
${project_path}/outputs |
All exported outputs |
${scripts} |
${project_path}/do_files |
Do-files directory |
${today} |
(from c(current_date)) |
Date stamp for log file names |
Data sub-folders (never define these separately — always construct from ${data}):
"${data}/raw/" // raw input data
"${data}/intermediate/" // intermediate processed files
"${data}/final/" // clean, analysis-ready datasetsIf your data lives outside the repo (Dropbox, shared drive, Cryptomator vault):
-
Copy the template:
cp config.do.template config.do
-
Edit
config.doto set your paths:global data "C:/Users/YourName/Dropbox/Project/data" global logs "" // leave blank to use project root default global outputs "" // leave blank to use project root default
-
Run as usual —
00_run.doloadsconfig.doautomatically.
Important
Never commit config.do — it is gitignored because it contains
machine-specific paths. Always commit config.do.template.
The master do-file orchestrates the training pipeline. It uses control switches to run specific modules:
// Set to 0 to skip a module during development
local run_01_data_cleaning = 1
local run_02_string_cleaning = 1
local run_03_deduplication = 1
local run_04_outliers_flags = 1
local run_05_labeling_codebook = 1Runner pattern — pass a module name to run only that one:
do do_files/00_run.do "03_deduplication"Or with just:
just stata-script 03_deduplicationTip
If you get a "Root folder of project not found" error, change to the project
directory in Stata first: cd /path/to/repo then re-run.
Install just to run common tasks with short commands:
# Windows
winget install --id Casey.Just -e
# macOS/Linux
brew install justAvailable commands:
just stata-setup # One-time setup (install setroot + packages)
just stata-run # Run full training pipeline
just stata-script 02_string_cleaning # Run a single module
just stata-config # Show Stata configuration
just lint-stata # Lint all do-files with stata_linter
just fmt-markdown # Format Markdown files
just help # See all available commandsFor Python tools and pre-commit hooks:
just get-startedThis installs: uv (Python), markdownlint-cli2, and all Stata packages.
All do-files in this repository follow IPA Stata standards:
- Path management via
setrootand${data}globals — no hardcoded paths - Standard file header with project name, file, purpose, author, date
- Log open/close in every script using
${logs}and${today} - IPA extended missing values:
.d(don't know),.r(refused),.n(not applicable),.o(other/out of range),.s(skipped) - Defensive programming:
assert,isid, and merge validation throughout - Package management via
.config/stata/stata_requirements.txt— nossc installin do-files
setroot not found — Run do setup.do first to install required packages.
"Root folder not found" error — Change to the project directory in Stata
(cd /path/to/repo) before running do do_files/00_run.do.
household_survey_raw.dta not found — Run
do setup/generate_synthetic_data.do from the project root first.
Stata path errors — Check your .env file; ensure paths with spaces are
quoted.
Package not found — Run just stata-setup (or do setup.do) to install
all packages listed in .config/stata/stata_requirements.txt.
This training is built on the IPA Stata Template and the following resources:
- IPA Data Cleaning Guide — data.poverty-action.org/data-cleaning
- IPA Stata Coding Standards — data.poverty-action.org/software/stata
- Data Carpentry: Stata for Economics — datacarpentry.github.io/stata-economics (CC BY 4.0)
- DIME Analytics Data Handbook — worldbank.github.io/dime-data-handbook
- Sean Higgins Stata Guide — github.com/skhiggins/Stata_guide
- ipaplots — github.com/PovertyAction/ipaplots
- statacons — bquistorff.github.io/statacons (MIT)
Released under the MIT License. See LICENSE for details.