Tech Hubs Session 8: AI Applications in Research

A hands-on training repository for IPA Research Associates and Research Managers. This session teaches AI-assisted data cleaning in Stata using GitHub Copilot inside VS Code, built on top of the IPA Stata Template.

Warning

NEVER COMMIT DATA FILES TO GITHUB.

NEVER USE AI ASSISTANTS WITH PERSONALLY IDENTIFIABLE DATA.

YOU ARE REQUIRED TO REMOVE IDENTIFYING INFORMATION BEFORE CONNECTING AI ASSISTANTS OR STORING IN ANY UNENCRYPTED LOCATION.

This training uses synthetic data only. No real survey data should ever be committed to this repository.

Quick Start

Prerequisites

Git (download)
VSCode (download)
GitHub Copilot Extension for VSCode(download)
Stata 17+
Run the following command in your terminal to install just on windows:
- windows - winget install Casey.Just
- Linux - brew install just
Restart your terminal and proceed with the steps

Steps

Clone the repository

git clone <repo-url>
cd ai-assisted-data-cleaning

Configure your Stata path

Copy .env-example to .env and set your Stata executable path:

# Windows example
STATA_CMD='C:\Program Files\Stata18\StataSE-64.exe'
STATA_EDITION='se'

# macOS example
# STATA_CMD='/Applications/Stata/StataSE.app/Contents/MacOS/StataSE'

One-time setup (installs setroot and all required packages)

just stata-setup
# or from Stata directly:
# do setup.do

Generate the synthetic training dataset (run once)

In Stata, from the project root:
```
just create-synthetic-data
# or from Stata directly:
do setup/generate_synthetic_data.do
```
This creates data/raw/household_survey_raw.dta — a synthetic household survey with 500 observations and intentional data quality issues.

Run the training pipeline

# Full pipeline
just stata-run

# Or from Stata directly
do do_files/00_run.do

# Run a single module
just stata-script 02_string_cleaning
# or: do do_files/00_run.do "02_string_cleaning"

Check outputs
- CSV exports: outputs/
- Codebook: outputs/codebook.xlsx
- Logs: logs/
- Final clean dataset: data/final/hh_clean_final.dta

Training Modules

The pipeline consists of five modules, each designed as a hands-on Copilot exercise. Every module contains // TODO comments and * COPILOT PROMPT: comments with ready-to-use natural language prompts.

Module	File	Topic
01	`do_files/01_data_cleaning.do`	Load data, inspect quality, check identifiers
02	`do_files/02_string_cleaning.do`	Trim spaces, title case, standardise categories
03	`do_files/03_deduplication.do`	Detect and resolve duplicate records
04	`do_files/04_outliers_flags.do`	IQR outlier flagging, winsorisation, `.o` recoding
05	`do_files/05_labeling_codebook.do`	Variable labels, value labels, codebook export

How Each Module Works

Each module:

Runs standalone — includes an initialisation block that sets up paths if the module is run directly, without going through 00_run.do
Runs as part of the pipeline — 00_run.do calls each module in sequence
Contains TODOs — blank sections where participants write Copilot-assisted code
Contains COPILOT PROMPT comments — copy the plain-English prompt into Copilot Chat or let inline Copilot autocomplete the code

Exercises

Each module contains one or more exercises marked with // TODO and * COPILOT PROMPT: comments. The prompts are written in plain English so you can paste them directly into GitHub Copilot Chat or use them as inline autocomplete triggers.

Module 01 — Data Cleaning (`01_data_cleaning.do`)

Show missingness information (variable name, count missing, % missing) — key commands: missings report

Module 02 — String Cleaning (`02_string_cleaning.do`)

#	Exercise	Key commands
1	Trim leading and trailing whitespace from every string variable	`ds, has(type string)`, `strtrim()`
2	Standardise `enumerator_name` to title case	`proper()`
3	Clean `district_name` — lowercase, trim spaces, collapse internal spaces	`lower()`, `strtrim()`, `itrim()`
4	Recode `occupation_raw` to five canonical categories: Farmer, Teacher, Trader, Laborer, Other	`inlist()`, `strmatch()`

Module 03 — Deduplication (`03_deduplication.do`)

#	Exercise	Key commands
1	Report how many records share the same `hhid`	`duplicates report`
2	Create an `is_duplicate` flag (0 = unique, 1 = duplicate)	`duplicates tag`
3	Export all duplicate records to `outputs/hh_duplicates.xlsx` for review	`export excel ... if is_duplicate == 1`
4	Keep only the most recent record per `hhid` using `survey_date`	`bysort hhid (survey_date): keep if _n == _N`, `isid`

Module 04 — Outliers & Flags (`04_outliers_flags.do`)

#	Exercise	Key commands
1	Flag outliers in `hh_income_monthly` using the IQR method; add an `income_flag_reason` string	`summarize, detail`, `r(p25)`, `r(p75)`
2	Winsorise `hh_expenditure` at the 1st and 99th percentiles	`winsor2 ..., cuts(1 99) replace`

Module 05 — Labeling & Codebook (`05_labeling_codebook.do`)

#	Exercise	Key commands
1	Apply descriptive variable labels to all 21 variables in the dataset	`label variable`
2	Define a `yn_label` (0 = No, 1 = Yes) and apply it to all `_yn` binary variables	`label define`, `label values`, `foreach var of varlist *_yn`
3	Define an `edu_label` (0–3: No education → Tertiary) and apply it to `edu_level`	`label define`, `label values`
4	Generate and export a codebook to `outputs/codebook.xlsx`	`ipacodebook` (preferred), or `codebook` + `putexcel`

Project Structure

├── README.md                           # This file
├── CLAUDE.md                           # AI assistant instructions and conventions
├── .here                               # Project root marker (used by setroot)
├── .env                                # Stata executable config (gitignored — copy from .env-example)
├── .env-example                        # Template for .env
├── config.do.template                  # Template for user-specific data paths
├── config.do                           # User-specific paths (gitignored — copy from template)
├── setup.do                            # One-time setup: installs setroot + packages
│
├── setup/
│   └── generate_synthetic_data.do      # Generates synthetic training data (run once)
│
├── do_files/                           # Stata do-files
│   ├── 00_run.do                       # Master runner (controls pipeline + single-module mode)
│   ├── 01_data_cleaning.do             # MODULE 01: Load data, quality checks, identifier validation
│   ├── 02_string_cleaning.do           # MODULE 02: String standardisation
│   ├── 03_deduplication.do             # MODULE 03: Duplicate detection and resolution
│   ├── 04_outliers_flags.do            # MODULE 04: Outlier detection and flagging
│   └── 05_labeling_codebook.do         # MODULE 05: Variable labels and codebook
│
├── data/
│   ├── raw/                            # Raw input data (read-only after generation)
│   │   └── household_survey_raw.dta    # Synthetic training dataset (generated by setup/)
│   ├── intermediate/                   # Intermediate files produced by modules 01–04
│   └── final/                          # Final clean dataset (produced by module 05)
│
├── outputs/                            # All exported outputs
│   ├── missing_summary.csv             # From module 01
│   ├── dedup_log.csv                   # From module 03
│   ├── flag_summary.csv                # From module 04
│   └── codebook.xlsx                   # From module 05
│
├── logs/
│   ├── setup.log                       # One-time setup log (root level)
│   └── 14_Apr_2026/                    # Date-based subfolder (one per run)
│       ├── 00_run.log
│       ├── 01_data_cleaning.log
│       ├── 02_string_cleaning.log
│       ├── 03_deduplication.log
│       ├── 04_outliers_flags.log
│       └── 05_labeling_codebook.log
├── ado/                                # Local Stata packages (installed by setup.do)
│
├── .config/
│   ├── stata/
│   │   ├── stata_requirements.txt      # Stata package list (installed via require)
│   │   └── install_packages.do         # Package installation script
│   └── quarto/                         # Quarto formatting configuration
│
└── .github/
    └── workflows/                      # CI workflows (code review, pre-commit)

Path Resolution and Globals

How It Works

The project uses setroot to automatically locate the project root from any directory by searching upward for the .here marker file. This means:

No hardcoded paths — scripts work regardless of where Stata is launched
No if c(user) blocks — paths resolve automatically for every team member
Reproducible adopath — only BASE + local ado/ directory

Global Path Variables

After 00_run.do runs, these globals are available in all modules:

Global	Default	Purpose
`${project_path}`	(from setroot)	Project root directory
`${data}`	`${project_path}/data`	Data root folder
`${logs}`	`${project_path}/logs`	Log files
`${outputs}`	`${project_path}/outputs`	All exported outputs
`${scripts}`	`${project_path}/do_files`	Do-files directory
`${today}`	(from `c(current_date)`)	Date stamp for log file names

Data sub-folders (never define these separately — always construct from ${data}):

"${data}/raw/"           // raw input data
"${data}/intermediate/"  // intermediate processed files
"${data}/final/"         // clean, analysis-ready datasets

Separating Code and Data

If your data lives outside the repo (Dropbox, shared drive, Cryptomator vault):

Copy the template:
```
cp config.do.template config.do
```

Edit config.do to set your paths:

global data    "C:/Users/YourName/Dropbox/Project/data"
global logs    ""   // leave blank to use project root default
global outputs ""   // leave blank to use project root default

Run as usual — 00_run.do loads config.do automatically.

Important

Never commit config.do — it is gitignored because it contains machine-specific paths. Always commit config.do.template.

Understanding `00_run.do`

The master do-file orchestrates the training pipeline. It uses control switches to run specific modules:

// Set to 0 to skip a module during development
local run_01_data_cleaning     = 1
local run_02_string_cleaning   = 1
local run_03_deduplication     = 1
local run_04_outliers_flags    = 1
local run_05_labeling_codebook = 1

Runner pattern — pass a module name to run only that one:

do do_files/00_run.do "03_deduplication"

Or with just:

just stata-script 03_deduplication

Tip

If you get a "Root folder of project not found" error, change to the project directory in Stata first: cd /path/to/repo then re-run.

Advanced Setup

Task Runner (`just`)

Install just to run common tasks with short commands:

# Windows
winget install --id Casey.Just -e

# macOS/Linux
brew install just

Available commands:

just stata-setup                      # One-time setup (install setroot + packages)
just stata-run                        # Run full training pipeline
just stata-script 02_string_cleaning  # Run a single module
just stata-config                     # Show Stata configuration
just lint-stata                       # Lint all do-files with stata_linter
just fmt-markdown                     # Format Markdown files
just help                             # See all available commands

Full Development Environment

For Python tools and pre-commit hooks:

just get-started

This installs: uv (Python), markdownlint-cli2, and all Stata packages.

IPA Coding Standards

All do-files in this repository follow IPA Stata standards:

Path management via setroot and ${data} globals — no hardcoded paths
Standard file header with project name, file, purpose, author, date
Log open/close in every script using ${logs} and ${today}
IPA extended missing values: .d (don't know), .r (refused), .n (not applicable), .o (other/out of range), .s (skipped)
Defensive programming: assert, isid, and merge validation throughout
Package management via .config/stata/stata_requirements.txt — no ssc install in do-files

Troubleshooting

setroot not found — Run do setup.do first to install required packages.

"Root folder not found" error — Change to the project directory in Stata (cd /path/to/repo) before running do do_files/00_run.do.

household_survey_raw.dta not found — Run do setup/generate_synthetic_data.do from the project root first.

Stata path errors — Check your .env file; ensure paths with spaces are quoted.

Package not found — Run just stata-setup (or do setup.do) to install all packages listed in .config/stata/stata_requirements.txt.

Acknowledgments

This training is built on the IPA Stata Template and the following resources:

IPA Data Cleaning Guide — data.poverty-action.org/data-cleaning
IPA Stata Coding Standards — data.poverty-action.org/software/stata
Data Carpentry: Stata for Economics — datacarpentry.github.io/stata-economics (CC BY 4.0)
DIME Analytics Data Handbook — worldbank.github.io/dime-data-handbook
Sean Higgins Stata Guide — github.com/skhiggins/Stata_guide
ipaplots — github.com/PovertyAction/ipaplots
statacons — bquistorff.github.io/statacons (MIT)

License

Released under the MIT License. See LICENSE for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tech Hubs Session 8: AI Applications in Research

Quick Start

Prerequisites

Steps

Training Modules

How Each Module Works

Exercises

Module 01 — Data Cleaning (`01_data_cleaning.do`)

Module 02 — String Cleaning (`02_string_cleaning.do`)

Module 03 — Deduplication (`03_deduplication.do`)

Module 04 — Outliers & Flags (`04_outliers_flags.do`)

Module 05 — Labeling & Codebook (`05_labeling_codebook.do`)

Project Structure

Path Resolution and Globals

How It Works

Global Path Variables

Separating Code and Data

Understanding `00_run.do`

Advanced Setup

Task Runner (`just`)

Full Development Environment

IPA Coding Standards

Troubleshooting

Acknowledgments

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.claude/skills		.claude/skills
.config		.config
.github		.github
.vscode		.vscode
ado		ado
data		data
do_files		do_files
logs		logs
outputs		outputs
setup		setup
.env-example		.env-example
.gitignore		.gitignore
.here		.here
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
Justfile		Justfile
LICENSE		LICENSE
README.md		README.md
_environment		_environment
config.do.template		config.do.template
profile.do		profile.do
pyproject.toml		pyproject.toml
setup.do		setup.do
uv.lock		uv.lock

Folders and files

Latest commit

History

Repository files navigation

Tech Hubs Session 8: AI Applications in Research

Quick Start

Prerequisites

Steps

Training Modules

How Each Module Works

Exercises

Module 01 — Data Cleaning (01_data_cleaning.do)

Module 02 — String Cleaning (02_string_cleaning.do)

Module 03 — Deduplication (03_deduplication.do)

Module 04 — Outliers & Flags (04_outliers_flags.do)

Module 05 — Labeling & Codebook (05_labeling_codebook.do)

Project Structure

Path Resolution and Globals

How It Works

Global Path Variables

Separating Code and Data

Understanding 00_run.do

Advanced Setup

Task Runner (just)

Full Development Environment

IPA Coding Standards

Troubleshooting

Acknowledgments

License

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Module 01 — Data Cleaning (`01_data_cleaning.do`)

Module 02 — String Cleaning (`02_string_cleaning.do`)

Module 03 — Deduplication (`03_deduplication.do`)

Module 04 — Outliers & Flags (`04_outliers_flags.do`)

Module 05 — Labeling & Codebook (`05_labeling_codebook.do`)

Understanding `00_run.do`

Task Runner (`just`)

Packages