Disaggregation — Fine-grained Property Value Assessment

Code for Archbold et al., Fine-grained Property Value Assessment Using Probabilistic Disaggregation (IGARSS 2023). Paper: https://arxiv.org/pdf/2306.00246.pdf

The model takes a 1m-resolution aerial RGB tile of Hennepin County, Minnesota plus parcel-level boundary masks and learns a per-pixel probabilistic value field whose parcel-aggregated values match the ground-truth property assessments.

@inproceedings{archbold2023fine,
  author    = {Archbold, Cohen and Brodie, Benjamin and Ogholbake, Aram Ansary and Jacobs, Nathan},
  booktitle = {IEEE International Geoscience and Remote Sensing Symposium (IGARSS)},
  title     = {Fine-grained Property Value Assessment Using Probabilistic Disaggregation},
  year      = {2023},
}

Repo layout

Directory	Purpose
`aggregation/`	Main pipeline. PyTorch-Lightning modules for the 5 model variants (`ral`, `uniform`, `rsample`, `gauss`, `logsample`), Hennepin dataset loader, train/test/vis entrypoints.
`segmentation/`	Optional building-segmentation pre-training that produces the checkpoint `aggregation/` loads when `cfg.train.use_pretrained=True`.
`synthetic_testing/`	Self-contained controlled experiment using EuroSAT with synthetic labels. See `synthetic_testing/README.md`.
`dataset_generation/`	Original pipeline that turned the raw Hennepin County parcels shapefile + WMS imagery into the `ver8/`-style per-tile layout.
`scripts/`	Setup helpers (`setup_hennepin.sh`).
`paths.env`	Single source of truth for filesystem paths. Edit this once; both Python configs and the setup script read it.
`paths.py`	Tiny stdlib loader that merges `paths.env` into `os.environ` at import time.

Setup

Install dependencies. A Python 3.9+ environment is fine; CUDA strongly recommended for training.
```
pip install -r requirements.txt
```
The original cluster lockfile is preserved at aggregation/legacy_py39_cluster.yml for reference but is not needed.
Configure paths once in paths.env. Open the file at the repo root and fill in the values you care about. The defaults work if you keep the standard layout (Box-synced archives + repo-relative data/ directory). All path-shaped variables flow from this file: both Python config.py files and scripts/setup_hennepin.sh read it. Anything you export in your shell takes precedence.
Get the data. The Hennepin aggregation dataset ships as two ~455 MB zips on Box:
```
1m_302px.zip                  → 1916 tiles, uncombined parcels
1m_302px_region_combined.zip  → 1916 tiles, parcels merged with neighbors
```
Both files live in Box at <Box>/datasets/Hennepin_dataset/. If your local Box mount shows them as 0-byte placeholders, right-click each in Finder → Download Now and wait for the sync to finish.

Unpack. From the repo root:

bash scripts/setup_hennepin.sh

The script reads BOX_DIR and HENNEPIN_DATA_ROOT from paths.env (or from your shell) and unpacks each zip to its own subdirectory under HENNEPIN_DATA_ROOT:

$HENNEPIN_DATA_ROOT/1m_302px/{imgs,masks,vals.pkl}
$HENNEPIN_DATA_ROOT/1m_302px_region_combined/{imgs,masks,vals.pkl}

If you let HENNEPIN_DATA_ROOT default, set it after the unpack:

# add to paths.env, or export in your shell:
HENNEPIN_DATA_ROOT=/abs/path/to/repo/data/hennepin

Train and evaluate the aggregation models

Either edit aggregation/config.py directly, or override the most common knobs via env vars at invocation time:

Env var	Default	What it sets
`EXPERIMENT_NAME`	`gauss_covariance`	`cfg.experiment_name` (controls `results/<name>/` output dir)
`TRAIN_MODEL`	`gauss`	`cfg.train.model` — one of `ral` / `uniform` / `rsample` / `gauss` / `logsample`
`USE_PRETRAINED`	`0`	`1` to load `PRETRAINED_CKPT` (see "Optional pre-training")
`BATCH_SIZE`	`8`	Training batch size
`NUM_EPOCHS`	`300`
`NUM_WORKERS`	`4`	DataLoader workers (use `0` on macOS to avoid mp issues)
`PATIENCE`	`100`	EarlyStopping patience on `val_loss`
`PL_ACCELERATOR`	`auto`	`auto` / `cpu` / `gpu` / `mps`
`PL_DEVICES`	`1`	Number of devices

Then:

cd aggregation
python train.py     # writes results/<experiment_name>/best.ckpt
python test.py      # writes results/<experiment_name>/test_results/*.pkl + stats.txt
python vis.py       # writes results/<experiment_name>/visualizations/*.png

Smoke test

The full pipeline has been verified end-to-end on CPU with a 10-tile subset:

cd aggregation
EXPERIMENT_NAME=smoke_test \
BATCH_SIZE=4 NUM_EPOCHS=1 NUM_WORKERS=0 \
PATIENCE=1 \
PL_ACCELERATOR=cpu PL_DEVICES=1 \
python train.py && python test.py && python vis.py

(On macOS with the conda-shipped libomp, prefix with KMP_DUPLICATE_LIB_OK=TRUE.) This finishes in under a minute on CPU with a 10-tile dataset and produces the full output tree: checkpoint, stats, prediction pickles, scatter/error histograms, and per-sample visualizations. The model isn't useful at 1 epoch on 10 tiles — this just confirms the train→test→vis chain runs cleanly.

Hardware reality

Full training assumes a CUDA GPU and was originally done on a multi-GPU cluster. On a CPU-only laptop you can import the code and run a 1-step Lightning smoke test, but real training will not finish in any practical time. Move to a server with a recent NVIDIA GPU (or set accelerator='mps' on Apple Silicon for limited experimentation).

Optional: pre-train the segmentation backbone

The aggregation models can be initialized from a building-segmentation network. The default configuration disables this (cfg.train.use_pretrained = False); to use it you need to produce a checkpoint at segmentation/outputs/buildsegpretrain/building_seg_pretrained.pth (or somewhere else and set PRETRAINED_CKPT).

The segmentation pipeline expects the ver8/-style raw Hennepin layout, not the two zips above. Each tile has its own subdirectory keyed by lat_mid/lon_mid and contains:

<tile>/building_mask.tif
<tile>/parcel_boundary.tif
<tile>/parcel_mask.tif
<tile>/<lat>_<lon>.tif        # source image
<tile>/parcel_value.tif
<tile>/parcels.shp            # optional

plus a top-level hennepin_bbox.csv listing every tile bbox. This data is not in the Box archives — those zips contain only the downstream-compiled aggregation format (per-tile RGB PNG + pickled parcel masks + a single vals.pkl), with no GeoTIFF rasters or per-tile bbox catalog. The ver8/ layout was the upstream input that dataset_generation/hennepin_compile.py consumed to produce the zips. It lives on the original cluster.

Once you have it, set these in paths.env (or your shell):

HENNEPIN_VER8_ROOT=/path/to/ver8
HENNEPIN_VER8_CSV=/path/to/ver8/hennepin_bbox.csv
SEG_OUT_DIR=/abs/path/to/repo/segmentation/outputs/buildsegpretrain

Then:

cd segmentation
python train.py

That writes both model_dict.pth (best by val loss) and building_seg_pretrained.pth (canonical filename consumed by aggregation/). After training, flip cfg.train.use_pretrained = True and re-train aggregation.

If the ver8/ data is gone, regenerate it with dataset_generation/ (next section).

Optional: regenerate the `ver8/` data and the compiled zips from scratch

dataset_generation/ is a complete pipeline that produces the per-tile ver8/ layout that segmentation/ consumes, and from there the two compiled zips that aggregation/ consumes. You only need this if you want to retrain the segmentation backbone and the original ver8/ data is unavailable.

What you need first:

Asset	Where to get it	Notes
`hennepin_bbox.csv`	Already in repo at `dataset_generation/hennepin_bbox.csv`	5501-tile catalog, restored from git history.
Hennepin County parcels shapefile	Auto-downloaded by `dataset_generation/request_zip.sh` from `gis.hennepin.us`	~92 MB.
`Minnesota_ESPG26915.shp` (Microsoft US Building Footprints, MN, reprojected to EPSG:26915)	Auto-downloaded and reprojected by the same `request_zip.sh` (uses the Microsoft Building Footprints dataset on Azure and `ogr2ogr` to reproject).	Requires `ogr2ogr` (GDAL CLI) on PATH.
Aerial imagery	Downloaded on-the-fly by `satellite_gather.py` from the Hennepin County WMS service.	Slow + bandwidth-heavy.

Pipeline:

cd dataset_generation

# 1. Get both shapefiles in one shot (parcels + buildings, reprojected to EPSG:26915):
chmod u+x request_zip.sh && ./request_zip.sh
# Produces: hennepin_county_parcels/hennepin_county_parcels.shp and Minnesota_ESPG26915.shp

# 2. Download aerial imagery into a per-tile lat/lon directory tree
python satellite_gather.py --path /path/to/raw_hennepin --gsd 1

# 3. Generate per-tile {building_mask, parcel_mask, parcel_boundary, masks/<PID>.tif}
python generate_labels.py --dir /path/to/raw_hennepin/ --gsd 1

# At this point /path/to/raw_hennepin matches the `ver8/` layout. You can now train
# the segmentation backbone (see "Optional: pre-train the segmentation backbone").

# 4. (Aggregation only) Compile the per-tile rasters into the imgs/+masks/+vals.pkl
#    layout the aggregation code reads. HENNEPIN_COMBINE=1 = combined-parcels variant.
HENNEPIN_RAW_ROOT=/path/to/raw_hennepin \
HENNEPIN_PARCEL_SHP=$(pwd)/hennepin_county_parcels/hennepin_county_parcels.shp \
HENNEPIN_COMPILED_OUT=/path/to/hennepin/1m_302px_region_combined \
HENNEPIN_COMBINE=1 \
python hennepin_compile.py

The supplied Box zips already contain the output of this entire chain for the aggregation pipeline. Re-run it only if you need the per-tile rasters too — i.e., to retrain the segmentation backbone.

Synthetic experiments

A self-contained EuroSAT-based ablation lives under synthetic_testing/. See synthetic_testing/README.md for full instructions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disaggregation — Fine-grained Property Value Assessment

Repo layout

Setup

Train and evaluate the aggregation models

Smoke test

Hardware reality

Optional: pre-train the segmentation backbone

Optional: regenerate the `ver8/` data and the compiled zips from scratch

Synthetic experiments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 238 Commits
.vscode		.vscode
aggregation		aggregation
dataset_generation		dataset_generation
scripts		scripts
segmentation		segmentation
synthetic_testing		synthetic_testing
.gitignore		.gitignore
README.md		README.md
paths.env		paths.env
paths.py		paths.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Disaggregation — Fine-grained Property Value Assessment

Repo layout

Setup

Train and evaluate the aggregation models

Smoke test

Hardware reality

Optional: pre-train the segmentation backbone

Optional: regenerate the ver8/ data and the compiled zips from scratch

Synthetic experiments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Optional: regenerate the `ver8/` data and the compiled zips from scratch

Packages