Research workspace for evaluating LLM routing behavior on a simplified street-network representation of Southern Helsinki.
This repository contains the research and evaluation side of the project: routing-network inputs, stable SSAL artifacts, exported experiment inputs, and analysis scripts. The broader project compares GPT-family and Gemini-family models on routing tasks over an OpenStreetMap-derived Helsinki network, using Routingpy as the reference baseline.
Relationship to llm-compare-dashboard
The project is split across two repositories:
- llm-compare-dashboard: run prompts, compare OpenAI and Gemini outputs side by side, and store/export history
- research: prepare routing artifacts, version experiment inputs, and evaluate routing results
In practice, prompts are run in the dashboard, the history is exported as JSON, and the exported results are analyzed here.
The project studies how LLMs handle route-generation tasks when given a compact graph-like representation of a real street network instead of a standard map UI. The current reference map is a selected area of Southern Helsinki derived from OpenStreetMap. The current evaluation focuses on GPT and Gemini models.
Main evaluation concerns:
- structured output correctness
- plausible node sequence selection
- distance estimation quality
- robustness as route difficulty increases
- Prepare an OSM-derived routing network
- Build the SSAL artifact from the GeoPackage input
- Run routing prompts in llm-compare-dashboard
- Export the dashboard history as JSON
- Store the export in this repository
- Evaluate the results with the scripts here
- Record summaries and notes for later review
Create and activate a virtual environment first.
python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txtCreate a repo-root .env file:
ORS_API_KEY=your_ors_api_key_here
GPKG_PATH=data/raw/routing_networks/osm_southern_helsinki_slimmed_cropped.gpkg
HISTORY_JSON=data/raw/llm_history_exports/llm_compare_history_2026-04-20.json
NODES_LAYER=slimmed_cropped_nodesSee scripts/README.md for script-specific usage details.
The OSM road network is converted into a simplified semantic adjacency list to reduce token usage while keeping the routing structure that matters.
The reusable conversion logic lives in:
The CLI entry point for regenerating the versioned SSAL artifact is:
Stable generated SSAL text artifacts are intentionally versioned in this repo.
The current script workflow lives under:
In particular:
- scripts/build_ssal.py builds the compact SSAL text artifact from the GeoPackage road-network input
- scripts/evaluate_history.py evaluates exported dashboard history against the routing network and prints per-entry and summary results
- scripts/README.md documents dependencies, configuration, and usage
An earlier prototype script feeds SSAL data and a routing prompt to an LLM and expects a route in JSON format. It is kept for historical reference and is not treated as the main current workflow.
Current location:
The project also uses a comparison interface (app.py in the separate dashboard repo) for side-by-side model testing and persisted history. That history is later exported and analyzed here.
data/raw/routing_networks/— OSM-derived GeoPackage inputsdata/derived/ssal/— stable generated SSAL text artifactsdata/raw/llm_history_exports/— exported dashboard history JSONsresearch/— reusable Python logicscripts/— executable SSAL generation and evaluation scriptsresults/summaries/— experiment notes and summariesarchive/prototypes/— older prototype scripts
Build the default SSAL artifact:
python scripts/build_ssal.pyEvaluate the default exported history:
python scripts/evaluate_history.pyShow script options:
python scripts/build_ssal.py --help
python scripts/evaluate_history.py --helpThis repo reflects an evolving research workflow, not a finished software product.
Early experiment notes indicate:
- GPT-family models sometimes produced partially correct routes and distance estimates
- Gemini 2.5 Flash often failed to return the expected JSON format
- performance worsened on more difficult routes
- output-format reliability was itself a major issue
Detailed chronology and test-by-test notes are kept in the supporting docs, summaries, and changelog rather than in the README.
The current route evaluator uses an approximate exploratory node-sequence comparison between LLM-produced node paths and OpenRouteService route geometry. This is useful for rough comparison, but it is not yet a fully graph-native path-equivalence metric.