research

Research workspace for evaluating LLM routing behavior on a simplified street-network representation of Southern Helsinki.

This repository contains the research and evaluation side of the project: routing-network inputs, stable SSAL artifacts, exported experiment inputs, and analysis scripts. The broader project compares GPT-family and Gemini-family models on routing tasks over an OpenStreetMap-derived Helsinki network, using Routingpy as the reference baseline.

Relationship to llm-compare-dashboard

The project is split across two repositories:

llm-compare-dashboard: run prompts, compare OpenAI and Gemini outputs side by side, and store/export history
research: prepare routing artifacts, version experiment inputs, and evaluate routing results

In practice, prompts are run in the dashboard, the history is exported as JSON, and the exported results are analyzed here.

Project scope

The project studies how LLMs handle route-generation tasks when given a compact graph-like representation of a real street network instead of a standard map UI. The current reference map is a selected area of Southern Helsinki derived from OpenStreetMap. The current evaluation focuses on GPT and Gemini models.

Main evaluation concerns:

structured output correctness
plausible node sequence selection
distance estimation quality
robustness as route difficulty increases

Current workflow

Prepare an OSM-derived routing network
Build the SSAL artifact from the GeoPackage input
Run routing prompts in llm-compare-dashboard
Export the dashboard history as JSON
Store the export in this repository
Evaluate the results with the scripts here
Record summaries and notes for later review

Setup

Create and activate a virtual environment first.

macOS / Linux

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

Create a repo-root .env file:

ORS_API_KEY=your_ors_api_key_here
GPKG_PATH=data/raw/routing_networks/osm_southern_helsinki_slimmed_cropped.gpkg
HISTORY_JSON=data/raw/llm_history_exports/llm_compare_history_2026-04-20.json
NODES_LAYER=slimmed_cropped_nodes

See scripts/README.md for script-specific usage details.

Core components

SSAL conversion

The OSM road network is converted into a simplified semantic adjacency list to reduce token usage while keeping the routing structure that matters.

The reusable conversion logic lives in:

research/ssal.py

The CLI entry point for regenerating the versioned SSAL artifact is:

scripts/build_ssal.py

Stable generated SSAL text artifacts are intentionally versioned in this repo.

Evaluation scripts

The current script workflow lives under:

scripts/

In particular:

scripts/build_ssal.py builds the compact SSAL text artifact from the GeoPackage road-network input
scripts/evaluate_history.py evaluates exported dashboard history against the routing network and prints per-entry and summary results
scripts/README.md documents dependencies, configuration, and usage

LLM routing prototype

An earlier prototype script feeds SSAL data and a routing prompt to an LLM and expects a route in JSON format. It is kept for historical reference and is not treated as the main current workflow.

Current location:

archive/prototypes/

Comparison interface and history

The project also uses a comparison interface (app.py in the separate dashboard repo) for side-by-side model testing and persisted history. That history is later exported and analyzed here.

Repository layout

data/raw/routing_networks/ — OSM-derived GeoPackage inputs
data/derived/ssal/ — stable generated SSAL text artifacts
data/raw/llm_history_exports/ — exported dashboard history JSONs
research/ — reusable Python logic
scripts/ — executable SSAL generation and evaluation scripts
results/summaries/ — experiment notes and summaries
archive/prototypes/ — older prototype scripts

Common commands

Build the default SSAL artifact:

python scripts/build_ssal.py

Evaluate the default exported history:

python scripts/evaluate_history.py

Show script options:

python scripts/build_ssal.py --help
python scripts/evaluate_history.py --help

Current status

This repo reflects an evolving research workflow, not a finished software product.

Early experiment notes indicate:

GPT-family models sometimes produced partially correct routes and distance estimates
Gemini 2.5 Flash often failed to return the expected JSON format
performance worsened on more difficult routes
output-format reliability was itself a major issue

Detailed chronology and test-by-test notes are kept in the supporting docs, summaries, and changelog rather than in the README.

Evaluation note

The current route evaluator uses an approximate exploratory node-sequence comparison between LLM-produced node paths and OpenRouteService route geometry. This is useful for rough comparison, but it is not yet a fully graph-native path-equivalence metric.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
archive/prototypes		archive/prototypes
data		data
docs		docs
research		research
results/summaries		results/summaries
scripts		scripts
template		template
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Makefile		Makefile
README.md		README.md
references.bib		references.bib
report-metadata.yaml		report-metadata.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

research

Relationship to llm-compare-dashboard

Project scope

Current workflow

Setup

macOS / Linux

Core components

SSAL conversion

Evaluation scripts

LLM routing prototype

Comparison interface and history

Repository layout

Common commands

Current status

Evaluation note

See also

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

research

Relationship to llm-compare-dashboard

Project scope

Current workflow

Setup

macOS / Linux

Core components

SSAL conversion

Evaluation scripts

LLM routing prototype

Comparison interface and history

Repository layout

Common commands

Current status

Evaluation note

See also

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages