Internal geotechnical platform MVP for automatic organization of boreholes and geotechnical data. Upload SPT (Standard Penetration Test) borehole PDF reports, extract structured data with OCR and AI, store results in PostgreSQL/PostGIS, and query or visualize boreholes geographically.
- PDF ingestion — Upload SPT borehole reports via REST API or CLI
- Intelligent text extraction — Native PDF text parsing with automatic OCR fallback for scanned pages
- AI structured extraction — OpenAI or Anthropic APIs parse borehole metadata, soil layers, and SPT readings into JSON
- Geospatial storage — PostgreSQL + PostGIS for borehole locations and radius search
- Technical visualization — SPT depth profiles (Plotly) and map-ready GeoJSON endpoints
PDF Upload → Text extraction (pdfplumber / PyMuPDF)
→ OCR if scanned (Tesseract)
→ LLM extraction (OpenAI / Claude)
→ PostgreSQL + PostGIS
→ Geo search & visualization API
| Component | Technology |
|---|---|
| API | FastAPI, Uvicorn |
| Database | PostgreSQL 16, PostGIS |
| ORM | SQLAlchemy 2 (async), GeoAlchemy2 |
| PDF / OCR | pdfplumber, PyMuPDF, Tesseract |
| AI | OpenAI API, Anthropic API |
| Charts / maps | Plotly, GeoJSON |
- Python 3.11+
- Docker (recommended for PostGIS)
- Tesseract OCR (required for scanned PDFs)
- OpenAI or Anthropic API key
Install Tesseract and set the path in .env:
TESSERACT_CMD=C:\Program Files\Tesseract-OCR\tesseract.exe
OCR_LANG=por+engcd d:\AI_Projects\AI_OCR
copy .env.example .envEdit .env and set at minimum:
OPENAI_API_KEY(orANTHROPIC_API_KEYwithAI_PROVIDER=anthropic)TESSERACT_CMDon Windows if using OCR
docker compose up -dThis starts PostGIS on localhost:5432 with user/password/database geo / geo / geotech.
python -m venv .venv
.\.venv\Scripts\activate
pip install -r requirements.txtuvicorn app.main:app --reload- API: http://localhost:8000
- Interactive docs: http://localhost:8000/docs
- Health check: http://localhost:8000/health
On first startup, the app enables the PostGIS extension and creates database tables automatically.
All settings are loaded from .env. See .env.example for the full list.
| Variable | Description | Default |
|---|---|---|
DATABASE_URL |
Async PostgreSQL connection string | postgresql+asyncpg://geo:geo@localhost:5432/geotech |
AI_PROVIDER |
openai or anthropic |
openai |
OPENAI_API_KEY |
OpenAI API key | — |
OPENAI_MODEL |
OpenAI model | gpt-4o |
ANTHROPIC_API_KEY |
Anthropic API key | — |
ANTHROPIC_MODEL |
Anthropic model | claude-sonnet-4-20250514 |
TESSERACT_CMD |
Path to Tesseract executable | — |
OCR_LANG |
Tesseract language packs | por+eng |
UPLOAD_DIR |
Stored PDF upload directory | ./data/uploads |
MAX_UPLOAD_MB |
Max upload size in MB | 50 |
CORS_ORIGINS |
Comma-separated allowed origins | http://localhost:3000,http://localhost:5173 |
Base path: /api/v1
| Method | Endpoint | Description |
|---|---|---|
POST |
/upload/pdf |
Upload a borehole PDF (multipart form, field: file) |
GET |
/upload/jobs/{job_id} |
Check ingest job status |
Upload returns a job immediately; processing runs in the background. Poll the job endpoint until status is completed or failed.
Example — upload a PDF:
curl -X POST "http://localhost:8000/api/v1/upload/pdf" `
-F "file=@C:\path\to\borehole-report.pdf"| Method | Endpoint | Description |
|---|---|---|
GET |
/boreholes |
List boreholes (limit, offset) |
GET |
/boreholes/{id} |
Borehole detail with layers and SPT readings |
POST |
/boreholes/search/nearby |
Find boreholes within a radius |
Nearby search body:
{
"latitude": -23.5505,
"longitude": -46.6333,
"radius_m": 5000,
"limit": 50
}| Method | Endpoint | Description |
|---|---|---|
POST |
/projects |
Create a project (name, client, description) |
Projects are also auto-created from project_name during PDF extraction when present.
| Method | Endpoint | Description |
|---|---|---|
GET |
/viz/boreholes/{id}/spt-profile |
Plotly JSON chart (N-value vs depth) |
GET |
/viz/map |
GeoJSON FeatureCollection of all georeferenced boreholes |
Ingest a local PDF without running the HTTP server:
python scripts/ingest_local.py C:\path\to\borehole-report.pdfRequires the database to be running and .env configured.
Alembic is included for schema versioning. The app also auto-creates tables on startup for local development.
# Apply migrations (uses sync psycopg2 driver)
alembic upgrade headInitial migration: alembic/versions/001_initial_schema.py
| Table | Purpose |
|---|---|
projects |
Site / client projects |
boreholes |
Borehole header, location (PostGIS point), metadata |
soil_layers |
Stratigraphy (depth intervals, USCS, description) |
spt_readings |
SPT N-values and blow counts by depth |
ingest_jobs |
PDF processing job status and errors |
Extracted fields include borehole code, coordinates, elevation, groundwater depth, total depth, soil layers, and SPT readings. Raw LLM output is stored in boreholes.raw_extraction for audit and reprocessing.
AI_OCR/
├── app/
│ ├── main.py # FastAPI application
│ ├── config.py # Settings from .env
│ ├── database.py # Async SQLAlchemy engine
│ ├── api/routes/ # REST endpoints
│ ├── models/ # SQLAlchemy models
│ ├── schemas/ # Pydantic request/response schemas
│ ├── services/ # PDF, OCR, extraction, geo, viz logic
│ └── pipeline/ # Ingest orchestration
├── alembic/ # Database migrations
├── scripts/
│ └── ingest_local.py # CLI PDF ingest
├── docker-compose.yml # PostGIS database
├── requirements.txt
└── .env.example
- PDF read — Extract text per page with pdfplumber; fall back to PyMuPDF if needed.
- OCR — Pages with little or no text are rendered and processed with Tesseract.
- LLM extraction — Full report text is sent to OpenAI or Claude with a geotechnical schema prompt.
- Persist — Structured data is saved to PostgreSQL; coordinates are stored as WGS84 points when lat/lon are available.
- Web UI (map viewer, borehole detail, SPT charts)
- UTM / local grid → WGS84 conversion (pyproj)
- Human review workflow for low-confidence extractions
- PDF report generation from structured data
- Supabase auth and multi-tenant access
- CAD / GIS import (DXF, Shapefile)
Database connection failed
Ensure Docker is running: docker compose ps. Wait for the health check to pass before starting the API.
OCR returns empty text
Verify Tesseract is installed and TESSERACT_CMD points to the executable. Install language packs matching OCR_LANG.
Extraction failed / low quality
Check your API key and model settings. Review ingest_jobs.error_message or the job response. Tune prompts in app/services/extractor.py using your actual report formats.
Geo search returns no results
Boreholes need valid latitude / longitude in the source report (or successful coordinate extraction). UTM-only locations are stored as raw_location until conversion is implemented.
Internal use — add your organization's license terms here.