Skip to content

cryptofan430/Geotech-Intelligence-Platform

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Geotech Intelligence Platform

Internal geotechnical platform MVP for automatic organization of boreholes and geotechnical data. Upload SPT (Standard Penetration Test) borehole PDF reports, extract structured data with OCR and AI, store results in PostgreSQL/PostGIS, and query or visualize boreholes geographically.

Features

  • PDF ingestion — Upload SPT borehole reports via REST API or CLI
  • Intelligent text extraction — Native PDF text parsing with automatic OCR fallback for scanned pages
  • AI structured extraction — OpenAI or Anthropic APIs parse borehole metadata, soil layers, and SPT readings into JSON
  • Geospatial storage — PostgreSQL + PostGIS for borehole locations and radius search
  • Technical visualization — SPT depth profiles (Plotly) and map-ready GeoJSON endpoints

Architecture

PDF Upload → Text extraction (pdfplumber / PyMuPDF)
          → OCR if scanned (Tesseract)
          → LLM extraction (OpenAI / Claude)
          → PostgreSQL + PostGIS
          → Geo search & visualization API
Component Technology
API FastAPI, Uvicorn
Database PostgreSQL 16, PostGIS
ORM SQLAlchemy 2 (async), GeoAlchemy2
PDF / OCR pdfplumber, PyMuPDF, Tesseract
AI OpenAI API, Anthropic API
Charts / maps Plotly, GeoJSON

Prerequisites

  • Python 3.11+
  • Docker (recommended for PostGIS)
  • Tesseract OCR (required for scanned PDFs)
  • OpenAI or Anthropic API key

Windows: Tesseract

Install Tesseract and set the path in .env:

TESSERACT_CMD=C:\Program Files\Tesseract-OCR\tesseract.exe
OCR_LANG=por+eng

Quick start

1. Clone and configure

cd d:\AI_Projects\AI_OCR
copy .env.example .env

Edit .env and set at minimum:

  • OPENAI_API_KEY (or ANTHROPIC_API_KEY with AI_PROVIDER=anthropic)
  • TESSERACT_CMD on Windows if using OCR

2. Start the database

docker compose up -d

This starts PostGIS on localhost:5432 with user/password/database geo / geo / geotech.

3. Install dependencies

python -m venv .venv
.\.venv\Scripts\activate
pip install -r requirements.txt

4. Run the API

uvicorn app.main:app --reload

On first startup, the app enables the PostGIS extension and creates database tables automatically.

Configuration

All settings are loaded from .env. See .env.example for the full list.

Variable Description Default
DATABASE_URL Async PostgreSQL connection string postgresql+asyncpg://geo:geo@localhost:5432/geotech
AI_PROVIDER openai or anthropic openai
OPENAI_API_KEY OpenAI API key
OPENAI_MODEL OpenAI model gpt-4o
ANTHROPIC_API_KEY Anthropic API key
ANTHROPIC_MODEL Anthropic model claude-sonnet-4-20250514
TESSERACT_CMD Path to Tesseract executable
OCR_LANG Tesseract language packs por+eng
UPLOAD_DIR Stored PDF upload directory ./data/uploads
MAX_UPLOAD_MB Max upload size in MB 50
CORS_ORIGINS Comma-separated allowed origins http://localhost:3000,http://localhost:5173

API reference

Base path: /api/v1

Upload & ingest

Method Endpoint Description
POST /upload/pdf Upload a borehole PDF (multipart form, field: file)
GET /upload/jobs/{job_id} Check ingest job status

Upload returns a job immediately; processing runs in the background. Poll the job endpoint until status is completed or failed.

Example — upload a PDF:

curl -X POST "http://localhost:8000/api/v1/upload/pdf" `
  -F "file=@C:\path\to\borehole-report.pdf"

Boreholes

Method Endpoint Description
GET /boreholes List boreholes (limit, offset)
GET /boreholes/{id} Borehole detail with layers and SPT readings
POST /boreholes/search/nearby Find boreholes within a radius

Nearby search body:

{
  "latitude": -23.5505,
  "longitude": -46.6333,
  "radius_m": 5000,
  "limit": 50
}

Projects

Method Endpoint Description
POST /projects Create a project (name, client, description)

Projects are also auto-created from project_name during PDF extraction when present.

Visualization

Method Endpoint Description
GET /viz/boreholes/{id}/spt-profile Plotly JSON chart (N-value vs depth)
GET /viz/map GeoJSON FeatureCollection of all georeferenced boreholes

CLI ingest

Ingest a local PDF without running the HTTP server:

python scripts/ingest_local.py C:\path\to\borehole-report.pdf

Requires the database to be running and .env configured.

Database migrations

Alembic is included for schema versioning. The app also auto-creates tables on startup for local development.

# Apply migrations (uses sync psycopg2 driver)
alembic upgrade head

Initial migration: alembic/versions/001_initial_schema.py

Data model

Table Purpose
projects Site / client projects
boreholes Borehole header, location (PostGIS point), metadata
soil_layers Stratigraphy (depth intervals, USCS, description)
spt_readings SPT N-values and blow counts by depth
ingest_jobs PDF processing job status and errors

Extracted fields include borehole code, coordinates, elevation, groundwater depth, total depth, soil layers, and SPT readings. Raw LLM output is stored in boreholes.raw_extraction for audit and reprocessing.

Project structure

AI_OCR/
├── app/
│   ├── main.py              # FastAPI application
│   ├── config.py            # Settings from .env
│   ├── database.py          # Async SQLAlchemy engine
│   ├── api/routes/          # REST endpoints
│   ├── models/              # SQLAlchemy models
│   ├── schemas/             # Pydantic request/response schemas
│   ├── services/            # PDF, OCR, extraction, geo, viz logic
│   └── pipeline/            # Ingest orchestration
├── alembic/                 # Database migrations
├── scripts/
│   └── ingest_local.py      # CLI PDF ingest
├── docker-compose.yml       # PostGIS database
├── requirements.txt
└── .env.example

Ingest pipeline

  1. PDF read — Extract text per page with pdfplumber; fall back to PyMuPDF if needed.
  2. OCR — Pages with little or no text are rendered and processed with Tesseract.
  3. LLM extraction — Full report text is sent to OpenAI or Claude with a geotechnical schema prompt.
  4. Persist — Structured data is saved to PostgreSQL; coordinates are stored as WGS84 points when lat/lon are available.

Roadmap

  • Web UI (map viewer, borehole detail, SPT charts)
  • UTM / local grid → WGS84 conversion (pyproj)
  • Human review workflow for low-confidence extractions
  • PDF report generation from structured data
  • Supabase auth and multi-tenant access
  • CAD / GIS import (DXF, Shapefile)

Troubleshooting

Database connection failed
Ensure Docker is running: docker compose ps. Wait for the health check to pass before starting the API.

OCR returns empty text
Verify Tesseract is installed and TESSERACT_CMD points to the executable. Install language packs matching OCR_LANG.

Extraction failed / low quality
Check your API key and model settings. Review ingest_jobs.error_message or the job response. Tune prompts in app/services/extractor.py using your actual report formats.

Geo search returns no results
Boreholes need valid latitude / longitude in the source report (or successful coordinate extraction). UTM-only locations are stored as raw_location until conversion is implemented.

License

Internal use — add your organization's license terms here.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages