Skip to content

AnwarDebes/drone-id

Repository files navigation

Drone-ID: a source-cited drone model catalog and identification API

Drone Identification System

Release Python License: MIT Tests Catalog

A curated, source-cited reference catalog of drone models, plus an API that matches an observation to a known model by free-text description or by a structured feature set.

Read this first: what this is, and what it is not

This system answers one question: "what model of drone is this?" It is built so that two distinctions are impossible to miss and never get blurred.

1. Model identification, not operator attribution

The catalog tells you what a drone is: its model, manufacturer, and the country of the company that produces it (manufacturer_country). It never tells you whose a drone is.

manufacturer_country is a property of the model. It is not the nationality of whoever is flying a given airframe, and there is deliberately no operator, owner, or nationality field anywhere in the schema or the API. A Chinese-made DJI, a Turkish-made TB2, and a US-made MQ-9 are each flown by many different operators across many countries. Inferring an operator from a model is exactly the mistake this design refuses to make. Operator and nationality attribution is out of scope (see "Out of scope" below).

So: this is not a system that can prove whose drone crossed a border. It can say "that looks like a DJI Matrice 350 RTK". It cannot say who was holding the controller.

2. A catalog, not a sensor

This is a database and a matching API. It is not a sensing system. It stores whether a model supports Remote ID and what a model's RF, acoustic, and visual signatures look like, so that a separate sensing layer can match observations against it. It does not receive Remote ID broadcasts, capture RF, run radar, or classify camera images. Those need hardware and labeled datasets and are Phase 2.

Detection inputs (Remote ID payload fields, RF band lists, image attributes) are designed to be matched against catalog entries, but the project never simulates or fakes sensor capability. The signature-match endpoint scores structured features that something else already extracted, not raw signals.

What is in the box

  • A PostgreSQL catalog (drone_models) with Alembic migrations.
  • A FastAPI application: CRUD, rich filtering, semantic search, and structured signature matching.
  • ChromaDB semantic search over model descriptions and spec text, using a multilingual embedding model by default.
  • An ingestion pipeline with three paths (per-source importers, manual YAML curation, and a validation step that rejects entries missing citations).
  • 56 fully cited entries spanning nine manufacturer countries and consumer to military, with the DJI lines covered first (the dominant market): the Mini, Air, Mavic, FPV, and enterprise Matrice / Agras families, plus Autel, Skydio, Parrot, Yuneec, fixed-wing mapping platforms (senseFly, WingtraOne, Quantum Systems), and notable ISR / strike platforms (TB2, MQ-9, Global Hawk, ScanEagle, Heron, Wing Loong II, and others). Run python -m ingestion.audit for the live breakdown.
  • SOURCES.md documenting every data source and its reliability.

Quickstart (no Docker needed)

The app defaults to SQLite and an offline embedding backend, so it runs with no external services. Production uses PostgreSQL and a multilingual model (below).

python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt          # core + dev
# optional, for real semantic search:
# pip install -r requirements-semantic.txt

cp .env.example .env

# offline-friendly settings for a first run:
export EMBEDDING_BACKEND=hash CHROMA_MODE=persistent CHROMA_PATH=./chroma_data

alembic upgrade head                      # build the schema
python -m ingestion.seed --reset          # validate + load the cited catalog + index
uvicorn app.main:app --reload             # serve at http://127.0.0.1:8000

Then open http://127.0.0.1:8000/docs for the interactive API, or:

curl localhost:8000/health
curl "localhost:8000/drones?manufacturer_country=CN"
curl -X POST localhost:8000/search/describe \
  -H 'content-type: application/json' \
  -d '{"text":"small white foldable camera quadcopter ~250g on 2.4GHz","limit":3}'

Running against PostgreSQL + ChromaDB (production shape)

docker compose up -d            # starts postgres + chroma
cp .env.example .env            # then set the two URLs below

In .env:

DATABASE_URL=postgresql+asyncpg://drone:drone@localhost:5432/drone_id
CHROMA_MODE=http
CHROMA_HOST=localhost
CHROMA_PORT=8000
EMBEDDING_BACKEND=sentence_transformers
EMBEDDING_MODEL=paraphrase-multilingual-MiniLM-L12-v2
alembic upgrade head
python -m ingestion.seed --reset
uvicorn app.main:app --host 0.0.0.0 --port 8080

The embedding model is multilingual on purpose, so descriptions and queries in other languages embed sensibly. If the semantic stack is not installed or Chroma is unreachable, the rest of the API keeps working and /search/describe returns 503 with a clear reason (check /health).

API

Method Path Auth Purpose
GET / - Service banner and scope statement
GET /health - DB + semantic-search status
GET /drones - List / filter the catalog
GET /drones/{id_or_slug} - One model
POST /drones admin Create a model
PUT /drones/{id_or_slug} admin Update a model
DELETE /drones/{id_or_slug} admin Delete a model
POST /search/describe - Semantic search by free text
POST /match/signature - Structured signature match

Filters on GET /drones: manufacturer, manufacturer_country, category, airframe, freq_band, remote_id_supported, weight_min/max, mtow_min/max, q (name + aliases), plus limit / offset.

Write endpoints are gated by a shared secret in the X-Admin-Key header (ADMIN_API_KEY). This is a simple gate, not a full auth system; put a real identity layer in front before any deployment.

POST /match/signature

Accepts a structured observation and returns ranked candidate models with a transparent per-component breakdown:

{
  "freq_bands": ["2.4GHz", "5.8GHz"],
  "remote_id_fields": {"present": true, "standard": "ASTM F3411"},
  "visual_attributes": ["foldable", "quadcopter"],
  "weight_estimate": {"value": 245, "tolerance": 60},
  "size_estimate": {"value": 150, "tolerance": 80}
}

Scoring is a weighted mean over only the components you actually provide (freq 0.30, weight 0.20, size 0.20, remote_id 0.15, visual 0.15), so a sparse observation is not penalised for what it could not measure. See the important limitation in "Out of scope": this matches structured features, not raw signals.

Each candidate carries the model's data_confidence, so you can see how trustworthy its specs are, and the response sets ambiguous: true when the top two candidates are too close to separate confidently. For a critical system that matters: the API surfaces uncertainty instead of asserting a single answer.

Data model

One table, drone_models. Highlights:

  • Identity: model_name, aliases, manufacturer, manufacturer_country (ISO 3166-1 alpha-2), category, airframe, slug.
  • Physical: length_mm, width_mm, height_mm, diagonal_mm, weight_g, mtow_g.
  • Performance: max_speed_kmh, range_km, endurance_min, max_altitude_m, max_payload_g.
  • Comms: control_freq_bands, video_freq_bands, control_protocol, gnss_support.
  • Detection signatures: remote_id_supported, remote_id_standard, radar_cross_section_class, acoustic_notes, visual_identifiers.
  • Provenance: sources (JSON array of {field, url_or_citation, retrieved_at}) and data_confidence (verified / partial / unverified).

Every populated spec field carries a citation. Unknown fields are left null, not estimated. The catalog is designed and migrated for PostgreSQL (native arrays and JSONB); the same models and migration also run on SQLite for local dev and tests.

Ingestion

Three paths, all converging on the same validated DroneCreate shape:

  1. Per-source importers (ingestion/sources/): manufacturer pages, FAA UAS, EASA. The manufacturer/FAA/EASA importers are documented scaffolds today (they raise NotImplementedError with an implementation plan rather than returning fabricated data); the manual YAML importer is fully working.
  2. Manual curation (data/catalog/**.yaml, one file per model): the primary path right now. See data/SCHEMA.md.
  3. Validation (app/validation.py): rejects any entry that populates a spec field without a covering source citation.
python -m ingestion.seed --validate-only   # parse + validate, touch nothing
python -m ingestion.seed --reset           # rebuild catalog + index
python -m ingestion.seed --no-embed        # DB only, skip ChromaDB
python -m ingestion.audit                  # coverage / confidence / citation snapshot

Most of the catalog was built by verified web research captured as cited JSON in ingestion/research/ (kept as an audit trail), then turned into validated YAML by python -m ingestion.convert_research. That converter is a gate: it strips any field it cannot tie to a citation rather than guessing, so only cited data lands. The choice to grow the catalog this way, rather than via a scraper, is deliberate: for a critical system, verified and cited beats fast and fragile.

Data coverage

Snapshot from python -m ingestion.audit (56 entries):

  • By category: commercial 16, consumer 15, military strike 12, military ISR 7, VTOL 3, racing 2, fixed-wing 1.
  • By manufacturer country (of the producer, never an operator): CN 30, US 13, TR 3, FR 2, CH 2, IL 2, RU 2, DE 1, IR 1.
  • Confidence: 89% verified (50/56), 11% partial (6/56). After a second re-verification pass against primary sources, the only entries left partial are those with no manufacturer or government spec sheet: CH-4, Wing Loong II, Lancet-3, Orlan-10, Shahed-136 (open-source intelligence estimates) and Freefly Alta X (no official altitude/range published). These are honestly marked, never inflated. For a system that could inform defense use, acknowledged uncertainty is safer than false confidence.
  • Citation health: every populated spec field is cited (the seeder enforces this).

The country spread is the point: it shows manufacturer country is a property of the model and nothing more. The same airframe is operated by many nations, which is exactly why operator attribution is out of scope.

Out of scope (Phase 2: needs hardware and/or separate ML effort)

These are intentionally not built here, and the design does not pretend they are:

  • Live RF, radar, and acoustic sensing. Requires hardware. The catalog stores signature descriptors to match against; it does not capture signals.
  • Camera + ML detection/classification. A separate ML project that needs a labeled drone image dataset. The catalog stores visual_identifiers text only.
  • Remote ID broadcast receiver/decoder. Requires hardware and RF. The catalog stores whether a model supports Remote ID; live decoding is separate.
  • RF / acoustic signature matching against real captured signals. The /match/signature endpoint matches on structured features only. Matching real captured signals would need a labeled signal dataset (RF fingerprints, acoustic profiles) that does not exist in this project.
  • Operator / nationality attribution of a detected drone. Out of scope by design, for the reasons in "Read this first".

Testing

pytest -q

All 46 tests pass offline (SQLite plus the hash embedder, no services needed). The suite covers the scoring logic, the citation validation, the schema rules, and a full end-to-end pass over the API (filters, semantic search, signature match, admin gating, and create/update/delete) against the seeded catalog.

Project layout

app/            FastAPI app: models, schemas, crud, scoring, search, routers
ingestion/      loader, seed CLI, validation use, per-source importers
data/catalog/   curated YAML, one file per model (the seed data)
migrations/     Alembic
tests/          unit + end-to-end
SOURCES.md      data sources and reliability
data/SCHEMA.md  YAML entry format