ETL pipeline developed in Python to extract data from the Zammad API, transform it, and load it into a PostgreSQL database for analytics and reporting.
This project implements a full ETL workflow:
1. Extraction: Data retrieval from the Zammad API (tickets, users, organizations, states, time entries, and real ticket change history).
2. Transformation: Data cleaning, normalization, and enrichment, including reconstruction of real ticket timelines.
3. Loading: Incremental loading into PostgreSQL with a relational schema optimized for BI analysis.
Zammad API → extract.py → transform.py → load_postgre.py → PostgreSQL
↑
main.py (orchestrator)
| File | Description |
|---|---|
main.py |
ETL orchestrator |
extract.py |
Data extraction from API (with pagination and ticket history) |
transform.py |
Data cleaning and transformation |
load_postgre.py |
PostgreSQL loading (table creation and upserts) |
load_sqlite.py |
SQLite loading (development-only, disabled by default) |
The load_sqlite.py was used during development and have been replaced by the final PostgreSQL implementation.
The model is structured in three layers:
Dimension tables: — reference catalogs:
users, organizations, ticket_states, time_entry_types
Core table:
tickets — connects all dimensions
Fact & history tables: — capture activity over time:
time_entries, ticket_state_history, ticket_event_history, fact_ticket_time, fact_ticket_time_enriched
Aggregated BI table:
ticket_summary — denormalized and optimized for direct consumption in visualization tools
A key feature of this project is the integration of the Zammad endpoint /api/v1/ticket_history/{id} . This endpoint returns all individual changes of a ticket with exact timestamps, allowing reconstruction of the real timeline of each ticket, including transitions between states such as open, pending, resolved, or billable.
Each change is stored with:
start_time
end_time
duration_h_m
This enables advanced analysis such as time per state, bottlenecks, and full lifecycle tracking.
Note: Access to this endpoint requires a token with admin and ticket.agent permissions enabled.
Create an .env file based on .env.example:
ZAMMAD_TOKEN=your_token
ZAMMAD_URL=https://your-zammad-instance/api/v1
DB_HOST=your_host
DB_PORT=5432
DB_NAME=your_db
DB_USER=your_user
DB_PASSWORD=your_passwordpip install -r requirements.txtMain dependencies:
pandas
requests
psycopg2
python-dotenv
python src/main.pyThe process performs:
- Extraction of tickets, users, organizations, states, and time entry types
- Extraction of real change history per ticket
- Data transformation and cleaning
- Table creation if not exists
- Data insertion/update (upserts using
ON CONFLICT) - Generation of the aggregated
ticket_summarytable
The project is fully containerized using Docker and includes an internal cron-based scheduler for automated execution.
The ETL runs inside a lightweight Python 3.11 container with cron installed.
Key aspects of the Docker configuration:
- Based on
python:3.11-slim - Installs
croninside the container - Copies the project and dependencies
- Registers a custom cron job
- Uses a startup script to initialize execution and scheduling
A cron job is embedded directly into the container via a custom crontab file:
*/30 * * * * root cd /app && python src/main.py >> /var/log/cron.log 2>&1This means:
- The ETL runs every 30 minutes.
- Execution happens inside the container.
- Logs are written to
/var/log/cron.log - Both standard output and errors are captured
The container is started using a script (start.sh in Docker, implemented as stash.sh), which orchestrates the runtime behavior:
#!/bin/bash
echo "Starting Zammad ETL..."
python src/main.py
echo "Initial execution completed. Starting cron..."
cron && tail -f /var/log/cron.logWhat this script does:
1. Initial ETL run
- Executes the pipeline immediately when the container starts
- Ensures data is available without waiting for the first cron cycle
2. Starts cron daemon
- Activates the scheduled job inside the container
3. Log streaming
- Uses
tail -fto continuously stream logs - Keeps the container alive (required for Docker)
- Automatic pagination for Zammad API
- Real ticket change history reconstruction (per ticket)
- Time unit transformation (1 unit = 15 minutes)
- Upserts using ON CONFLICT to prevent duplicates
- Normalized relational model + BI-ready ticket_summary table
- Ready for integration with BI tools like Metabase, Apache Superset, or Grafana
- Dockerized execution with built-in cron scheduling
- The
.envfile is excluded from the repository for security - Requires Zammad API access with admin and ticket.agent permissions
- PostgreSQL must be available before execution
- Fact tables (
fact_ticket_time, fact_ticket_time_enriched, ticket_state_history) are fully rebuilt on each run usingTRUNCATEbefore loading
Cèlia Trullà
This project was developed in a professional internship context as part of a data integration and processing workflow.
- GitHub: https://github.com/ChampCTE
- Email: celia.trulla@gmail.com
This project is intended for internal or educational use only. Commercial use is not permitted.
For inquiries or usage requests, contact: celia.trulla@gmail.com