Skip to content

ChampCTE/ETL-Zammad-API

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ETL Zammad → PostgreSQL

ETL pipeline developed in Python to extract data from the Zammad API, transform it, and load it into a PostgreSQL database for analytics and reporting.


Description

This project implements a full ETL workflow:

1. Extraction: Data retrieval from the Zammad API (tickets, users, organizations, states, time entries, and real ticket change history).

2. Transformation: Data cleaning, normalization, and enrichment, including reconstruction of real ticket timelines.

3. Loading: Incremental loading into PostgreSQL with a relational schema optimized for BI analysis.


Architecture

Zammad API → extract.py → transform.py → load_postgre.py → PostgreSQL
                                ↑
                          main.py (orchestrator)

Components

File Description
main.py ETL orchestrator
extract.py Data extraction from API (with pagination and ticket history)
transform.py Data cleaning and transformation
load_postgre.py PostgreSQL loading (table creation and upserts)
load_sqlite.py SQLite loading (development-only, disabled by default)

The load_sqlite.py was used during development and have been replaced by the final PostgreSQL implementation.


Data Model

The model is structured in three layers:

Dimension tables: — reference catalogs: users, organizations, ticket_states, time_entry_types

Core table: tickets — connects all dimensions

Fact & history tables: — capture activity over time: time_entries, ticket_state_history, ticket_event_history, fact_ticket_time, fact_ticket_time_enriched

Aggregated BI table: ticket_summary — denormalized and optimized for direct consumption in visualization tools


Real Change History

A key feature of this project is the integration of the Zammad endpoint /api/v1/ticket_history/{id} . This endpoint returns all individual changes of a ticket with exact timestamps, allowing reconstruction of the real timeline of each ticket, including transitions between states such as open, pending, resolved, or billable.

Each change is stored with:

start_time

end_time

duration_h_m

This enables advanced analysis such as time per state, bottlenecks, and full lifecycle tracking.

Note: Access to this endpoint requires a token with admin and ticket.agent permissions enabled.


Configuration

Create an .env file based on .env.example:

ZAMMAD_TOKEN=your_token
ZAMMAD_URL=https://your-zammad-instance/api/v1
 
DB_HOST=your_host
DB_PORT=5432
DB_NAME=your_db
DB_USER=your_user
DB_PASSWORD=your_password

Installation

pip install -r requirements.txt

Main dependencies:

pandas
requests
psycopg2
python-dotenv

Execution

python src/main.py

The process performs:

  1. Extraction of tickets, users, organizations, states, and time entry types
  2. Extraction of real change history per ticket
  3. Data transformation and cleaning
  4. Table creation if not exists
  5. Data insertion/update (upserts using ON CONFLICT)
  6. Generation of the aggregated ticket_summary table

Deployment

The project is fully containerized using Docker and includes an internal cron-based scheduler for automated execution.

Docker Setup

The ETL runs inside a lightweight Python 3.11 container with cron installed.

Key aspects of the Docker configuration:

  • Based on python:3.11-slim
  • Installs cron inside the container
  • Copies the project and dependencies
  • Registers a custom cron job
  • Uses a startup script to initialize execution and scheduling

Scheduled Execution (Cron inside container)

A cron job is embedded directly into the container via a custom crontab file:

*/30 * * * * root cd /app && python src/main.py >> /var/log/cron.log 2>&1

This means:

  • The ETL runs every 30 minutes.
  • Execution happens inside the container.
  • Logs are written to /var/log/cron.log
  • Both standard output and errors are captured

Startup Script (stash.sh)

The container is started using a script (start.sh in Docker, implemented as stash.sh), which orchestrates the runtime behavior:

#!/bin/bash

echo "Starting Zammad ETL..."

python src/main.py

echo "Initial execution completed. Starting cron..."

cron && tail -f /var/log/cron.log

What this script does:

1. Initial ETL run

  • Executes the pipeline immediately when the container starts
  • Ensures data is available without waiting for the first cron cycle

2. Starts cron daemon

  • Activates the scheduled job inside the container

3. Log streaming

  • Uses tail -f to continuously stream logs
  • Keeps the container alive (required for Docker)

Key Features

  • Automatic pagination for Zammad API
  • Real ticket change history reconstruction (per ticket)
  • Time unit transformation (1 unit = 15 minutes)
  • Upserts using ON CONFLICT to prevent duplicates
  • Normalized relational model + BI-ready ticket_summary table
  • Ready for integration with BI tools like Metabase, Apache Superset, or Grafana
  • Dockerized execution with built-in cron scheduling

Notes

  • The .env file is excluded from the repository for security
  • Requires Zammad API access with admin and ticket.agent permissions
  • PostgreSQL must be available before execution
  • Fact tables (fact_ticket_time, fact_ticket_time_enriched, ticket_state_history) are fully rebuilt on each run using TRUNCATE before loading

Author

Cèlia Trullà

This project was developed in a professional internship context as part of a data integration and processing workflow.


License

This project is intended for internal or educational use only. Commercial use is not permitted.

For inquiries or usage requests, contact: celia.trulla@gmail.com

About

ETL pipeline developed in Python to extract data from the Zammad API, transform it, and load it into a PostgreSQL database for analytics and reporting.

Topics

Resources

Stars

Watchers

Forks

Contributors