It's not part of the problem statement, but I view data architecture as a 7-layer burrito that is pictured in this diagram (PDF) or, if you want a better experience, inside the Lucid App, where you can submit comments!
The diagram is just me drawing off the top of my head, no research involved, just hard-won experience.
The problem this repo addresses:
Use Docker Compose and its supporting technologies to run all the code necessary.
Run a dedicated container running MySQL 8.
Load the Top 1000 US Cities by population .csv file into MySQL as a part of a docker compose up.
Write a Python app that connects to the MySQL database and executes a query that aggregates the population determined by a list of US States as input and prints the results to standard out (stdout).
Build a container image that has the Python code above and the requirements in a virtualenv specific to just the code, not the system's Python dist-packages.
Have the ability to enable debug logging by setting an environment variable DEBUG=true.
Use command line arguments to specify specific US States and their total population.
Provide a README that documents how to setup, build, and run the project.
Provide a compressed archive (.zip or .tar.gz) file with the minimum Project files below.
Project files
You will provide, at the minimum:
README.md - instructions to setup, build, and run your project submission.
docker-compose.yml
Dockerfile for running the Python app code provided with no secrets embedded in the image.
Python source code in a subdirectory src/.
├── Dockerfile
├── README.md
├── .dockerignore
├── docker-compose.yml
├── entrypoint
├── requirements.txt
└── src/
Python app arguments
The Python app must support the following CLI arguments:
The username for MySQL authentication.
The password for MySQL authentication passed via the following ways: STDIN, read from a local file, or as plain text on the CLI with a warning about not doing that.
One, or more, state names to calculate the total population.
Display the usage if the argument --help is provided.
Plotly dataset: Top 1000 US Cities by population
Top 1000 US Cities by population (Github)
This is mostly hand-coded Python with minimal help from LLMs other than using Claude to look up a few details of syntax. I borrowed bits of boilerplate code from other projects. Total time spent on this repo: About 5 hours including polish time and much time spent writing down my implementation notes and thoughts.
Copy .env-sample to .env and supply the MySQL root password and regular user username/password.
To run:
# simplest (runs an all-states dump):
docker compose up --build
# for a usage menu:
docker compose run --rm us-city-pops --help
# to print population of 2 states
docker compose run --rm us-city-pops -s "Ohio, New York"
# now remove the contents of the .env file and try command-line authentication:
docker compose run --rm us-city-pops -s "Ohio, New York" --mysql_password=XXX --mysql_username=ZZZ
# or you can enter your password interactively (again, if .env is empty):
docker compose run --rm us-city-pops -s "Vermont" --mysql_username=mysql_user
# to run the tests:
poetry run python -m pytestPrepend DEBUG=true to the entrypoint or poetry command for more log output. This is case-sensitive.
- I'd normally cache the external data somehow. In-memory is fine for this data set, but Redis and Memcached are fast in-mem options, or a document store like Mongo for certain use cases, and relational db like MySQL/PostgreSQL for others.
- I sometimes prefer to distinguish LOG_LEVEL from DEBUG flag. The semantics of the latter are "run the program and use the debug behavior" (which may follow special code paths) and the LOG_LEVEL determines what you are going to print as you follow that code path. However, it can be a code smell to have separate code paths for DEBUG mode; sometimes this can be more harmful than useful.
- I chose to interpret DEBUG=true very literally, only the literal "true" will turn on debug. This is the letter of the requirement; normally, however, I'd be flexible and accept "truthy" values like 1 and "True".
- structlog is typically my first import in a new project. I kept the formatter basic for now. Usually I'll spend time prettying up the structlog format and adding alternative output modes (JSON for machine use, more textual and with semantic colors for developers).
- click is my second import, and is a great abstraction over argument parsing and documentation.
- I didn't set up a pre-commit config file, but I would in a real project. This would ensure that the whole linter and
test toolchain specified in
pyproject.tomlgets run successfully with every commit. You'd be surprised how many problems this catches before it ever touches a human reviewer. - Similarly, in a real project, I'd plug an LLM-based code review into the Github CI/CD flow, whether coordinated by Github Actions or by Harness or CircleCI. That way PRs start out with an LLM review that the human can use as a springboard. Not a replacement for human review, but a good leg up.
- Data is loaded from the Plotly csv into MySQL via a bulk
executemany()insert that runs each time you start the container. This won't scale, but it's fine for the 4000-row data set we have -- think about this simultaneously with caching issue mentioned earlier. In real code, bulk insert or upsert would be almost mandatory, with consideration for maintaining idempotency. The size of the data is key in choosing the appropriate strategy. - All the data is loaded into MySQL unvalidated. In many cases, I prefer not to validate at ingest time, instead defer to a downstream view or transform, something like city_populations_by_year_clean, which would add checks for null constraints, valid data types, acceptable values, etc
- Time was short so test coverage is only partial. I did heavily use the tests to develop the param parsing logic, which is hard to get right without a unit test!
A few areas I'd get clarified in the requirements gathering phase (or just via conversations with the product owner as I roughed out the prototype's skeleton):
- The package name "states" isn't great; on a real project, I'd spend an hour of background time thinking up a better name. Names matter for the API surface we are exposing, because they matter to humans who use them. Without more context for this project it's difficult to pick a better name. Matching the package name in Python to the externally visible API surface isn't strictly necessary, but it saves everyone a lot of cognitive load translating.
- Requirements didn't specify fetching the CSV via http vs just copying the file into the project. I chose the former because it's better practice generally than shipping your repo with a multi-megabyte file.
- This line of requirements is ambiguous and in real life I'd be asking clarifying questions: "Use command line arguments to specify specific US States and their total population"
- Requirements don't specify file type of
entrypointso I guessed and made it a bash script. - Requirements have this line but nothing else about visualizing the dataset with Plotly, so I disregarded it: "Plotly dataset: Top 1000 US Cities by population".
- Requirements say
entrypoint, but as an old Unix hacker, I mislike shell scripts that don't advertise themselves as such, so it'sentrypoint.shin my repo. - Requirements say "calculate the total population" but they don't say whether I'm to report by year or sum all years together (which doesn't really make sense, but would be worth discussing with product owners to get their needs. The app might need a "year" param in the CLI.)
- "prints the results to standard out (stdout)" is insufficiently specified -- in what format, and with what columns? I
chose to just
print(json.dumps())because that gets the product in front of the product owner quicker so that we can have that conversation.
I used PyCharm with an IntelliJ AI subscription and Claude Code alongside. I didn't use LLMs to code anything in this
repo, but I did rely heavily on IntelliJ's code completion/templating and I used Claude for linting,
reformatting, and checking style. CLAUDE.md has the rules for this. It's very powerful to put rulesets in these files,
especially as the project grows and you add nested CLAUDE.md files in submodules.
Even more powerful is to give your LLMs access to certain databases and data sets via MCP servers, and let the project
know about these via CLAUDE.md context files. Doing this securely takes some work, but it's the path to the
LLM-assisted land of milk and honey.
Copyright (c) 2026 Callie Tweney. All rights reserved.