While there are other projects building up small (~2x2) TPU-inspired designs (see related projects below), this project has a salient combination of goals:
- Modular SystemVerilog setup to support non-rectangular systolic architectures
- Easy software interface via ONNX EP and maybe others
- Scaffolding to evaluate architectural tradeoffs, include performance counters
- Support for FPGA deployment
Set up in WSL or other Linux:
sudo apt install iverilog-- Icarus Verilog for simulationsudo apt install verilator-- Compile SV -> C++ for EP linkage- Install pre-built onnxruntime (check https://github.com/microsoft/onnxruntime/releases) -- this is used to build the ONNX EP C++ library
sudo apt install yosys-- (optional) Yosys for synthesis (or build from source for the latest version)
sudo mkdir -p /opt/onnxruntime
cd /tmp
wget https://github.com/microsoft/onnxruntime/releases/download/v1.24.2/onnxruntime-linux-x64-1.24.2.tgz
sudo tar -xzf onnxruntime-linux-x64-1.24.2.tgz -C /opt/onnxruntime --strip-components=1- Add the ONNX Runtime library to your library path:
echo 'export LD_LIBRARY_PATH=/opt/onnxruntime/lib:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrcSet up a venv for python packages
python3 -m venv .venv
source .venv/bin/activate
# Python tool for more powerful SystemVerilog testing
pip install cocotb
# Run ONNX models (matching onnxruntime version to the downloaded release)
pip install onnxruntime==1.24.2 onnxBuild:
mkdir -p build && cd build
cmake .. -DSIM=ON
make -jImportant flags:
-DSIM=ON- link Verilator into the ONNX EP so that it executes verilator simulation. When off, it will attempt to use hardware when implemented.
- Install the Surfer waveform viewer VSCode extension for viewing
.vcdwaveform files
cd build && ctest --verboseTests produce waveform files (*.fst) in test/sim_build/. Open them in VSCode with the Surfer extension to inspect signals.
The end-to-end flow is: generate an ONNX model → run it through onnxruntime
with the TinyXPU execution provider, which dispatches MatMulInteger to the
Verilator simulation of the systolic array.
Step 1 — build (see above, must use -DSIM=ON).
Step 2 — generate the ONNX model:
cd scripts
python3 matmul.py
# writes matmul_integer_4x4.onnxThe model contains a single MatMulInteger node:
X (int8, [M, 4])W (int8, [4, 4])Y (int32, [M, 4]) = MatMulInteger(X, W)
Constraints: K = N = 4 (the hardware array dimensions).
M is unrestricted — the array streams one row of X at a time.
Step 3 — run with the TinyXPU EP:
python3 run_matmul.pyThe script registers the plugin EP, loads the model, feeds a 4×4 int8
input, verifies the int32 result against a NumPy reference, and prints
PASS on success. It replaces the old onnx-plugin/test/test_tinyxpu_ep.py
development harness.
The systolic array is a ROWS × COLS grid of processing elements (PEs) connected in a mesh. Each PE performs one multiply-accumulate per cycle. The array size is set by ROWS and COLS parameters, overridable at elaboration time (e.g. via Verilator's -GROWS=N -GCOLS=N).
Current dataflow is weight-stationary: weights are loaded once into the PE grid, then activations stream east (→) through each row while partial sums cascade south (↓) through each column, accumulating as they go. This maximises weight reuse — each weight participates in every row of the output tile without being reloaded.
Input and output ports of array.sv:
data_in[ROWS]— one int8 activation per row, presented sequentially (one output row per streaming cycle); internal skew registers stagger them automaticallyweight_in[ROWS*COLS],weight_ld— load all PE weights in one cycle before streaming beginsacc_out[COLS]— one int32 result per column, de-skewed so all columns are valid at the same cycle
Note on configurable dataflow: It is feasible to add a
DATAFLOWparameter (weight-stationary vs output-stationary) switchable via a CMake option that passes-GDATAFLOW=0/1to Verilator. However, the two modes differ enough in their weight-loading interface (weight_in[ROWS*COLS]broadcast vsweight_in[COLS]streaming) that a clean unified port is awkward in SV. The most practical approach would be separatearray_ws.sv/array_os.svfiles, selected by the CMakeARRAY_DATAFLOWoption, sharing a commonpe.svmodified with agenerate iffor the accumulation logic.
Processing Element (PE) for systolic array, named as in Kung (1982)
- Performs multiply-accumulate:
acc += weight * data_in - Passes data through to neighboring PEs via
data_out - The PE does
int8 × int8 → int32, thenint32 + int32 → int32 int8×int8→int32is the standard choice (used by Google's TPUs, Arm NEONsdot, etc.)
In a systolic array, there are two distinct phases:
- Weight loading phase (
weight_ld=1, en=0): Before computation begins, you load each PE with its weight from the weight matrix. In a 2x2 systolic array doingC = A × B, each PE gets one element of B. This happens once per matrix multiply (or once per tile, for larger matrices). - Compute phase (
weight_ld=0, en=1): The weights stay "stationary" (this is the weight-stationary dataflow). Input activations stream through via data_in/data_out, and partial sums accumulate via acc_in/acc_out. The weights don't change during this phase.
So the typical sequence is:
- Load weights for all PEs (a few cycles with
weight_ld=1) - Stream many inputs through with weights held fixed (
en=1, weight_ld=0) - When you need new weights (next layer, next tile), load again
Data flows east (→), partial sums flow south (↓) — this is the standard output-stationary / weight-stationary systolic layout from Kung (1982).
weight_ld
│
│ en
▼ ▼
┌──────────┐
│ PE ├──► data_out
│ │
│ weight │
│ (reg) │
data_in─►│ │
│ × + │
acc_in ──► ├──► acc_out
└──────────┘
The standard way to fully utilize a weight-stationary systolic array is input skewing: PE row k receives its activation one cycle later than PE row k-1. For a 4×4 array computing C = A × B with M output rows, the driver presents:
Cycle: 0 1 2 3 4 ... M+2
Row 0: a00 a10 a20 a30 ...
Row 1: 0 a01 a11 a21 a31 ...
Row 2: 0 0 a02 a12 a22 ...
Row 3: 0 0 0 a03 a13 ...
With skewing, M output rows flow through the pipeline in M + (ROWS+COLS−1) total streaming ticks instead of M × (ROWS+COLS), so MAC efficiency approaches 100% as M grows (weight reuse AND compute utilisation both improve). Without skewing the current driver pays the full pipeline fill/drain cost per row, capping MAC efficiency at 12.5% regardless of M.
The skewed input stream must be de-skewed on the output side: acc_out[j] for output row i is valid at tick i + ROWS + j, not all at the same tick.
There are a number of "tiny TPU"-type projects, due to the current popularity of TPUs and LLMs.
- tiny-tpu-v2/tiny-tpu - 2x2 matmul + ReLU to solve XOR problem
- Alanma23/tinytinyTPU - 2x2 matmul + ReLU / ReLU6