86x faster arithmetic acceleration through optimized custom hardware and DMA pipeline
This project demonstrates high-performance FPGA design using Custom Instructions, Modular Scatter-Gather DMA, and Avalon Streaming Pipeline to achieve massive speedups over pure software implementations on Nios II.
For detailed implementation journey, design decisions, and technical deep-dive:
- 🚀 Nios II & DMA Acceleration Guide
- 📈 Burst Master Optimization
- 🌊 Stream Processor Pipeline
- 🔄 Dynamic PLL Reconfiguration
- 📝 Project Roadmap (TODO)
Hardware-accelerated arithmetic unit integrated directly into Nios II CPU pipeline.
Optimization Highlights:
- Target Operation:
(A × B) / 400 - Traditional Approach: Hardware divider → Setup Time Violations at 50MHz
- Our Solution: Shift-Add approximation
(A × 5243) >> 21- Mathematical accuracy: 99.998% (0.0018% error)
- Zero timing violations even at high frequency
- Massive cycle reduction vs. software division
Parameterizable N-stage pipeline with robust backpressure handling.
Architecture:
Stage 0: Input Capture & Endian Swap
↓
Stage 1: Coefficient Multiplication (Input × Coeff)
↓
Stage 2: Division Approximation & Final Endian Swap
Design Features:
- Valid-Ready Handshake: Industry-standard Avalon-ST backpressure
- Automatic Byte Swapping: Resolves mSGDMA endianness mismatch
- Reusable Template: pipe_template.v for future projects
- Timing Closure: Maintains high throughput while meeting 50MHz+ timing
Disaggregated mSGDMA architecture with inline computation.
Benefits:
- Zero CPU Load: Calculations happen during DMA transfer
- Memory Efficiency: Direct memory-to-memory with transformation
- Flexible Structure: Separate Dispatcher, Read Master, Write Master
Benchmarks on Nios II @ 50MHz with 1000-element array processing:
| Mode | Description | Performance vs. Software |
|---|---|---|
| Bypass | DMA copy only | 7.59x faster than CPU memcpy |
| Full Acceleration | DMA + Pipeline computation | 86.14x faster than software division |
Real Numbers:
- Software computation: ~860ms
- DMA + Hardware: ~10ms
- Result: 86x speedup 🚀
Professional hardware verification using Cocotb and pytest.
- ✅ Python-based testbenches for flexible test scenarios
- ✅ Automated waveform generation (VCD/FST)
- ✅ Pytest integration for CI/CD compatibility
- ✅ Isolated build directories per module
- ✅ Behavioral models for Altera IP (altsyncram)
cd tests/cocotb
pytest test_runner.py -v
# Output:
# test_runner.py::test_cocotb_modules[my_custom_slave] PASSED [50%]
# test_runner.py::test_cocotb_modules[stream_processor] PASSED [100%]
# ==================== 2 passed in 0.81s ====================# GTKWave
gtkwave tests/cocotb/sim_build/stream_processor/dump.vcd
# Or use VS Code extension: Surferquartus_project/
├── RTL/
│ ├── stream_processor.v # 3-Stage Pipeline Accelerator
│ ├── pipe_template.v # Reusable N-Stage Template
│ ├── my_multi_calc.v # Custom Instruction Unit
│ ├── my_slave.v # Avalon-MM Slave w/ DPRAM
│ └── top_module.v # System Integration
│
├── ip/
│ └── dpram.v # Dual-Port RAM (1KB)
│
├── software/
│ └── cust_inst_app/
│ └── main.c # Benchmark & Test Application
│
├── tests/cocotb/
│ ├── test_runner.py # Pytest Runner
│ ├── tb_my_slave.py # Avalon-MM Testbench
│ ├── tb_stream_processor_avs.py # Pipeline Testbench
│ └── sim_models/
│ └── altsyncram.v # Behavioral Model
│
├── custom_inst_qsys.qsys # Platform Designer System
├── doc/
│ ├── burst_master.md # Burst Master Documentation
│ ├── history.md # Detailed Implementation Guide (EN)
│ ├── history_kor.md # Detailed Implementation Guide (KR)
│ ├── nios.md # Nios II Implementation Details
│ ├── pll.md # PLL Reconfiguration Details
│ ├── README_kor.md # Korean README
│ └── TODO.md # Project TODO List
└── README.md # Main English README
- Intel Quartus Prime (20.1 or later)
- Nios II EDS
- DE10-Nano Board (or Cyclone V FPGA)
- Python 3.8+ with Cocotb (for verification)
# Open Quartus project
quartus_sh --tcl_eval project_open custom_inst.qpf
# Compile (or use Quartus GUI: Processing → Start Compilation)
quartus_sh --flow compile custom_instcd software/cust_inst_app
nios2-app-generate-makefile --bsp-dir ../cust_inst_bsp
make# Via Quartus Programmer or command line
quartus_pgm -c 1 -m JTAG -o "p;output_files/custom_inst.sof"nios2-terminal # Connect to UART
# Then from Nios II shell:
./software/cust_inst_app/cust_inst_app.elfProblem: Hardware divider couldn't meet 50MHz timing.
Solution: Mathematical transformation using fixed-point approximation:
1/400 ≈ 5243/2^21
Error: 0.0018%
Result: Zero timing violations
Problem: mSGDMA "First Symbol In High-Order Bits" reversed byte order.
Solution: Automatic byte-swapping at pipeline input/output:
assign swapped = {original[7:0], original[15:8],
original[23:16], original[31:24]};Problem: Data loss when downstream stalls.
Solution: Cascaded Valid-Ready handshake through all stages:
always @(posedge clk) begin
if (pipe_ready[N] || !pipe_valid[N])
stage_data[N] <= stage_data[N-1];
endIf you're new to FPGA or Nios II development, check out:
- history.md - Complete design journey with rationale
- pipe_template.v - Reusable pipeline template with detailed comments
- Cocotb Tests - See tests/cocotb/ for verification examples
Contributions are welcome! Areas of interest:
- Additional test cases for edge scenarios
- Support for other FPGA boards
- Enhanced pipeline configurations
- Documentation improvements
MIT License - See LICENSE for details
- Intel FPGA University Program
- Cocotb open-source verification framework
- VS Code Surfer waveform viewer



