TensorCraft-HPC

Modern C++/CUDA AI Kernel Library for High-Performance Computing

Modern C++ / CUDA AI 高性能计算内核库

English | 简体中文 | Documentation | API Reference

Overview

TensorCraft-HPC is a modern C++/CUDA library designed for learning, validating, and implementing high-performance AI computing kernels. It provides a comprehensive collection of optimized implementations covering fundamental operations used in deep learning and AI workloads.

Key Features

GEMM Kernels: From naive to Tensor Core (WMMA) implementations
- Naive, Tiled, Double-Buffer, and Tensor Core versions
- Performance comparison and optimization study
Attention Mechanisms: Memory-efficient attention computation
- FlashAttention-style fused attention
- RoPE (Rotary Positional Embeddings)
- MoE (Mixture of Experts) Router
Normalization: Standard normalization layers
- LayerNorm, RMSNorm, BatchNorm
- Warp-optimized implementations
Convolution: 2D convolution operations
- Naive, Im2Col, and Depthwise Separable
Sparse Operations: CSR/CSC format support
- Sparse Matrix-Vector (SpMV) and Matrix-Matrix (SpMM) multiplication
Quantization: INT8 and FP8 (CUDA 12.0+) support
- Fused operations with quantization
Python Bindings: NumPy-compatible interface via pybind11

Quick Start

Prerequisites

CUDA Toolkit: 12.8
CMake: 3.20+
C++ Compiler: C++17-capable
NVIDIA GPU: Recommended for running tests

Installation

# Clone the repository
git clone https://github.com/LessUp/modern-ai-kernels.git
cd modern-ai-kernels

# Build with development preset
cmake --preset dev
cmake --build --preset dev --parallel 2

# Run tests
ctest --preset dev --output-on-failure

# Install Python bindings
python -m pip install -e .
python -c "import tensorcraft_ops as tc; print(tc.__version__)"

Quick Example

C++:

#include "tensorcraft/kernels/gemm.hpp"
#include "tensorcraft/memory/tensor.hpp"

// Using RAII Tensor wrapper
tensorcraft::FloatTensor A({256, 512});
tensorcraft::FloatTensor B({512, 128});
tensorcraft::FloatTensor C({256, 128});

// GEMM operation
tensorcraft::kernels::gemm(A.data(), B.data(), C.data(), 256, 128, 512);

Python:

import tensorcraft_ops as tc
import numpy as np

# Matrix multiplication
A = np.random.randn(256, 512).astype(np.float32)
B = np.random.randn(512, 128).astype(np.float32)
C = tc.gemm(A, B)

# Activation & Normalization  
x = np.random.randn(32, 256).astype(np.float32)
y = tc.gelu(tc.layernorm(x, gamma, beta))

Documentation

Bilingual Documentation | 双语文档

We provide comprehensive documentation in both English and 简体中文:

English: docs/en/
中文: docs/zh/

Documentation Structure

Section	Description	Link
Getting Started	Installation and troubleshooting	en / zh
Guides	Architecture and optimization	en / zh
API Reference	Complete API documentation	en / zh
Examples	Code examples and tutorials	en / zh
Changelog	Version history	CHANGELOG.md

Online Documentation

📚 https://lessup.github.io/modern-ai-kernels/

GPU Architecture Support

Architecture	SM	Tensor Core	TMA	WGMMA
Volta	70	✅	❌	❌
Turing	75	✅	❌	❌
Ampere	80	✅	❌	❌
Ada Lovelace	89	✅	❌	❌
Hopper	90	✅	✅	✅

Project Structure

modern-ai-kernels/
├── specs/                  # Specification documents (Single Source of Truth)
│   ├── product/           # Product feature definitions (PRDs)
│   ├── rfc/               # Technical design documents
│   ├── api/               # API specifications
│   ├── db/                # Database schema designs
│   └── testing/           # Test case specifications & implementation plans
├── include/tensorcraft/    # Header-only kernel library
│   ├── core/              # CUDA error handling, type traits
│   ├── memory/            # Tensor, memory pool
│   └── kernels/           # All compute kernels
├── src/python_ops/        # Python bindings
├── tests/                 # Unit tests
├── benchmarks/            # Performance benchmarks
├── docs/                  # Documentation (en/, zh/)
├── changelog/             # Development changelog
└── examples/              # Example code

Build Presets

Preset	Purpose
`dev`	Recommended CUDA development preset
`python-dev`	Lighter build focused on Python bindings
`release`	Full release build with benchmarks
`debug`	Debug-oriented CUDA build
`cpu-smoke`	CPU-only configure/install validation

Contributing

We welcome contributions! Please see our Contributing Guide for details.

License

This project is licensed under the MIT License.

Acknowledgments

Inspired by CUTLASS, FlashAttention, and other excellent CUDA libraries
Built with modern C++17/20 features and CUDA 12.8

Made with ❤️ for the AI HPC community

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github		.github
.vscode		.vscode
_includes		_includes
_sass/color_schemes		_sass/color_schemes
assets/css		assets/css
benchmarks		benchmarks
changelog		changelog
docs		docs
examples		examples
include/tensorcraft		include/tensorcraft
specs		specs
src/python_ops		src/python_ops
tests		tests
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CMakePresets.json		CMakePresets.json
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Gemfile		Gemfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README.zh-CN.md		README.zh-CN.md
SECURITY.md		SECURITY.md
_config.yml		_config.yml
index.md		index.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TensorCraft-HPC

Overview

Key Features

Quick Start

Prerequisites

Installation

Quick Example

Documentation

Bilingual Documentation | 双语文档

Documentation Structure

Online Documentation

GPU Architecture Support

Project Structure

Build Presets

Contributing

License

Acknowledgments

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TensorCraft-HPC

Overview

Key Features

Quick Start

Prerequisites

Installation

Quick Example

Documentation

Bilingual Documentation | 双语文档

Documentation Structure

Online Documentation

GPU Architecture Support

Project Structure

Build Presets

Contributing

License

Acknowledgments

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages