Skip to content

LessUp/modern-ai-kernels

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

TensorCraft-HPC

Modern C++/CUDA AI Kernel Library for High-Performance Computing

Modern C++ / CUDA AI 高性能计算内核库

English | 简体中文 | Documentation | API Reference

CI GitHub Pages License: MIT CUDA C++ CMake Python


Overview

TensorCraft-HPC is a modern C++/CUDA library designed for learning, validating, and implementing high-performance AI computing kernels. It provides a comprehensive collection of optimized implementations covering fundamental operations used in deep learning and AI workloads.

Key Features

  • GEMM Kernels: From naive to Tensor Core (WMMA) implementations

    • Naive, Tiled, Double-Buffer, and Tensor Core versions
    • Performance comparison and optimization study
  • Attention Mechanisms: Memory-efficient attention computation

    • FlashAttention-style fused attention
    • RoPE (Rotary Positional Embeddings)
    • MoE (Mixture of Experts) Router
  • Normalization: Standard normalization layers

    • LayerNorm, RMSNorm, BatchNorm
    • Warp-optimized implementations
  • Convolution: 2D convolution operations

    • Naive, Im2Col, and Depthwise Separable
  • Sparse Operations: CSR/CSC format support

    • Sparse Matrix-Vector (SpMV) and Matrix-Matrix (SpMM) multiplication
  • Quantization: INT8 and FP8 (CUDA 12.0+) support

    • Fused operations with quantization
  • Python Bindings: NumPy-compatible interface via pybind11


Quick Start

Prerequisites

  • CUDA Toolkit: 12.8
  • CMake: 3.20+
  • C++ Compiler: C++17-capable
  • NVIDIA GPU: Recommended for running tests

Installation

# Clone the repository
git clone https://github.com/LessUp/modern-ai-kernels.git
cd modern-ai-kernels

# Build with development preset
cmake --preset dev
cmake --build --preset dev --parallel 2

# Run tests
ctest --preset dev --output-on-failure

# Install Python bindings
python -m pip install -e .
python -c "import tensorcraft_ops as tc; print(tc.__version__)"

Quick Example

C++:

#include "tensorcraft/kernels/gemm.hpp"
#include "tensorcraft/memory/tensor.hpp"

// Using RAII Tensor wrapper
tensorcraft::FloatTensor A({256, 512});
tensorcraft::FloatTensor B({512, 128});
tensorcraft::FloatTensor C({256, 128});

// GEMM operation
tensorcraft::kernels::gemm(A.data(), B.data(), C.data(), 256, 128, 512);

Python:

import tensorcraft_ops as tc
import numpy as np

# Matrix multiplication
A = np.random.randn(256, 512).astype(np.float32)
B = np.random.randn(512, 128).astype(np.float32)
C = tc.gemm(A, B)

# Activation & Normalization  
x = np.random.randn(32, 256).astype(np.float32)
y = tc.gelu(tc.layernorm(x, gamma, beta))

Documentation

Bilingual Documentation | 双语文档

We provide comprehensive documentation in both English and 简体中文:

Documentation Structure

Section Description Link
Getting Started Installation and troubleshooting en / zh
Guides Architecture and optimization en / zh
API Reference Complete API documentation en / zh
Examples Code examples and tutorials en / zh
Changelog Version history CHANGELOG.md

Online Documentation

📚 https://lessup.github.io/modern-ai-kernels/


GPU Architecture Support

Architecture SM Tensor Core TMA WGMMA
Volta 70
Turing 75
Ampere 80
Ada Lovelace 89
Hopper 90

Project Structure

modern-ai-kernels/
├── specs/                  # Specification documents (Single Source of Truth)
│   ├── product/           # Product feature definitions (PRDs)
│   ├── rfc/               # Technical design documents
│   ├── api/               # API specifications
│   ├── db/                # Database schema designs
│   └── testing/           # Test case specifications & implementation plans
├── include/tensorcraft/    # Header-only kernel library
│   ├── core/              # CUDA error handling, type traits
│   ├── memory/            # Tensor, memory pool
│   └── kernels/           # All compute kernels
├── src/python_ops/        # Python bindings
├── tests/                 # Unit tests
├── benchmarks/            # Performance benchmarks
├── docs/                  # Documentation (en/, zh/)
├── changelog/             # Development changelog
└── examples/              # Example code

Build Presets

Preset Purpose
dev Recommended CUDA development preset
python-dev Lighter build focused on Python bindings
release Full release build with benchmarks
debug Debug-oriented CUDA build
cpu-smoke CPU-only configure/install validation

Contributing

We welcome contributions! Please see our Contributing Guide for details.


License

This project is licensed under the MIT License.


Acknowledgments

  • Inspired by CUTLASS, FlashAttention, and other excellent CUDA libraries
  • Built with modern C++17/20 features and CUDA 12.8

Made with ❤️ for the AI HPC community

About

Modern AI Kernel Library (CUDA C++17/20): Elementwise, GEMM, Attention, Conv2D, Sparse, Fusion & Quantization | 现代 AI 算子库(CUDA C++17/20):Elementwise、GEMM、Attention、Conv2D、稀疏矩阵、融合与量化

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors