Simd by cstroie · Pull Request #28 · RightNow-AI/picolm

cstroie · 2026-04-16T10:59:46Z

Summary

AVX build target (make avx): 8-wide float accumulators for all hot paths on Sandy Bridge+ / Bulldozer+ CPUs
Tiered x86 build targets: make x86 (SSE2 default), make sse2, make sse3, make avx
SSE2 Q4_K dot product: was NEON-only on x86; now uses 128-bit SIMD (hottest inference path)
SSE2 Q6_K dot product: rewritten from scalar; uses int8 sign-extension trick (no SSE4.1 needed)
SSE2/SSE3/AVX RoPE: vectorized complex rotation using _mm_addsub_ps / _mm256_addsub_ps
AVX paths for rmsnorm, softmax, elemwise_mul, vec_add, and all dot products
--mem flag: load model into RAM instead of mmap for consistent inference latency

SIMD tier detection (compile-time)

PICOLM_NEON → PICOLM_AVX → PICOLM_SSE3 → PICOLM_SSE2 → scalar

Test plan

make clean && make sse2/sse3/avx/native — all build without warnings
./picolm model.gguf -p "The capital of France is" -n 20 -t 0 — output matches reference
./picolm model.gguf --mem -p "Hello" -n 10 — RAM mode works correctly

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

…nfigurations

…formance

- Add x86, sse2, sse3, avx targets to platform-specific builds section - Update SIMD feature entry to mention SSE2/SSE3/AVX tiers - Expand x86 SIMD optimization section with per-tier description - Update performance waterfall chart to reflect 8-wide AVX ops - Add --mem option to usage section - Mark AVX as done in roadmap, keep AVX2/AVX-512 as next step - Update FAQ SIMD mention Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

cstroie and others added 7 commits April 16, 2026 10:52

feat: add --mem parameter to load model into RAM instead of mmap

d73aa69

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

fix: fix model_load call and signed comparison warning

3f30789

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

feat: add --mem option for model loading mode selection

9639988

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

feat: add fast mode with optimized parameters for better performance

bc1a262

Co-authored-by: aider (openrouter/z-ai/glm-4.5-air:free) <aider@aider.chat>

feat: enhance performance with SSE2 optimizations and update build co…

92d5aa7

…nfigurations

feat: add AVX support for optimized vector operations and enhance per…

ede96ff

…formance

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simd#28

Simd#28
cstroie wants to merge 7 commits intoRightNow-AI:mainfrom
cstroie:simd

cstroie commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cstroie commented Apr 16, 2026

Summary

SIMD tier detection (compile-time)

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant