Skip to content

yukon-systems/StorOps__Project_Coherent_Storage

Repository files navigation

Project Coherent Storage - ADR Package

  • Project: Project Coherent Storage
  • Architecture cycle: 2026-Q2
  • Architecture focus:
    • Auto-Scaling Ai/HPC storage architecture featuring accelerator-centric Coherent Memory-Mesh
    • Custom Max-IO Grid-Engine w/ ACID-compliant cache transactions for superscaler architectures
    • Custom UA-Link pod-scale systems design with host-based CXL memory pools to clear the 'memory wall'
    • Fully automated deployment w/ Ansible workflows, SLURM workload management, netboot ramdisks
    • Dev/Lab, Stage/LT, Prod env support with full CI/CD test-coverage, load-test profiles, ITIL change-controls
    • Gate-based workflows with 'failure-semantics', SLO & SLA definitions for observability and monitoring
    • Network environment scaling from 10-25Gb/s to 400-800Gb/s port-based tuning profiles for RDMA/RoCEv2
    • NVMe-oF with DPU hardware-based protocol offloads for OpenZFS (storage tiering + ACLs + DoD compliant encryption)
    • LLM Prompt-Cache acceleration supported via disaggregated heterogeneous GP-GPU compute (AMD, NVIDIA, NPU, FPGA)
  • Generated: 2026-05-18
  • Automation: Tracked workflows, machine profiles, neteng scopes, etc located in 'Infra-Stage4-LLVM-NoGNU' repository.
  • Status: Proposed / Review

Visualized High-Level Architecture Scope

Operational View: API Front-End to Block Storage Layer

Top-down Project Coherent Storage stack

Operational View: Coherent-Mesh Traversal, S3-Object to REST-API

S3/Object REST translator architecture

Purpose

This package refreshes the ADR set using the expanded RAG corpus and the project directives. It keeps the core invariant: inference actors connect to the Coherence-CE Memory Mesh and never bind directly to OpenZFS, DPU, RoCEv2, NVMe-oF, CXL, UA-Link, VLANs, RDMA memory handles, or physical storage internals.

The architecture emphasizes:

  1. Coherence-CE namespace modalities with explicit Unified Namespace and Dimensional Indexed Namespace workflows for scalable cache locality.
  2. UA-Link enabled pod-scale systems as a scale-up accelerator domain inside pod/rack boundaries.
  3. Network architecture across scale-up and scale-out planes, separating UA-Link accelerator fabrics, Ethernet/RDMA scale-out, storage/NVMe-oF fabrics, management, telemetry, and timing.
  4. CXL memory pools as governed T1/T1.5 memory capacity for warm KV/prefix state, metadata, vector heads, and future shared-memory research paths.
  5. RDMA/RoCEv2 performance tuning with explicit PFC/ECN/DCQCN, traffic-class, rail, telemetry, and failure semantics.
  6. DPU/SmartNIC storage offload as a hard requirement for NVMe-oF/RDMA storage-network paths.
  7. General-purpose GPU and heterogeneous accelerator scheduling, covering vendor capability profiles and admission-control policy.
  8. Reference Architecture Focused Development, baseline scoped architecture elements suitable for layering, adapting, and ease of feature adoption as the industry rapidly evolves; fully open-source across the entire application stack.

Source basis

The source pass extracted text from 363 PDFs in the RAG-DATA/ corpus into a local processing cache.

  • Text extraction OK: 360 PDFs
  • Source map: review-artifacts/rag-extraction-and-source-map.md

Important sources include the UA-Link white paper, UniFabriX UA-Link material, OCP Open Cluster inference/training fabric reference architectures, OCP MRC, Arista/Broadcom lossless Ethernet/RoCE material, AMD Pensando/Pollara cluster and product collateral, Intel Gaudi 3 cluster design, CXL/KV/GPU research, and prior Marvell/XConn/CXL/DPU materials.

Package index

Path Purpose
reports/project-coherent-storage_architecture-report.md Main architecture report for UA-Link pod scale, CXL memory pools, RDMA/RoCEv2, and heterogeneous GP-GPU compute.
reports/project-coherent-storage_engineering-deep-dive.md Top-down engineering deep-dive from OpenAI/user layer through global/regional/datacenter load-balancer meshes and intra-datacenter storage layers.
reports/project-coherent-storage_overview__executive-overview.md Executive overview for business value, hard requirements, namespace posture, and residual risks.
reports/project-coherent-storage_overview__director-overview.md Director overview for procurement, lifecycle, deployment risk, and operational readiness.
reports/project-coherent-storage_overview__engineering-overview.md Engineering/ARB overview for data paths, CXL roles, namespace rules, and validation checklist.
reports/project-coherent-storage_s3-object-rest-api-translator-design.md Translator design report for S3/Object REST access and explicit prefix-cache namespace modalities.
reports/project-coherent-storage_coherence-ce-object-chunking-and-lfs-gateway-design.md Design report for Coherence-CE object chunking, manifest semantics, and Git LFS gateway migration.
api/coherence-ce-vllm-adapter.openapi.yaml OpenAPI contract for Coherence-CE vLLM adapter operations.
api/s3-object-rest-translator.openapi.yaml OpenAPI contract for S3/Object REST translator routes, including Unified and Dimensional Indexed Namespace routes.
api/coherence-ce-object-chunking-lfs-gateway.openapi.yaml OpenAPI contract for Coherence-native object chunking and Git LFS gateway facade routes.
adr/diagrams/*.puml, *.png, *.svg Per-ADR PlantUML source and rendered PNG/SVG assets.
diagrams/*.puml, *.png, *.svg Report-level PlantUML source and rendered PNG/SVG assets.
review-artifacts/rag-extraction-and-source-map.md and JSON peer Extraction evidence and source map.
review-artifacts/ietf-icnrg-chunking-source-map.md and JSON peer Source map for CCNx chunking, FLIC, RFC 8569/8609, and Git LFS API references.
docs/git-lfs-policy.md Repository Git LFS lock-verification, normalized .gitattributes, pre-push hook, test-server, and migration policy.

ADR index

ADR File Document Function
ADR-001_Inference_Storage_Principles_and_SLOs.md Defines inference-first storage principles, latency SLOs, tier boundaries, and workload classes that govern all later ADRs.
ADR-002_Hot_KV_and_Prefix_Cache_Data_Plane.md Defines the hot KV/prefix-cache data plane and keeps inference actors behind the Coherence-CE Memory Mesh.
ADR-003_Model_Weight_Object_and_Corpus_Data_Tiers.md Defines model-weight, adapter, tokenizer, object, corpus, and artifact tiers for reproducible inference data placement.
ADR-004_RDMA_Fabric_and_GPU_Direct_Data_Paths.md Defines RDMA, RoCEv2, GPU-direct, and scale-out data-path rules for cross-node inference and storage movement.
ADR-005_DPU_and_SmartNIC_Offload_Boundaries.md Defines mandatory DPU/SmartNIC offload boundaries for NVMe-oF, RDMA mediation, isolation, telemetry, and degraded host fallback.
ADR-006_OpenZFS_NVMe_oF_and_Media_Layout.md Defines OpenZFS, NVMe-oF, mirrored NAND, media layout, and durable block-substrate rules.
ADR-007_Inference_Scheduler_Locality_and_Admission_Control.md Defines scheduler admission using model, KV, fabric, CXL, DPU, rail, and locality telemetry.
ADR-008_RAG_Vector_Index_and_Corpus_Service.md Defines immutable RAG corpus, embedding, vector-index, retrieval-cache, and corpus-service architecture.
ADR-009_Observability_Benchmarking_and_Rollout_Gates.md Defines observability, benchmark, failure-drill, and rollout gates for inference, fabric, storage, CXL, and scheduler claims.
ADR-010_Coherence_CE_Write_Policy_to_OpenZFS.md Defines Coherence-CE write-through, write-back, write-around, and write-behind policy to OpenZFS by durability class.
ADR-011_KV_Durability_Classes.md Defines KV-D0 through KV-D5 durability classes used by Coherence-CE, OpenZFS write policy, failure recovery, and scheduler admission.
ADR-012_Coherence_CE_vLLM_Adapter_API_Contract.md Defines the Coherence-native and OpenAI-compatible API contract exposed to vLLM adapters without leaking lower-layer storage or fabric.
ADR-013_Failure_Semantics_and_Fencing.md Defines failure semantics, fencing, recovery, drain behavior, and degraded-mode rules across compute, fabric, DPU, CXL, and storage.
ADR-014_Coherence_Metrics_Scheduler_Admission.md Defines how Coherence-CE metrics roll up into scheduler GREEN, AMBER, RED, and DRAIN admission states.
ADR-015_CXL_Memory_Tiering_and_OpenZFS_Interaction.md Defines CXL T1/T1.5 memory tiering, memory-pool governance, and safe OpenZFS-adjacent CXL roles.
ADR-016_Roadmap_Evidence_and_Public_Claim_Guardrails.md Defines evidence grades and public-claim guardrails for vendor roadmap, partnership, and integration statements.
ADR-017_Research_Metadata_and_Arxiv_Publication_Workflow.md Defines research metadata, arXiv API/bulk-data, Markdown, LaTeX, BibTeX, and publication workflow requirements.
ADR-018_UALink_Pod_Scale_Fabric_and_Compute_Domains.md Defines UA-Link pod-scale accelerator fabric domains and their scheduler-visible but actor-hidden compute locality semantics.
ADR-019_Pod_Scale_Network_Architecture_and_RDMA_RoCEv2_Tuning.md Defines pod-scale network planes and RDMA/RoCEv2 tuning gates for traffic classes, PFC, ECN/DCQCN, rails, and telemetry.
ADR-020_CXL_Memory_Pools_for_UALink_Pods.md Defines CXL memory pools inside UA-Link pods as governed Coherence-owned warm capacity with ownership, latency, and failure gates.
ADR-021_Heterogeneous_GP_GPU_Compute_and_Scheduler_Governance.md Defines heterogeneous GP-GPU and accelerator capability profiles for scheduler governance across vendors and fabrics.
ADR-022_S3_Object_to_REST_API_Protocol_Mapping_Translator.md Defines the S3/Object-to-REST translator and its object, KV, vector, and prefix-cache REST contract.
ADR-023_Coherence_CE_Namespace_Modalities.md Defines Unified Namespace and Dimensional Indexed Namespace workflows, API route semantics, and locality-governance rules.
ADR-024_System_Level_Benchmarking_Suite_Definitions.md Defines system-level benchmark suite taxonomy across component, service, test-intent, SLURM execution, cross-platform tooling, and evidence gates.
ADR-025_Broad_Systems_E2E_Testing_Workflows_and_Tooling.md Defines broad-systems E2E testing workflows, scheduler-adapter execution, failure-mode tests, evidence bundles, and CI/CD gates.
ADR-026_Coherence_CE_Object_Chunking_and_Manifest_Semantics.md Defines Coherence-CE internal object chunking, manifest commit semantics, S3 multipart mapping, Git LFS facade behavior, and RAG byte-object boundaries.

Top-down architecture composition

The design composes the system from inference SLOs down through hot-state placement, namespace modality, data tiers, fabrics, offload, durable media, scheduler admission, failure semantics, CXL/UA-Link pod resources, heterogeneous accelerator governance, S3/Object REST translation, object chunking and manifest semantics, Git LFS gateway behavior, benchmark evidence, broad-systems E2ET, and research-publication workflow. Each ADR embeds its PNG diagram and has a PlantUML source file plus PNG/SVG renders under adr/diagrams/.

ADR Architecture interaction diagram
ADR-001_Inference_Storage_Principles_and_SLOs.md PNG / SVG / PUML
ADR-002_Hot_KV_and_Prefix_Cache_Data_Plane.md PNG / SVG / PUML
ADR-003_Model_Weight_Object_and_Corpus_Data_Tiers.md PNG / SVG / PUML
ADR-004_RDMA_Fabric_and_GPU_Direct_Data_Paths.md PNG / SVG / PUML
ADR-005_DPU_and_SmartNIC_Offload_Boundaries.md PNG / SVG / PUML
ADR-006_OpenZFS_NVMe_oF_and_Media_Layout.md PNG / SVG / PUML
ADR-007_Inference_Scheduler_Locality_and_Admission_Control.md PNG / SVG / PUML
ADR-008_RAG_Vector_Index_and_Corpus_Service.md PNG / SVG / PUML
ADR-009_Observability_Benchmarking_and_Rollout_Gates.md PNG / SVG / PUML
ADR-010_Coherence_CE_Write_Policy_to_OpenZFS.md PNG / SVG / PUML
ADR-011_KV_Durability_Classes.md PNG / SVG / PUML
ADR-012_Coherence_CE_vLLM_Adapter_API_Contract.md PNG / SVG / PUML
ADR-013_Failure_Semantics_and_Fencing.md PNG / SVG / PUML
ADR-014_Coherence_Metrics_Scheduler_Admission.md PNG / SVG / PUML
ADR-015_CXL_Memory_Tiering_and_OpenZFS_Interaction.md PNG / SVG / PUML
ADR-016_Roadmap_Evidence_and_Public_Claim_Guardrails.md PNG / SVG / PUML
ADR-017_Research_Metadata_and_Arxiv_Publication_Workflow.md PNG / SVG / PUML
ADR-018_UALink_Pod_Scale_Fabric_and_Compute_Domains.md PNG / SVG / PUML
ADR-019_Pod_Scale_Network_Architecture_and_RDMA_RoCEv2_Tuning.md PNG / SVG / PUML
ADR-020_CXL_Memory_Pools_for_UALink_Pods.md PNG / SVG / PUML
ADR-021_Heterogeneous_GP_GPU_Compute_and_Scheduler_Governance.md PNG / SVG / PUML
ADR-022_S3_Object_to_REST_API_Protocol_Mapping_Translator.md PNG / SVG / PUML
ADR-023_Coherence_CE_Namespace_Modalities.md PNG / SVG / PUML
ADR-024_System_Level_Benchmarking_Suite_Definitions.md PNG / SVG / PUML
ADR-025_Broad_Systems_E2E_Testing_Workflows_and_Tooling.md PNG / SVG / PUML
ADR-026_Coherence_CE_Object_Chunking_and_Manifest_Semantics.md PNG / SVG / PUML

ADR diagram gallery

ADR-001_Inference_Storage_Principles_and_SLOs

ADR-002_Hot_KV_and_Prefix_Cache_Data_Plane

ADR-003_Model_Weight_Object_and_Corpus_Data_Tiers

ADR-004_RDMA_Fabric_and_GPU_Direct_Data_Paths

ADR-005_DPU_and_SmartNIC_Offload_Boundaries

ADR-006_OpenZFS_NVMe_oF_and_Media_Layout

ADR-007_Inference_Scheduler_Locality_and_Admission_Control

ADR-008_RAG_Vector_Index_and_Corpus_Service

ADR-009_Observability_Benchmarking_and_Rollout_Gates

ADR-010_Coherence_CE_Write_Policy_to_OpenZFS

ADR-011_KV_Durability_Classes

ADR-012_Coherence_CE_vLLM_Adapter_API_Contract

ADR-013_Failure_Semantics_and_Fencing

ADR-014_Coherence_Metrics_Scheduler_Admission

ADR-015_CXL_Memory_Tiering_and_OpenZFS_Interaction

ADR-016_Roadmap_Evidence_and_Public_Claim_Guardrails

ADR-017_Research_Metadata_and_Arxiv_Publication_Workflow

ADR-018_UALink_Pod_Scale_Fabric_and_Compute_Domains

ADR-019_Pod_Scale_Network_Architecture_and_RDMA_RoCEv2_Tuning

ADR-020_CXL_Memory_Pools_for_UALink_Pods

ADR-021_Heterogeneous_GP_GPU_Compute_and_Scheduler_Governance

ADR-022_S3_Object_to_REST_API_Protocol_Mapping_Translator

ADR-023_Coherence_CE_Namespace_Modalities

ADR-024_System_Level_Benchmarking_Suite_Definitions

ADR-025_Broad_Systems_E2E_Testing_Workflows_and_Tooling

ADR-026_Coherence_CE_Object_Chunking_and_Manifest_Semantics


Public claim guardrails

UA-Link, CXL, RoCEv2, DPU, and heterogeneous GPU claims use the evidence-grade rule structures:

  • Direct: source explicitly states the relationship or capability.
  • Adjacent: relevant to architecture but not proof of a named integration.
  • Negative-control: retained to prevent overclaiming.
  • Not found in current sweep: searched but no direct source-backed mention found.

About

ExaScale Storage for Ai/HPC Inference :: RDMA-Mesh + ZFS/CXL + DPU/NVMe-oF

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Contributors