C++ pybind11 extension for ASUN (Array-Schema Unified Notation).
Provides 7 functions without requiring manual schema strings for encoding:
encode, encodeTyped, encodePretty, encodePrettyTyped, decode, encodeBinary, decodeBinary.
The wheel also ships asun.pyi and py.typed, so editors and static type checkers can understand the extension module without a separate stub package.
| Tool | Version |
|---|---|
| g++ | ≥ 11 (C++17) |
| python3-dev | any (provides Python.h) |
| Python | ≥ 3.8 |
pybind11 2.13.6 headers are vendored in vendor/pybind11/ — no separate installation needed.
# Option A — shell script (auto-installs python3-dev via sudo if missing)
bash build.sh
# Option B — Makefile
make
# Option C — CMake
cmake -B build && cmake --build build| Python value | Inferred ASUN type |
|---|---|
bool |
bool |
int |
int |
float |
float |
str |
str |
None |
optional (e.g. str?, int?) |
Cross-row type merging for lists: When encoding a list, all rows are scanned to compute the final type:
- A field that is non-
Nonein row 0 butNonein some later row is promoted to optional (e.g.str→str?,int→int?). - Type conflicts between non-
Nonevalues (e.g.intin row 0,strin row 1) fall back tostr.
This means encodeTyped is safe to use even when only some rows have None for a given field.
asun.encode({"id": 1, "name": "Alice"})
# → '{id,name}:\n(1,Alice)\n'
asun.encode([{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}])
# → '[{id,name}]:\n(1,Alice),\n(2,Bob)\n'Decode semantics without scalar hints: When decoded with
decode(), terminal field values are returned as strings because the schema omits scalar type hints. Structural bindings such as@{}and@[]still remain in the schema. UseencodeTypedwhen you need a type-preserving round-trip.
Type is inferred from all rows (not just the first). A field that is None in any row is made optional:
asun.encodeTyped({"id": 1, "name": "Alice", "active": True})
# → '{id@int,name@str,active@bool}:\n(1,Alice,true)\n'
# Optional field inferred from cross-row merging:
asun.encodeTyped([{"id": 1, "tag": "hello"}, {"id": 2, "tag": None}])
# → '[{id@int,tag@str?}]:\n(1,hello),\n(2,)\n'pretty = asun.encodePretty(rows)pretty = asun.encodePrettyTyped(rows)Decodes both schemas with scalar hints and schemas without scalar hints embedded in the text:
# schema with scalar hints → values restored as Python types
rec = asun.decode('{id@int, name@str}:\n(1,Alice)\n') # {'id': 1, 'name': 'Alice'}
rows = asun.decode('[{id@int, name@str}]:\n(1,Alice),\n(2,Bob)\n')
# schema without scalar hints → scalar values returned as strings
rec2 = asun.decode('{id,name}:\n(1,Alice)\n') # {'id': '1', 'name': 'Alice'}Block comments are supported anywhere whitespace is allowed:
rec = asun.decode('/* top */ {id@int,name@str}: /* row */ (1, /* name */ Alice)')data = asun.encodeBinary(rows)Schema is required because the binary wire format carries no embedded type information:
rows = asun.decodeBinary(data, "[{id@int, name@str}]")asun-py includes inline typing support for the compiled extension:
from asun import decode
rows = decode("[{id@int, name@str}]:(1,Alice),(2,Bob)")Type checkers will infer dict[str, Any] | list[dict[str, Any]] for decode results and validate function signatures from the bundled asun.pyi.
Little-endian layout, identical to asun-rs and asun-go:
| Type | Bytes |
|---|---|
int |
8 (i64 LE) |
uint |
8 (u64 LE) |
float |
8 (f64 LE) |
bool |
1 |
str |
4-byte length LE + UTF-8 bytes |
| optional | 1-byte tag (0=null, 1=present) + value |
| slice | 4-byte count LE + elements |
# after building:
python3 -m pytest tests/ -vimport asun
users = [
{"id": 1, "name": "Alice", "score": 9.5},
{"id": 2, "name": "Bob", "score": 7.2},
]
# Schema is inferred automatically—no schema string needed
text = asun.encode(users) # schema binding without scalar hints
textTyped = asun.encodeTyped(users) # schema binding with scalar hints
pretty = asun.encodePrettyTyped(users) # pretty + scalar hints
blob = asun.encodeBinary(users) # binary (schema inferred internally)
assert asun.decode(textTyped) == users # round-trip with scalar hints
assert asun.decode(pretty) == users
assert asun.decodeBinary(blob, "[{id@int, name@str, score@float}]") == usersMeasured on this machine with:
bash build.sh
PYTHONPATH=. python3 examples/bench.pyHeadline numbers:
- Flat 1,000-record dataset: ASUN text serialize
118.98msvs JSON403.32ms, deserialize221.21msvs JSON441.89ms - Flat 10,000-record dataset: ASUN text serialize
81.70msvs JSON293.38ms, deserialize158.39msvs JSON317.44ms - Size summary for 1,000 flat records: JSON
137,674 B, ASUN text57,761 B(58%smaller), ASUN binary74,454 B(46%smaller vs JSON) - Throughput summary on 1,000 records: ASUN text was
3.58xfaster than JSON for serialize and2.01xfaster for deserialize - Binary mode was even faster:
7.18xfaster than JSON on serialization and4.16xfaster on deserialization in the benchmark summary