Skip to content

billhails/CEKF

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,384 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CEKF(s)

logo

Low level, hopefully fast C implementation of a CEK machine with an additional "F" failure continuation supporting amb (and a stack).

CEKF is the rather unfortunate name of the repo and the implementation. The language itself I'm calling F♮ (F-Natural).

This is heavily based on a blog post by Matt Might Writing an interpreter, CESK-style, but using a bytecode interpreter rather than a tree-walking interpreter, and utilising a hybrid stack/closure/continuation implementation where variables local to a function are directly accessible on the stack, and closures and continuations are snapshots of stack frames. It additionally makes use of fast lexical addressing (technically De Bruijn Indexing) for added efficiency gains and an implementation of Hindley-Milner Algorithm W for strict implicit type checking.

I'm hoping that I can reproduce the F♮ language I once implemented in Python, but as a standalone binary with reasonable performance.

If you want to stick around, maybe start by reading the wiki, then maybe the math and comparing that with its implementation in step.c, or start at main.c and work your way through.

For AI coding assistants or developers wanting a comprehensive overview, see the AI Assistant Guide which covers architecture, build system, known complexity areas (TPMC, ANF), and the code generation system.

An aside on amb

I should probably give at least a brief explaination of amb here, for those who don't know what it is, since it's somewhat the point of this little project. amb is short for "ambivalent" in the sense of "having more than one value", and is a way of doing non-deterministic programming.

If you have a continuation passing style interpreter, then all control flow, both call and return, is always "forwards" by calling a function (call) or calling a continuation (return). It then becomes possible to thread an additional "failure" continuation as a sort of hidden argument through all those calls.

Mostly that additional continuation goes completely unnoticed, except in two specific cases:

  1. When amb is invoked with two (unevaluated) arguments, it arranges to have it's first argument evaluated, and additionally installs a new failure continuation that will, if invoked, restore the state of the machine to the point just after amb was invoked, but with the second argument to amb ready to be evaluated instead.
  2. When back is invoked, it restores the most recent state installed by amb, "backtracking" to the decision point and allowing the alternative to be produced.

To see amb in action, look at the sample fn/barrels.fn. Note that in this language amb is an infix operator called then.

For a good description of amb see SICP pp. 412-437.

What makes a CEK machine such an easy way to implement amb is that the failure continuation is just an additional register, nothing else in the CEK machine needs to change, apart from two additional cases in the $step$ function: one to deal with amb and one to deal with back.

Progress/Architecture

flowchart TD
classDef process fill:#aef,color:#123;
classDef new fill:#8bf,color:#000;
classDef minlam fill:#036,color:#fea;
classDef ast fill:#aa0,color:#000;
classDef lambda fill:#4a8,color:#000;
classDef anf fill:#a84,color:#000;

subgraph Parse
   source(Source) ==>
   scanner([Scanner]):::process ==>
   tokens(Tokens) ==>
   parser([Parser]):::process ==>
   oi([Operator Inlining]):::process --> scanner
   parser ==>
   ast(AST):::ast
end

ast ==> namespace_removal([Namespace Desugaring]):::process ==>
denamespace(Simplified AST):::ast

denamespace -.-> tcast

subgraph Proposed
   tcast([Typecheck AST]):::new
   tcast -.-> tcdast(Typechecked AST):::ast
   tcdast -.-> lconv([Lambda Conversion]):::new
end

lconv -.-> lambda_ds

denamespace ==> lc

subgraph Lambda Conversion
   lc([Lambda Conversion]):::process ==> tpmc([Pattern Matching Compiler]):::process
   lc <==> pg([Print Function Generator]):::process
   lc <==> me([Lazy Function Expansion]):::process
   tpmc ==> vs([Variable Substitution]):::process
   vs ==> lc
   lc <==> des([Desugaring]):::process
end

lc ==> lambda0(Plain Lambda Form):::lambda
lambda0 ==> sim([Simplification]):::process
sim ==> lambda1(Plain Lambda Form):::lambda
lambda1 ==> tc([Type Checking]):::process
tc <==> pc([Print Compiler]):::process
tc ==> lambda2(Plain Lambda Form):::lambda
lambda2 ==> ci([Constructor Inlining]):::process
ci ==> lambda3(Inlined Lambda):::lambda
desugaring(["Desugaring"]):::process
desugaring ==> lambda_ds("minimal lambda (MinLam)"):::minlam

lambda_ds ==> alpha(["ɑ-Conversion"]):::process
alpha ==> lambda_a(alphatized lambda):::minlam

lambda_a ==> curry(["Currying"]):::process
curry ==> curried(curried lambda):::minlam
curried ==> beta(["β-conversion"]):::process
beta ==> eta(["η-conversion"]):::process
eta ==> folding([Operator Folding]):::process
folding ==> folded(optimized operators):::minlam
folded ==> uncurry(["Un-Currying"]):::process
uncurry ==> uncurried(uncurried lambda):::minlam

uncurried ==> anfr([ANF Rewrite]):::process
anfr ==> lambda_b(New ANF):::anf

uncurried ==> cps

subgraph Target C
   cps([CPS Transform]):::process ==>
   beta2([Additional β-conversion]):::process ==>
   lambda_e("CPS λ"):::minlam ==>
   amb_conversion([AMB Conversion]):::process ==>
   amb(CPS λ with failure continuation):::minlam ==>
   beta3([Additional β-conversion]):::process ==>
   eta3(["Additional η-conversion (TCO)"]):::process ==>
   tree_shaking(["Dead Binding Elimination (Tree Shaking)"]):::process ==>
   small(Only Necessary Code):::minlam ==>
   closure_conversion([Closure Lifting]):::process ==>
   closures(Explicit Closures):::minlam ==>
   de_bruijn([DeBruijn Indexing]):::process ==>
   indexed("Annotated Variables (Final IR)"):::minlam ==>
   c_emitter([Emit C]):::process ==>
   target_c(Pure C code)
end
   
lambda3 ==> desugaring
uncurried ==> anfc

subgraph Target Bytecode
   anfc([A-Normal Form Conversion]):::process ==>
   anf(ANF):::anf ==>
   lexa([Lexical Analysis]):::process ==>
   ann(Annotated ANF):::anf ==>
   bcc([Bytecode Compiler]):::process ==>
   bc(Byte Code) ==>
   cekf([CEKF Runtime VM]):::process
   bc <==> bcf(Bytecode Files)
end
Loading

A big driver for the progress made has been arriving at a minimalist lambda (MinLam) set of structures for the IR. In the diagram above the stages making use of this MinLam representation are coloured dark-blue/grey.

The "ANF Rewrite" stage was a successful attempt to re-implement the clunky existing ANF transform, but in the process it became apparent that the CEK machine itself was blocking optimizations and so the trajectory pivoted towards targeting a more "traditional" register machine via CPS, initially in C but with an eye towards LLVM in the longer term.

The various components named in the diagram above are linked to their implementation entry point here:

The desugaring step is now rolled into thae lambda conversion phase as eary as possible to simplify downstream processing.

Function Over-Application

Implemented per RFC-001 (hybrid compile-time split + runtime staging). For details see: docs/RFC-001-over-application.md.

CEKF Formal Description

A formal mathematical description of the CEKF machine can be found here.

Bytecode

The description of the machine linked above assumes it is evaluating a tree of lambda expressions. That makes the concepts somewhat clearer so I've left it as originally written. However the actual implementation uses a bytecode interpreter instead. Translation from lambda expressions to bytecode turns out to be not that difficult, see docs/V2 for details of that.

Lexical Addressing

A lexical analysis stage annotates variables with their locations for faster run-time lookup. See docs/LEXICAL_ADDRESSING.

Type Inferencing

My previous attempt at implicit type-checking borrowed a pre-built implementation of Algorithm W written in Python. This time around I've gone with my own implementation, which required quite a lot of research. I've made notes on that process in docs/TYPES.

Arbitrary size integers

Rather than forcing a requirement on an external library like libgmp in the early stages, I've instead opted to incorporate a much smaller, public domain implementation from 983, only slightly amended to play nice with the CEKF memory management and garbage collection.

Unicode

There is basic support for Unicode, both in strings and in identifiers. The base Character type is now a wchar_t which stores raw Unicode code points (UTF-32). The only supported input/output encoding is UTF8, which is used internally for C strings in the interpreter, to read source files and and to output characters to the terminal.

SQLite

The implementation bundles SQLite3 and provides an interface to it, see here. The primary reason being to allow convenient and fast lookup of Unicode data for individual characters, should the need arise.

Code Generation

While not properly part of the language description, I'm using some hand-written code generators to make development easier and they're documented here.

About

Low level (C) implementation of a CEK machine with an additional "F" failure continuation supporting amb

Topics

Resources

License

GPL-3.0, Unknown licenses found

Licenses found

GPL-3.0
LICENSE.md
Unknown
LICENSE.bigint

Stars

Watchers

Forks

Packages

 
 
 

Contributors