Razen Compiler Roadmap (C++ Implementation)

Philosophy: Meaningful. Accurate. Simple. Maximum Performance. No Hidden Magic. Style: Direct. No fluff. Every checkbox is a concrete deliverable.

Legend

Mark	Meaning
✓	Done — tested and working in pipeline
◐	Partial — parsed/validated but codegen missing or broken
☐	Not started

Stage 0: Project Infrastructure

Build System

✓ Makefile build system (clang++-20, C++20, make && ./razenc)
✓ Dependency on C++20 or later (-std=c++20)
✓ razenc CLI binary (separate from host build)
✓ Source file input (positional .rzn args + -f flag)
◐ Output file flags (emitObject/emitAssembly via IRGen — no --emit= CLI flag yet)
☐ Target triple specification for cross-compilation
☐ Optimization level flags (-O0 through -O3)
☐ DWARF debug info generation (no -g CLI flag yet)

Documentation

✓ README.md with philosophy, build, and quick start
✓ ROADMAP.md (this file)
✓ docs/ — introduction, basics, types, functions, control flow, behaviours, std_lib, compilation, syntax, style, faq, memory, modules, expressions, generics, attributes, error handling, testing
✓ design/ — keywords, std_new (detailed std spec), examples
☐ Language specification (formal grammar)
☐ Compiler internals guide

Testing

✓ Sample programs in src/samples/ (6 sample headers with 30+ test programs)
☐ Automated test runner
☐ Unit tests for lexer, parser, semantic, codegen
☐ Integration tests (compile + verify LLVM IR output)
☐ Fuzz testing for parser and semantic analyzer
☐ Regression test suite for all open issues

Stage 1: Lexer (Phase 1) ✓ Complete

Token Types — All Tokens Defined and Lexed

✓ Keywords: func, ret, if, else, loop, break, skip, match, const, mut, pub, use, mod, struct, enum, union, error, behave, ext, async, defer, try, catch, type, true, false, void, noret, any, test, needs
✓ Primitive types: i1-i128, u1-u128, isize, usize, int, uint, f16-f128, float, bool, char, str, string
✓ Operators: +, -, *, /, %, =, ==, !=, <, >, <=, >=, +=, -=, *=, /=, %=, !, &&, ||, &, |, ^, ~, <<, >>, ., .., ..., ..=, ->, =>, ~>, :, :=, ,, ;, @, ?
✓ Delimiters: (, ), {, }, [, ]
✓ Integer literals (decimal)
✓ Float literals
✓ String literals with escape sequences (\n, \t, \", \\, \', \0, \r, \xNN)
✓ Char literals with escape sequences
✓ Bool literals (true, false)
✓ Single-line comments (//) with correct line tracking
✓ Block comments (/* */) spanning multiple lines with correct line tracking
✓ Line/column tracking on every token
✓ EOF token

Lexer Architecture

✓ Stateful Lexer class with position, line, char tracking
✓ Character-by-character processing loop
✓ Operator multi-character peek-ahead
✓ Dot operator differentiation (., .., ..., ..=)
✓ Word/keyword/number disambiguation
✓ Identifier tokenization
✓ Unrecognized character handling (throws LexerError with context)

Stage 2: Parser & AST (Phase 2) ✓ Complete

AST Node Types (48 node types)

✓ Literal nodes: IntegerLiteral, FloatLiteral, StringLiteral, CharLiteral, BoolLiteral, ArrayLiteral, TupleLiteral, ArrayType
✓ Identifier node
✓ Type nodes: VarType, PointerType (*T), ArrayType ([T], [T;N]), OptionalType (?T), FailableType (!T), ErrorUnionType (Error!T)
✓ Declaration nodes: FunctionDeclaration, VarDeclaration, ConstDeclaration, StructDeclaration, EnumDeclaration, UnionDeclaration, ErrorMapDeclaration, TypeAliasDeclaration, ModuleDeclaration, UseDeclaration, BehaviourDeclaration, ExtDeclaration, Annotation, GenericParams
✓ Statement nodes: ReturnStatement, IfStatement, ElseIfStatement, LoopStatement, MatchStatement, TryStatement (TryExpression), CatchBlock (CatchExpression), DeferStatement, Assignment, Block, TryBlock
✓ BreakStatement / SkipStatement — dedicated AST nodes with correct loop-scope validation
✓ Expression nodes: BinaryExpression, UnaryExpression, FunctionCall, MemberAccess, BuiltinExpression, RangeExpression, CaptureBlock
✓ Structural nodes: ReturnType, IfBody, ElseBody, LoopBody, MatchBody, MatchCase, Parameters, Parameter, Argument
✓ Comment node

Declaration Parsing

✓ func name(params) -> ret_type { body } — full function parsing
✓ pub func / async func / const func / ext func / ext struct / ext enum / ext union variants
✓ Generic parameters: @Generic(T), @Generic(T, E) — parsed with GenericParams node, stored on declaration
✓ Parameter parsing with mut/const prefix, variadic ...
✓ struct Name { fields... } with methods, ~> trait impls, ~> rename syntax, field defaults
✓ enum Name: backing_type { variants... } with explicit values, methods, ~> (traits in children, consistent with struct/union)
✓ union Name { variants... } — tuple-style, struct-variant
✓ error Name { variants... } — error set declaration
✓ behave Name { needs..., func... } — behaviour/trait declaration, with ~> inheritance
✓ const Name: type = expr — compile-time constants
✓ type Name = Type — type aliases
✓ mod Name; — module declarations
✓ use dotted.path; — import statements
✓ pub visibility flag on all declarations

Statement Parsing

✓ Variable declarations: name: type = expr, name := expr, mut variant
✓ Assignment: name = expr, name +=/-=/*=//=/%= expr
✓ ret expr / ret (void return)
✓ if cond { ... } else { ... } — including else if chaining
✓ loop { ... } — infinite loop
✓ loop cond { ... } — conditional loop
✓ loop expr |item| { ... } — iterator loop (parsed)
✓ break, skip
✓ defer { ... }, defer stmt
✓ match expr { pat => body, ... } with literal/enum/destructure/wildcard patterns
✓ try expr, try expr catch |err| { ... }, try { ... } catch (err) { ... } (block variant)
✓ @as(Type, expr) and other @ builtins (parsed as BuiltinExpression)

Expression Parsing

✓ Full precedence-climbing expression parser (12 precedence levels)
✓ All binary operators with correct associativity
✓ Unary: -, !, ~ (bitwise not), & (address-of), * (dereference via ptr.*)
✓ Pointer dereference: ptr.* (postfix)
✓ Member access: a.b.c
✓ Function calls: f(args) with argument lists
✓ Array literals: [1, 2, 3]
✓ Tuple literals: .{a, b, c}
✓ Range expressions: .., ..=, ... parsed as dedicated RangeExpression nodes with precedence 11
✓ Capture blocks: |e| expr
✓ Parenthesized grouping
✓ Type annotations in expression context

Type Parsing

✓ All primitive types (i1-u128, f16-f128, bool, char, void, noret, any)
✓ Pointer types: *T
✓ Optional types: ?T
✓ Failable types: !T
✓ Error union types: Error!T (named) and error!T (anonymous)
✓ Array types: [T], [T; N]
✓ Collection types: vec[T], map{K,V}, set{T}
◐ Builtin types: @Self, @Type, @Generic(T) — parsed as identifiers, no special validation
✓ mut type modifier

Stage 3: Semantic Analysis (Phase 3) ✓ Complete

Symbol Table & Scope Management

✓ Scope class with parent chain for lexical scoping
✓ Symbol types: Variable, Function, Struct, Enum, Union, Trait, ErrorSet, Module, TypeAlias
✓ PushScope / PopScope for block boundaries
✓ Two-pass design: pass 1 (declare globals) + pass 2 (analyze bodies)

Name Resolution

✓ Global declaration registration (functions, structs, enums, unions, traits, behaviours, aliases, modules)
✓ Variable name resolution in expressions
✓ Function name resolution for calls
◐ Module-scoped name resolution (mod / use paths — basic path tracking, no multi-file linking)
✓ std identifier whitelisted
✓ Built-in identifier whitelist: self, true, false, null, print, println, printf, puts, eprint, eprintln, exit, assert, panic, clock_ms, clock_ns

Declaration Validation

✓ Duplicate declaration detection in same scope
✓ Function parameter count validation on calls
✓ Function argument count validation
✓ Mutability checks (immutable assignment detection)
✓ Undeclared identifier detection
✓ Return type validation (shows expected vs actual types with typeToString)
✓ Function argument type matching
☐ Constant expression evaluation (comptime)
✓ Struct field declaration tracking
✓ Enum variant tracking

Type Checking

✓ Expression type inference for :=
✓ Operator type compatibility (arithmetic, comparison, logical, bitwise) with rich error messages
✓ Assignment type compatibility with typeToString diff messages
✓ Pointer/reference type validation (address-of returns pointer type, dereference requires pointer)
✓ If condition must be boolean
✓ Loop condition must be boolean
✓ Break/skip outside loop detection
✓ Struct field access validation (field existence, returns field type)
✓ Error union handling (error_type extracted from named error sets, try/catch recognized)
☐ Array/slice index validation
☐ Behaviour implementation signature checking
☐ Comptime const expression validation

Error Reporting

✓ Categorized errors: [TypeError], [NameError], [MutError], [ReturnError], [DeclError], [FlowError], [ArgError], [SyntaxError], [FieldError]
✓ Color-coded output with RED category tags and CYAN position info
✓ Line:column position on every error (line N:M)
✓ Expected vs found type display via typeToString() for pointer, optional, error union, struct, enum types

Stage 4: LLVM IR Code Generation (Phase 4) ≈85%

Phase 4 Architecture

✓ Module preamble: source_filename, target layout (e-m:e-p270:...), target triple (x86_64-pc-linux-gnu)
✓ Libc function declarations (printf, puts, exit, abort) with LLVM attributes
◐ Std library IR injection (std.fmt module injection from src/std/fmt.rzn)
✓ Global node dispatch (genNode switch on all node types)
✓ IRGen shared state (Codegen struct with locals, types, enums, unions, errors maps)
✓ StringBuilder for efficient IR assembly (IRBuilder with LLVM API)
✓ Comment/Annotation/GenericParams/Module/Use/Behave/TypeAlias nodes skipped in codegen
✓ Optimization pipeline (mem2reg + instcombine via new PM PassBuilder)
✓ Object/assembly emission (emitObject/emitAssembly via TargetMachine)
✓ CLI: --verbose/--debug, --help, --version, -f, positional file args

Type Mapping to LLVM

✓ Primitive types: i1-u128, f16-f128, bool→i1, char→i8, void, str→ptr, string→ptr, any→ptr
✓ Pointer types: *T → ptr (opaque)
✓ Compound types: structs→%T, enums→iN, unions→{i32,[N x i8]}, error unions→{i1,T}, optionals→{i1,T}, failables→{i1,T}
✓ Array types: [N x T], slice→ptr

Variable Declarations

✓ Local alloca with store initializer
✓ Global const declarations with InternalLinkage and constant initializers
✓ Global non-const declarations with deferred init via __raz_global_init() constructor
✓ Type inference via := with type mapping

Function Code Generation

✓ define @name with parameter list
✓ Parameter alloca and store at entry block
✓ Return value handling with default zeroinit for void
✓ External function declarations (ext func) with variadic support
✓ Self parameter handling for method calls

Expression Code Generation

✓ Literals: int (i32), float (double), bool (i1), char (i8), string (global @.str.N with dedup)
✓ Identifier: load from alloca with type tracking, enum/error lookup, null → null constant
✓ Binary operators: arithmetic (Add/FAdd), comparison (ICmp/FCmp), logical (And/Or i1), bitwise (And/Or/Xor/Shl/AShr) — all with float/int dispatch and SExt/Trunc widening
✓ Unary operators: negate (Neg/FNeg), not (Xor 1), bitnot (Xor all-ones), address-of (&), dereference (.* via LoadInst)
✓ Function calls with argument type widening (SExt/Trunc)
✓ Member access: struct field GEP, enum variant constant, error set constant, method call with mangled name
◐ Method calls: c.method() → mangled name struct.method + self arg (basic support, no vtable dispatch)
✓ Array literal: ConstantArray or alloca + per-element GEP+store
✓ Tuple literal: ConstantStruct or alloca + stores
✓ Range expression: {start, end} struct pair
✓ Union construction: alloca, tag store, payload bitcast+store

Statement Code Generation

✓ Variable declarations with initializer
✓ Assignment: =, +=, -=, *=, /=, %= (all with float/int dispatch)
✓ Assignment to struct field: x.field = expr via GEP chain
✓ Return statement with value (including SExt/Trunc for integer width mismatch)
✓ If/else with else-if chaining (CondBr based on i1 condition, end block joining)
✓ Loops: infinite (loop {}), conditional (loop cond {}), with cond/body/end basic blocks
✓ Break → br to loop.end, Skip → br to loop.continue
✓ Defer → LIFO flush before return (reverse iteration of deferred vector)
✓ Match → icmp eq chain for simple enums, tag dispatch for tagged unions with payload extraction and variable binding
✓ Try expression → flag check, branch to catch or propagate (return)
✓ Try block → scope with catch target

Struct Code Generation

✓ Type definition: %StructName = type { field_types }
✓ Field access via getelementptr with field type tracking
✓ Struct field assignment via GEP chain
✓ Constructors with explicit field initialization
◐ Struct methods: collected and emitted with mangled names (struct.method)

Enum Code Generation

✓ Type definition: %EnumName = type backing_integer (i32 default, custom backing)
✓ Variant values computed and tracked (explicit and implicit, auto-increment)
✓ Enum member access resolves to integer constant
✓ Match dispatch using icmp eq on backing integer type

Union Code Generation

✓ Tagged union type: %UnionName = type { i32, [N x i8] }
✓ Max payload size computed from all variants
✓ Union construction: tag store + payload bitcast+store
✓ Match tag dispatch with payload extraction
✓ Payload variable binding in match arms

Error Handling Code Generation

✓ Error set declaration with variant→code mapping (incrementing from 0)
✓ Error variant reference in expressions: FileError.NotFound → i32 code
✓ Error union return construction: {i1 1, i32 code} or {i1 0, T value}
✓ Try expression: extractvalue flag check, branch to catch or propagate
✓ Try block: scope with catch handler target
☐ Behaviour / Trait code generation (vtable dispatch)
☐ Generator / Async code generation

Stage 5: Standard Library

Current Std Architecture

◐ std.fmt module — src/std/fmt.rzn provides print, println, printf, puts wrappers (injected at compile time when use std.fmt is found)
◐ C++ source-level module injection via file read during AST building
☐ LLVM IR templates embedded in C++ string constants
☐ C++ wrapper files for std modules
☐ Std function name mapping

All Modules (std.core, std.mem, std.str, std.string, std.fmt, std.io, std.fs, std.os, std.vec, std.map, std.set, std.ring, std.math, std.bits, std.ascii, std.unicode, std.parse, std.buf, std.hash, std.sync, std.time, std.testing, std.debug)

☐ All items in all modules (design complete in design/std_new.md, implementation not started)

Stage 6: Critical Missing Features

Codegen Gaps

☐ Short-circuit evaluation for && / || (currently emits bitwise And/Or)
☐ Block-level break/continue labels (break/skip only target innermost loop)
☐ Behaviour/trait vtable dispatch
☐ Comptime / metaprogramming
☐ String literal interpolation ({} in format strings)
☐ Enum match IR — basic block ordering issues in some cases
☐ Generic monomorphization (@Generic(T) annotations parsed but not specialized)

Usability

☐ Automated test runner
☐ Package/module resolution across files
☐ Error recovery in parser (currently panics on first error)
☐ Language server protocol (LSP) support
☐ WASM target

Stage 7: Compiler Self-Hosting

Bootstrap Path (all ☐)

☐ Razen type system self-hosting
☐ Razen lexer written in Razen
☐ Razen parser written in Razen
☐ Razen → C/C++ transpilation bootstrapping
☐ Full self-hosting

Milestone Summary

Milestone	Description	Key Deliverables	Status
M0	C++ pipeline skeleton	Makefile, lexer, parser, semantic stubs	✓ Complete
M1	Working lexer	Full tokenization of all Razen constructs	✓ Complete
M2	Full parser + AST	All declarations, statements, expressions, generics, ranges, else-if, block try, ext struct	✓ Complete
M3	Semantic analysis	Type checking, scope, validation, categorized errors, typeToString, pointer/error union compatibility	✓ Complete
M4	LLVM codegen	All types, expressions, statements, control flow, optimization, emission	◐ ≈85% — methods basic, no vtable dispatch
M5	Struct codegen	Struct types, field access, methods	✓ Types+fields ✓, methods ◐
M6	Enum + Match + Union	Enumerations, tagged unions, match dispatch	✓ Enum✓ Union✓ Match✓
M7	Error handling	Error sets, error unions, try/catch	◐ Error sets + try expr done, union propagation basic
M8	Codegen optimization	mem2reg, instcombine, object/assembly emission	✓ Complete — new PM PassBuilder, TargetMachine
M9	CLI & tooling	--help, --version, -v, -f, file input	✓ Complete
M10	Collections	Vec, Map, Set with generics	☐ Not started
M11	Std library	All 24 std modules implemented	☐ Not started (fmt.rzn functional)
M12	Self-hosting	Razen compiler compiles itself	☐ Not started

Design Constraints

☐ Zero hidden allocations — all allocation takes explicit Allocator param
☐ Predictable LLVM mapping — clear path from source to IR
☐ No implicit casts — type conversions must be explicit
☐ No hidden magic — no GC, no implicit allocs, no hidden control flow
☐ Zero-cost abstractions — behaviours dispatch without overhead

Progress: Phases 1-3 (Lexer, Parser, Semantic Analysis) complete. Phase 4 (Codegen) ~85% complete — core infrastructure fully done, optimization pipeline (mem2reg + instcombine), object/assembly emission (emitObject/emitAssembly), CLI (--help, --version, -v, file args), all major control flow (if/loop/match/defer/try), compound types (struct/enum/union/error), all expression types. Remaining codegen work: struct method vtable dispatch, short-circuit evaluation, comptime evaluation, and match IR block ordering edge cases.

Std Library: ~5% — fmt.rzn (print/println/printf/puts) is functional and injected on use std.fmt.

All 10 built-in sample programs produce valid .o and .s files in output/; compiled object files link and execute correctly.

Next Target: Remaining codegen gaps (short-circuit &&/||, comptime evaluation, enum match CFG fix) + struct method codegen + begin std library modules (std.os, std.mem).

the roadmap