Files
l2/SPEC.md
IgorCielniak 8e8faf3c91 Initial commit
2025-12-06 16:30:58 +01:00

102 lines
8.7 KiB
Markdown

# L2 Language Specification (Draft)
## 1. Design Goals
- **Meta-language first**: L2 is a minimal core designed to be reshaped into other languages at runtime, matching Forth's malleability with modern tooling.
- **Native code generation**: Source compiles directly to NASM-compatible x86-64 assembly, enabling both AOT binaries and JIT-style pipelines.
- **Runtime self-modification**: Parsers, macro expanders, and the execution pipeline are ordinary user-defined words that can be swapped or rewritten on demand.
- **Total control**: Provide unchecked memory access, inline assembly, and ABI-level hooks for syscalls/FFI, leaving safety policies to user space.
- **Self-hosting path**: The bootstrap reference implementation lives in Python, but the language must be able to reimplement its toolchain using its own facilities plus inline asm.
## 2. Program Model
- **Execution units (words)**: Everything is a word. Words can be defined in high-level L2, inline asm, or as parser/runtime hooks.
- **Compilation pipeline**:
1. Source stream tokenized via active reader (user-overridable).
2. Tokens dispatched to interpreter or compiler hooks (also user-overridable).
3. Resulting IR is a threaded list of word references.
4. Code generator emits NASM `.text` with helper macros.
5. `nasm` + `ld` (or custom linker) build an ELF64 executable.
- **Interpreted mode**: For REPLs or rapid experimentation, the compiler can emit temporary asm, assemble to an object in memory, and `dlopen` or `execve` it.
- **Bootstrapping**: `main.py` orchestrates tokenizer, dictionary, IR, and final asm emission.
## 3. Parsing & Macro System
- **Reader hooks**:
- `read-token`: splits the byte stream; default is whitespace delimited with numeric/string literal recognizers.
- `on-token`: user code decides whether to interpret, compile, or treat the token as syntax.
- `lookup`: resolves token → word entry; can be replaced to build new namespaces or module systems.
- **Compile vs interpret**: Each word advertises stack effect + immediacy. Immediate words execute during compilation (macro behavior). Others emit code or inline asm.
- **Syntax morphing**: Provide primitives `set-reader`, `with-reader`, and word-lists so layers (e.g., Lisp-like forms) can be composed.
## 4. Core Types & Data Model
- **Cells**: 64-bit signed integers; all stack operations use cells.
- **Double cells**: 128-bit values formed by two cells; used for addresses or 128-bit arithmetic.
- **Typed views**: Optional helper words interpret memory as bytes, half-words, floats, or structs but core semantics stay cell-based.
- **User-defined types**: `struct`, `union`, and `enum` builders produce layout descriptors plus accessor words that expand to raw loads/stores.
## 5. Stacks & Calling Convention
- **Data stack**: Unlimited (up to memory). Manipulated via standard words (`dup`, `swap`, `rot`, `over`). Compiled code keeps top-of-stack in registers when possible for performance.
- **Return stack**: Used for control flow. Directly accessible for meta-programming; users must avoid corrupting call frames unless intentional.
- **Control stack**: Optional third stack for advanced flow transformations (e.g., continuations) implemented in the standard library.
- **Call ABI**: Compiled words follow System V: arguments mapped from data stack into registers before `call`, results pushed back afterward.
## 6. Memory & Allocation
- **Linear memory primitives**: `@` (fetch), `!` (store), `+!`, `-!`, `memcpy`, `memset` translate to plain loads/stores without checks.
- **Address spaces**: Single flat 64-bit space; no segmentation. Users may map devices via `mmap` or syscalls.
- **Allocators**:
- Default bump allocator in the runtime prelude.
- `install-allocator` allows swapping malloc/free pairs at runtime.
- Allocators are just words; nothing prevents multiple domains.
## 7. Control Flow
- **Branching**: `if ... else ... then`, `begin ... until`, `case ... endcase` compile to standard conditional jumps. Users can redefine the parsing words to create new control forms.
- **Tail calls**: `tail` word emits `jmp` instead of `call`, enabling explicit TCO.
- **Exceptions**: Not baked in; provide optional libraries that implement condition stacks via return-stack manipulation.
## 8. Inline Assembly & Low-Level Hooks
- **Asm blocks**: `asm { ... }` injects raw NASM inside a word. The compiler preserves stack/register invariants by letting asm declare its stack effect signature.
- **Asm-defined words**: `:asm name ( in -- out ) { ... } ;` generates a label and copies the block verbatim, wrapping prologue/epilogue helpers.
- **Macro assembler helpers**: Provide macros for stack slots (`.tos`, `.nos`), temporary registers, and calling runtime helpers.
## 9. Foreign Function Interface
- **Symbol import**: `c-import "libc.so" clock_gettime` loads a symbol and records its address as a constant word. Multiple libraries can be opened and cached.
- **Call sites**: `c-call ( in -- out ) symbol` pops arguments, loads System V argument registers, issues `call symbol`, then pushes return values. Variadic calls require the user to manage `al` for arg count.
- **Struct marshalling**: Helper words `with-struct` and `field` macros emit raw loads/stores so C structs can be passed by pointer without extra runtime support.
- **Error handling**: The runtime never inspects `errno`; users can read/write the TLS slot through provided helper words.
## 10. Syscalls & OS Integration
- **Primitive syscall**: `syscall ( args... nr -- ret )` expects the syscall number at the top of stack, maps previous values to `rdi`, `rsi`, `rdx`, `r10`, `r8`, `r9`, runs `syscall`, and returns `rax`.
- **Wrappers**: The standard library layers ergonomic words (`open`, `mmap`, `clone`, etc.) over the primitive but exposes hooks to override or extend them.
- **Process bootstrap**: Entry stub captures `argc`, `argv`, `envp`, stores them in global cells (`argc`, `argv-base`), and pushes them on the data stack before invoking the user `main` word.
## 11. Module & Namespace System
- **Wordlists**: Dictionaries can be stacked; `within wordlist ... end` temporarily searches a specific namespace.
- **Sealing**: Wordlists may be frozen to prevent redefinition, but the default remains open-world recompilation.
- **Import forms**: `use module-name` copies references into the active wordlist; advanced loaders can be authored entirely in L2.
## 12. Build & Tooling Pipeline
- **Compiler driver**: `main.py` exposes modes: `build <src> -o a.out`, `repl`, `emit-asm`, `emit-obj`.
- **External tools**: Default path is `nasm -f elf64` then `ld`; flags pass-through so users can link against custom CRT or libc replacements.
- **Incremental/JIT**: Driver may pipe asm into `nasm` via stdin and `dlopen` the resulting shared object for REPL-like workflows.
- **Configuration**: A manifest (TOML or `.sl`) records include paths, default allocators, and target triples for future cross-compilation.
## 13. Self-Hosting Strategy
- **Phase 1**: Python host provides tokenizer, parser hooks, dictionary, and code emitter.
- **Phase 2**: Re-implement tokenizer + dictionary in L2 using inline asm for hot paths; Python shrinks to a thin driver.
- **Phase 3**: Full self-host—compiler, assembler helpers, and driver written in L2, requiring only `nasm`/`ld`.
## 14. Standard Library Sketch
- **Core words**: Arithmetic, logic, stack ops, comparison, memory access, control flow combinators.
- **Meta words**: Reader management, dictionary inspection, definition forms (`:`, `:noninline`, `:asm`, `immediate`).
- **Allocators**: Default bump allocator, arena allocator, and hook to install custom malloc/free pairs.
- **FFI/syscalls**: Thin wrappers plus convenience words for POSIX-level APIs.
- **Diagnostics**: Minimal `type`, `emit`, `cr`, `dump`, and tracing hooks for debugging emitted asm.
## 15. Command-Line & Environment
- **Entry contract**: `main` receives `argc argv -- exit-code` on the data stack. Programs push the desired exit code before invoking `exit` or returning to runtime epilogue.
- **Environment access**: `envp` pointer stored in `.data`; helper words convert entries to counted strings or key/value maps.
- **Args parsing**: Library combinators transform `argv` into richer domain structures, though raw pointer arithmetic remains available.
## 16. Extensibility & Safety Considerations
- **Hot reload**: Redefining a word overwrites its dictionary entry and emits fresh asm. Users must relink or patch call sites if binaries are already running.
- **Sandboxing**: None by default. Documented patterns show how to wrap memory/syscall words to build capability subsets without touching the core.
- **Testing hooks**: Interpreter-mode trace prints emitted asm per word to aid verification.
- **Portability**: Spec targets x86-64 System V for now but the abstraction layers (stack macros, calling helpers) permit future backends.