SPy: architecture, WASM, MLIR, and the future of Python compilation

.

About these notes

This post was generated by a LLM (generative AI) from an exploratory conversation. I guided the discussion, challenged the AI’s conclusions, and verified the technical content.

These notes summarise a conversation exploring the architecture of SPy , a research Python variant designed to be both easily interpreted and compiled to native code or WebAssembly. The discussion covers SPy’s runtime design, its use of WASM as an interpreter substrate, threading and GPU topics, the numbacc project, and the future of SPy package distribution.

Background: What is SPy?

SPy is an experiment by Antonio Cuni (Anaconda) to create a Python variant that can be easily interpreted — for a good development experience — and compiled to native code or WebAssembly for performance. It is emphatically not a drop-in CPython replacement: the language design deliberately diverges from Python where necessary to enable full ahead-of-time compilation.

The architecture rests on two pillars:

  • libspy — a small C runtime library that handles memory, panics, primitive types, and other low-level concerns. It is compiled to three targets: a static native library ( libspy.a ), a WASI module ( libspy.wasm ), and an Emscripten module for the browser.

  • The SPy compiler — written in Python, it translates .spy source to C (via a cwrite backend), then invokes clang + ninja to produce a native or WASM binary.

The interpreter runs .spy code by loading libspy.wasm into the Python process via wasmtime . An llwasm abstraction layer makes this transparent: on CPython it uses wasmtime; inside Pyodide/PyScript it reuses the browser’s own WASM engine.


Why libspy.wasm and Not a Native libspy.so ?

This is one of the most elegant architectural decisions in SPy, and it pays multiple dividends simultaneously.

1. Sandboxing unsafe code

SPy has an “unsafe” mode that allows raw pointers and low-level struct manipulation — constructs that can trivially segfault a process. By running this code inside a WASM sandbox via wasmtime, any crash is contained within the WASM linear memory. CPython survives and receives a proper SPyError exception rather than dying. A native .so would offer no such protection without expensive subprocess tricks.

2. Multiple isolated VM instances

Each WASM instance has its own linear memory. This means you can instantiate multiple independent SPy VMs inside the same Python process at zero extra cost — something that would require careful global-state management with a shared library.

3. One build artefact, two uses

WASM is already a first-class compilation target for SPy (the whole point is to produce .wasm for browser and edge deployment). Using the same libspy.wasm artefact for the interpreter means there is a single build path to maintain. A libspy.so would be a second, diverging artefact.

4. Environment portability for free

A native .so cannot be loaded inside a browser or Pyodide environment. A .wasm file works everywhere — on native CPython via wasmtime and inside a browser via its built-in WASM engine. This is how Antonio Cuni and Hood Chatham were able to build the SPy playground running entirely in the browser.


Sharing Memory Between Two SPy VM Instances

Because each WASM instance has its own linear memory, two SPy VMs cannot accidentally share state. But intentional sharing is possible through the WebAssembly threads proposal , which wasmtime implements.

A SharedMemory object can be created by the host (Python) and passed as an import to two separate libspy.wasm instances. Both instances then read and write the same underlying bytes, with atomic access primitives available for synchronisation.

The limitations are important though:

  • Only raw bytes are shared — a flat SharedArrayBuffer -like region. There is no concept of sharing heap-allocated SPy objects, reference counts, or GC roots across instances.

  • The shared-everything threads proposal — which would allow sharing tables, functions, and GC-managed references — is still under development and not yet available in wasmtime.

  • Wasmtime’s Store architecture adds another constraint: objects from different stores cannot interact directly; SharedMemory is the designated bridge.

Practical upshot: sharing large typed data buffers (arrays of i32 , f64 , etc.) between two SPy VMs is feasible today and efficient. Sharing higher-level SPy values requires either serialisation or waiting for shared-everything threads.


OpenMP and WASM

OpenMP and WASM are not fundamentally incompatible, but they are an awkward fit.

A proof-of-concept ( wasm-openmp-examples ) already demonstrates compiling libomp.a to WebAssembly using wasi-threads and running it in wasmtime. However several friction points exist:

The architectural mismatch. OpenMP’s fork-join model assumes threads share an address space and a module instance. WASM’s threading model is “instance-per-thread” — each worker thread is a separately instantiated WASM module that shares only the linear memory, not globals or the function table. The OpenMP runtime must re-implement its fork-join barrier entirely inside WASM using atomics, which works but is not how it was designed.

wasi-threads is now legacy. The proposal that enables pthreads-style threading in WASM outside the browser is now considered legacy for WASI preview1. Future work on threads will go through the shared-everything-threads proposal targeting WASI v0.2.

A WASM-native alternative: wasi-parallel . Rather than mapping OpenMP onto WASM, the ecosystem is building wasi-parallel — a WASI proposal that provides a parallel for construct designed from scratch for WASM’s constraints. This is likely a cleaner long-term path than OpenMP-on-WASM.

For SPy specifically, libspy.wasm is single-threaded today, and OpenMP is not a near-term target. Explicit multi-instance concurrency or wasi-parallel are more natural future paths.


WASM and GPU

WASM and GPU are orthogonal technologies by design — WASM is a sandboxed CPU abstraction with no notion of a GPU. The ecosystem has two answers:

In the browser: WebGPU. WASM code calls out to WebGPU (a W3C API available in Chrome, Edge, and experimentally Firefox) to dispatch work to the GPU. Emscripten already has bindings for WebGPU. The division of labour is: WASM handles CPU logic, WebGPU handles GPU kernels.

Outside the browser: wasi-gfx . For runtimes like wasmtime, wasi-gfx is a phase-2 WASI proposal that exposes GPU access through WebGPU semantics, providing component bindings via wasi-webgpu . It is not yet production-ready.


MLIR and WASM

MLIR currently compiles to WASM by lowering through the LLVM dialect and then using LLVM’s existing WASM backend. This works, but it loses structural information: WASM’s control flow is structured ( block / loop / if ) whereas LLVM IR is flat, so the LLVM WASM backend has to reconstruct structure using algorithms like Relooper.

Active research (the WAMI project and a 2025 RFC to the LLVM community) proposes a native WASM dialect in MLIR — lifting WASM from being an LLVM backend target to a full citizen of the MLIR ecosystem. This would allow implementing new WASM proposals by adding a dialect and a lowering pass, without needing complicated reconstruction logic.

The MLIR → GPU → WASM gap

MLIR has a mature gpu dialect with a full pipeline for generating GPU kernels (PTX/NVVM for CUDA, SPIR-V for OpenCL). What does not exist is a unified compilation path that combines WASM for CPU and GPU kernels in one target. The WebGPU path uses WGSL (WebGPU Shading Language) — a completely different IR from PTX or SPIR-V — and no bridge between MLIR’s GPU dialects and WGSL exists today.

A realistic future “SPy on the web with GPU” would probably require compiling CPU orchestration code to WASM and writing GPU kernels separately in WGSL, with the SPy compiler eventually knowing how to emit both.


numbacc: A SPy→MLIR Compiler Under the Numba Umbrella

numba/numbacc is described as “the Numba ahead-of-time compiler”, but it has essentially nothing to do with Numba the library at a technical level.

The repository contains:

  • .spy source files (including an e2e.spy end-to-end demo)

  • A nbcc/ Python package implementing the compiler pipeline

  • .mlir files as intermediate output

  • Dependencies on spy and mlir not on numba

The actual pipeline is SPy source → SPy type inference → MLIR (linalg/affine dialects) → native binary or GPU kernels . Compare this with numba-mlir , which explicitly reuses Numba’s CPython bytecode frontend and type inference alongside its LLVM infrastructure — numbacc shares none of that.

The connection to “Numba” is organisational and aspirational: the project lives under the numba/ GitHub organisation (Antonio Cuni works at Anaconda, which sponsors both projects) and signals “this is the direction Numba could evolve towards” — a clean-slate reimagining with SPy’s type-safe frontend and MLIR’s backend, rather than a dependency on the existing Numba codebase.

numbacc and interpreter mode

numbacc is purely an AOT tool — there is no interpreter mode, by design. The two tools are complementary:

Mode

Tool

GPU

Development / debugging

SPy interpreter + libspy.wasm

No

High-performance compiled output

numbacc + MLIR pipeline

Yes (CUDA/NVVM)

“Interpreted” GPU

Fundamental gap

GPU kernels need a physical GPU (or a deprecated software emulator) to run. There is no lightweight interpreter-mode equivalent of libspy.wasm for GPU execution — this is not a numbacc limitation, it is a property of GPU hardware.


The Future: SPy Package Distribution

Speculative territory

No SPy package format exists yet — the only supported installation is an editable git checkout. What follows is a reasoned extrapolation from SPy’s current architecture. Some design decisions are genuinely open.

What a SPy wheel might contain

A hypothetical mypackage SPy wheel would likely need three kinds of artefact:

.spy source files. The compiler needs these for cross-module type checking and redshifting — for the same reason C libraries ship header files. Without source, the SPy compiler cannot specialise calls into the package.

A precompiled .wasm library. This is the portable artefact. A .wasm file compiled from SPy→C→clang is genuinely OS- and architecture-agnostic: the same file runs on Linux, macOS, Windows, and in the browser. This is already how libspy.wasm itself works.

A native Python extension ( .so / .pyd ). For the Python binding layer, this is unavoidably platform-specific. Separate wheels would be needed for each OS/architecture combination, exactly as any C-extension package today. The wheel filename would encode this in the standard way: mypackage-1.0-cp312-cp312-manylinux_2_17_x86_64.whl .

This creates an interesting asymmetry : the SPy-to-SPy consumption path could in principle use a single portable wheel ( .spy source + .wasm ), while the Python binding forces the familiar platform-specific wheel proliferation.

Using a SPy package from SPy: two models

Interpreter path (WASM-to-WASM). In interpreted mode the SPy interpreter already loads .wasm modules via wasmtime. A package’s precompiled .wasm could be loaded as a separate wasmtime instance, with data exchanged through shared memory. No recompilation needed; full portability preserved.

Compiled path (AOT build). When producing a native binary with spy build , two sub-options exist:

  • Recompile from .spy source : the compiler has full information, can inline and specialise across module boundaries, and produces the best possible output. Requires the full compiler toolchain at build time.

  • Link the precompiled .wasm : clang can link WASM object files into a native binary via wasm-ld . Faster builds, but cross-module optimisation is limited.

The most likely outcome is that SPy will support both — use precompiled .wasm for fast and portable deployment, allow source recompilation for maximum performance. This mirrors the C world’s header + .a distribution model, and it matches the spirit of SPy’s redshifting philosophy.

Open design questions

Several important questions have no answer yet:

Cross-module redshifting. If mypackage.add(x, y) can be specialised for i32 arguments, does that specialisation happen at package-compile time (producing multiple .wasm variants) or at application-compile time (requiring source)? This is a fundamental trade-off between distribution convenience and runtime performance.

Python binding generation. There is currently no mechanism in SPy for auto-generating a Python extension from SPy code. This would require something like a CFFI or pybind11 equivalent for SPy — a significant piece of work that has not yet been designed.

PyPI WASM wheel support. At the Python packaging level, there is no agreed wasm32-unknown-wasi platform tag for PyPI wheels. Discussions have happened but no consensus has been reached.

The libspy.wasm versioning problem. A package’s precompiled .wasm is linked against a particular version of libspy . As libspy evolves, ABI compatibility between a package wheel and the installed SPy runtime will need to be managed — a solved problem in the C world (soname versioning) but not yet addressed for SPy.


Summary

SPy’s use of WASM as its interpreter substrate is not a performance trick — it is a deliberate architectural choice that buys sandboxing, isolation, portability, and build-system simplicity in one move. The same .wasm artefact that gets deployed to edge and browser runtimes is the one that runs inside the development interpreter.

The ecosystem around SPy (numbacc, the WASM dialect RFC, wasi-parallel , wasi-gfx ) is young but coherent: each piece addresses a real gap, and the pieces fit together in a way that suggests the overall architecture is sound. The main unknowns are in the packaging and distribution layer — an area that tends to lag compiler research by several years in any language ecosystem.

References and Further Reading