. ¶
About these notes
This post was generated by a LLM (generative AI) from an exploratory conversation. I guided the discussion, challenged the AI’s conclusions, and verified the technical content.
These notes summarise a conversation exploring the architecture of SPy , a research Python variant designed to be both easily interpreted and compiled to native code or WebAssembly. The discussion covers SPy’s runtime design, its use of WASM as an interpreter substrate, threading and GPU topics, the numbacc project, and the future of SPy package distribution.
Background: What is SPy? ¶
SPy is an experiment by Antonio Cuni (Anaconda) to create a Python variant that can be easily interpreted — for a good development experience — and compiled to native code or WebAssembly for performance. It is emphatically not a drop-in CPython replacement: the language design deliberately diverges from Python where necessary to enable full ahead-of-time compilation.
The architecture rests on two pillars:
-
libspy— a small C runtime library that handles memory, panics, primitive types, and other low-level concerns. It is compiled to three targets: a static native library (libspy.a), a WASI module (libspy.wasm), and an Emscripten module for the browser. -
The SPy compiler — written in Python, it translates
.spysource to C (via acwritebackend), then invokesclang+ninjato produce a native or WASM binary.
The interpreter runs
.spy
code by loading
libspy.wasm
into the Python process via
wasmtime
. An
llwasm
abstraction layer makes this transparent: on CPython it uses
wasmtime; inside Pyodide/PyScript it reuses the browser’s own WASM engine.
Why
libspy.wasm
and Not a Native
libspy.so
?
¶
This is one of the most elegant architectural decisions in SPy, and it pays multiple dividends simultaneously.
1. Sandboxing unsafe code ¶
SPy has an “unsafe” mode that allows raw pointers and low-level struct manipulation —
constructs that can trivially segfault a process. By running this code inside a WASM
sandbox via wasmtime, any crash is contained within the WASM linear memory. CPython
survives and receives a proper
SPyError
exception rather than dying. A native
.so
would offer no such protection without expensive subprocess tricks.
2. Multiple isolated VM instances ¶
Each WASM instance has its own linear memory. This means you can instantiate multiple independent SPy VMs inside the same Python process at zero extra cost — something that would require careful global-state management with a shared library.
3. One build artefact, two uses ¶
WASM is already a first-class compilation
target
for SPy (the whole point is to produce
.wasm
for browser and edge deployment). Using the same
libspy.wasm
artefact for the
interpreter means there is a single build path to maintain. A
libspy.so
would be a
second, diverging artefact.
4. Environment portability for free ¶
A native
.so
cannot be loaded inside a browser or Pyodide environment. A
.wasm
file
works everywhere — on native CPython via wasmtime and inside a browser via its built-in
WASM engine. This is how Antonio Cuni and Hood Chatham were able to build the
SPy playground
running
entirely in the browser.
OpenMP and WASM ¶
OpenMP and WASM are not fundamentally incompatible, but they are an awkward fit.
A proof-of-concept (
wasm-openmp-examples
) already demonstrates compiling
libomp.a
to
WebAssembly using
wasi-threads
and running it in wasmtime. However several friction
points exist:
The architectural mismatch. OpenMP’s fork-join model assumes threads share an address space and a module instance. WASM’s threading model is “instance-per-thread” — each worker thread is a separately instantiated WASM module that shares only the linear memory, not globals or the function table. The OpenMP runtime must re-implement its fork-join barrier entirely inside WASM using atomics, which works but is not how it was designed.
wasi-threads
is now legacy.
The proposal that enables pthreads-style threading in
WASM outside the browser is now considered legacy for WASI preview1. Future work on
threads will go through the
shared-everything-threads
proposal targeting WASI v0.2.
A WASM-native alternative:
wasi-parallel
.
Rather than mapping OpenMP onto WASM, the
ecosystem is building
wasi-parallel
— a WASI proposal that provides a parallel
for
construct designed from scratch for WASM’s constraints. This is likely a cleaner
long-term path than OpenMP-on-WASM.
For SPy specifically,
libspy.wasm
is single-threaded today, and OpenMP is not a
near-term target. Explicit multi-instance concurrency or
wasi-parallel
are more natural
future paths.
WASM and GPU ¶
WASM and GPU are orthogonal technologies by design — WASM is a sandboxed CPU abstraction with no notion of a GPU. The ecosystem has two answers:
In the browser: WebGPU. WASM code calls out to WebGPU (a W3C API available in Chrome, Edge, and experimentally Firefox) to dispatch work to the GPU. Emscripten already has bindings for WebGPU. The division of labour is: WASM handles CPU logic, WebGPU handles GPU kernels.
Outside the browser:
wasi-gfx
.
For runtimes like wasmtime,
wasi-gfx
is a phase-2
WASI proposal that exposes GPU access through WebGPU semantics, providing component
bindings via
wasi-webgpu
. It is not yet production-ready.
MLIR and WASM ¶
MLIR currently compiles to WASM by lowering through the LLVM dialect and then using
LLVM’s existing WASM backend. This works, but it loses structural information: WASM’s
control flow is structured (
block
/
loop
/
if
) whereas LLVM IR is flat, so the LLVM
WASM backend has to reconstruct structure using algorithms like Relooper.
Active research (the WAMI project and a 2025 RFC to the LLVM community) proposes a native WASM dialect in MLIR — lifting WASM from being an LLVM backend target to a full citizen of the MLIR ecosystem. This would allow implementing new WASM proposals by adding a dialect and a lowering pass, without needing complicated reconstruction logic.
The MLIR → GPU → WASM gap ¶
MLIR has a mature
gpu
dialect with a full pipeline for generating GPU kernels (PTX/NVVM
for CUDA, SPIR-V for OpenCL). What does
not
exist is a unified compilation path that
combines
WASM for CPU
and
GPU kernels
in one target. The WebGPU path uses WGSL
(WebGPU Shading Language) — a completely different IR from PTX or SPIR-V — and no bridge
between MLIR’s GPU dialects and WGSL exists today.
A realistic future “SPy on the web with GPU” would probably require compiling CPU orchestration code to WASM and writing GPU kernels separately in WGSL, with the SPy compiler eventually knowing how to emit both.
numbacc: A SPy→MLIR Compiler Under the Numba Umbrella ¶
numba/numbacc
is described as “the Numba
ahead-of-time compiler”, but it has essentially nothing to do with Numba the library at a
technical level.
The repository contains:
-
.spysource files (including ane2e.spyend-to-end demo) -
A
nbcc/Python package implementing the compiler pipeline -
.mlirfiles as intermediate output -
Dependencies on
spyandmlir— not onnumba
The actual pipeline is
SPy source → SPy type inference → MLIR (linalg/affine dialects)
→ native binary or GPU kernels
. Compare this with
numba-mlir
, which explicitly reuses
Numba’s CPython bytecode frontend and type inference alongside its LLVM infrastructure —
numbacc shares none of that.
The connection to “Numba” is organisational and aspirational: the project lives under the
numba/
GitHub organisation (Antonio Cuni works at Anaconda, which sponsors both
projects) and signals
“this is the direction Numba could evolve towards”
— a
clean-slate reimagining with SPy’s type-safe frontend and MLIR’s backend, rather than a
dependency on the existing Numba codebase.
numbacc and interpreter mode ¶
numbacc is purely an AOT tool — there is no interpreter mode, by design. The two tools are complementary:
|
Mode |
Tool |
GPU |
|---|---|---|
|
Development / debugging |
SPy interpreter +
|
No |
|
High-performance compiled output |
numbacc + MLIR pipeline |
Yes (CUDA/NVVM) |
|
“Interpreted” GPU |
— |
Fundamental gap |
GPU kernels need a physical GPU (or a deprecated software emulator) to run. There is no
lightweight interpreter-mode equivalent of
libspy.wasm
for GPU execution — this is not
a numbacc limitation, it is a property of GPU hardware.
The Future: SPy Package Distribution ¶
Speculative territory
No SPy package format exists yet — the only supported installation is an editable git checkout. What follows is a reasoned extrapolation from SPy’s current architecture. Some design decisions are genuinely open.
What a SPy wheel might contain ¶
A hypothetical
mypackage
SPy wheel would likely need three kinds of artefact:
.spy
source files.
The compiler needs these for cross-module type checking and
redshifting — for the same reason C libraries ship header files. Without source, the SPy
compiler cannot specialise calls into the package.
A precompiled
.wasm
library.
This is the
portable
artefact. A
.wasm
file
compiled from SPy→C→clang is genuinely OS- and architecture-agnostic: the same file runs
on Linux, macOS, Windows, and in the browser. This is already how
libspy.wasm
itself
works.
A native Python extension (
.so
/
.pyd
).
For the Python binding layer, this is
unavoidably platform-specific. Separate wheels would be needed for each OS/architecture
combination, exactly as any C-extension package today. The wheel filename would encode
this in the standard way:
mypackage-1.0-cp312-cp312-manylinux_2_17_x86_64.whl
.
This creates an interesting
asymmetry
: the SPy-to-SPy consumption path could in
principle use a single portable wheel (
.spy
source +
.wasm
), while the Python binding
forces the familiar platform-specific wheel proliferation.
Using a SPy package from SPy: two models ¶
Interpreter path (WASM-to-WASM).
In interpreted mode the SPy interpreter already
loads
.wasm
modules via wasmtime. A package’s precompiled
.wasm
could be loaded as a
separate wasmtime instance, with data exchanged through shared memory. No recompilation
needed; full portability preserved.
Compiled path (AOT build).
When producing a native binary with
spy
build
, two
sub-options exist:
-
Recompile from
.spysource : the compiler has full information, can inline and specialise across module boundaries, and produces the best possible output. Requires the full compiler toolchain at build time. -
Link the precompiled
.wasm: clang can link WASM object files into a native binary viawasm-ld. Faster builds, but cross-module optimisation is limited.
The most likely outcome is that SPy will support both — use precompiled
.wasm
for fast
and portable deployment, allow source recompilation for maximum performance. This mirrors
the C world’s header +
.a
distribution model, and it matches the spirit of SPy’s
redshifting philosophy.
Open design questions ¶
Several important questions have no answer yet:
Cross-module redshifting.
If
mypackage.add(x,
y)
can be specialised for
i32
arguments, does that specialisation happen at package-compile time (producing multiple
.wasm
variants) or at application-compile time (requiring source)? This is a
fundamental trade-off between distribution convenience and runtime performance.
Python binding generation. There is currently no mechanism in SPy for auto-generating a Python extension from SPy code. This would require something like a CFFI or pybind11 equivalent for SPy — a significant piece of work that has not yet been designed.
PyPI WASM wheel support.
At the Python packaging level, there is no agreed
wasm32-unknown-wasi
platform tag for PyPI wheels. Discussions have happened but no
consensus has been reached.
The
libspy.wasm
versioning problem.
A package’s precompiled
.wasm
is linked
against a particular version of
libspy
. As
libspy
evolves, ABI compatibility between
a package wheel and the installed SPy runtime will need to be managed — a solved problem
in the C world (soname versioning) but not yet addressed for SPy.
Summary ¶
SPy’s use of WASM as its interpreter substrate is not a performance trick — it is a
deliberate architectural choice that buys sandboxing, isolation, portability, and
build-system simplicity in one move. The same
.wasm
artefact that gets deployed to edge
and browser runtimes is the one that runs inside the development interpreter.
The ecosystem around SPy (numbacc, the WASM dialect RFC,
wasi-parallel
,
wasi-gfx
) is
young but coherent: each piece addresses a real gap, and the pieces fit together in a way
that suggests the overall architecture is sound. The main unknowns are in the packaging
and distribution layer — an area that tends to lag compiler research by several years in
any language ecosystem.