Make your numerical Python code fly at transonic 🚀 speed!¶

Pierre Augier, Ashwin Vishnu, Pierre Blanc-fatin

Pyconfr 2019 (3 November 2019, Bordeaux)

http://tiny.cc/pyconfr2019-transonic

Presentation given by Laura Mendoza (INRIA)!

A new package
A unified modern Python API for Python/Numpy accelerators
A thin layer between developers and Pythran / Cython / Numba

https://transonic.readthedocs.io

Who am I (the guy who prepared the slides)?¶

A scientist (CNRS researcher on turbulence in stratified and rotating fluids)
A Python user / developer (FluidDyn project)

Numerical simulations (FluidSim, FluidFFT)
Image processing (FluidImage)
Lab experiments (FluidLab)

Python, only a great glue language?¶

Can we write efficient scientific/numeric packages in Python?¶

Python $\geqslant$ 3.6
Numpy API
an API to describe numerical types

$\Rightarrow$ fine language to express algorithms

But... Python interpreters bad at crunching numbers¶

CPython
PyPy + C extensions (Numpy, Scipy, ...)

$\Rightarrow$ Python scientific stack not written mainly in Python!¶

($\neq$ Julia!)

A lot of C, Fortran, C++ and Cython...

So what ?¶

Transonic is landing 🛬 !¶

Pure Python package (>= 3.6) to easily accelerate modern Python-Numpy code with different accelerators

https://transonic.readthedocs.io

Unified Python API to use different Python-Numpy accelerators

Keep your Python-Numpy code clean and "natural" 🧘
Clean type annotations (🐍 3)
Easily mix Python code and compiled functions
Ahead-of-time (AOT) and just-in-time (JIT) modes
JIT based on AOT compilers (especially for Pythran)
Accelerate functions, methods (of classes) and blocks of code

Work in progress! Current state: 3 backends based on Pythran, Cython and Numba

Few Transonic code examples¶

These codes can be accelerated with Pythran, Cython and Numba.

Ahead-of-time compilation¶

In [4]:

import numpy as np

from transonic import boost

T0 = "int[:, :]"
T1 = "int[:]"

@boost
def row_sum(arr: T0, columns: T1):
    return arr.T[columns].sum(0)

@boost(boundscheck=False, wraparound=False)
def row_sum_loops(arr: T0, columns: T1):
    # locals type annotations are used only for Cython
    i: int
    j: int
    sum_: int
    res: "int[]" = np.empty(arr.shape[0], dtype=arr.dtype)
    for i in range(arr.shape[0]):
        sum_ = 0
        for j in range(columns.shape[0]):
            sum_ += arr[i, columns[j]]
        res[i] = sum_
    return res

Just-in-time compilation¶

In [5]:

import numpy as np

from transonic import jit

def add(a, b):
    return a + b

@jit
def func(a, b):
    return np.exp(a) * b * add(a, b)

Type annotations¶

In [6]:

import numpy as np
from transonic import Type, NDim, Array, boost

T = Type(int, float, np.complex128)
N = NDim(1, 2, 3)

A = Array[T, N]
A1 = Array[np.float32, N + 1]

@boost
def compute(a: A, b: A, c: T, d: A1):
    ...

`inline` functions¶

In [7]:

from transonic import boost

T = int

@boost(inline=True)
def add(a: T, b: T) -> T:
    return a + b

@boost
def use_add(n: int = 10000):
    _: int
    for _ in range(n):
        tmp = add(tmp, 1)
    return tmp

Accelerate methods¶

In [8]:

from transonic import boost

@boost
class MyClass:
    attr: int
    
    @boost
    def numerical_kernel(self, arg: int):
        return self.attr + arg

Why Transonic ?¶

Yet another Python accelerator!?¶

Not really!

We try to fix issues of our community!

Incompatible accelerators
Cython "hegemony" (C-like code)
Pythran not as used/supported as it should

Performance (generalities)¶

First nice, readable, maintainable and correct code!¶

Balance between time/energy spent, generality of the code, readability

Do not optimize everything!¶

"Premature optimization is the root of all evil" (Donald Knuth)
80 / 20 rule, efficiency important for expensive things and NOT for small things

Measure ⏱, don't guess! Profile to find the bottlenecks.¶

Unittests before optimizing to maintain correctness!¶

Use the right algorithms and the right data structures!¶

Proper compilation needed to get very high performance¶

Python interpreters bad at crunching numbers¶

$\Rightarrow$ Few numerical functions/methods have to be accelerated, i.e. optimized and compiled!¶

In [9]:

from transonic import boost

@boost
def my_numerical_kernel():
    ...

Proper compilation needed for high efficiency !¶

Not like CPython: compile(...) to (high level) virtual machine instructions (with nearly no optimization)

Compilation to machine instructions¶

One needs to write code that can be well optimized by a compiler!

Just-in-time (@jit)

Has to be fast (warm up), can be hardware specific
Ahead-of-time (@boost)

Can be slow, hardware specific or more general to distribute binaries

First step for Python: Transpilation¶

From one language to another language (for example Python to C++, or Cython to C)

Many tools to compile / accelerate Python¶

Compiled at which levels?¶

programs (Nuitka)
slowest loops (PyPy)
modules (Cython, Pythran)
user-defined functions / methods (Numba, Transonic)
blocks of code (Transonic)
expressions (Numexp)
call compiled functions (Numpy / Python)

Which subset of Python (and its numerical ecosystem)?¶

Clearly, useless to properly compile all Python features (inspect, metaclass, ...)!

At least "Python used in numerical kernels" + Numpy!
User-defined classes?

How much the compiled code uses the Python interpreter?¶

Langage: superset of Python
A great mix of Python / C / CPython C API!

Very powerfull but a tool for experts!
Easy to study where the interpreter is used (cython --annotate).
Very mature
Very efficient for C-like code (explicit loops, "low level")
Now able to use Pythran internally...

My experience: large Cython extensions difficult to maintain

Numba: (per-method) JIT for Python-Numpy code¶

Very simple to use (just add few decorators) 🙂

In [10]:

from numba import jit

@jit
def myfunc(x):
    return x**2

"nopython" mode (fast and no GIL) 🙂
Also a "python" mode 🙂
GPU and Cupy 😀
Methods (of classes) 🙂

Numba: (per-method) JIT for Python-Numpy code¶

Only JIT 🙁

Sometimes not as much efficient as it could be 🙁

(sometimes slower than Pythran / Julia / C++)

Not good to optimize high-level NumPy code 🙁

Pythran: AOT compiler for modules using Python-Numpy¶

Transpiles Python to efficient C++

Good to optimize high-level NumPy code 😎
Extensions never use the Python interpreter (pure C++ ⇒ no GIL) 🙂
Can produce C++ that can be used without Python
Usually very efficient (sometimes faster than Julia)
- High and low level optimizations
  
  (Python optimizations and C++ compilation)
- SIMD 🤩 (with xsimd)
- Understand OpenMP instructions 🤗 !
Can use and make PyCapsules (functions operating in the native word) 🙂

Pythran: AOT compiler for module using Python-Numpy¶

Compile only full modules (⇒ refactoring needed 🙁)
Only "nopython" mode
- limited to a subset of Python
  - only homogeneous list / dict 🤷‍♀️
  - no methods (of classes) 😢 and user-defined class
- limited to few extension packages (Numpy + bits of Scipy)
- pythranized functions can't call Python functions
No JIT: need types (written manually in comments)
Lengthy ⌛️ and memory intensive compilations
Debugging 🐜 Pythran requires C++ skills
No GPU (maybe with OpenMP 4?)
compilers unable to compile Pythran C++11 👎
Small community, only 1 core-dev

First conclusions¶

Python great language & ecosystem for sciences & data

Performance issues, especially for crunching numbers 🔢

⇒ need to accelerate the "numerical kernels"

Many good accelerators and compilers for Python-Numpy code
- All have pros and cons! Very different technologies!
- Diversity good for open-source
- Pythran is really great
  
  The Python community would be wise to use and support it!
⇒ We shouldn't have to write specialized code for one accelerator!

Issues¶

Incompatible accelerators + "Cython hegemony"

Transonic is landing 🛬 !¶

Pure Python package (>= 3.6) to easily accelerate modern Python-Numpy code with different accelerators

Unified Python API to use different Python-Numpy accelerators

https://transonic.readthedocs.io

Transonic: how does it work?¶

AST analyses (using Beniget, no import at compilation time)
Write the (Pythran/Cython/Numba) files when needed
Compile the (Pythran/Cython) files when needed
Use the fast solutions when available

Transonic: examples from real-life packages¶

JIT (@jit)

fluidsim/solvers/plate2d/output/correlations_freq.py
AOT compilation for functions and methods (@boost)

fluidfft/fft3d/operators.py
Blocks of code (with if ts.is_transpiled:)

fluidsim/base/time_stepping/pseudo_spect.py
Parallelism with a class (adapted from Olivier Borderies)

omp/tsp.py (OpenMP) and tsp_concurrent.py (concurrent - threads)

Also compatible with MPI!

Works also well in simple scripts and IPython / Jupyter.

Transonic example: a unique code accelerated with 3 backends¶

Benchmark ahead-of-time compilation¶

In [17]:

import numpy as np

from transonic import boost

T0 = "int[:, :]"
T1 = "int[:]"

@boost
def row_sum(arr: T0, columns: T1):
    return arr.T[columns].sum(0)

@boost
def row_sum_loops(arr: T0, columns: T1):
    # locals type annotations are used only for Cython
    i: int
    j: int
    sum_: int
    res: "int[]" = np.empty(arr.shape[0], dtype=arr.dtype)
    for i in range(arr.shape[0]):
        sum_ = 0
        for j in range(columns.shape[0]):
            sum_ += arr[i, columns[j]]
        res[i] = sum_
    return res

Transonic example: a unique code accelerated with 3 backends¶

Benchmark ahead-of-time compilation¶

Python
high level: 1.31e-03 s  (=  1.00 * norm)
low level:  1.04e-01 s  (= 79.27 * norm)

Cython
high level: 1.29e-03 s  (=  0.99 * norm)
low level:  4.10e-04 s  (=  0.31 * norm)

Numba
high level: 1.04e-03 s  (=  0.80 * norm)
low level:  2.69e-04 s  (=  0.21 * norm)

Pythran
high level: 7.68e-04 s  (=  0.59 * norm)
low level:  2.55e-04 s  (=  0.19 * norm)

Transonic example: a unique code accelerated with 2 backends¶

Benchmark Just-in-time compilation¶

In [1]:

import numpy as np
from transonic import jit

@jit(native=True, xsimd=True)
def fxfy(ft, fn, theta):
    sin_theta = np.sin(theta)
    cos_theta = np.cos(theta)
    fx = cos_theta * ft - sin_theta * fn
    fy = sin_theta * ft + cos_theta * fn
    return fx, fy

@jit(native=True, xsimd=True)
def fxfy_loops(ft, fn, theta):
    n0 = theta.size
    fx = np.empty_like(ft)
    fy = np.empty_like(fn)
    for index in range(n0):
        sin_theta = np.sin(theta[index])
        cos_theta = np.cos(theta[index])
        fx[index] = cos_theta * ft[index] - sin_theta * fn[index]
        fy[index] = sin_theta * ft[index] + cos_theta * fn[index]
    return fx, fy

Transonic example: a unique code accelerated with 2 backends¶

Benchmark Just-in-time compilation¶

fxfy (pure-Numpy)                : 1.000 * norm
norm = 6.90e-04 s
fxfy_numba                       : 0.952 * norm
fxfy_loops_numba                 : 0.776 * norm
fxfy_pythran                     : 0.152 * norm
fxfy_loops_pythran               : 0.784 * norm

Pythran high-level 5 times faster than Numba with loops!¶

Conclusions¶

Transonic: a unified Python API for different Python-Numpy accelerators
- a new package
- a long term project
- a tool adapted for expert users and simple users
- see also our text on why Transonic is useful for our community
Very efficient scientific software can be written in clean modern Python

Transonic perspectives¶

Improve the API, support more Cython, Pythran and Numba (PyCapsules, dataclass, ...)
Usage in important / fundamental packages (for example Scikit-image)
Improve backends, create other backends (Cupy, PyTorch, ...)

Feedback appreciated and help needed !¶

NotImplemented¶

dataclass (`numba.jitclass` and Cython `cdef class`)¶

In [19]:

from transonic import dataclass, boost

@dataclass
class MyStruct:
    attr: int

    def compute(self, arg: int):
        return self.attr + arg

    def modify(self, arg: int):
        self.attr = arg

@boost
def func(o: MyStruct, a: int):
    o.modify(a)
    return o

@boost
def func1(o: MyStruct, a: int):
    return o.compute(a)

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-19-2550813677e6> in <module>
----> 1 from transonic import dataclass, boost
      2 
      3 @dataclass
      4 class MyStruct:
      5     attr: int

ImportError: cannot import name 'dataclass' from 'transonic' (/home/pierre/Dev/transonic/transonic/__init__.py)

Make your numerical Python code fly at transonic 🚀 speed!¶

Who am I (the guy who prepared the slides)?¶

Python, only a great glue language?¶

Can we write efficient scientific/numeric packages in Python?¶

$\Rightarrow$ fine language to express algorithms

But... Python interpreters bad at crunching numbers¶

$\Rightarrow$ Python scientific stack not written mainly in Python!¶

$\Rightarrow$ Python scientific stack not written mainly in Python!¶

So what ?¶

Transonic is landing 🛬 !¶

Few Transonic code examples¶

Ahead-of-time compilation¶

Just-in-time compilation¶

Type annotations¶

inline functions¶

Accelerate methods¶

Why Transonic ?¶

Yet another Python accelerator!?¶

Performance (generalities)¶

First nice, readable, maintainable and correct code!¶

Do not optimize everything!¶

Measure ⏱, don't guess! Profile to find the bottlenecks.¶

Unittests before optimizing to maintain correctness!¶

Use the right algorithms and the right data structures!¶

Proper compilation needed to get very high performance¶

Python interpreters bad at crunching numbers¶

$\Rightarrow$ Few numerical functions/methods have to be accelerated, i.e. optimized and compiled!¶

Proper compilation needed for high efficiency !¶

Compilation to machine instructions¶

First step for Python: Transpilation¶

Many tools to compile / accelerate Python¶

Compiled at which levels?¶

Which subset of Python (and its numerical ecosystem)?¶

How much the compiled code uses the Python interpreter?¶

Numba: (per-method) JIT for Python-Numpy code¶

Numba: (per-method) JIT for Python-Numpy code¶

Pythran: AOT compiler for modules using Python-Numpy¶

Pythran: AOT compiler for module using Python-Numpy¶

First conclusions¶

Issues¶

Transonic is landing 🛬 !¶

Transonic: how does it work?¶

Transonic: examples from real-life packages¶

Transonic example: a unique code accelerated with 3 backends¶

Benchmark ahead-of-time compilation¶

Transonic example: a unique code accelerated with 3 backends¶

Benchmark ahead-of-time compilation¶

Transonic example: a unique code accelerated with 2 backends¶

Benchmark Just-in-time compilation¶

Transonic example: a unique code accelerated with 2 backends¶

Benchmark Just-in-time compilation¶

Pythran high-level 5 times faster than Numba with loops!¶

Conclusions¶

Transonic perspectives¶

Feedback appreciated and help needed !¶

NotImplemented¶

dataclass (numba.jitclass and Cython cdef class)¶

`inline` functions¶

dataclass (`numba.jitclass` and Cython `cdef class`)¶