Make your numerical Python code fly at transonic 🚀 speed!

Pierre Augier, Ashwin Vishnu, Pierre Blanc-fatin
           
Pyconfr 2019 (3 November 2019, Bordeaux)
Presentation given by Laura Mendoza (INRIA)!
  • A new package

  • A unified modern Python API for Python/Numpy accelerators

  • A thin layer between developers and Pythran / Cython / Numba

Who am I (the guy who prepared the slides)?

  • A scientist (CNRS researcher on turbulence in stratified and rotating fluids)
  • A Python user / developer (FluidDyn project)
  • Numerical simulations (FluidSim, FluidFFT)
  • Image processing (FluidImage)
  • Lab experiments (FluidLab)

Python, only a great glue language?

Can we write efficient scientific/numeric packages in Python?

  • Python $\geqslant$ 3.6

  • Numpy API

  • an API to describe numerical types

$\Rightarrow$ fine language to express algorithms

But... Python interpreters bad at crunching numbers

  • CPython

  • PyPy + C extensions (Numpy, Scipy, ...)

$\Rightarrow$ Python scientific stack not written mainly in Python!

$\Rightarrow$ Python scientific stack not written mainly in Python!

($\neq$ Julia!)

A lot of C, Fortran, C++ and Cython...

So what ?

Transonic is landing 🛬 !

Pure Python package (>= 3.6) to easily accelerate modern Python-Numpy code with different accelerators

Unified Python API to use different Python-Numpy accelerators

  • Keep your Python-Numpy code clean and "natural" 🧘

  • Clean type annotations (🐍 3)

  • Easily mix Python code and compiled functions

  • Ahead-of-time (AOT) and just-in-time (JIT) modes

  • JIT based on AOT compilers (especially for Pythran)

  • Accelerate functions, methods (of classes) and blocks of code

Work in progress! Current state: 3 backends based on Pythran, Cython and Numba

Few Transonic code examples

These codes can be accelerated with Pythran, Cython and Numba.

Ahead-of-time compilation

In [4]:
import numpy as np

from transonic import boost

T0 = "int[:, :]"
T1 = "int[:]"

@boost
def row_sum(arr: T0, columns: T1):
    return arr.T[columns].sum(0)

@boost(boundscheck=False, wraparound=False)
def row_sum_loops(arr: T0, columns: T1):
    # locals type annotations are used only for Cython
    i: int
    j: int
    sum_: int
    res: "int[]" = np.empty(arr.shape[0], dtype=arr.dtype)
    for i in range(arr.shape[0]):
        sum_ = 0
        for j in range(columns.shape[0]):
            sum_ += arr[i, columns[j]]
        res[i] = sum_
    return res

Just-in-time compilation

In [5]:
import numpy as np

from transonic import jit

def add(a, b):
    return a + b

@jit
def func(a, b):
    return np.exp(a) * b * add(a, b)

Type annotations

In [6]:
import numpy as np
from transonic import Type, NDim, Array, boost

T = Type(int, float, np.complex128)
N = NDim(1, 2, 3)

A = Array[T, N]
A1 = Array[np.float32, N + 1]

@boost
def compute(a: A, b: A, c: T, d: A1):
    ...

inline functions

In [7]:
from transonic import boost

T = int

@boost(inline=True)
def add(a: T, b: T) -> T:
    return a + b

@boost
def use_add(n: int = 10000):
    _: int
    for _ in range(n):
        tmp = add(tmp, 1)
    return tmp

Accelerate methods

In [8]:
from transonic import boost

@boost
class MyClass:
    attr: int
    
    @boost
    def numerical_kernel(self, arg: int):
        return self.attr + arg

Why Transonic ?

Yet another Python accelerator!?

Not really!

We try to fix issues of our community!

  • Incompatible accelerators

  • Cython "hegemony" (C-like code)

  • Pythran not as used/supported as it should

Performance (generalities)

First nice, readable, maintainable and correct code!

Balance between time/energy spent, generality of the code, readability


Do not optimize everything!

  • "Premature optimization is the root of all evil" (Donald Knuth)

  • 80 / 20 rule, efficiency important for expensive things and NOT for small things


Measure ⏱, don't guess! Profile to find the bottlenecks.


Unittests before optimizing to maintain correctness!


Use the right algorithms and the right data structures!


Proper compilation needed to get very high performance

Python interpreters bad at crunching numbers

$\Rightarrow$ Few numerical functions/methods have to be accelerated, i.e. optimized and compiled!

In [9]:
from transonic import boost

@boost
def my_numerical_kernel():
    ...

Proper compilation needed for high efficiency !

Not like CPython: compile(...) to (high level) virtual machine instructions (with nearly no optimization)

Compilation to machine instructions

One needs to write code that can be well optimized by a compiler!

  • Just-in-time (@jit)

    Has to be fast (warm up), can be hardware specific

  • Ahead-of-time (@boost)

    Can be slow, hardware specific or more general to distribute binaries

First step for Python: Transpilation

From one language to another language (for example Python to C++, or Cython to C)

Many tools to compile / accelerate Python

Compiled at which levels?

  • programs (Nuitka)

  • slowest loops (PyPy)

  • modules (Cython, Pythran)

  • user-defined functions / methods (Numba, Transonic)

  • blocks of code (Transonic)

  • expressions (Numexp)

  • call compiled functions (Numpy / Python)

Which subset of Python (and its numerical ecosystem)?

Clearly, useless to properly compile all Python features (inspect, metaclass, ...)!

  • At least "Python used in numerical kernels" + Numpy!

  • User-defined classes?

How much the compiled code uses the Python interpreter?

  • Langage: superset of Python

  • A great mix of Python / C / CPython C API!

    Very powerfull but a tool for experts!

  • Easy to study where the interpreter is used (cython --annotate).

  • Very mature

  • Very efficient for C-like code (explicit loops, "low level")

  • Now able to use Pythran internally...

My experience: large Cython extensions difficult to maintain

Numba: (per-method) JIT for Python-Numpy code

  • Very simple to use (just add few decorators) 🙂
In [10]:
from numba import jit

@jit
def myfunc(x):
    return x**2
  • "nopython" mode (fast and no GIL) 🙂

  • Also a "python" mode 🙂

  • GPU and Cupy 😀

  • Methods (of classes) 🙂

Numba: (per-method) JIT for Python-Numpy code

  • Only JIT 🙁


  • Sometimes not as much efficient as it could be 🙁

    (sometimes slower than Pythran / Julia / C++)


  • Not good to optimize high-level NumPy code 🙁

Pythran: AOT compiler for modules using Python-Numpy

Transpiles Python to efficient C++

  • Good to optimize high-level NumPy code 😎

  • Extensions never use the Python interpreter (pure C++ ⇒ no GIL) 🙂

  • Can produce C++ that can be used without Python

  • Usually very efficient (sometimes faster than Julia)

    • High and low level optimizations

      (Python optimizations and C++ compilation)

    • SIMD 🤩 (with xsimd)

    • Understand OpenMP instructions 🤗 !

  • Can use and make PyCapsules (functions operating in the native word) 🙂

Pythran: AOT compiler for module using Python-Numpy

  • Compile only full modules (⇒ refactoring needed 🙁)

  • Only "nopython" mode

    • limited to a subset of Python

      • only homogeneous list / dict 🤷‍♀️
      • no methods (of classes) 😢 and user-defined class
    • limited to few extension packages (Numpy + bits of Scipy)

    • pythranized functions can't call Python functions

  • No JIT: need types (written manually in comments)

  • Lengthy ⌛️ and memory intensive compilations

  • Debugging 🐜 Pythran requires C++ skills

  • No GPU (maybe with OpenMP 4?)

  • Intel compilers unable to compile Pythran C++11 👎

  • Small community, only 1 core-dev

First conclusions

  • Python great language & ecosystem for sciences & data
  • Performance issues, especially for crunching numbers 🔢

    ⇒ need to accelerate the "numerical kernels"

  • Many good accelerators and compilers for Python-Numpy code

    • All have pros and cons! Very different technologies!

    • Diversity good for open-source

    • Pythran is really great

      The Python community would be wise to use and support it!

    ⇒ We shouldn't have to write specialized code for one accelerator!

Issues

Incompatible accelerators + "Cython hegemony"

Transonic is landing 🛬 !

Pure Python package (>= 3.6) to easily accelerate modern Python-Numpy code with different accelerators

Unified Python API to use different Python-Numpy accelerators

Transonic: how does it work?

  • AST analyses (using Beniget, no import at compilation time)

  • Write the (Pythran/Cython/Numba) files when needed

  • Compile the (Pythran/Cython) files when needed

  • Use the fast solutions when available

Transonic: examples from real-life packages

Works also well in simple scripts and IPython / Jupyter.

Transonic example: a unique code accelerated with 3 backends

Benchmark ahead-of-time compilation

In [17]:
import numpy as np

from transonic import boost

T0 = "int[:, :]"
T1 = "int[:]"

@boost
def row_sum(arr: T0, columns: T1):
    return arr.T[columns].sum(0)

@boost
def row_sum_loops(arr: T0, columns: T1):
    # locals type annotations are used only for Cython
    i: int
    j: int
    sum_: int
    res: "int[]" = np.empty(arr.shape[0], dtype=arr.dtype)
    for i in range(arr.shape[0]):
        sum_ = 0
        for j in range(columns.shape[0]):
            sum_ += arr[i, columns[j]]
        res[i] = sum_
    return res

Transonic example: a unique code accelerated with 3 backends

Benchmark ahead-of-time compilation

Python
high level: 1.31e-03 s  (=  1.00 * norm)
low level:  1.04e-01 s  (= 79.27 * norm)

Cython
high level: 1.29e-03 s  (=  0.99 * norm)
low level:  4.10e-04 s  (=  0.31 * norm)
Numba
high level: 1.04e-03 s  (=  0.80 * norm)
low level:  2.69e-04 s  (=  0.21 * norm)

Pythran
high level: 7.68e-04 s  (=  0.59 * norm)
low level:  2.55e-04 s  (=  0.19 * norm)

Transonic example: a unique code accelerated with 2 backends

Benchmark Just-in-time compilation

In [1]:
import numpy as np
from transonic import jit

@jit(native=True, xsimd=True)
def fxfy(ft, fn, theta):
    sin_theta = np.sin(theta)
    cos_theta = np.cos(theta)
    fx = cos_theta * ft - sin_theta * fn
    fy = sin_theta * ft + cos_theta * fn
    return fx, fy

@jit(native=True, xsimd=True)
def fxfy_loops(ft, fn, theta):
    n0 = theta.size
    fx = np.empty_like(ft)
    fy = np.empty_like(fn)
    for index in range(n0):
        sin_theta = np.sin(theta[index])
        cos_theta = np.cos(theta[index])
        fx[index] = cos_theta * ft[index] - sin_theta * fn[index]
        fy[index] = sin_theta * ft[index] + cos_theta * fn[index]
    return fx, fy

Transonic example: a unique code accelerated with 2 backends

Benchmark Just-in-time compilation

fxfy (pure-Numpy)                : 1.000 * norm
norm = 6.90e-04 s
fxfy_numba                       : 0.952 * norm
fxfy_loops_numba                 : 0.776 * norm
fxfy_pythran                     : 0.152 * norm
fxfy_loops_pythran               : 0.784 * norm

Pythran high-level 5 times faster than Numba with loops!

Conclusions

  • Transonic: a unified Python API for different Python-Numpy accelerators

  • Very efficient scientific software can be written in clean modern Python

Transonic perspectives

  • Improve the API, support more Cython, Pythran and Numba (PyCapsules, dataclass, ...)

  • Usage in important / fundamental packages (for example Scikit-image)

  • Improve backends, create other backends (Cupy, PyTorch, ...)

Feedback appreciated and help needed !

NotImplemented

dataclass (numba.jitclass and Cython cdef class)

In [19]:
from transonic import dataclass, boost

@dataclass
class MyStruct:
    attr: int

    def compute(self, arg: int):
        return self.attr + arg

    def modify(self, arg: int):
        self.attr = arg

@boost
def func(o: MyStruct, a: int):
    o.modify(a)
    return o

@boost
def func1(o: MyStruct, a: int):
    return o.compute(a)
---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-19-2550813677e6> in <module>
----> 1 from transonic import dataclass, boost
      2 
      3 @dataclass
      4 class MyStruct:
      5     attr: int

ImportError: cannot import name 'dataclass' from 'transonic' (/home/pierre/Dev/transonic/transonic/__init__.py)