Pierre Augier, Ashwin Vishnu, Cyrille Bonamy, Antoine Campagne, Julien Salort


Coopcalcul, LEGI, 01/02/2018
http://www.legi.grenoble-inp.fr/people/Pierre.Augier/

Dynamique open-source et conséquences sur la recherche et l'enseignement en mécanique des fluides

1. Dynamiques actuelles en technologies informatiques et leurs implications pour la mécanique des fluides

2. Projet d'open-science FluidDyn

   

3. Une proposition (mon idée)

Rééquilibrer les financements communs du laboratoire entre codes propriétaires et projets open-source utiles pour le laboratoire

$\newcommand{\kk}{\boldsymbol{k}} \newcommand{\eek}{\boldsymbol{e}_\boldsymbol{k}} \newcommand{\eeh}{\boldsymbol{e}_\boldsymbol{h}} \newcommand{\eez}{\boldsymbol{e}_\boldsymbol{z}} \newcommand{\cc}{\boldsymbol{c}} \newcommand{\uu}{\boldsymbol{u}} \newcommand{\vv}{\boldsymbol{v}} \newcommand{\bnabla}{\boldsymbol{\nabla}} \newcommand{\Dt}{\mbox{D}_t} \newcommand{\p}{\partial} \newcommand{\R}{\mathcal{R}} \newcommand{\eps}{\varepsilon} \newcommand{\mean}[1]{\langle #1 \rangle} \newcommand{\epsK}{\varepsilon_{\!\scriptscriptstyle K}} \newcommand{\epsA}{\varepsilon_{\!\scriptscriptstyle A}} \newcommand{\epsP}{\varepsilon_{\!\scriptscriptstyle P}} \newcommand{\epsm}{\varepsilon_{\!\scriptscriptstyle m}} \newcommand{\CKA}{C_{K\rightarrow A}} \newcommand{\D}{\mbox{D}}$

Part 1:

Dynamiques actuelles en technologies informatiques et leurs implications pour la mécanique des fluides

The computer/web revolution, open-source, science and software

  • A revolution related to computers and www

  • Open-source is at the center of these technologies

    • Open-source is used by these technologies
    • Open-source used these technologies

    $\Rightarrow$ A new open-source has emerged

  • Disrupt the way science is done: software at the center + sharing with www

Methods and tools for open-source software engineering

  • Distributed source management tools and source development web-platforms

    Remark Mercurial: easier to learn than Git / more secure for beginners / as powerful for experts / can work with Git repositories (Github, Gitlab) (using hg-git)

  • Third-party software repositories

    • CPAN for Perl,
    • Python Package Index (Pypi) and Anaconda,
    • ...
  • Unit tests and continuous integration platforms: Travis, Bitbucket Pipeline, Codecov

  • Websites to share knowledge

  • Automatic web documentation built on servers (Sphinx, Doxigen, Readthedocs)

Python, a programming language adapted for open-science

Designed to boost the communication of technical ideas between humans. Code as simple as possible to emphasize code readability. Humans can focus on the ideas.

  • Nice and elegant syntax. Blocks of code defined by indentation.

  • Explicit guidelines (PEP 8) for code regularity (and thereby readability).

  • Dynamical language. Variables (names) not attached for life to an object. Types of the objects inferred from the code.

  • Automatic memory management.

  • "Interpreted language": an interpreter executes instruction by instruction the code.

    No proper compilation: no translation to optimized machine instructions.

    • 2 advantages: shorter development cycle + interactive workflow (IPython, Jupyter)

    • 2 drawbacks...

Python is simple but also very powerful

  • Multiple programming paradigms, including imperative, object-oriented and functional.

  • Easy to interact with code written in other languages (in particular C, C++ and Fortran).

  • Large and high quality standard library.

  • Can be run on machines with different

    • operating systems (Linux, Windows, OSX, Android) and
    • architectures (from a microcontroller - with micropython - to a Blue Gene supercomputer).

A "glue" language for fast prototyping

Development much faster and easier than with most other languages.

Less bugs only because much less lines of code.


Nice learning curve

Python is good for developers of all levels:

  • very gentle for beginners
  • very powerful for advanced users.

An old language (first implementation in 1991!) which continues to evolve

   
  • For science, the transition Python 2 $\rightarrow$ Python 3 is basically over.

  • Interesting new features in recent versions. For Python 3.5 (first released on September 2015)

    • the @ operator for matrix multiplication,
    • the new async and await keywords for concurrency,
    • type hinting (with the module typing and an associated syntax, PEP 484) to allow for optional type checking (see the recent project Mypy).

An incredible success

Large and supportive community (see Stackoverflow tags).

Several companies using Python and the open-source dynamics.

Versatile language, Python widely used for several applications:

  • Simple scripting.

  • System, database and network administration.

  • Linux distribution software.

  • Main scripting language to add programmability to applications (Paraview, Visit, QGIS, Blender, ...).

  • Web servers.

  • Web scraping and data analysis.

  • Animation movies, game development and gaming.

  • Education.

  • Science!

Python for science

Widely used in scientific applications. Mature and powerful scientific ecosystem with:

  • numpy for N-dimensional homogeneous arrays,
  • scipy for fundamental feature of scientific computing,
  • matplotlib for plotting
  • pandas for data structures

and several specialized packages (h5py, mpi4py, skimage, sklearn, ...).

Great tools for most of the applications. See landscape of Python visualization tools.

Data science

Main language for data science with pandas, statmodels, sklearn, keras and tensorflow.

Python for science

Ready-to-use distributions

(similar to Matlab) freemium open-source distribution Anaconda $\Rightarrow$ very easy to use Python for scientific purposes.

Jupyter notebooks

Open-source web application to create and share documents that contain live code, equations, visualizations and narrative text.

Nice Python Integrated Development Environments (IDE) adapted for scientists

(similar to Matlab)

Some Python issues

1. No proper compilation

  • No type-checking (but Mypy is comming...)

  • Python code can be in some CPU bounded cases too slow.

    One has thus to use tools to use optimized machine instructions.

Ahead Of Time compilation

First solution: use C, C++ or Fortran to speedup the performance critical code.

  • Libraries can be used directly
  • Extensions: Python modules written in C or C++ using the C CPython API

No need to write the extensions in C or C++: tools to make them from (nearly) Python code.


Cython: the most widely used static compiler for Python

Code in C/C++ style with a syntax similar to Python (but with type declarations).

Very powerfull but to really understand one needs to know C and Python.

Pythran: an open-source compiler for scientific kernels written in Python.

Creates compiled extensions from pure Python code with simple type annotations written in comments.

Resulting extensions usually as fast as Fortran or C++ written by non-specialists.

Very interesting two-step compilation:

  1. code optimized at the Python level,
  2. automatically produced C++ code properly compiled.

Pythran supports:

  • OpenMP pragma (without the GIL limitation),
  • vectorizations with SIMD instructions,
  • both numpy-like vectorized code and C-like code with explicit loops.

Just In Time (JIT) compilation

JIT compilation: compiling only the critical code at run time.

Good results with other languages as JavaScript, Matlab or Julia.

Pypy (an alternative interpreter written in Python) has a JIT compiler. However, Pypy not widely used for scientific applications mainly because of compatibility problems for the extensions.

Adding a JIT to CPython is notoriously difficult.

Not as much money and work put in accelerating Python than for example for Java and JavaScript.

Numba: a JIT for CPython through an external package. Numba can take advantage of the GPU.

Some Python issues

2. A lively, huge and thus complicated ecosystem

A versatile language $\Rightarrow$ for many applications, one needs external packages.

For most of the applications: several projects usable through Python.

A user/developer has to make choices. It can be difficult to understand a "landscape of projects" and to make good technological choices.

  • Important to get a good introduction on the open-source dynamics and the scientific Python ecosystem.

  • Useful to ask to experts.

Python is much richer and diverse than for example Matlab.

Often an advantage but can represents an issue.

Some Python issues

3. Multicore computational parallelism using threads ($\sim$ light subprograms).

CPython forbids that threads interpret python code at the same time.

The Global Interpreter Lock (GIL)

Pypy has a GIL, while Jython and IronPython, do not have this limitation.

Threads used in Python for concurrency, i.e. to perform i/o concurrently.

For proper computational parallelism with the GIL:

  • extensions (fine grain parallelism),
  • multiprocessing (coarse grain parallelism) and
  • processes communicating for example with MPI (with mpi4py) or zeromq.

Rich landscape of open-source languages useful for science

  • Fortran

  • C

  • C++ (C++11, C++14)

  • Python

  • Several others: JavaScript, Java, Scala, Smalltalk, Haskell, R, Julia and Lua

  • New: Go and Rust

No perfect language: each language has strong points and weaknesses.

Rich landscape of open-source languages useful for science

Predict the future!?

The idea of "one language to do everything for science" will not succeed, at least soon.

⇒ Important aspect: interoperability between coexisting tools.

So what about Python?

  • today strongest dynamics in science and data analysis.
  • Several students learn and will learn it. Many scientists will like the pythonic approaches.

Python should be able to

  • embrace the new technological trends and
  • overcome some of its actual limitations.

Python will continues to shine as

  • a language particularly good for human communication and to write ideas

  • a language for fast prototyping

  • a glue language able to interact nicely with code written in other languages (see cppyy and pybind11, or the possibility to interact with Rust code).

Conclusion: Python, a versatile Swiss Army Knife for the scientists

  • Great versatile tool for scientists.

  • Impressive scientific ecosystem.

  • Very strong and quickly growing scientific Python community.

A scientist with good skills in Python can do most of what she/he needs to do. (which does not mean that Python is the best language to do everything!)

In contrast, no skill in Python is a real disadvantage for many tasks and for employability.

$\Rightarrow$ In most fields, today, if a scientists has to acquire good skills in one language, it is in Python.

Productivity at individual, group and community levels

Short-term efficiency and long-term efficiency are often incompatible:

quick and dirty scripts efficient in the short term, but ...


Similarly, productivity at different scales:

  • individual

  • group

  • community

We have to consider these conflicts when choosing between different technologies.

Programming in the field of fluid mechanics

Fluid mechanics:

  • laboratory experiments,
  • in situ measurements,
  • analytic computation,
  • numerical simulations and
  • data processing.

Programming is everywhere...

However, very low level in software engineering in the community!

  • Not unusual to start a PhD without any serious training in Linux and in programming with modern tools.

  • Few people aware of the challenges and opportunities of open-source.

Programming in the field of fluid mechanics

Engineering science $\Rightarrow$ close-source commercial software is strong.

The vicious circle of the close-source model

  • A group pays for a license or a new development.

  • It does not learn how to do what have been paid.

  • By its money and its feedback, it contributes to improve the product sold by the company.

  • The group produces codes, books, courses using the close-source product or acquire knowledge on how to use the close-source product

  • The group is more in need of the product and is ready to pay more for it.

Difficult to break such circle but can be done with open-source solutions (similar circle instead positive).

Programming in the field of fluid mechanics

Scientists produce code. But without the technical knowledge on how to work collectively on code, production of bad quality doomed to be soon abandon.

However, one tries to reuse code...

Thousands of hours of highly qualified people are spent in trying to understand and reuse code badly written with inappropriate languages!

Coding in the field:

  • Mix of Fortran/C or C++, a shell languages (as bash) and Matlab.

  • For experiments, the graphical programming environment Labview.

Often, languages are used for things for which they are not adapted

  • Fortran, C or C++ are inadequate for fast prototyping of complex programs.

  • For scientific purposes, one should use shell scripting only for extremely simple tasks.

  • Matlab is not adapted for complex programs.

Problems for code reuse and sharing and collaborative development.

The Matlab issue

Matlab is a close-source proprietary numerical computing environment.

No so bad for individual productivity

  • A good tool for simple processing with matrices and data plotting.

    Language well adapted for these tasks.

  • Nice development environment.

  • Quite fast interpreter now that it has a JIT compiler.

Serious technical issues of the Matlab language

Bad tool for doing more than simple processing and data plotting.

1. One file for each function available outside the file where it is implemented

2. No real organization of the standard library

  • All functions available in a huge flat namespace.

  • No import mechanism so that we do not know looking at the code from where a function comes.

3. Global variables and files

Scripts that modify and define global variables (see this Matlab version of Diablo).

The language strongly encourages this practice.

Matlab files not self-consistent, i.e. it is normal to use in a file a global variable defined outside of the file.

4. Very bad default argument mechanism

In python:

In [ ]:
def myfunc(a, b, c=1, has_to_print=True):
    if has_to_print:
        print('a =', a, 'b =', b, 'c =', c)
    return c * (a + b)

In Matlab:

function ret = myfunc(a, b, varargin)
    if nargin < 2 | nargin > 4
        error(['The number of arguments has to be ' ...
               'greater than 2 and lower than 5'])
    end
    if nargin == 4
        has_to_print = varargin{2};
    else
        has_to_print = 1;
    end
    if nargin >= 3
        c = varargin{1};
    else
        c = 1;
    end

    if has_to_print
        disp(['a = ' num2str(a) '; b = ' num2str(b) '; c = ' num2str(c)])
    end
    ret = c * (a + b);
end

5. Poor syntax for string operations

  • String comparison, strcmp(s1, 'fit') in Matlab versus s1 == 'fit' in Python
  • Look up, contains(s1, 'fit') in Matlab versus 'fit' in s1 in Python.

6. Weak error handling mechanisms

7. Bad object oriented programming model and syntax

8. Parenthesis for function calls and for indexing

9. Bad readability

  • Usually full of ;, .*, ./, && and || (quite "noisy").

  • Nothing like the Python PEP 8.

10. Syntax a(100, 100) = 1; to create and extend matrix

  • Not error with a = eye(2); a(i0, i1) = 1; even for crazy values of i0 and i1!

  • No error if a user misspell a variable var and write something like vae(100, 100) = 1;.

Matlab is a close-source software $\Rightarrow$ a big black box

Impossible to study the implementation of its functions.

Using Matlab is not free.

One license (for non-commercial use) not very expensive for most research and teaching institutes.

However, most codes need toolboxes... Non negligible cost for some institutions.

Multi-processing (one licence per processes): very expensive!

Cracked versions of Matlab? Illegal!

Open-source interpreters for Matlab language (Octave and Scilab)

   

Unfortunately, they are not serious alternative today.

  • Too small user community.

  • Not very efficient (now optimized Matlab codes rely on its JIT).

Common arguments pro Matlab and against Python

Integrated environment

True ... but now Anaconda, Spyder and JupyterLab.


Interactive plot

True with basic Matplotlib... but several other tools for interactive plot.

Common arguments pro Matlab and against Python

Python needs add-on libraries

Python bad for science because

  • need "add-on libraries for performing even basic mathematics" and
  • the "quality, comprehensiveness, and maintenance [of these libraries] varies widely".

Very mature, robust, well documented and integrated libraries (numpy, scipy and IPython).

Amateurism

Badly documented and buggy. Python for science today is not old-school open-source.

  • New open-source methods give impressive results in term of code quality.

  • Most of the big Python packages supported by companies or research institutes:

    Numba (Anaconda, Nvidia), TensorFlow (Google), Scikit-learn (INRIA), Mercurial (Facebook).

Matlab would be "the more natural way to express computational mathematics"

Let's compare code. First the Matlab version:

% matrix creation (row's shape is [1, 3])
row = [1 2 3];
% matrix transposition
col = row';
% matrix multiplication
inner = row * col;
outer = col * row;
% element-by-element multiplication and division
row2 = row .* (row + 1);
result = row2 ./ (row2 + 1);

and then the equivalent code in Python 3.6 (using numpy):

In [1]:
import numpy as np
# creation of a 1D array with shape [1, 3]
row = np.array([1, 2, 3], ndmin=2)
# matrix transposition
col = row.T
# matrix multiplication with the symbol @
inner = row @ col
outer = col @ row
# element-by-element multiplication and division
row2 = row * (row + 1)
result = row2 / (row2 + 1)

Teaching and scientific books mostly in Matlab?

Nowadays wrong...

  • Python most popular language for teaching

  • Researches on Youtube give in early 2018:

    • 3,800,000 videos on Python and
    • 900,000 on Matlab.

Support

Matlab team praises their support and claims that no support exists for Python.

Completely wrong...

  • Very reactive support by the community (Stackoverflow, ...)

  • Professional support on scientific Python (Intel and Anaconda).

Performance: Matlab would be faster...

Matlab team compares well optimized Matlab to badly written scientific Python.

To our experience, the translation in Python of a real Matlab code taken in a laboratory is faster even using only the basic scientific packages.

Actually mainly because the principle (and the effect) of Matlab is to keep the users with bad habits and a bad level.

With the tools available today (Cython, Numba, Pythran, ...), we can run with little efforts very efficient computations using Python.

Labview

Same analysis valid...

Moreover, graphical programming language with programs saved as binary files:

  • Impossible to read a Labview program without Labview.
  • Forbids source management tools which are so important for collective work and open-source.

Symbolic computation programs: Mathematica / Maple

   
  • Good but expensive...

  • Sage and Sympy are two complementary open-source Python-based alternatives.

   

Employability...

Matlab and Labview are also problems for students, who learn

  • bad coding habits and
  • languages much less in demand by employers than for example Python.

Different models for software development in fluid mechanics

Proprietary codes tend to dominate the field of fluid mechanics

Examples: Ansis, Lavision, Dantec, ...

Close-source or even undistributed software

"We do not share ..."

  • "to keep a comparative advantage."

  • "because we do not provide support."

  • "because people would not be able to correctly use the code or interpret the results."

  • "because we want to control industrial usage."

  • "because we do not want people to review and criticize our code."

  • "because in this way, people will think our project has more value."

Different models for software development in fluid mechanics

The grey zone between close-source and open-source

  • Share only for "friends" to control the dissemination of the software

  • Share without repository and/or without license

Lack of a proper open-source license ⇒ problems and limitation for users/developers

  • Open-source code written with proprietary tools

    To use them, we need to pay but we do not pay the authors of the software.

    The authors work (often for free) for the company that sells the proprietary tool?

Different models for software development in fluid mechanics

Open-source in fluid mechanics

Some real open-source codes have also emerged, for example NEK5000 (Fortran), OpenFOAM (C++), Basilisk (C) and Channelflow (C++) and Code_Saturn (C/Fortran).

Big companies have started to use open-source development for fluid mechanics applications.

Python in fluid mechanics

CFD codes (Dedalus, SpectralDNS, Oasis, PyFR and FEniCS) or data analysis (OpenPTV).

The packages of the FluidDyn project are part of this trend.

Part 2 (much shorter ☺)

The FluidDyn project

   

The FluidDyn project

  • A project to foster open-science and open-source in fluid mechanics

  • A set of Python packages

  • Examples for:

    • Good coding practice: readable and understandable Python code (PEP 8).
    • Source control management (Mercurial) and forge (Bitbucket) simple for the new comers.
    • Packaging and installation procedures.
    • Licenses: depending on the packages, CeCILL-B or CeCILL licenses.
    • Documentation sites produced with the standard and up-to-date tools:
      Sphinx, Readthedocs, Anaconda and Jupyter.
    • Unittest and continuous integration with Bitbucket Pipeline and Travis.

The packages of the FluidDyn project

fluiddyn

Pure-python code that can be reused in scripts or in specialized FluidDyn packages. It also contains the code of miscellaneous command-line utilities useful for a typical fluid dynamics user.

fluidfft

A library which provides C++ and Python classes useful to perform Fast Fourier Transform (FFT) in sequential and in parallel.

fluidsim

Numerically oriented framework to run sequential and parallel Computational Fluid Dynamics (CFD) simulations and on-the-fly post-processing for a variety of problems (Navier-Stokes, Shallow Water, Föppl von Kármán equations, to name a few).

fluidlab

Package to carry out laboratory experiments. Primarily used to communicate with various hardware devices such as motors and pumps, to handle I/O between sensors, and to store data.

fluidimage

Scalable image processing package which implements various algorithms to calibrate cameras, to preprocess, to do Particle Image Velocimetry (PIV) and to postprocess data.

fluidcoriolis

Small package used to carry out experiments in the Coriolis platform and share the data obtained.

fluidfoam

Small package to load OpenFoam data and plot them.

Conclusions (before Part 3 ☺)

  • Open-source in science has never been so strong

    Very good tools and methods to be more efficient and to do better science

  • Collectivelly:

    • The LEGI does not use this dynamics as much as it could.
    • Addiction to some proprietary tools.
  • Question at the beginning of a PhD thesis : Matlab ou Python?

    • One main tool. The PhD thesis is an opportunity to become an expert in one tool.
    • Transition during the PhD thesis very difficult.
    • Real strong responsability on the directors.
  • Open-source choice : individual and collective choice. Scientific policy.

  • Open-source has a cost. No good business model for scientific open-source. We tend to be stingy!!

Plenty of opportunities of synergy by coding in the lab

For example:

  • Pseudo-spectral simulations

    • Achim Wirth
    • Nicolas Mordant
    • Chantal Staquet
    • Pierre Augier
    • MOST
  • Experiments

  • Data analysis

  • Visualisation fluid

And "LEGI" open-source tools (good for the lab influence...)

Part 3 (also short ☺)

Une proposition (mon idée)

But :

  • Augmenter la dynamique open-source au laboratoire.

  • Petite puissance de frappe pour orienter développement open-source dans directions utiles au labo.

  • Sur le long terme économie pour le labo (diminution progressive addiction outils propriétaires)

Moyen :

  • Rééquilibrer les financements communs du laboratoire entre les entreprises fournissant des codes propriétaires et les projets open-source utiles pour le laboratoire

  • Petit appel à projet "Open-source utile au labo" au niveau du labo

Appel à projets "Open-source utile au labo"

Une petite action collective du labo...

Proposition de Méthode

Besoin d'un groupe d'organisation ouvert...

  1. Lister dépenses communes du laboratoire pour logiciels propriétaires potentiellement remplaçables $\Rightarrow$ tableau avec somme totale (2017, Matlab + Labview $\simeq$ 8000 € TTC).

    Remarque : lissage sommes sur plusieurs années

  2. Décision chaque année d'un pourcentage de cette somme totale allouée à l'appel à projets (décision politique scientifique donc direction et conseil du labo).

  3. Appel à projets proprement dit. Demande de courts descriptifs (A4?).

  4. Classement motivé et ouvert par le groupe d'organisation.

  5. Décision (politique scientifique $\Rightarrow$ direction et conseil du labo) de financer ou non des projets.

Appel à projets "Open-source utile au labo"

Exemples de possibles projets :

  • Support Anaconda pour le labo

  • Adhésion à une association open-source (Debian, OpenFoam France, PythonFr, ...)

  • Financement d'un projet open-source sur un point précis utile au labo (exemple Spyder issue 4180, voir Spyder is unfunded)

  • Stage pour amélioration d'un outil open-source du labo utile à plusieurs projets au labo.

  • Stage pour transition pour le contrôle d'une manip. ou d'un instrument de Labview à des solutions open-source (fluidlab, micropython, Raspberry Pi, Arduino).

  • Stage pour transition d'un code pour se libérer d'un outil propriétaire.

  • ...

Pub groupe Python for science and data analysis in Grenoble et séminaires

On monte un groupe Python au niveau grenoblois.

Journée "Python scientique à Grenoble" le 8 mars

Interventions entre autre de

Séminaire de Alexandre Gramfort au LEGI le 7 mars à 14h

Machine learning with scikit-learn