Rééquilibrer les financements communs du laboratoire entre codes propriétaires et projets open-source utiles pour le laboratoire
$\newcommand{\kk}{\boldsymbol{k}} \newcommand{\eek}{\boldsymbol{e}_\boldsymbol{k}} \newcommand{\eeh}{\boldsymbol{e}_\boldsymbol{h}} \newcommand{\eez}{\boldsymbol{e}_\boldsymbol{z}} \newcommand{\cc}{\boldsymbol{c}} \newcommand{\uu}{\boldsymbol{u}} \newcommand{\vv}{\boldsymbol{v}} \newcommand{\bnabla}{\boldsymbol{\nabla}} \newcommand{\Dt}{\mbox{D}_t} \newcommand{\p}{\partial} \newcommand{\R}{\mathcal{R}} \newcommand{\eps}{\varepsilon} \newcommand{\mean}[1]{\langle #1 \rangle} \newcommand{\epsK}{\varepsilon_{\!\scriptscriptstyle K}} \newcommand{\epsA}{\varepsilon_{\!\scriptscriptstyle A}} \newcommand{\epsP}{\varepsilon_{\!\scriptscriptstyle P}} \newcommand{\epsm}{\varepsilon_{\!\scriptscriptstyle m}} \newcommand{\CKA}{C_{K\rightarrow A}} \newcommand{\D}{\mbox{D}}$
A revolution related to computers and www
Open-source is at the center of these technologies
$\Rightarrow$ A new open-source has emerged
Disrupt the way science is done: software at the center + sharing with www
Distributed source management tools and source development web-platforms
Remark Mercurial: easier to learn than Git / more secure for beginners / as powerful for experts / can work with Git repositories (Github, Gitlab) (using hg-git)
Third-party software repositories
Unit tests and continuous integration platforms: Travis, Bitbucket Pipeline, Codecov
Websites to share knowledge
community driven sites: Wikipedia, Stackoverflow, ...
IRC, Slack, Riot,
Automatic web documentation built on servers (Sphinx, Doxigen, Readthedocs)
Designed to boost the communication of technical ideas between humans. Code as simple as possible to emphasize code readability. Humans can focus on the ideas.
Nice and elegant syntax. Blocks of code defined by indentation.
Explicit guidelines (PEP 8) for code regularity (and thereby readability).
Dynamical language. Variables (names) not attached for life to an object. Types of the objects inferred from the code.
Automatic memory management.
"Interpreted language": an interpreter executes instruction by instruction the code.
No proper compilation: no translation to optimized machine instructions.
2 advantages: shorter development cycle + interactive workflow (IPython, Jupyter)
2 drawbacks...
Multiple programming paradigms, including imperative, object-oriented and functional.
Easy to interact with code written in other languages (in particular C, C++ and Fortran).
Large and high quality standard library.
Can be run on machines with different
For science, the transition Python 2 $\rightarrow$ Python 3 is basically over.
Interesting new features in recent versions. For Python 3.5 (first released on September 2015)
Large and supportive community (see Stackoverflow tags).
Several companies using Python and the open-source dynamics.
Versatile language, Python widely used for several applications:
Simple scripting.
System, database and network administration.
Linux distribution software.
Main scripting language to add programmability to applications (Paraview, Visit, QGIS, Blender, ...).
Web servers.
Web scraping and data analysis.
Animation movies, game development and gaming.
Education.
Science!
Widely used in scientific applications. Mature and powerful scientific ecosystem with:
and several specialized packages (h5py, mpi4py, skimage, sklearn, ...).
Great tools for most of the applications. See landscape of Python visualization tools.
Main language for data science with pandas, statmodels, sklearn, keras and tensorflow.
(similar to Matlab) freemium open-source distribution Anaconda $\Rightarrow$ very easy to use Python for scientific purposes.
Open-source web application to create and share documents that contain live code, equations, visualizations and narrative text.
(similar to Matlab)
First solution: use C, C++ or Fortran to speedup the performance critical code.
No need to write the extensions in C or C++: tools to make them from (nearly) Python code.
Code in C/C++ style with a syntax similar to Python (but with type declarations).
Very powerfull but to really understand one needs to know C and Python.
Creates compiled extensions from pure Python code with simple type annotations written in comments.
Resulting extensions usually as fast as Fortran or C++ written by non-specialists.
Very interesting two-step compilation:
Pythran supports:
JIT compilation: compiling only the critical code at run time.
Good results with other languages as JavaScript, Matlab or Julia.
Pypy (an alternative interpreter written in Python) has a JIT compiler. However, Pypy not widely used for scientific applications mainly because of compatibility problems for the extensions.
Adding a JIT to CPython is notoriously difficult.
Not as much money and work put in accelerating Python than for example for Java and JavaScript.
Numba: a JIT for CPython through an external package. Numba can take advantage of the GPU.
A versatile language $\Rightarrow$ for many applications, one needs external packages.
For most of the applications: several projects usable through Python.
A user/developer has to make choices. It can be difficult to understand a "landscape of projects" and to make good technological choices.
Important to get a good introduction on the open-source dynamics and the scientific Python ecosystem.
Useful to ask to experts.
Python is much richer and diverse than for example Matlab.
Often an advantage but can represents an issue.
CPython forbids that threads interpret python code at the same time.
The Global Interpreter Lock (GIL)
Used to prevent race conditions.
Greatly simplifies the implementation of CPython.
Very difficult to remove it while keeping other nice technical properties.
Pypy has a GIL, while Jython and IronPython, do not have this limitation.
Threads used in Python for concurrency, i.e. to perform i/o concurrently.
For proper computational parallelism with the GIL:
Fortran
C
C++ (C++11, C++14)
Python
Several others: JavaScript, Java, Scala, Smalltalk, Haskell, R, Julia and Lua
New: Go and Rust
No perfect language: each language has strong points and weaknesses.
Predict the future!?
The idea of "one language to do everything for science" will not succeed, at least soon.
⇒ Important aspect: interoperability between coexisting tools.
Python should be able to
Python will continues to shine as
a language particularly good for human communication and to write ideas
a language for fast prototyping
a glue language able to interact nicely with code written in other languages (see cppyy and pybind11, or the possibility to interact with Rust code).
Great versatile tool for scientists.
Impressive scientific ecosystem.
Very strong and quickly growing scientific Python community.
A scientist with good skills in Python can do most of what she/he needs to do. (which does not mean that Python is the best language to do everything!)
In contrast, no skill in Python is a real disadvantage for many tasks and for employability.
$\Rightarrow$ In most fields, today, if a scientists has to acquire good skills in one language, it is in Python.
Short-term efficiency and long-term efficiency are often incompatible:
Similarly, productivity at different scales:
individual
group
community
We have to consider these conflicts when choosing between different technologies.
Fluid mechanics:
Programming is everywhere...
However, very low level in software engineering in the community!
Not unusual to start a PhD without any serious training in Linux and in programming with modern tools.
Few people aware of the challenges and opportunities of open-source.
Engineering science $\Rightarrow$ close-source commercial software is strong.
A group pays for a license or a new development.
It does not learn how to do what have been paid.
By its money and its feedback, it contributes to improve the product sold by the company.
The group produces codes, books, courses using the close-source product or acquire knowledge on how to use the close-source product
The group is more in need of the product and is ready to pay more for it.
Difficult to break such circle but can be done with open-source solutions (similar circle instead positive).
Scientists produce code. But without the technical knowledge on how to work collectively on code, production of bad quality doomed to be soon abandon.
However, one tries to reuse code...
Thousands of hours of highly qualified people are spent in trying to understand and reuse code badly written with inappropriate languages!
Coding in the field:
Mix of Fortran/C or C++, a shell languages (as bash) and Matlab.
For experiments, the graphical programming environment Labview.
Fortran, C or C++ are inadequate for fast prototyping of complex programs.
For scientific purposes, one should use shell scripting only for extremely simple tasks.
Matlab is not adapted for complex programs.
Problems for code reuse and sharing and collaborative development.
Matlab is a close-source proprietary numerical computing environment.
A good tool for simple processing with matrices and data plotting.
Language well adapted for these tasks.
Nice development environment.
Quite fast interpreter now that it has a JIT compiler.
Bad tool for doing more than simple processing and data plotting.
All functions available in a huge flat namespace.
No import mechanism so that we do not know looking at the code from where a function comes.
Scripts that modify and define global variables (see this Matlab version of Diablo).
The language strongly encourages this practice.
Matlab files not self-consistent, i.e. it is normal to use in a file a global variable defined outside of the file.
In python:
def myfunc(a, b, c=1, has_to_print=True):
if has_to_print:
print('a =', a, 'b =', b, 'c =', c)
return c * (a + b)
In Matlab:
function ret = myfunc(a, b, varargin)
if nargin < 2 | nargin > 4
error(['The number of arguments has to be ' ...
'greater than 2 and lower than 5'])
end
if nargin == 4
has_to_print = varargin{2};
else
has_to_print = 1;
end
if nargin >= 3
c = varargin{1};
else
c = 1;
end
if has_to_print
disp(['a = ' num2str(a) '; b = ' num2str(b) '; c = ' num2str(c)])
end
ret = c * (a + b);
end
Usually full of ;
, .*
, ./
, &&
and ||
(quite "noisy").
Nothing like the Python PEP 8.
a(100, 100) = 1;
to create and extend matrix¶Not error with a = eye(2); a(i0, i1) = 1;
even for crazy values of i0
and i1
!
No error if a user misspell a variable var
and write something like vae(100, 100) = 1;
.
Impossible to study the implementation of its functions.
One license (for non-commercial use) not very expensive for most research and teaching institutes.
However, most codes need toolboxes... Non negligible cost for some institutions.
Multi-processing (one licence per processes): very expensive!
Unfortunately, they are not serious alternative today.
Too small user community.
Not very efficient (now optimized Matlab codes rely on its JIT).
True ... but now Anaconda, Spyder and JupyterLab.
True with basic Matplotlib... but several other tools for interactive plot.
Python bad for science because
Very mature, robust, well documented and integrated libraries (numpy, scipy and IPython).
Badly documented and buggy. Python for science today is not old-school open-source.
New open-source methods give impressive results in term of code quality.
Most of the big Python packages supported by companies or research institutes:
Numba (Anaconda, Nvidia), TensorFlow (Google), Scikit-learn (INRIA), Mercurial (Facebook).
Let's compare code. First the Matlab version:
% matrix creation (row's shape is [1, 3])
row = [1 2 3];
% matrix transposition
col = row';
% matrix multiplication
inner = row * col;
outer = col * row;
% element-by-element multiplication and division
row2 = row .* (row + 1);
result = row2 ./ (row2 + 1);
and then the equivalent code in Python 3.6 (using numpy):
import numpy as np
# creation of a 1D array with shape [1, 3]
row = np.array([1, 2, 3], ndmin=2)
# matrix transposition
col = row.T
# matrix multiplication with the symbol @
inner = row @ col
outer = col @ row
# element-by-element multiplication and division
row2 = row * (row + 1)
result = row2 / (row2 + 1)
Nowadays wrong...
Python most popular language for teaching
Researches on Youtube give in early 2018:
Matlab team compares well optimized Matlab to badly written scientific Python.
To our experience, the translation in Python of a real Matlab code taken in a laboratory is faster even using only the basic scientific packages.
Actually mainly because the principle (and the effect) of Matlab is to keep the users with bad habits and a bad level.
With the tools available today (Cython, Numba, Pythran, ...), we can run with little efforts very efficient computations using Python.
Same analysis valid...
Moreover, graphical programming language with programs saved as binary files:
Good but expensive...
Sage and Sympy are two complementary open-source Python-based alternatives.
Matlab and Labview are also problems for students, who learn
Examples: Ansis, Lavision, Dantec, ...
"We do not share ..."
"to keep a comparative advantage."
"because we do not provide support."
"because people would not be able to correctly use the code or interpret the results."
"because we want to control industrial usage."
"because we do not want people to review and criticize our code."
"because in this way, people will think our project has more value."
Share only for "friends" to control the dissemination of the software
Share without repository and/or without license
Lack of a proper open-source license ⇒ problems and limitation for users/developers
Open-source code written with proprietary tools
To use them, we need to pay but we do not pay the authors of the software.
The authors work (often for free) for the company that sells the proprietary tool?
Some real open-source codes have also emerged, for example NEK5000 (Fortran), OpenFOAM (C++), Basilisk (C) and Channelflow (C++) and Code_Saturn (C/Fortran).
Big companies have started to use open-source development for fluid mechanics applications.
Volkswagen group (Volkswagen, Audi, Seat, Porsche, Skoda, ...) has used OpenFOAM since 2006.
Interesting open-source strategy of EDF (the main French electric utility company)
CFD codes (Dedalus, SpectralDNS, Oasis, PyFR and FEniCS) or data analysis (OpenPTV).
The packages of the FluidDyn project are part of this trend.
A project to foster open-science and open-source in fluid mechanics
A set of Python packages
Examples for:
Pure-python code that can be reused in scripts or in specialized FluidDyn packages. It also contains the code of miscellaneous command-line utilities useful for a typical fluid dynamics user.
A library which provides C++ and Python classes useful to perform Fast Fourier Transform (FFT) in sequential and in parallel.
Numerically oriented framework to run sequential and parallel Computational Fluid Dynamics (CFD) simulations and on-the-fly post-processing for a variety of problems (Navier-Stokes, Shallow Water, Föppl von Kármán equations, to name a few).
Package to carry out laboratory experiments. Primarily used to communicate with various hardware devices such as motors and pumps, to handle I/O between sensors, and to store data.
Scalable image processing package which implements various algorithms to calibrate cameras, to preprocess, to do Particle Image Velocimetry (PIV) and to postprocess data.
Small package used to carry out experiments in the Coriolis platform and share the data obtained.
Small package to load OpenFoam data and plot them.
Open-source in science has never been so strong
Very good tools and methods to be more efficient and to do better science
Collectivelly:
Question at the beginning of a PhD thesis : Matlab ou Python?
Open-source choice : individual and collective choice. Scientific policy.
Open-source has a cost. No good business model for scientific open-source. We tend to be stingy!!
But :
Augmenter la dynamique open-source au laboratoire.
Petite puissance de frappe pour orienter développement open-source dans directions utiles au labo.
Sur le long terme économie pour le labo (diminution progressive addiction outils propriétaires)
Moyen :
Rééquilibrer les financements communs du laboratoire entre les entreprises fournissant des codes propriétaires et les projets open-source utiles pour le laboratoire
Petit appel à projet "Open-source utile au labo" au niveau du labo
Une petite action collective du labo...
Besoin d'un groupe d'organisation ouvert...
Lister dépenses communes du laboratoire pour logiciels propriétaires potentiellement remplaçables $\Rightarrow$ tableau avec somme totale (2017, Matlab + Labview $\simeq$ 8000 € TTC).
Remarque : lissage sommes sur plusieurs années
Décision chaque année d'un pourcentage de cette somme totale allouée à l'appel à projets (décision politique scientifique donc direction et conseil du labo).
Appel à projets proprement dit. Demande de courts descriptifs (A4?).
Classement motivé et ouvert par le groupe d'organisation.
Décision (politique scientifique $\Rightarrow$ direction et conseil du labo) de financer ou non des projets.
Support Anaconda pour le labo
Adhésion à une association open-source (Debian, OpenFoam France, PythonFr, ...)
Financement d'un projet open-source sur un point précis utile au labo (exemple Spyder issue 4180, voir Spyder is unfunded)
Stage pour amélioration d'un outil open-source du labo utile à plusieurs projets au labo.
Stage pour transition pour le contrôle d'une manip. ou d'un instrument de Labview à des solutions open-source (fluidlab, micropython, Raspberry Pi, Arduino).
Stage pour transition d'un code pour se libérer d'un outil propriétaire.
...
On monte un groupe Python au niveau grenoblois.
Interventions entre autre de
Serge Guelton (Pythran) :
Quelque chose comme : Introduction au calcul numérique remettant en questions les mythes Fortran > all
Alexandre Gramfort (scikit-learn) :
How to build a success scientific software package: the scikit-learn model
Machine learning with scikit-learn