How Python became the Language of Choice for Data Science

A slightly updated version of my 2013 post.

Originally posted Nov 20, 2013 on blog.mikiobraun.de. Slightly edited.

Nowadays Python is probably the programming language of choice (besides R) for data scientists for prototyping, visualization, and running data analyses on small and medium sized data sets. And rightly so, I think, given the large number of available tools (just look at the list at the top of this article).

However, it wasn’t always like this. In fact, when I started working on my Ph.D. back in 2000 virtually everyone was using MATLAB for this. And again, rightly so. MATLAB was very well suited to quickly prototype linear algebra and matrix stuff, came with a nice set of visualizations, and even allowed to do some text mining and file parsing if you really needed it to do so.

The problem was, however, that MATLAB was and is actually very expensive. A single license costs a few thousand Euros, and each toolbox costs another few thousand Euros. However, MATLAB was always very cheap for universities, which made perfect sense: That way, students could be trained in MATLAB so that they already knew how to use it to solve problems and companies would then be willing to pay for the licenses.

All of this changed significantly in 2005 or so. At that time I was working at the Fraunhofer Institute FIRST, which belongs to a group of German publicly funded research institutes focused on applied research. Originally, Fraunhofer institutes could get the same cheap university licenses, but then Mathworks changed their policies to the effect that you could only get the university rate if you are an institution which hands out degrees.

This did not hold for most publicly funded research institutes all over the world, like the Max-Planck-Institutes (like the one in Tübingen where Bernhard Schölkopf is), or the NICTA in Australia where Alex Smola and others were working at the time. So we decided something had to change and we started looking for alternatives.

Python was clearly one of the possible choices, but at the time other opportunities seemed possible as well. For example, octave had been around for a long time and people wondered whether one should not just help them to make octave as good as matlab and fix all remaining compatibility issues. Together with Stefan Harmeling I started phantasizing about a new programming language dubbed rhabarber (the repo, originally hosted on Google Code, still exists) which would allow to extend even the syntax dynamically to be able to have true matrix literals (or even other things). Later I would play around with JRuby as a basis because it allowed better integration with Java to write high performance code where necessary (instead of doing painful low-level stuff with C and swig).

If I remember correctly, the general consensus was already back then that Python would the language of choice. I think early versions of numpy already existed, as well as early versions of matplotlib. Shogun, which had been developed and used extensively in our lab, had already begun to provide Python bindings, and so on.

I personally always felt (and still feel, even in 2021) that there are things where MATLAB is still superior to Python. MATLAB was always a quite dynamic environment because you could edit files and it would reload the files automatically. Python is also somewhat restrictive with what you can say on a single line. In MATLAB you would often load some data, start editing the functions and build you data analysis step by step, while in Python you tend to have files which you start from the command line (or at least that’s how I tend to do it).

In any case, early on there was also the understanding that we should focus our efforts on a single project and not have the work scattered over several independent projects, so we planned a workshop at NIPS 2005 on this, but unfortunately the workshop was rejected. However, engagement was so high, that we just rented a seminar room in the same hotel where NIPS was going to be held on the Sunday before the conference, notified all people we thought would be relevant and had the Machine Learning Tools Satellite Workshop the day before the NIPS conference.

The hot contender back then was the Elefant toolbox designed by Alex Smola and collaborators, which was a pretty ambituous project. The idea was to use PETSc as the numerical back end. PETSc was developed in the area of large scale numerical simulations and had a number of pretty advanced features like distributed matrices and similar things. I think ultimately, it might have been a bit too advanced. Simple things like creating a matrix were already quite complicated.

I also gave a talk together with Stefan on rhabarber, but most people were skeptical whether a new language was really the right way to go, as Python seemed good enough. In any case, things really started to get going around that time and people were starting to build stuff based on Python. Humans are always hungry for social proof and having that one day meeting with a bunch of people from the same community gave everyone the confidence that he wouldn’t be left alone with Python.

A year later, we finally had our first Machine Learning Open Source Workshop which eventually led to the creation of the MLOSS track over at JMLR in an attempt to give scientists a better incentive to publish their software. We had several iterations of our workshop, had Travis Oliphant give an intro to numpy, invited John Hunter, the main author of matplotlib who sadly passed away in 2012, as well as John W. Eaton, main author of octave, and also had more workshops (although without me). Somehow, the big, open, interoperable framework didn’t emerge, but we’re still trying. Instead there exist many framework which are wrapping the same basic algorithms and tools again and again.

Eventually, Elefant didn’t make the race, but other toolboxes like scikit- learn became common place, and nowadays we luckily have a large body of powerful tools to work with data, without having to pay horrenduous licensing fees. Other tools like Pandas were created in other communities and everything came together nicely. I think it’s quite a success story and having been minor part of it is nice, although I didn’t directly contribute in terms of software.

Now in 2021, we have seen so many more additions, and Python is really the de facto standard platform for doing data science. Tensorflow and pytorch are not written in Python, but that is the main interface that people use. Python also became one of the main languages for serverless services. New frameworks like Ray are using Python as their main interface. Python has even added more operators to the language to make it easier to express matrix computations.

Interestingly, I never became that much of a Python enthusiast. I wrote my own stuff in JRuby, which lead to the development of jblas, but at some point started working on real-time analysis stuff where I just needed better control over my data structures, and used Java and Scala for that. In 2020 I started working more with Python, and I'm still not exited, but the word I associate most with it is solid. Depending on how you use it, it is quite fast. It did a good job at being open and extendable. Some newer features like type hints are nice. I still like Scala's collection API, though.

If you have stories to share (or corrections) on the “early years of Data Science”, I’d love to hear from you.

Previous[WIP] Writing Software At Scale

Last updated 5 years ago