Senior R&D Manager (Data Science) at Synopsys Software Integrity Group and Treasurer at Farset Labs & Bsides Belfast
After last weeks return to posting, I thought it was time to do something vaguely useful (and actually talk about it) so I’m tidying up a few meetup sessions I’ve presented at into a series of Basic Data Science (with Python) posts. This is the first one and covers my Python environment, the Jupyter notebook environments I use for analysis, and some on the Plot.ly graphs and RISE / Reveal.js methods I use to turn those notebooks into presentations.
This section will be “Light” on the data science, maths, politics and everything else outside of the setup of this wee environment.
- Basic Linux/OSX familiarity (This setup was tested on Ubuntu Server 17.04, YMMV)
- Basic Programming familiarity in a hybrid language like Python, R, Bash, Go, Rust, Julia, Node.JS, Matlab, etc.
- That when I use the phrases “High Performance”,”Large”,”Big”,”Parallel”, etc, this is in the context of a single, small, machine processing small amounts of interesting derived data. Yes, we can go big, but for now lets start up small…
- Download Anaconda Python 3.X
wget https://repo.continuum.io/archive/Anaconda3-4.4.0-Linux-x86_64.sh -O anaconda.sh
chmod +x anaconda.sh
- [OPTIONAL] Set up a
condaenvironment to play in;
conda create -n datasci
source activate datasci
conda config --add channels conda-forge
conda install numpy scipy pandas nb_conda
conda install jupyter_nbextensions_configurator nbpresent nbbrowserpdf plotly seaborn
pip install cufflinks
conda install -c damianavila82 rise
jupyter notebook --ip=0.0.0.0
If you’re using something else to do Data Science, good for you, but in terms of flexibility, developer/analyst velocity and portability, Python is hard to beat. R still has Python beat in some of the depths of it’s Statistics expertise, but IMO Python gets you 99% of the way there.
My environment of choice uses the Anaconda distribution. Its largely pre-built and optimised modules, communuity maintained channels, and environment maintence capabilites make it hard to beat for exploratory analysis / proof of concept without the overheads of containerization, the pain of virtualenvs or the mess that comes from
sudo pip install
I’ll skip past the installation instructions since Anaconda’s installation guides are pretty spot on, and we’ll swing on to some tasty packages.
NOTE in most cases, when the Anaconda installer asks “Do you wish the installer to prepend the Anaconda3 install location to PATH…?”, unless you craft your own dotfiles, it’s a good idea to “yes” this
If you want to keep all your datascience environments tidy, or you just want to minimise cross-bloat,
conda makes environment creation and management extremely easy
conda create -n datasci source activate datasci
This way, you’re not mundging over the base python installation, and best of all, everything lives in userland so this is great for shared machines where you don’t have
Once the base distribution is installed, it’s also worthwhile adding in the
conda-forge channel, which gives you access to community built and tested versions of most major Python packages (and can be more up to date than Anaconda’s pre-built versions)
conda config --add channels conda-forge
Numpy and Pandas and Scipy, oh, my!
There are a few fundamental challenges in Data Science that can be loosely grouped as:
- Extraction, Collection and Structured Storage of Mixed Value/Format Data
- Slicing, Dicing Grouping and Applying
- Extraction of Meaning/Context
The first two of these are made massively more comfortable with the addition of 3 (large) Python packages;
scipy specialise in highly optimised vector-based computation and processing, allowing the application of single functions across large amounts of data in one go. They’re both very tightly integrated with eachother but in general,
numpy focuses on dense, fast, data storage and iteration, with a certain amount of statistical power under the hood (most of it from
scipy), and then
scipy focuses on slightly more ‘esoteric’ maths from the scientific, engineering and academic fields.
Without a doubt,
pandas is my favourite Python module; it makes the conversion, collection, normalisation and output of complex datasets a breeze. It’s a little bit heavy once you hit a few million rows, but at that point you should be thinking very carefully about what questions you’re trying ask (and why you’re not using Postgres/Spark/Mongo/Elasticsearch depending on where you are).
So, enough introductions;
conda install numpy scipy pandas
Drops of Jupyter
pandas is still my favourite module,
jupyter probably makes the biggest difference in my day to day life. Oftentimes you will either be living in a text editor / IDE and executing those scripts from the command line, or running through a command line interpreter directly like IDLE or ipython.
jupyter presents a “middle-way” of a block-based approach hosted through a local web server. Go play with it, you’ll thank me.
While the base
jupyter module is packaged as part of Anaconda, a collection of helper modules in inluded in the
nb_conda package that includes features such as automatic kernel detection among other features
conda install nb_conda jupyter notebook --ip=0.0.0.0
--ip=0.0.0.0 sets the jupyter server to accept connections to all addresses; by default it will only listen on
On first time launch, you’ll need to connect with a tokenized url
After visiting that URL, you’ll be presented with Jupyter’s file-tree based browser view:
From here we can launch “notebooks” from a range of kernels; the
datasci conda environment we just created, the “root” conda environment (not “root” as in “superuser”, “root” as in the base, user-space, conda installation), and the system default python version.
You can fire up the
datasci environment to get into the new notebook view, where Jupyter really comes into its own.
As you can see, the “output” (stdout / stderr) is redirected from the cell scripts into the browser. This is gets particularly powerful as we start to use more complex functionality than just “printing”.
Notice the lack of
pandas comes in…
When working with large datasets, “accidentally” filling the screen with noise when you just want to peek into a dataset is an occupational hazard, so all
pandas collection objects have
.head() methods that show the first 5 (by default) results. Unsurprisingly, there’s also a
tail, but more interestingly; there’s a
.describe() that give you a great general rundown on the nature of the dataset.
But who needs numbers when we can have graphs?
Jupyter supports a wonderful meta-language, helpfully called “magic functions” that give lots of access to the innards of Jupyter. One handy one is the
%matplotlib function, that lets you tell Jupyter how to interpret
matplotlib plot objects.
%matplotlib inline essentially says “make it an inline png”
That’s fine for fairly constrained plots like histograms; if we’re dealing with some kind of long-term varying data (eg timeseries), it’s not ideal to try to “navigate” a static image.
%matplotlib nbagg instructs jupyter to set matplotlib to use a heavily customised backend especially for notebooks, that brings interactive spanning, zooming, translation and image export.
(Note, when switching between
nbagg, the kernel has to be restarted using the relevant button in the toolbar, this will keep all your code and cells and outputs, but will clear any stored variables)
Using the Pan button on the bottom right of the plot, we can zoom into a particular area of the graph and export it if we like.
So far so meh; what about something vaguely realistic?
numpy’s random modules, and
pandas.date_range functionality, we can generate a (basic) simulation of the daily up/down movements of a financial market;
From there’s it’s a hop and a skip to use the
cumsum() method to get a simulated price change from the launch and plot it (with proper time-series x-axis labels!)
Similar (but far nicer) than the
Series display, this 2 dimentional dataset (
DataFrame) has a very clean and readable representation in
Two things to point out before we move on; dimensional aggregation and column creation;
df.mean() by default returns the average value of each column, but this behaviour can be made explicit or transposed through the
And then we can use the output of
axis=1 to create a new “AVG” column in the dataframe in place
Mixed Cells, Presentations and RISE / Reveal.js
For presentationy stuff later though, I like to add in a few extras, this first batch is related to extension management, pdf export and introducing Jupyters Presentation capabilities, in particular, RISE which greatly augments Jupyters’ presentation capability, bringing in the popular reveal.js presentation framework.
conda install jupyter_nbextensions_configurator nbpresent nbbrowserpdf conda install -c damianavila82 rise
This also gives us an opportunity to experiment with the multiple cell types that Jupyter gives; (It’s not all about code!)
This regular boring comment could also be considered Markdown, if only we could tell jupyter…
By selecting the cell, and switching the cell type from the dropdown to Markdown, it’ll re-display as giant blue text while in edit mode, and when you “run” the cell, it’ll be rendered as a markdown title.
So we’ve got markdown, now to get presentations; in default configuration, we need to enable the “Slideshow” cell toolbar.
This adds an extra toolbar to each cell that allows you to control the slide state of that cell (this will make more sense in a sec)
This dropdown gives 6 options for what kind of slideshow behaviour that cell has;
-: Direct continuation, i.e. this cell will be considered part of the above cell (useful for mixing Markdown and Code in the same slide)
Slide: Make this a primary slide in the
Sub-Slide: Make this a secondary slide (i.e. slides “below” the primary slides)
Fragment: This cell will “appear” on the element above, think of it like an “appear” transition in Powerpoint/etc.
Skip: Simple enough, these cells aren’t shown.
Notes: These cells are used for Reveal.js Speaker Notes, YMMV and they don’t work with the native Jupyter integration, so I’ve never used them.
Using a combination of these, we can turn our mixed Markdown/Python notebook into a slideshow including code-outputs:
Click the “Enter/Exit Live Reveal Slideshow” button on the top toolbar (Or
Alt + r)
Using these tools and the
Pandas functionality from above, we can quickly grab the
iris dataset from Vincent Arel-Bundock’s RDatasets repository and start graphing things almost instantly without showing all the gubbins in the presentation;
Then click either the arrows on the bottom right, or
PgDn to reveal the next fragment
Graph all the things
One of the key parts of interrogating and analysing a dataset is visualisation. Python has a pile of fantastic visualisation libraries, with
matplotlib being the granddaddy of them all (and is a dependency of most numerical libraries because it’s handy as hell). But that’s not our only option by a long shot.
Plot.ly’s getting started instructions are spot on so I’ll not dive too far into them and go straight on to using it.
cufflinks is a brilliant little bridge between
plotly, creating an overloaded
iplot() method on
DataFrame objects. Unfortunately it’s not in the default
conda distribution so we have to
pip it the old fashioned way.
conda install plotly seaborn pip install cufflinks
But before we completely ignore that
iris dataset, we’ll gently fire it at
seaborn to get something interesting;
hue argument gives much higher clarity on the delineation between the species, and for wide datasets like this,
seaborn can infer what columns to plot against, allowing rapid inspection of relationships in data.
And that’s about it!
I’ll try to keep this page updated with “my ‘best’ practice” setup and workflow. Slap me around if I’ve messed anything up in these instructions, or to comment on how I’m totally wrong!
If there’s a decent level of interest in this, I might put together a proper course for this stuff, lemme know if that’s something that’d be useful! (Otherwise I’ll just keep ranting at Meetups!)