Marco Bonzanini

Python and Data Science

I’m a Data Science consultant, corporate trainer and author based in London, UK. I specialise in the Python for Data Science (PyData) software stack.

With 20 years of experience in the tech industry, I provide consulting, coaching and training services in the data science space through my company Bonzanini Consulting Ltd.

More about Marco here.

Tips for saving memory with pandas

In this article you’ll find some tips to reduce the amount of RAM used when working with pandas, the fundamental Python library for data analysis and data manipulation.

When dealing with large(ish) datasets, reducing the memory usage is something you need to consider if you’re stretching to the limits of using a single machine. For example, when you try to load some data from a big CSV file, you want to avoid your program crashing with a MemoryError. These tips can also help speeding up some downstream analytical queries.

The overall strategy boils down to choosing the right data types and loading only what you need. In this article you’ll learn about:

Finding out how much memory is used
Saving memory using categories
Saving memory using smaller number representations
Saving memory using sparse data (when you have a lot of NaN)
Choosing the right dtypes when loading the data
Loading only the columns you need
Loading only a subset of rows

Finding out how much memory is used

First, let’s look into some simple steps to observe how much memory is taken by a pandas DataFrame.

For the examples I’m using a dataset about Olympic history from Kaggle. The dataset is in CSV format and takes roughly 40Mb on disk.

>>> import pandas as pd
>>> athletes = pd.read_csv('athlete_events.csv')
>>> athletes.shape
(271116, 15)

There are ~271K records with 15 columns

For a breakdown of the memory usage, column by column, we can use memory_usage() on the whole DataFrame. The memory is reported in bytes:

>>> athletes.memory_usage(deep=True)
Index          128
ID         2168928
Name      20697535
Sex       15724728
Age        2168928
Height     2168928
Weight     2168928
Team      17734961
NOC       16266960
Games     18435888
Year       2168928
Season    17080308
City      17563109
Sport     18031019
Event     24146495
Medal      9882241
dtype: int64

The function also works for a single column:

>>> athletes['Name'].memory_usage(deep=True)
20697663

The difference between the two outputs is due to the memory taken by the index: when calling the function on the whole DataFrame, the Index has its own entry (128 bytes), while for a single column (i.e. a pandas Series) the memory used by the index is aggregated.

For an aggregated figure on the whole table, we can simply sum:

>>> athletes.memory_usage(deep=True).sum()
186408012  # roughly 178Mb

Why do we need deep=True? This flag will introspect the data deeply, reporting the actual system-level memory consumption. Without setting this flag, the function returns an estimate which could be quite far from the actual number, for example:

>>> athletes.memory_usage()
Index         128
ID        2168928
Name      2168928
Sex       2168928
Age       2168928
Height    2168928
Weight    2168928
Team      2168928
NOC       2168928
Games     2168928
Year      2168928
Season    2168928
City      2168928
Sport     2168928
Event     2168928
Medal     2168928
dtype: int64

>>> athletes['Name'].memory_usage()
2169056

Another way of getting the overall memory consumption is through the function info(), which is going to be useful because it also gives us information on the data types (dtype) used by the DataFrame. Notice again the use of deep introspection for the memory usage:

>>> athletes.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 15 columns):
 #   Column  Non-Null Count   Dtype  
---  ------  --------------   -----  
 0   ID      271116 non-null  int64  
 1   Name    271116 non-null  object 
 2   Sex     271116 non-null  object 
 3   Age     261642 non-null  float64
 4   Height  210945 non-null  float64
 5   Weight  208241 non-null  float64
 6   Team    271116 non-null  object 
 7   NOC     271116 non-null  object 
 8   Games   271116 non-null  object 
 9   Year    271116 non-null  int64  
 10  Season  271116 non-null  object 
 11  City    271116 non-null  object 
 12  Sport   271116 non-null  object 
 13  Event   271116 non-null  object 
 14  Medal   39783 non-null   object 
dtypes: float64(3), int64(2), object(10)
memory usage: 177.8 MB

Notice how all the string fields are loaded as object, while all the numerical fields use a 64-bit representation, due to the architecture of the local machine — it could be 32-bit with older hardware.

Saving memory using categories

Some of the variables in our dataset are categorical, meaning they only have a handful of possible values. Rather than using a generic object for these variables, when appropriate we can use the more relevant Categorical dtype in pandas. For example, good candidates for this data type include the variables Medal, Season, or Team, amongst others.

If you don’t have a full description of the data, in order to decide which columns should be treated as categorical, you can simply observe the number of unique values to confirm this is much smaller than the dataset size:

>>> athletes['Medal'].unique()
array([nan, 'Gold', 'Bronze', 'Silver'], dtype=object)

There are only three different values, plus the null value nan.

Observe the difference in memory consumption between using object and using categories:

>>> athletes['Medal'].memory_usage(deep=True)
9882369  # 9.4+ Mb
>>> athletes['Medal'].astype('category').memory_usage(deep=True)
271539  # 0.25 Mb

Besides saving memory, another advantage of using categorical data in pandas is that we can include a notion of logical order between the values, different from the lexical order.

Saving memory using smaller number representations.

Let’s look at some numerical variables, for example ID (int64), Height (float64) and Year (int64).

You can observe their range by checking the minimum and maximum values:

>>> athletes['ID'].min(), athletes['ID'].max()
(1, 135571)

The int64 dtype is able to hold numbers on a much broader range, at the price of a much bigger memory footprint:

>>> import numpy as np
>>> np.iinfo('int64')  # integer info
iinfo(min=-9223372036854775808, max=9223372036854775807, dtype=int64)

Using int32 for the column ID is enough to store its values and it will save us half of the memory space:

>>> athletes['ID'].memory_usage(deep=True)
2169056
>>> athletes['ID'].astype('int32').memory_usage(deep=True)
1084592

We can do the same with floats:

>>> athletes['Height'].min(), athletes['Height'].max()
(127.0, 226.0)

In this case, a float16 is enough, and costs a quarter of the memory price:

>>> athletes['Height'].memory_usage(deep=True)
2169056
>>> athletes['Height'].astype('float16').memory_usage(deep=True)
542360

Finally, let’s look at the variable Year:

>>> athletes['Year'].min(), athletes['Year'].max()
(1896, 2016)

In this case, it looks like an int16 would be enough. On a closer look though, we can consider the Year column to be categorical as well: there’s only a handful of possible values. Let’s check the difference:

>>> athletes['Year'].memory_usage(deep=True)
2169056
>>> athletes['Year'].astype('int16').memory_usage(deep=True)
542360
>>> athletes['Year'].astype('category').memory_usage(deep=True)
272596

For this particular situation, it makes more sense to use categories rather than numbers, unless we plan on performing arithmetic operations on this column (you cannot sum or multiply two categories).

Saving memory using sparse data (when you have a lot of NaN)

The sparse dtypes in pandas are useful wen dealing with columns that have a lot of null values. Depending on your variables, you may want to consider representing your data as sparse. The info() function used earlier tells us how many non-null records we have for each column, so if that number is much lower than the size of the dataset, it means we have a lot of null values.

This is exactly the case of the Medal column that we treated as categorical earlier:

>>> athletes['Medal'].memory_usage(deep=True)
9882369
>>> athletes['Medal'].astype('category').memory_usage(deep=True)
271539
>>> athletes['Medal'].astype('Sparse[category]').memory_usage(deep=True)
199067

Choosing the right dtypes when loading the data

So far we have looked at the memory usage of different dtypes, converting the columns after the dataset was loaded.

Once we have chosen the desired dtypes, we can make sure they are used when loading the data, by passing the schema as a dictionary to the read_csv() function:

>>> schema = {
... 'ID': 'int32',
... 'Height': 'float16',
... # add all your Column->dtype mappings
... }
>>> athletes = pd.read_csv('athlete_events.csv', dtype=schema)

Note: it’s not possible to use the Sparse dtype when loading the data in this way, we still need to convert the sparse columns after the dataset is loaded.

Loading only the columns you need

Depending on the application, we often don’t need the full set of columns in memory.

In our medal example, let’s say we simply want to compute the overall count of the medals per nation. For this specific use case, we only need to look at the columns Medal and NOC (National Olympic Committee).

We can pass the argument usecols to the read_csv() function:

>>> athletes = pd.read_csv('athlete_events.csv', usecols=['NOC', 'Medal'], dtype=schema)
>>> athletes.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 271116 entries, 0 to 271115
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype   
---  ------  --------------   -----   
 0   NOC     271116 non-null  category
 1   Medal   39783 non-null   category
dtypes: category(2)
memory usage: 816.3 KB

If we only want to count the medals per NOC, we can use a groupby operation on the above DataFrame:

>>> athletes.groupby('NOC')['Medal'].count()
# (long output omitted)

It’s also worth noting that for this groupby operation there is a significant speed-up when using the categorical data types.

Loading only a subset of rows

To complete the picture, the read_csv() function also offers options to limit the number of rows we’re loading. This can be useful in a few circumstances, for example when we just want to take a peek at the data without looking at the whole dataset, or when the dataset is big enough that we can answer our analytical questions with a sample.

The first case, just taking a peek at the data, is straightforward:

>>> athletes = pd.read_csv('athlete_events.csv', nrows=1000)
>>> len(athletes)
1000

Using the nrows argument, we’ll load the first N records (1000 in the example above) into the DataFrame. This can often be enough to have a first feeling for the data before digging into further analysis.

If we want to implement some random sampling instead, the read_csv() function also offers the skiprows argument. Specifically, if we pass a custom function to this argument, we can implement our sampling logic. The function takes one argument (the row number) and should return True if you want to skip that row.

In the example, we want to keep the first row because it has the column names, and we load only ~10% of the data, using the function random() which returns a random float in the [0, 1) range (if this number is greater than 0.1, we skip the row):

>>> from random import random
>>> def skip_record(row_number):
...     return random() > 0.1 and row_number > 0
... 
>>> athletes = pd.read_csv('athlete_events.csv', skiprows=skip_record)
>>> len(athletes)
27176  # 27K rows in the sample, ~271K in the full dataset

Summary

In this article we have discussed some options to save memory with pandas choosing the most appropriate data types and loading only the data that we need for our analysis.

Do you need to upskill your team in pandas? Marco runs public and private training courses on Effective pandas and other Python topics, please get in touch to know more.

Follow Marco on Twitter.

Intervista Pythonista: Podcast Interview for the Italian Python Community

A few weeks ago I had the pleasure of chatting with Marco Santoni and Cesare Placanica from the Python Milano user group, as the first guest of their new Intervista Pythonista, a podcast about the Italian Python community. You can listen to the episode on Anchor/Spotify in Italian.

This blog post offers a very belated summary of the podcast episode for English readers. It’s not a full transcript but it sums up our chat.

Some of the points we discussed (questions from the hosts in bold):

How did you get into Python? And how do you use it nowadays?
My first encounter with Python was random. At that time I was working as software engineer, mostly on web applications (PHP and JavaScript, long before Angular/React and friends). Someone, probably from a local Linux User Group, mentioned Python and I had a first look. When I got into Natural Language Processing, it felt like a good excuse to study this new (for me) language a bit more and to look into the ecosystem — only NLTK at the time, as far as I remember. Later I used it for my MSc dissertation and throughout my PhD work. In the meanwhile the Python ecosystem for Data Science became more and more robust, and it became my first weapon of choice. Nowadays I use it as my main tool for all my Data Science work, and I teach it at my corporate training courses.
How do you keep your knowledge up to date [with the latest Python developments]?
Through a variety of channels. Conferences and meetups are a great way to see what people are working on, and to be exposed to fresh ideas. These days in-person gatherings are on hold so that part is mostly missing, but different user groups are still producing a lot of good content published on YouTube. Twitter or other social media channels can also be useful to catch up with some Python news, bump into a new library or new articles. Of course it’s difficult to keep up with everything so one has to be a bit selective when it comes to spend time digging into the details. When I’m really interested in a particular topic, after a first look at blogs and tutorials, I’d probably seek something more structured like a book or a video course.
How (and why) did you start a career as solo consultant? Bonus: how do you find your clients?
I started my Data Science consultancy firm in 2015, mainly looking for independence. Taking the first step was simple (register your company online at Companies House and you’re good to go). Over time I’ve learned, and I’m learning, all the other facets of how to run a business, but taking that first step was the crucial moment. On the topic of finding clients, most of my work comes through my network, e.g. via word-of-mouth from people I’ve worked with in the past, or through a recommendation from a person I’ve helped somehow in the past. With this in mind, having a presence at conferences/events and curating your personal network is essential.
How did you get into NLP?
I was working on a search application so I had to learn more. I started learning about Information Retrieval methods like TF-IDF, using off-the-shelf tools. From there, things took off: I later studied Information Retrieval for my MSc and PhD, I developed an interest for Natural Language Processing at large and I’ve been involved in many NLP projects. These days, I work on a broad variety of Data Science projects but NLP remains my favourite topic.
Can you tell us about your experience in writing books?
I wrote one book on Data Mining for Social Media, and developed two video courses on data science so far. It’s a lot of work! The reward is usually not on the financial side, unless you write the next Harry Potter series. The process gave me a lot of insights on the publishing industry as a whole and put me in touch with a variety of professionals in that industry, which was great. I learned bits and pieces related to editing, marketing and all the other steps that one doesn’t think about when starting a book. More importantly, it gave me the opportunity to polish my writing skills, which I think are essential in our profession, even more so for a consultant. I think for every professional in our field, improving your writing is a good investment, that will always pay dividends in the long run.
Tell us about your experience as Python trainer
I’ve been teaching and training in tech, in different capacities, for more than twenty years now. In the early days of my career, it was a second job through a local non-profit organisation. Later it became more and more central and after a short stint in academia, a few years ago I started offering corporate training courses as part of my consulting services. The demand for Python and Data Science is high, with many companies looking into the PyData stack for their data analytics and business intelligence needs. Most of my trainings these days are 2 or 3-day sessions with small groups of circa 10-12 people. Python is easy to pick up so during those few days of training folks can learn a lot and feel productive, even people who are new to programming.
How has your training experience changed with remote work now being more prominent?
Training remotely is something I’ve been doing for a few years, so when Covid restrictions forced companies to work from home, I was ready. Reading the room in a Zoom call is obviously more difficult compared to in-person sessions, but there are tricks one can implement to keep the engagement high, improving the overall enjoyment for the delegates. For example, short demos with frequent “try it yourself” moments bring up questions that help me perceive the students’ understanding of the subject — Jupyter notebooks are great for this because of their interactive nature. Splitting the class in groups of 2-3 people using breakout rooms is also very useful for interactive exercises that people can solve in a “pair programming” fashion.
Tell us about your community engagement (meet-ups, conferences, etc)
Since 2014 I started attending local meet-ups, and in particular I was regularly at PyData London, a new (at the time) meet-up around Python and Data Science. I was enjoying the atmosphere and the quality of the presentations, and I found myself coming back every month. Like many Python events, PyData London is community-driven: everything is run by volunteers. Shortly after the first few events, I got closer with the regulars and the organisers and started helping out with the monthly meet-ups and even with the annual conference, reviewing proposals, chairing sessions, etc. Since 2018 I’ve been the chair of our annual conference, and I’ve helped growing the conference to over 700 attendees. Meanwhile, I’ve also attended many other Python conferences giving talks and tutorials, in particular PyCon UK and PyCon Italy, but also EuroPython, various local PyData chapters in the UK, PyMunich and PyParis. The common line is always the great community: you get the opportunity to meet interesting people in your field, talking shop while enjoying a relaxing atmosphere. I certainly recommend everybody to look for local events to connect to your peers, and if possible to present your work to boost your CV!

Many thanks to Marco and Cesare for having me as first guest of their podcast, which now has already a few episodes under its belt, do check them out here!

Getting into Data Science presentation at Hisar Coding Summit 2021

Last week I had the opportunity to speak at the Hisar Coding Summit 2021, an event organised by students of the Hisar School of Istanbul. The remote format opened the doors for participants around the globe, but the audience was mainly high-school students with an interest in Data Science.

The title of my presentation, Getting into Data Science, already suggests the core topics: an overview on what Data Science is, some examples of interesting applications and an overview on the type of skill that are useful for a data scientist.

Link to the slides

Some of the core points I discussed:

Data Science is a three-way handshake between Computer Science, Statistics and Business Domain Knowledge. There are many Data Science Venn diagrams out there, with Drew Conway’s (that I referenced in the presentation) probably being the most recognised. One aspect of the Venn diagram representation that I don’t fully like is that it can be misread, suggesting that Data Scientists only exist within the intersection of the three disciplines (hence the representation of Data Scientists as Unicorns). The point of the three-way handshake is to suggest that Data Scientists’ skills can be found on the union, rather than the intersection, of such diagram (i.e. you don’t have to be an expert at everything to do Data Science).
“Data Scientist” is not the only job title in Data Science, there are plenty of other professionals whose skills are crucial for the positive outcome of a Data Science project, including e.g. data engineers, software engineers, business analysts, etc. There is also still some fuzziness on what data scientists are supposed to do, so different organisations will use job titles differently. I used Monica Rogati’s Data Science Hierarchy of Needs to illustrate this point, explaining how often people and companies tend to focus too much on the cool stuff at the top of the pyramid (AI and Deep Learning), which can only be achieved with solid foundations (reliable infrastructure and data handling) at the bottom of the pyramid.
Data Science Applications: too many to mention. One of the great aspects of Data Science is that it lets you work in almost any domain. Or if you prefer, any domain can benefit from Data Science. I offered some examples making references to previous presentations given at PyData London (either the meetup or our annual conference), citing Weather Forecasting, Healthcare, Biology, Journalism, Food Recommendations and more.
Data Science Skills: I used again the three-way handshake to discuss how the skills of a Data Scientist will lie somewhere on the union of Computer Science, Mathematic/Statistics and Domain Knowledge. We discussed this in terms of where to start and where to go next, rather than must-have. I find a lot of those “top 10 must-have Data Science skills” articles out there to be silly at best, or damaging at worst, because again they instil the notion that you have to master multiple disciplines before you can even start, intimidating people rather than encouraging them into Data Science. I hope my main message, “you don’t have to be an expert at everything to do Data Science”, came through and left the students eager to learn more.

The session was wrapped up with a few excellent questions (that I did not expect from high-school students!) ranging from data privacy, to ethics, to queries on state of the art computer vision and NLP.

Feature Scaling – Machine Learning Notes

Feature Scaling, also known as Data Normalisation, is a data preprocessing technique used in Machine Learning to normalise the range of predictor variables (i.e. independent variables, or features).

This is done to ensure that all the input variables have values on a normalised range. Since ranges of values can be widely different, and many Machine Learning algorithms use some notion of distance between data points, features with broader ranges will have a stronger impact on the computation of such distance.

By scaling the features into a normalised range, their contribution to the final result will be about the same.

There are several methods to perform feature scaling, common examples include Data Standardisation and Min-Max Normalisation.

Data Standardisation

Each predictor variable is transformed by subtracting its mean and dividing by the standard deviation. The resulting distribution is centred in zero and has unit variance.

$x^{\prime} = \frac{x - x_{mean}}{\sigma_{x}}$

Min-Max Normalisation

Also called rescaling, the transformed values are in the [0, 1] range. Each predictor variable is transformed by subtracting its minimum value and dividing by the difference between maximum and minimum value.

$x^{\prime} = \frac{x - x_{min}}{x_{max} - x_{min}}$

The min-max normalisation approach can be generalised to produce transformed variables with values in any [a, b] range, using the following formula:

$x^{\prime} = a + \frac{(x - x_{min})(b - a)}{x_{max} - x_{min}}$

Do all Machine Learning algorithms need feature scaling?

Algorithms based on distance/similarity and curve fitting require scaling (kNN, SVM, Neural Networks, Linear/Logistic Regression).

Tree-based algorithms (Random Forest, XGBoost) and Naive Bayes don’t require scaling.

Scaling training/test data sets correctly

When scaling on a dataset that is going to be used for supervised learning using a train/test split, we need to re-use the training parameters to transform the test data set. By “training parameters” in this context we mean the relevant statistics like mean and standard deviation for data normalisation.

Why do we need to compute these statistics on the training set only? When using a trained model to make predictions, the test data should be “new and unseen”, i.e. not available at the time the model is built.

In Python/scikit-learn, this translates roughly to the following:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

scaled_train_data = scaler.fit_transform(train_data)
scaled_test_data = scaler.transform(test_data)

The first function fit_transform() computes the mean and standard deviation on the training data, while the second function transform() re-uses those statistics and applies them to transform the test data.

PyData London 2018

Last weekend (April 27-29) we run PyData London 2018, the fifth edition of our annual conference (we also have a monthly meet-up, with currently 7,200+ members).

The event is entirely run by volunteers, with the purpose of bringing the community together and raising money for NumFOCUS, the charity that provides financial support to open-source scientific computing projects.

This year I had the pleasure of chairing the conference together with Cecilia and Florian. The organisation started in September last year when the chairing committee was formed.

These are some of the highlights of the weekend:

A new and bigger venue, the Tower Hotel in front of the iconic Tower Bridge, we had about 330 delegates for the tutorials on Friday and 550 for the talks on Saturday and Sunday
A great programme with 4 keynotes, 12 tutorials, 36 talks and two session of lightning talks. With more than 200 proposals, the review committee did an amazing job (thanks to Linda for leading the effort)
A Beginners Bootcamp run the day before the conference by Conrad of PythonAnywhere
Community-driven hackathons: an Algorithmic Art Hackathon (led by Tariq and our friends at the Algorithmic Art Meet-up), a pandas sprint (led by Marc and the Python Sprints Meet-up), and a Politics-themed hackathon (led by John and Frank of PyData Bristol)
An Algorithmic Art Expo: our friends from the Algorithmic Art Meet-up brought in some cool toys showcasing their work
Diversity Round Table, organised by Gina Helfrich
Childcare: for the first time we’ve been able to offer an on-site creche, supporting parents who otherwise wouldn’t be able to enjoy the conference
Book signing with Steve Holden (Python in a Nutshell), Ian Ozsvald (High Performance Python) and Holden Karau (High Performance Spark); thanks to O’Reilly we had 60 paperback books as gifts to our attendees
Our Social Event with the now classic Pub Quiz organised by James Powell

For a flavour of what the event was like, you can check out the buzz on Twitter and our shared photo album.

Thanks to all the people who contributed to yet another great PyData event!

@MarcoBonzanini

Video Course: Practical Python Data Science Techniques

I’m happy to announce the recent release of my second video course,
Practical Python Data Science Techniques published with Packt Publishing.

VideoCourse-Cover

Links:

video course on Packt Publishing (the publisher)
companion code for the course (on my GitHub)

This video course follows my first introductory course (Data Analysis with Python) and provides the audience with recipe-like solutions to common Data Science problems.

In particular, with about 2.5 hours of material, the video course covers the following topics:

Exploring Your Data
This section covers some of the most common techniques related to loading data, performing exploratory analysis and cleaning your data to get them in the right shape.
Dealing with Text
describes the common pre-processing techniques that you need to deal with text, from tokenisation to normalisation, to calculating word frequencies.
Machine Learning Problems
describes the most common Machine Learning problems and how to tackle them using scikit-learn.
Time Series and Recommender Systems
The last section groups some miscellanous topics, in particulr Time Series Analysis and the basics to implement a recommender system.

More details about the content of the course are available on the PacktPub’s page, and of course you can check out the code examples on my GitHub (links on top of this page).

If you are a beginner you may also be interested in my other video course, Data Analysis with Python (see video course on PacktPub.com, course material on GitHub and course overview on this blog).

@MarcoBonzanini

Video Course: Data Analysis with Python

VideoCourse-Cover

I’m happy to announce the release of my first video course Data Analysis with Python, published with Packt Publishing.

Links:

video course on Packt Publishing (the publisher)
companion code for the course on my GitHub

With 2 hours 26 minutes of content segmented into short video sessions, this course aims at introducing the audience to the field of Data Science using Python, discussing some of the fundamental tools of the trade.

Bird’s eye view on the course:

Python Core
- Course overview
- Python Core Concepts and Data Types
- Understanding Iterables
- List Comprehensions
- Dates and Times
- Accessing Raw Data
NumPy for Array Computation
- Creating NumPy Arrays
- Basic Stats and Linear Algebra
- Reshaping, Indexing, and Slicing
Pandas for Data Frames
- Getting Started with Pandas
- Essential Operations with Data Frames
- Summary Statistics from a Data Frame
- Data Aggregation over a Data Frame
Exercise: Titanic Survivor Analysis
- Exploratory Analysis of the Titanic Disaster Data Set
- Predicting Titanic Survivor as a Supervised Learning Problem
- Performing Supervised Learning with scikit-learn

More details are discussed on the PacktPub’s page.

Please have a look at the companion code for the course on my GitHub page, so you can have an idea of the topics discussed in the course.

PyCon Italy 2017 write-up

Last week I’ve travelled to Florence where I attended PyCon Otto, the 8th edition of the Italian Python Conference. As expected, it’s been yet another great experience with the Italian Python community and many international guests.

This year the very first day, Thursday, was beginners’ day, with introductory workshops run by volunteer mentors. Thanks to a cancelled flight, I’ve missed out on this opportunity so I joined the party only for the main event.

On Friday, I’ve run another version of my tutorial on Natural Language Processing for beginners. The tutorial was oversubscribed and the organisers really made an effort to accommodate as many people as possible in the small training room, so at the end, I had ~35 attendees. After the workshop, I had a lot of interesting conversations and some ideas on how to improve the material with additional exercises. Some credits for this are due to my friend Miguel Martinez who contributed with the text classification material for the first edition of the workshop.

As per tradition, at the end of the workshop I’ve also run a raffle to give away a free copy of my book on Mastering Social Media Mining with Python.

On Saturday, I gave a talk titled Word Embeddings for Natural Language Processing with Python (link to slides), somehow a natural follow-up of the tutorial with slightly more advanced concepts, but still tailored for beginners. The talk was really well received, and a lot of interesting questions and conversations came up.

Following the traditional social event on Saturday night (a huge fiorentina), Sunday was pretty much a mellow day, with the last few excellent talks, a light lunch and my journey back.

It was great to meet so many new and old friends! The quality of this community event was stellar, and this was possible thanks to the contributions of organisers, volunteers, mentors, speakers and all the attendees.

See you for PyCon Italy 2018!

PyCon UK 2016 write-up

Last week I had a long weekend at PyCon UK 2016 in Cardiff, and it’s been a fantastic experience! Great talks, great friends/colleagues and lots of ideas.

On Monday 19th, on the last day of the conference, my friend Miguel and I have run a tutorial/workshop on Natural Language Processing in Python (the GitHub repo contains the Jupyter notebooks we used as well as some slides for an introduction).

Our NLP tutorial

Since I’ve already mentioned it, I’ll start from the end :)

The tutorial was tailored for NLP beginners and, as I mentioned explicitly at the very beginning, I wasn’t there to impress the experts. Rather, the whole point was to get the attendees a bit curious about Natural Language Processing, and to show them what you can do with a few lines of Python.

Overall, I think we’ve been quite lucky as we had the perfect audience: the right number of people (around 20+) with a bit of Python knowledge but not much NLP knowledge.

We only had some minor hiccups with the installation process, which is something we’re going to work on to make it smoother and more beginner-friendly. In particular the things I’d like to improve are:

~~add some testing / pre-flight checks, e.g. “how do I know that the environment is set up correctly?”~~ (Miguel has already added this)
support for Windows: I’m quite useless with trouble-shooting Windows issues, but a couple of attendees had some troubles with the installation process not going too smoothly; maybe some virtual machine setup will be helpful

I also think having the material available in advance, so the attendees can start setting up the environment is very helpful. Most of them were quite engaged and I received a couple of “bug reports” on-the-fly, even a pull request that improved the installation process (thanks!)

Last but not least, I was also happy to give out a copy of my book (Mastering Social Media Mining with Python) that I had with me (the raffle was implemented on the spot through random.choice(), and the book went to Paivi from Django Girls).

I’ll give a shorter version of this tutorial at PyCon Ireland later this year, so in case you’ll be around, I’ll see you there :)

Unfortunately, the tutorials were not recorded so there is no video on-line, but the slides are in the GitHub repo so please dig in and send feedback if you have any.

The Open Day

Thursday 15th was “day zero” of the conference, hosted at Cardiff University. The ticket was free, although there was limited capacity. The day was aimed at introducing the new audience to Python and PyCon. We haven’t seen much Python code on that day, as the talks were mainly for newcomers, yet we had a lot of food for thoughs. This is a great way to introduce more people to Python and to show them how the community is friendly and happy to get more beginners on board.

Teachers, Kids and Education

One of the main themes of the conference was Education. Friday 16th, the first day of the main event, was labelled “Teachers Day”, while Saturday 17th was “Kids Day”. The effort to make CS education more accessible for kids was very clear, and some of the initiatives were really spot-on. In particular, some of the kids have been able to hack some small project together in a very short time, and they delivered a “show and tell” session at the end of the second day. I think their creativity and the fact that they were standing in front of a crowd of 500+ developers to show what they have been working on during their day have been very impressive.

Community in the Broader Sense

Another aspect that became quite clear is the strength of the Python Community. Some representatives of PyCon Poland, PyCon Switzerland and Django Europe were introducing their upcoming events. Some attendees with less economic capabilities were given the opportunity to attend, through some form of financial support (including e.g. students from India).

Representatives from PyCon Namibia and PyCon Zimbabwe were also attending and they discussed some of the challenges they are facing while building a local community in their countries.

In particular, the work Jessica from PyNAM is carrying out with young learners is extremely inspiring and deserves more visibility (link to the video of her talk).

Accessibility for Everybody

One of the features that I’ve never experienced in a conference so far was the speech-to-text transcription. During the talks, the speech-to-text team have been very busy writing down what the speakers were saying in real-time. While this is sometimes considered an accessibility feature which might benefit only deaf users, it turned out live captions are extremely beneficial for everybody. Firstly, not all the non-deaf attendees have perfect hearing. Secondly, not everybody is an English native speaker (both speakers and audience), so a word might be missed, or an accent might cause some confusion. Lastly, not every attendee is paying full attention to every talk for the whole talk: sometimes towards the end of the day, you just switch off for a moment and the live captions allow you to catch up.

Providing some accessibility feature turned out to be beneficial for everybody.

Shout out to the Organisers

Organising such a big event (500+ attendees) is not an easy task, so all the people who have worked hard to make this conference happen deserve a big round of applause. Not naming names here, but if you’ve been involved, thanks!

Being Interviewed about NLP

This was a bit random, in a very pleasant way. On Saturday, Miguel, Lev from RaRe Technologies and I spent some time with Kate Jarmul, who by the way just introduced her book on data wrangling, and also delivered a tutorial on the topic. The topic of the conversation was on our views, in the broader sense, about NLP / Text Analytics, how we got into this field, how we see this field evolving and so on. Apparently, this was an interview with some experts of the field, for a piece she’s writing for the O’Reilly blog (I should put an amazed emoticon here).

Using Python for …

The breadth of the topics discussed during the conference was really amazing. I think this kind of events are a great way to see what people are working on and how the tools we use every day are used by other people.

I’m not going to name any talk in particular, because there are too many good talks that deserve to be mentioned.

In terms of topics, some fields that are well covered by Python are:

Data Science (and related topics like data cleaning, NLP and machine learning)
Web development (with Django and so many interesting libraries)
electronics and robotics (with Raspberry Pi, micro:bit, MicroPython etc)
you name it :)

I’m probably not saying anything new here, but it was nice to see it in first person and step outside my data-sciency comfort zone.

Summary

Thanks to everybody who contributed to this event, and see you in Cardiff for PyCon UK 2017!