Getting into Data Science presentation at Hisar Coding Summit 2021

Last week I had the opportunity to speak at the Hisar Coding Summit 2021, an event organised by students of the Hisar School of Istanbul. The remote format opened the doors for participants around the globe, but the audience was mainly high-school students with an interest in Data Science.

The title of my presentation, Getting into Data Science, already suggests the core topics: an overview on what Data Science is, some examples of interesting applications and an overview on the type of skill that are useful for a data scientist.

Link to the slides

Some of the core points I discussed:

  • Data Science is a three-way handshake between Computer Science, Statistics and Business Domain Knowledge. There are many Data Science Venn diagrams out there, with Drew Conway’s (that I referenced in the presentation) probably being the most recognised. One aspect of the Venn diagram representation that I don’t fully like is that it can be misread, suggesting that Data Scientists only exist within the intersection of the three disciplines (hence the representation of Data Scientists as Unicorns). The point of the three-way handshake is to suggest that Data Scientists’ skills can be found on the union, rather than the intersection, of such diagram (i.e. you don’t have to be an expert at everything to do Data Science).
  • “Data Scientist” is not the only job title in Data Science, there are plenty of other professionals whose skills are crucial for the positive outcome of a Data Science project, including e.g. data engineers, software engineers, business analysts, etc. There is also still some fuzziness on what data scientists are supposed to do, so different organisations will use job titles differently. I used Monica Rogati’s Data Science Hierarchy of Needs to illustrate this point, explaining how often people and companies tend to focus too much on the cool stuff at the top of the pyramid (AI and Deep Learning), which can only be achieved with solid foundations (reliable infrastructure and data handling) at the bottom of the pyramid.
  • Data Science Applications: too many to mention. One of the great aspects of Data Science is that it lets you work in almost any domain. Or if you prefer, any domain can benefit from Data Science. I offered some examples making references to previous presentations given at PyData London (either the meetup or our annual conference), citing Weather Forecasting, Healthcare, Biology, Journalism, Food Recommendations and more.
  • Data Science Skills: I used again the three-way handshake to discuss how the skills of a Data Scientist will lie somewhere on the union of Computer Science, Mathematic/Statistics and Domain Knowledge. We discussed this in terms of where to start and where to go next, rather than must-have. I find a lot of those “top 10 must-have Data Science skills” articles out there to be silly at best, or damaging at worst, because again they instil the notion that you have to master multiple disciplines before you can even start, intimidating people rather than encouraging them into Data Science. I hope my main message, “you don’t have to be an expert at everything to do Data Science”, came through and left the students eager to learn more.

The session was wrapped up with a few excellent questions (that I did not expect from high-school students!) ranging from data privacy, to ethics, to queries on state of the art computer vision and NLP.

PyCon Italy 2017 write-up

Last week I’ve travelled to Florence where I attended PyCon Otto, the 8th edition of the Italian Python Conference. As expected, it’s been yet another great experience with the Italian Python community and many international guests.

This year the very first day, Thursday, was beginners’ day, with introductory workshops run by volunteer mentors. Thanks to a cancelled flight, I’ve missed out on this opportunity so I joined the party only for the main event.

On Friday, I’ve run another version of my tutorial on Natural Language Processing for beginners. The tutorial was oversubscribed and the organisers really made an effort to accommodate as many people as possible in the small training room, so at the end, I had ~35 attendees. After the workshop, I had a lot of interesting conversations and some ideas on how to improve the material with additional exercises. Some credits for this are due to my friend Miguel Martinez who contributed with the text classification material for the first edition of the workshop.

As per tradition, at the end of the workshop I’ve also run a raffle to give away a free copy of my book on Mastering Social Media Mining with Python.

On Saturday, I gave a talk titled Word Embeddings for Natural Language Processing with Python (link to slides), somehow a natural follow-up of the tutorial with slightly more advanced concepts, but still tailored for beginners. The talk was really well received, and a lot of interesting questions and conversations came up.

Following the traditional social event on Saturday night (a huge fiorentina), Sunday was pretty much a mellow day, with the last few excellent talks, a light lunch and my journey back.

It was great to meet so many new and old friends! The quality of this community event was stellar, and this was possible thanks to the contributions of organisers, volunteers, mentors, speakers and all the attendees.

See you for PyCon Italy 2018!

PyCon UK 2016 write-up

Last week I had a long weekend at PyCon UK 2016 in Cardiff, and it’s been a fantastic experience! Great talks, great friends/colleagues and lots of ideas.

On Monday 19th, on the last day of the conference, my friend Miguel and I have run a tutorial/workshop on Natural Language Processing in Python (the GitHub repo contains the Jupyter notebooks we used as well as some slides for an introduction).

Our NLP tutorial

Since I’ve already mentioned it, I’ll start from the end :)

The tutorial was tailored for NLP beginners and, as I mentioned explicitly at the very beginning, I wasn’t there to impress the experts. Rather, the whole point was to get the attendees a bit curious about Natural Language Processing, and to show them what you can do with a few lines of Python.

Overall, I think we’ve been quite lucky as we had the perfect audience: the right number of people (around 20+) with a bit of Python knowledge but not much NLP knowledge.

We only had some minor hiccups with the installation process, which is something we’re going to work on to make it smoother and more beginner-friendly. In particular the things I’d like to improve are:

  • add some testing / pre-flight checks, e.g. “how do I know that the environment is set up correctly?” (Miguel has already added this)
  • support for Windows: I’m quite useless with trouble-shooting Windows issues, but a couple of attendees had some troubles with the installation process not going too smoothly; maybe some virtual machine setup will be helpful

I also think having the material available in advance, so the attendees can start setting up the environment is very helpful. Most of them were quite engaged and I received a couple of “bug reports” on-the-fly, even a pull request that improved the installation process (thanks!)

Last but not least, I was also happy to give out a copy of my book (Mastering Social Media Mining with Python) that I had with me (the raffle was implemented on the spot through random.choice(), and the book went to Paivi from Django Girls).

I’ll give a shorter version of this tutorial at PyCon Ireland later this year, so in case you’ll be around, I’ll see you there :)

Unfortunately, the tutorials were not recorded so there is no video on-line, but the slides are in the GitHub repo so please dig in and send feedback if you have any.

The Open Day

Thursday 15th was “day zero” of the conference, hosted at Cardiff University. The ticket was free, although there was limited capacity. The day was aimed at introducing the new audience to Python and PyCon. We haven’t seen much Python code on that day, as the talks were mainly for newcomers, yet we had a lot of food for thoughs. This is a great way to introduce more people to Python and to show them how the community is friendly and happy to get more beginners on board.

Teachers, Kids and Education

One of the main themes of the conference was Education. Friday 16th, the first day of the main event, was labelled “Teachers Day”, while Saturday 17th was “Kids Day”. The effort to make CS education more accessible for kids was very clear, and some of the initiatives were really spot-on. In particular, some of the kids have been able to hack some small project together in a very short time, and they delivered a “show and tell” session at the end of the second day. I think their creativity and the fact that they were standing in front of a crowd of 500+ developers to show what they have been working on during their day have been very impressive.

Community in the Broader Sense

Another aspect that became quite clear is the strength of the Python Community. Some representatives of PyCon Poland, PyCon Switzerland and Django Europe were introducing their upcoming events. Some attendees with less economic capabilities were given the opportunity to attend, through some form of financial support (including e.g. students from India).

Representatives from PyCon Namibia and PyCon Zimbabwe were also attending and they discussed some of the challenges they are facing while building a local community in their countries.

In particular, the work Jessica from PyNAM is carrying out with young learners is extremely inspiring and deserves more visibility (link to the video of her talk).

Accessibility for Everybody

One of the features that I’ve never experienced in a conference so far was the speech-to-text transcription. During the talks, the speech-to-text team have been very busy writing down what the speakers were saying in real-time. While this is sometimes considered an accessibility feature which might benefit only deaf users, it turned out live captions are extremely beneficial for everybody. Firstly, not all the non-deaf attendees have perfect hearing. Secondly, not everybody is an English native speaker (both speakers and audience), so a word might be missed, or an accent might cause some confusion. Lastly, not every attendee is paying full attention to every talk for the whole talk: sometimes towards the end of the day, you just switch off for a moment and the live captions allow you to catch up.

Providing some accessibility feature turned out to be beneficial for everybody.

Shout out to the Organisers

Organising such a big event (500+ attendees) is not an easy task, so all the people who have worked hard to make this conference happen deserve a big round of applause. Not naming names here, but if you’ve been involved, thanks!

Being Interviewed about NLP

This was a bit random, in a very pleasant way. On Saturday, Miguel, Lev from RaRe Technologies and I spent some time with Kate Jarmul, who by the way just introduced her book on data wrangling, and also delivered a tutorial on the topic. The topic of the conversation was on our views, in the broader sense, about NLP / Text Analytics, how we got into this field, how we see this field evolving and so on. Apparently, this was an interview with some experts of the field, for a piece she’s writing for the O’Reilly blog (I should put an amazed emoticon here).

Using Python for …

The breadth of the topics discussed during the conference was really amazing. I think this kind of events are a great way to see what people are working on and how the tools we use every day are used by other people.

I’m not going to name any talk in particular, because there are too many good talks that deserve to be mentioned.

In terms of topics, some fields that are well covered by Python are:

  • Data Science (and related topics like data cleaning, NLP and machine learning)
  • Web development (with Django and so many interesting libraries)
  • electronics and robotics (with Raspberry Pi, micro:bit, MicroPython etc)
  • you name it :)

I’m probably not saying anything new here, but it was nice to see it in first person and step outside my data-sciency comfort zone.

Summary

Thanks to everybody who contributed to this event, and see you in Cardiff for PyCon UK 2017!

PyData London 2016 write-up

Last weekend I was at the PyData London conference for three Pythonic days. Firstly, thanks to the organiser, volunteers, speakers, sponsors and everyone who has contributed in a way or another to make the event a great success.

This year I had the opportunity to contribute as member of the review committee, which means I had a glimpse at the behind-the-scenes and I know how many great proposals we had. With three days and three to four tracks running in parallel, there is room for a lot of Pythonic parley, yet unfortunately many good proposals had to be turned down due to time/space constraints. The programme turned out to be great nevertheless.

The three days were really intense so there is just too much to say, but I’ll try to summarise some of the take-home messages.

Tutorials: delivering a tutorial is difficult. Everything that could go wrong, will go wrong (big screen that goes bananas for 10 minutes, flaky Internet connection so a conda install takes ages, you mention it). Jupyter notebook makes life better, but I strongly feel for the speakers, so a big thank you for taking the time to prepare some quality material.

Topics of interest: some topics seem to capture most of the attention this year, in particular there was a lot of interest around data pipelines, deep learning and Bayesian stats. Unsurprising?

Keynotes: following the recent news on the LIGO project, Prof. Andreas Freise gave an introduction to gravitational waves, lasers, the latest achievements in physics and other cool things far beyond my understanding. Something I could understand and relate to is his way to describe how he needs to write code to carry on his job, but writing code is not his main job. This is true for many academics and researchers without a software engineering background, who were also the main audience of my talk on building data pipelines (luckily enough, scheduled right after the keynote in the same room).

The second keynote, given by Tetiana Ivanova, was about the beginning of her journey in Data Science without formal education. Some of the suggestions were sensible, in fact I recently shared some of the same ideas in a short talk to UCL students and post-docs who want to move to industry.

The third and last keynote was given by Travis Oliphant: CEO of Continuum Analytics, author of NumPy, creator of SciPy, Pythonista since the late 1990’s. His talk was about scaling up and scaling out the PyData stack. Things to watch out for: Numba and Dask. Really exciting stuff going on!

My talk: I presented “Building Data Pipelines in Python”, with a focus on the need to bring R&D and Engineering together, and how basic engineering principles can be beneficial even if your job is not all about writing code. After presenting a very similar talk at PyCon Italy, I found the audience in London to be a bit more on the academic side than I initially thought, which was perfect for my engineering rants. After the usual first few minutes of feeling awkward when speaking publicly, I started my discussion on unit testing and asked how many in the audience write unit tests regularly. Random guy from the audience: “What’s a unit test?”. Thank you kind stranger, you lifted my spirit and the rest of the talk was a breeze.

The slides of my talk are on my speakerdeck.

Last year it took several months to get the videos out, this year only one day! So this is the video of my talk: https://www.youtube.com/watch?v=7NzH1Gx8-4E

I had some interesting questions after the talk and I also had some nice conversations the day after. Apparently, I raised some interest on Luigi, in fact a few people told me how they really had to attend the other talk about using Luigi in production, deliverd by Pete Owlett from Deliveroo, after listening to mine (the room was overflowing so I couldn’t even get close!). There was also some genuine interest on unit testing, and a very interesting question was how to apply it when working with Jupyter notebooks.

Lighting talks: apparently, saving your Jupyter notebooks on git is an issue that is taken very seriously by the community. In fact, three speakers came up with different solutions for the same problem.

Organisation: hat off to the organisers and everyone involved, and see you at the PyData London meetup!

Get in touch if you also have a write up of the event:

@MarcoBonzanini

PyCon Italia / PyData Italy 2016 Write-Up

Last week I’ve travelled to Florence to attend PyCon Sette, the seventh edition of the Italian Python Conference, born 10 years ago and held annually (with three editions of EuroPython in between).

First off, I have something to admit: as this was my first time at PyCon Italia, clearly I didn’n know what I was missing. Being overly busy with work and side projects, this is the perfect excuse to resume the blog.

Florence

The city doesn’t need much presentation: it’s simply one of the most beautiful cities in the world. I haven’t been there for a few years but things don’t seem to be very different from a turist’s point of view. The craft beer scene is booming, but at the same time culinary traditions are well preserved. Both of these are big thumbs-up for me. The best random moment of my trip: getting lost in the back streets of the old city centre, and then finding a dodgy hole-in-the-wall place that sells incredible focaccia and panini.

The Conference

PyCon Sette can be summarised as three intense days of Python, with more than 500 attendees. The first day was opened by Alex Martelli with a keynote about exception handling in Python 2 vs Python 3. A part from the keynotes, at any given time we had between 4 and 6 parallel sessions of talks or trainings. I decided to stick to the PyData track for the whole time, although the other tracks were also featuring some interesting talks. Some of the tracks were related to a particular sub-community, with PyData and DjangoVillage having a strong presence, but also Odoo, DjangoGirls and the Italian Postgres User Group are worth mentioning.

I’ve listened to many interesting talks. On top of my head, a few to remember: the talk about Internet of Things by Stefano Terna of TomorrowData.io (also winners of the start-up contest), the one about deployment of scikit-learn models in the cloud by Alex Casalboni and an interesting one about Functional Programming and Dask by Holger Peters.

Overall, hats off to the organisers. In particular, I had some conversations with Valerio Maggio who is the founder of PyData Italy. We exchanged some opinions about the conference and the community in the broader sense. Hopefully the interest around Data Science in Italy will keep rising, so maybe several local events throughout the year will be held, rather than having just one big national event per year.

My Talk

On Saturday, I gave a talk on Building Data Pipelines in Python. I wrote about building data pipelines with Luigi before, but this talk gave me the opportunity to look at the bigger picture. The general message was that Research and Engineering are different disciplines, but we (data-sciency and researchy people) can benefit from trying to meet in the middle. In particular, good engineering practices can help the less engineering-oriented researchers in their day-to-day mundane tasks. After opening the discussion on the overall topic, I had a brief moment of ranting about unit testing (or the lack of testing culture in some academic circles), I introduced Luigi as a workflow manager to build pipelines in Python and I closed with an overview on logging (described by Alex Martelli in his keynote as something that scares people off, at least initially) and a consideration about using good engineering practices in research.

The talk was addressed to beginners and to the less engineering-savvy PyData users, so expert software engineers probably didn’t benefit much from it. I had anyway a good response with several people coming after the talk for a chat. All in all, if at least one researcher will look into testing or will decide to try one of the workflow managers I mentioned, I’d say I’ve reached my goal.

The slides of my talk are on my speakerdeck (videos will be on-line soon).

See you next year in Florence!