Mastering Social Media Mining with Python

book-cover

Great news, my book on data mining for social media is finally out!

The title is Mastering Social Media Mining with Python. I’ve been working with Packt Publishing over the past few months, and in July the book has been finalised and released.

Links:

As part of Packt’s Mastering series, the book assumes the readers already have some basic understanding of Python (e.g. for loops and classes), but more advanced concepts are discussed with examples. No particular experience with Social Media APIs and Data Mining is required. With 300+ pages, by the end of the book, the readers should be able to build their own data mining projects using data from social media and Python tools.

A bird’s eye view on the content:

  1. Social Media, Social Data and Python
    • Introduction on Social Media and Social Data: challenges and opportunities
    • Introduction on Python tools for Data Science
    • Overview on the use of public APIs to interact with social media platforms
  2. #MiningTwitter: Hashtags, Topics and Time Series
    • Interacting with the Twitter API in Python
    • Twitter data: the anatomy of a tweet
    • Entity analysis, text analysis, time series analysis on tweets
  3. Users, Followers, and Communities on Twitter
    • Analysing who follows whom
    • Mining your followers
    • Mining communities
    • Visualising tweets on a map
  4. Posts, Pages and User Interactions on Facebook
    • Interacting the Facebook Graph API in Python
    • Mining you posts
    • Mining Facebook Pages
  5. Topic analysis on Google Plus
    • Interacting with the Google Plus API in Python
    • Finding people and pages on G+
    • Analysis of notes and activities on G+
  6. Questions and Answers on Stack Exchange
    • Interacting with the StackOverflow API in Python
    • Text classification for question tags
  7. Blogs, RSS, Wikipedia, and Natural Language Processing
    • Blogs and web pages as social data Web scraping with Python
    • Basics of text analytics on blog posts
    • Information extraction from text
  8. Mining All the Data!
    • Interacting with many other APIs and types of objects
    • Examples of interaction with YouTube, Yelp and GitHub
  9. Linked Data and the Semantic Web
    • The Web as Social Media
    • Mining relations from DBpedia
    • Mining geo coordinates

The detailed table of contents is shown on the Packt Pub’s page. Chapter 2 is also offered as free sample.

Please have a look at the companion code for the book on my GitHub, so you can have an idea of the applications discussed in the book.

PyData London 2016 write-up

Last weekend I was at the PyData London conference for three Pythonic days. Firstly, thanks to the organiser, volunteers, speakers, sponsors and everyone who has contributed in a way or another to make the event a great success.

This year I had the opportunity to contribute as member of the review committee, which means I had a glimpse at the behind-the-scenes and I know how many great proposals we had. With three days and three to four tracks running in parallel, there is room for a lot of Pythonic parley, yet unfortunately many good proposals had to be turned down due to time/space constraints. The programme turned out to be great nevertheless.

The three days were really intense so there is just too much to say, but I’ll try to summarise some of the take-home messages.

Tutorials: delivering a tutorial is difficult. Everything that could go wrong, will go wrong (big screen that goes bananas for 10 minutes, flaky Internet connection so a conda install takes ages, you mention it). Jupyter notebook makes life better, but I strongly feel for the speakers, so a big thank you for taking the time to prepare some quality material.

Topics of interest: some topics seem to capture most of the attention this year, in particular there was a lot of interest around data pipelines, deep learning and Bayesian stats. Unsurprising?

Keynotes: following the recent news on the LIGO project, Prof. Andreas Freise gave an introduction to gravitational waves, lasers, the latest achievements in physics and other cool things far beyond my understanding. Something I could understand and relate to is his way to describe how he needs to write code to carry on his job, but writing code is not his main job. This is true for many academics and researchers without a software engineering background, who were also the main audience of my talk on building data pipelines (luckily enough, scheduled right after the keynote in the same room).

The second keynote, given by Tetiana Ivanova, was about the beginning of her journey in Data Science without formal education. Some of the suggestions were sensible, in fact I recently shared some of the same ideas in a short talk to UCL students and post-docs who want to move to industry.

The third and last keynote was given by Travis Oliphant: CEO of Continuum Analytics, author of NumPy, creator of SciPy, Pythonista since the late 1990’s. His talk was about scaling up and scaling out the PyData stack. Things to watch out for: Numba and Dask. Really exciting stuff going on!

My talk: I presented “Building Data Pipelines in Python”, with a focus on the need to bring R&D and Engineering together, and how basic engineering principles can be beneficial even if your job is not all about writing code. After presenting a very similar talk at PyCon Italy, I found the audience in London to be a bit more on the academic side than I initially thought, which was perfect for my engineering rants. After the usual first few minutes of feeling awkward when speaking publicly, I started my discussion on unit testing and asked how many in the audience write unit tests regularly. Random guy from the audience: “What’s a unit test?”. Thank you kind stranger, you lifted my spirit and the rest of the talk was a breeze.

The slides of my talk are on my speakerdeck.

Last year it took several months to get the videos out, this year only one day! So this is the video of my talk: https://www.youtube.com/watch?v=7NzH1Gx8-4E

I had some interesting questions after the talk and I also had some nice conversations the day after. Apparently, I raised some interest on Luigi, in fact a few people told me how they really had to attend the other talk about using Luigi in production, deliverd by Pete Owlett from Deliveroo, after listening to mine (the room was overflowing so I couldn’t even get close!). There was also some genuine interest on unit testing, and a very interesting question was how to apply it when working with Jupyter notebooks.

Lighting talks: apparently, saving your Jupyter notebooks on git is an issue that is taken very seriously by the community. In fact, three speakers came up with different solutions for the same problem.

Organisation: hat off to the organisers and everyone involved, and see you at the PyData London meetup!

Get in touch if you also have a write up of the event:

@MarcoBonzanini

PyCon Italia / PyData Italy 2016 Write-Up

Last week I’ve travelled to Florence to attend PyCon Sette, the seventh edition of the Italian Python Conference, born 10 years ago and held annually (with three editions of EuroPython in between).

First off, I have something to admit: as this was my first time at PyCon Italia, clearly I didn’n know what I was missing. Being overly busy with work and side projects, this is the perfect excuse to resume the blog.

Florence

The city doesn’t need much presentation: it’s simply one of the most beautiful cities in the world. I haven’t been there for a few years but things don’t seem to be very different from a turist’s point of view. The craft beer scene is booming, but at the same time culinary traditions are well preserved. Both of these are big thumbs-up for me. The best random moment of my trip: getting lost in the back streets of the old city centre, and then finding a dodgy hole-in-the-wall place that sells incredible focaccia and panini.

The Conference

PyCon Sette can be summarised as three intense days of Python, with more than 500 attendees. The first day was opened by Alex Martelli with a keynote about exception handling in Python 2 vs Python 3. A part from the keynotes, at any given time we had between 4 and 6 parallel sessions of talks or trainings. I decided to stick to the PyData track for the whole time, although the other tracks were also featuring some interesting talks. Some of the tracks were related to a particular sub-community, with PyData and DjangoVillage having a strong presence, but also Odoo, DjangoGirls and the Italian Postgres User Group are worth mentioning.

I’ve listened to many interesting talks. On top of my head, a few to remember: the talk about Internet of Things by Stefano Terna of TomorrowData.io (also winners of the start-up contest), the one about deployment of scikit-learn models in the cloud by Alex Casalboni and an interesting one about Functional Programming and Dask by Holger Peters.

Overall, hats off to the organisers. In particular, I had some conversations with Valerio Maggio who is the founder of PyData Italy. We exchanged some opinions about the conference and the community in the broader sense. Hopefully the interest around Data Science in Italy will keep rising, so maybe several local events throughout the year will be held, rather than having just one big national event per year.

My Talk

On Saturday, I gave a talk on Building Data Pipelines in Python. I wrote about building data pipelines with Luigi before, but this talk gave me the opportunity to look at the bigger picture. The general message was that Research and Engineering are different disciplines, but we (data-sciency and researchy people) can benefit from trying to meet in the middle. In particular, good engineering practices can help the less engineering-oriented researchers in their day-to-day mundane tasks. After opening the discussion on the overall topic, I had a brief moment of ranting about unit testing (or the lack of testing culture in some academic circles), I introduced Luigi as a workflow manager to build pipelines in Python and I closed with an overview on logging (described by Alex Martelli in his keynote as something that scares people off, at least initially) and a consideration about using good engineering practices in research.

The talk was addressed to beginners and to the less engineering-savvy PyData users, so expert software engineers probably didn’t benefit much from it. I had anyway a good response with several people coming after the talk for a chat. All in all, if at least one researcher will look into testing or will decide to try one of the workflow managers I mentioned, I’d say I’ve reached my goal.

The slides of my talk are on my speakerdeck (videos will be on-line soon).

See you next year in Florence!

PyData London Meetup 2015-02-03

A quick update with my impressions on the last PyData London meet-up I attended this evening.

I missed the past couple of meet-ups because of some clash on my personal schedule, so this was my first time in the new venue, still close to Old Street, hosted by Lyst. A nice big room to accomodate many many people (more than 200 probably?), and a nice selection of craft beers.

We started with the usual initial community-related announcements, including the impressive achievement of the group: 1,000+ members on the meetup.com dedicated page! Well done guys!

The core topic of the evening was data visualisation. For this reason, the main candidate for Python Module Of The Month was (… drum roll …) matplotlib :-D

The first talk, “Thinking about data visualisation”, was given by Andy Kirk from Visualising Data. Thinking was an important keyword of the presentation: after an initial comparison between talent (e.g. artistic, creative) vs thinking, Andy went on discussing different aspects of thinking and how our thinking can provide more value to the visualisation we are proposing. Long story short, you don’t have to be an artist to create useful visualisation, assuming you take the context and the overall story into account.

The second talk, “Lies, damned lies and dataviz”, was given by Andrew Clegg from Etsy. Andrew started the presentation with some argument in support of providing good dataviz, and then he continued with a great sequence of examples of bad dataviz, discussing how some of them simply confuse the user without providing any additional insight, while others really trick the user into wrong conclusions. Whether these are damned lies or just bad design choices… well that’s another story.

Both presentations were really great, and both of them technology-agnostic, which is something I enjoyed for some reason.

During the final lightning-and-community-talks, the interesting news was the announcement of a new O’Reilly book on data visualisation with Python and Javascript (link to come).

Overall a very nice evening, congratulations to all the organisers.

PyData London meetup.com group:
http://www.meetup.com/PyData-London-Meetup/