Sentiment Analysis with Python and scikit-learn

Sentiment Analysis is a field of study which analyses people’s opinions towards entities like products, typically expressed in written forms like on-line reviews. In recent years, it’s been a hot topic in both academia and industry, also thanks to the massive popularity of social media which provide a constant source of textual data full of opinions to analyse.

This article discusses one particular application of sentiment analysis: sentiment classification at the document level. In other words, given a document (e.g. a review), the task consists in finding out whether it provides a positive or a negative sentiment towards the product being discussed.

The following paragraphs describe the setup and the main components
or our classification example with samples of code in Python using scikit-learn, a popular machine learning library. The complete code is discussed at the end of this post, and available as Gist on Github.

Setting up for the experiments

We’re using Python and in particular scikit-learn for these experiments. To install scikit-learn:

pip install -U scikit-learn

Scikit-learn has a couple of dependencies, in particular numpy and scipy. If these dependencies are not resolved by pip for some reason, you can make the installation explicit with:

pip install -U numpy scipy scikit-learn

The data set used for this experiments is the well-known Polarity Dataset v2.0, downloadable from here.

The data set contains 2,000 documents, labelled and pre-processed. In particular, there are two labels, positive and negative with 1,000 documents each. Every document has been tokenised and lowercased; each line of a document represents a sentence. This pre-processing takes out most of the work we have to do to get started, so we can focus on the classification problem. Real world data are usually messy and need proper pre-processing before we can make good use of them. All we need to do here is read the files and split the words over white spaces.

Feature extraction in scikit-learn

In classification, items are represented by their features. In our case, documents are represented by their words, so we will use words as features.

scikit-learn provides several vectorizers to translate the input documents into vectors of features (or feature weights). Typically we want to give appropriate weights to different words, and TF-IDF is one of the most common weighting schemes used in text analytics applications. In scikit-learn, we can use the TfidfVectorizer:

vectorizer = TfidfVectorizer(min_df=5,
                             max_df = 0.8,
                             sublinear_tf=True,
                             use_idf=True)
train_vectors = vectorizer.fit_transform(train_data)
test_vectors = vectorizer.transform(test_data)

The parameters used in this example with the vectorizer are:

min_df=5, discard words appearing in less than 5 documents
max_df=0.8, discard words appering in more than 80% of the documents
sublinear_tf=True, use sublinear weighting
use_idf=True, enable IDF

More options are available and the best configuration might depend on your data or on the details of the task you’re facing.

The first call to fit_transform() will create the vocabulary (i.e. the list of words/features) and the feature weights from the training data. Secondly, we call simply transform() on the test data, which will create the feature weights for the test data, using the same vocabulary as the training data.

Classification in scikit-learn

scikit-learn comes with a number of different classifiers already built-in. In these experiments, we use different variations of Support Vector Machine (SVM), which is commonly used in classification applications.

The classification procedure is fairly simple:

classifier_rbf = svm.SVC()
classifier_rbf.fit(train_vectors, train_labels)
prediction_rbf = classifier_rbf.predict(test_vectors)

The SVC() class generates a SVM classifier with RBF (Gaussian) kernel as default option (several other options are available).

The fit() method will perform the training and it requires the training data processed by the vectorizer as well as the correct class labels.

The classification step consists in predicting the labels for the test data.

Comments on The Complete Code

The complete code is available as Gist on Github. The script takes the data folder as parameter, assuming the same format of the original data, with two subfolders pos and neg.

The first reads the content of the files and creates lists of training/testing documents and labels.
We split the data set into training (90% of the documents) and testing (10%) by exploiting the file names (they all start with “cvX”, with X=[0..9]). This calls for k-fold cross-validation,
not implemented in the example but fairly easy to integrate.

if fname.startswith('cv9'):
    # 10% test data
    test_data.append(content)
    test_labels.append(curr_class)
else:
    # 90% training data
    train_data.append(content)
    train_labels.append(curr_class)

Once the vectorizer has generated the feature vectors for training and testing, we can call the classifier as described above. In the example, we try different variations of SVM:

classifier_rbf = svm.SVC()
classifier_linear = svm.SVC(kernel='linear')
classifier_liblinear = svm.LinearSVC()

After performing the classification, we print the quality (precision/recall) results using classification_report(), and some timing information.

We notice that:

The default RBG kernel performs worse than the linear kernel
SVC() with linear kernel is much much slower than LinearSVC()

The first point opens for a discussion on Gaussian vs. linear kernels, not really part of this blog post, but as a rule of thumb when the number of features is much higher than the number of samples (documents), a linear kernel is probably the preferred choice. Moreover, there are options to properly tune the parameters of a RBF kernel.

The second bullet point is easily explained by the fact that, under the hood, scikit-learn relies on different C libraries. In particular SVC() is implemented using libSVM, while LinearSVC() is implemented using liblinear, which is explicitly designed for this kind of application.

Summary

We have discussed an application of sentiment analysis, tackled as a document classification problem with Python and scikit-learn.

The choice of the classifier, as well as the feature extraction process, will influence the overall quality of the results, and it’s always good to experiment with different configurations.

scikit-learn offers many options from this point of view.

Knowing the underlying implementation also allows for a better choice in terms of speed.

Full example in Python.

Published by

Marco

Data Scientist View all posts by Marco

15 thoughts on “Sentiment Analysis with Python and scikit-learn”

lrbarba says:

September 7, 2015 at 11:12 pm

Marco

How run the code?, What is the line for run in python?

It is not clear for me.

Thank you for your aid.

Regareds

LikeLike

Reply
Marco says:

September 8, 2015 at 6:46 am

Hi Rodrigo
if you follow the link for the complete code on Gist/GitHub at the end of the article, you’ll see how the full script looks like. Save the script and then call it from command line with:
python sentiment_classification.py your-data-dir
where your-data-dir is the folder where you unzip the Movie Review Dataset
Notice: tested on Python 3

Cheers
Marco

LikeLike

Reply
Daniele says:

September 25, 2015 at 6:21 pm

Hi Marco,
I’m quite new in this subject, so I ask you maybe a trivial question. That’s due also to the fact that I’m working on R and I’m actually translating the Python code. However the question is more theoretical: how you created the training set labels for which we perform the svm? What actually are ‘pos’ and ‘neg’ in the script?
Thank you, Daniele

LikeLike

Reply
1. Marco says:
  
  September 25, 2015 at 7:14 pm
  
  Hi Daniele,
  “pos” and “neg” are the two labels/classes. The dataset comes with the documents already split in two sub-folders with those names, so simply all the data in the folder “pos” are positive reviews, and similarly all the docs in “neg” are negative reviews. They have been pre-labelled by the authors of the dataset as described in their papers. Moreover, the documents in each folder/class have names starting with “cvX”, with X being a digit: you can exploit this to create a 90-10 split of the data like I’m doing in the sample code. Ideally, for this kind of experiment, you would do a 10-fold validation, iterating through different 90-10 splits, one for each digit, and reporting the average precision/recall. The sample code only takes one split for simplicity. Hope this helps.
  
  Cheers,
  Marco
  
  LikeLike
  
  Reply
  1. Daniele says:
    
    September 25, 2015 at 7:21 pm
    
    Ok, I miss reading that part! So they are pre-labelled, I think that should exist something to not label manually in the training phase, even if it seems counterintuitive!
    Thank you for the answer and the CV hint!
    I’ll continue to work on and will try to perform it.
    
    Daniele
    
    LikeLike
Stan says:

November 13, 2015 at 3:17 am

Hi Marco,

Thanks for this great resource, I’m currently using it to further my understanding of text SVMs. If I wanted to expand this from not only pos neg, but using emotions as well (anger, fear etc). What would I have to do?

Cheers.

LikeLike

Reply
1. Marco says:
  
  November 13, 2015 at 6:59 am
  
  Hi Stan, if you want to stick with SVM or another supervised learning approach, you’ll need labelled data. I’m not aware of a freely available data set of documents annotated with emotions. There are a few lexical resources around, e.g. https://github.com/marcoguerini/DepecheMood/releases (this is the paper: http://aclweb.org/anthology/P14-2070 ) that you could use with an unsupervised approach.
  Cheers,
  Marco
  
  LikeLike
  
  Reply
ELSA says:

November 22, 2015 at 7:28 pm

Hi Marco,

I’m quite new in this subject, so I’m trying to run the code but I don’t know which lines do I have to change in it to add the folder that I have already unzip from Movie Review Dataset.

Thanks for your help

LikeLike

Reply
1. Marco says:
  
  November 23, 2015 at 6:56 am
  
  Hi Elsa, there is no need to change the code as it takes the argument from the command line — if you follow the example on gist/github (last link at the end of the article) you can just run it with:
  python sentiment_classification.py your-data-dir
  
  LikeLike
  
  Reply
Nicolò says:

February 13, 2016 at 10:21 pm

Hi marco and thanks for sharing your work,
I was wondering if i want to use your script with a dataset with more than 2 classes, let’s say four, is enough to change classes = [‘pos’, ‘neg’] to the actual classes? something like : classes = [‘one’,’two’,’three’,’four’] ?
and for use it with my dataset, can i just make 4 folder for the four classes and then put inside 1000 files each that starts with ‘cv’ + some number?
Thanks in advance Nico.

LikeLike

Reply
1. Marco says:
  
  February 14, 2016 at 7:27 pm
  
  Hi Nico,
  extending the code for a multi-class task should be straightforward, I don’t see any particular problem. The naming “cv”+number is the approach used in the movie data set so it can make cross-validation (or k-fold validation) easier to perform — you don’t have to follow it if you have a clear train-vs-test split with your data.
  Cheers,
  Marco
  
  LikeLike
  
  Reply
Ryan NH says:

March 5, 2017 at 9:46 am

Hi, how do you use this to actually classify your own files?

LikeLike

Reply
1. Marco says:
  
  March 15, 2017 at 8:15 am
  
  Hi,
  assuming you have labelled data, you simply load your data in the variables train_data, train_labels, test_data, test_labels (the *_data variables are lists of documents, the *_labels variables are list of labels in the same order as the documents). You can then pass these variables to vectoriser and classifier using the same code.
  
  Cheers,
  Marco
  
  LikeLike
  
  Reply
cuixiwen says:

March 31, 2017 at 2:29 am

Could we use it to do sentiment_analysis on other data like tweets? save pickles? And i am sorry that i couldnt read the result so understanding,like the precision is 0.92 for neg that means 92% correction on test data?and is there any ways to use this get sentiment value like pos0.67 something. I tried pickles and i always get 0.6 0.8 1.0 something

LikeLike

Reply
Swapnajit says:

January 15, 2018 at 7:52 pm

Hello Marco,

Very good example. I am trying to train the system using all movie data (including the ‘9’ series) and then use an independent test vector (outside of the movie data) to get the prediction. My question is what should I put as the curr_class in Line 44 of your code at https://gist.github.com/bonzanini/c9248a239bbab0e0d42e? Since this is independent test vector, I have no reasonable way of knowing the class (unlike the movie review dataset).

I went ahead and put everything as ‘neg’ just to try it out, but I get the following message for RBF:

Training time: 16.707707s; Prediction time: 0.031200s
C:\Python34\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision and F-score are
ill-defined and being set to 0.0 in labels with no predicted samples.
‘precision’, ‘predicted’, average, warn_for)
C:\Python34\lib\site-packages\sklearn\metrics\classification.py:1137: UndefinedMetricWarning: Recall and F-score are ill
-defined and being set to 0.0 in labels with no true samples.
‘recall’, ‘true’, average, warn_for)
precision recall f1-score support

neg 0.00 0.00 0.00 1
pos 0.00 0.00 0.00 0

avg / total 0.00 0.00 0.00 1

As you can see it did not work. Will very much appreciate any help from you.

LikeLike

Reply