Python · Sentiment Analysis

Sentiment Analysis with Python and scikit-learn

Sentiment Analysis is a field of study which analyses people’s opinions towards entities like products, typically expressed in written forms like on-line reviews. In recent years, it’s been a hot topic in both academia and industry, also thanks to the massive popularity of social media which provide a constant source of textual data full of opinions to analyse.

This article discusses one particular application of sentiment analysis: sentiment classification at the document level. In other words, given a document (e.g. a review), the task consists in finding out whether it provides a positive or a negative sentiment towards the product being discussed.

The following paragraphs describe the setup and the main components
or our classification example with samples of code in Python using scikit-learn, a popular machine learning library. The complete code is discussed at the end of this post, and available as Gist on Github.

Setting up for the experiments

We’re using Python and in particular scikit-learn for these experiments. To install scikit-learn:

$ sudo pip install -U scikit-learn

Scikit-learn has a couple of dependencies, in particular numpy and scipy. If these dependencies are not resolved by pip for some reason, you can make the installation explicit with:

$ sudo pip install -U numpy scipy scikit-learn

The data set used for this experiments is the well-known Polarity Dataset v2.0, downloadable from here.

The data set contains 2,000 documents, labelled and pre-processed. In particular, there are two labels, positive and negative with 1,000 documents each. Every document has been tokenised and lowercased; each line of a document represents a sentence. This pre-processing takes out most of the work we have to do to get started, so we can focus on the classification problem. Real world data are usually messy and need proper pre-processing before we can make good use of them. All we need to do here is read the files and split the words over white spaces.

Feature extraction in scikit-learn

In classification, items are represented by their features. In our case, documents are represented by their words, so we will use words as features.

scikit-learn provides several vectorizers to translate the input documents into vectors of features (or feature weights). Typically we want to give appropriate weights to different words, and TF-IDF is one of the most common weighting schemes used in text analytics applications. In scikit-learn, we can use the TfidfVectorizer:

vectorizer = TfidfVectorizer(min_df=5,
                             max_df = 0.8,
                             sublinear_tf=True,
                             use_idf=True)
train_vectors = vectorizer.fit_transform(train_data)
test_vectors = vectorizer.transform(test_data)

The parameters used in this example with the vectorizer are:

  • min_df=5, discard words appearing in less than 5 documents
  • max_df=0.8, discard words appering in more than 80% of the documents
  • sublinear_tf=True, use sublinear weighting
  • use_idf=True, enable IDF

More options are available and the best configuration might depend on your data or on the details of the task you’re facing.

The first call to fit_transform() will create the vocabulary (i.e. the list of words/features) and the feature weights from the training data. Secondly, we call simply transform() on the test data, which will create the feature weights for the test data, using the same vocabulary as the training data.

Classification in scikit-learn

scikit-learn comes with a number of different classifiers already built-in. In these experiments, we use different variations of Support Vector Machine (SVM), which is commonly used in classification applications.

The classification procedure is fairly simple:

classifier_rbf = svm.SVC()
classifier_rbf.fit(train_vectors, train_labels)
prediction_rbf = classifier_rbf.predict(test_vectors)

The SVC() class generates a SVM classifier with RBF (Gaussian) kernel as default option (several other options are available).

The fit() method will perform the training and it requires the training data processed by the vectorizer as well as the correct class labels.

The classification step consists in predicting the labels for the test data.

Comments on The Complete Code

The complete code is available as Gist on Github. The script takes the data folder as parameter, assuming the same format of the original data, with two subfolders pos and neg.

The first reads the content of the files and creates lists of training/testing documents and labels.
We split the data set into training (90% of the documents) and testing (10%) by exploiting the file names (they all start with “cvX”, with X=[0..9]). This calls for k-fold cross-validation,
not implemented in the example but fairly easy to integrate.

if fname.startswith('cv9'):
    # 10% test data
    test_data.append(content)
    test_labels.append(curr_class)
else:
    # 90% training data
    train_data.append(content)
    train_labels.append(curr_class)

Once the vectorizer has generated the feature vectors for training and testing, we can call the classifier as described above. In the example, we try different variations of SVM:

classifier_rbf = svm.SVC()
classifier_linear = svm.SVC(kernel='linear')
classifier_liblinear = svm.LinearSVC()

After performing the classification, we print the quality (precision/recall) results using classification_report(), and some timing information.

We notice that:

  • The default RBG kernel performs worse than the linear kernel
  • SVC() with linear kernel is much much slower than LinearSVC()

The first point opens for a discussion on Gaussian vs. linear kernels, not really part of this blog post, but as a rule of thumb when the number of features is much higher than the number of samples (documents), a linear kernel is probably the preferred choice. Moreover, there are options to properly tune the parameters of a RBF kernel.

The second bullet point is easily explained by the fact that, under the hood, scikit-learn relies on different C libraries. In particular SVC() is implemented using libSVM, while LinearSVC() is implemented using liblinear, which is explicitly designed for this kind of application.

Summary

We have discussed an application of sentiment analysis, tackled as a document classification problem with Python and scikit-learn.

The choice of the classifier, as well as the feature extraction process, will influence the overall quality of the results, and it’s always good to experiment with different configurations.

scikit-learn offers many options from this point of view.

Knowing the underlying implementation also allows for a better choice in terms of speed.

Full example in Python.

11 thoughts on “Sentiment Analysis with Python and scikit-learn

  1. Hi Rodrigo
    if you follow the link for the complete code on Gist/GitHub at the end of the article, you’ll see how the full script looks like. Save the script and then call it from command line with:
    python sentiment_classification.py your-data-dir
    where your-data-dir is the folder where you unzip the Movie Review Dataset
    Notice: tested on Python 3

    Cheers
    Marco

    Like

  2. Hi Marco,
    I’m quite new in this subject, so I ask you maybe a trivial question. That’s due also to the fact that I’m working on R and I’m actually translating the Python code. However the question is more theoretical: how you created the training set labels for which we perform the svm? What actually are ‘pos’ and ‘neg’ in the script?
    Thank you, Daniele

    Like

    1. Hi Daniele,
      “pos” and “neg” are the two labels/classes. The dataset comes with the documents already split in two sub-folders with those names, so simply all the data in the folder “pos” are positive reviews, and similarly all the docs in “neg” are negative reviews. They have been pre-labelled by the authors of the dataset as described in their papers. Moreover, the documents in each folder/class have names starting with “cvX”, with X being a digit: you can exploit this to create a 90-10 split of the data like I’m doing in the sample code. Ideally, for this kind of experiment, you would do a 10-fold validation, iterating through different 90-10 splits, one for each digit, and reporting the average precision/recall. The sample code only takes one split for simplicity. Hope this helps.

      Cheers,
      Marco

      Like

      1. Ok, I miss reading that part! So they are pre-labelled, I think that should exist something to not label manually in the training phase, even if it seems counterintuitive!
        Thank you for the answer and the CV hint!
        I’ll continue to work on and will try to perform it.

        Daniele

        Like

  3. Hi Marco,

    Thanks for this great resource, I’m currently using it to further my understanding of text SVMs. If I wanted to expand this from not only pos neg, but using emotions as well (anger, fear etc). What would I have to do?

    Cheers.

    Like

  4. Hi Marco,

    I’m quite new in this subject, so I’m trying to run the code but I don’t know which lines do I have to change in it to add the folder that I have already unzip from Movie Review Dataset.

    Thanks for your help

    Like

    1. Hi Elsa, there is no need to change the code as it takes the argument from the command line — if you follow the example on gist/github (last link at the end of the article) you can just run it with:
      python sentiment_classification.py your-data-dir

      Like

  5. Hi marco and thanks for sharing your work,
    I was wondering if i want to use your script with a dataset with more than 2 classes, let’s say four, is enough to change classes = [‘pos’, ‘neg’] to the actual classes? something like : classes = [‘one’,’two’,’three’,’four’] ?
    and for use it with my dataset, can i just make 4 folder for the four classes and then put inside 1000 files each that starts with ‘cv’ + some number?
    Thanks in advance Nico.

    Like

    1. Hi Nico,
      extending the code for a multi-class task should be straightforward, I don’t see any particular problem. The naming “cv”+number is the approach used in the movie data set so it can make cross-validation (or k-fold validation) easier to perform — you don’t have to follow it if you have a clear train-vs-test split with your data.
      Cheers,
      Marco

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s