Sentiment Analysis with Python and scikit-learn

Sentiment Analysis is a field of study which analyses people’s opinions towards entities like products, typically expressed in written forms like on-line reviews. In recent years, it’s been a hot topic in both academia and industry, also thanks to the massive popularity of social media which provide a constant source of textual data full of opinions to analyse.

This article discusses one particular application of sentiment analysis: sentiment classification at the document level. In other words, given a document (e.g. a review), the task consists in finding out whether it provides a positive or a negative sentiment towards the product being discussed.

The following paragraphs describe the setup and the main components
or our classification example with samples of code in Python using scikit-learn, a popular machine learning library. The complete code is discussed at the end of this post, and available as Gist on Github.

Setting up for the experiments

We’re using Python and in particular scikit-learn for these experiments. To install scikit-learn:

pip install -U scikit-learn

Scikit-learn has a couple of dependencies, in particular numpy and scipy. If these dependencies are not resolved by pip for some reason, you can make the installation explicit with:

pip install -U numpy scipy scikit-learn

The data set used for this experiments is the well-known Polarity Dataset v2.0, downloadable from here.

The data set contains 2,000 documents, labelled and pre-processed. In particular, there are two labels, positive and negative with 1,000 documents each. Every document has been tokenised and lowercased; each line of a document represents a sentence. This pre-processing takes out most of the work we have to do to get started, so we can focus on the classification problem. Real world data are usually messy and need proper pre-processing before we can make good use of them. All we need to do here is read the files and split the words over white spaces.

Feature extraction in scikit-learn

In classification, items are represented by their features. In our case, documents are represented by their words, so we will use words as features.

scikit-learn provides several vectorizers to translate the input documents into vectors of features (or feature weights). Typically we want to give appropriate weights to different words, and TF-IDF is one of the most common weighting schemes used in text analytics applications. In scikit-learn, we can use the TfidfVectorizer:

vectorizer = TfidfVectorizer(min_df=5,
                             max_df = 0.8,
                             sublinear_tf=True,
                             use_idf=True)
train_vectors = vectorizer.fit_transform(train_data)
test_vectors = vectorizer.transform(test_data)

The parameters used in this example with the vectorizer are:

  • min_df=5, discard words appearing in less than 5 documents
  • max_df=0.8, discard words appering in more than 80% of the documents
  • sublinear_tf=True, use sublinear weighting
  • use_idf=True, enable IDF

More options are available and the best configuration might depend on your data or on the details of the task you’re facing.

The first call to fit_transform() will create the vocabulary (i.e. the list of words/features) and the feature weights from the training data. Secondly, we call simply transform() on the test data, which will create the feature weights for the test data, using the same vocabulary as the training data.

Classification in scikit-learn

scikit-learn comes with a number of different classifiers already built-in. In these experiments, we use different variations of Support Vector Machine (SVM), which is commonly used in classification applications.

The classification procedure is fairly simple:

classifier_rbf = svm.SVC()
classifier_rbf.fit(train_vectors, train_labels)
prediction_rbf = classifier_rbf.predict(test_vectors)

The SVC() class generates a SVM classifier with RBF (Gaussian) kernel as default option (several other options are available).

The fit() method will perform the training and it requires the training data processed by the vectorizer as well as the correct class labels.

The classification step consists in predicting the labels for the test data.

Comments on The Complete Code

The complete code is available as Gist on Github. The script takes the data folder as parameter, assuming the same format of the original data, with two subfolders pos and neg.

The first reads the content of the files and creates lists of training/testing documents and labels.
We split the data set into training (90% of the documents) and testing (10%) by exploiting the file names (they all start with “cvX”, with X=[0..9]). This calls for k-fold cross-validation,
not implemented in the example but fairly easy to integrate.

if fname.startswith('cv9'):
    # 10% test data
    test_data.append(content)
    test_labels.append(curr_class)
else:
    # 90% training data
    train_data.append(content)
    train_labels.append(curr_class)

Once the vectorizer has generated the feature vectors for training and testing, we can call the classifier as described above. In the example, we try different variations of SVM:

classifier_rbf = svm.SVC()
classifier_linear = svm.SVC(kernel='linear')
classifier_liblinear = svm.LinearSVC()

After performing the classification, we print the quality (precision/recall) results using classification_report(), and some timing information.

We notice that:

  • The default RBG kernel performs worse than the linear kernel
  • SVC() with linear kernel is much much slower than LinearSVC()

The first point opens for a discussion on Gaussian vs. linear kernels, not really part of this blog post, but as a rule of thumb when the number of features is much higher than the number of samples (documents), a linear kernel is probably the preferred choice. Moreover, there are options to properly tune the parameters of a RBF kernel.

The second bullet point is easily explained by the fact that, under the hood, scikit-learn relies on different C libraries. In particular SVC() is implemented using libSVM, while LinearSVC() is implemented using liblinear, which is explicitly designed for this kind of application.

Summary

We have discussed an application of sentiment analysis, tackled as a document classification problem with Python and scikit-learn.

The choice of the classifier, as well as the feature extraction process, will influence the overall quality of the results, and it’s always good to experiment with different configurations.

scikit-learn offers many options from this point of view.

Knowing the underlying implementation also allows for a better choice in terms of speed.

Full example in Python.