Sentiment Analysis is a field of study which analyses people’s opinions towards entities like products, typically expressed in written forms like on-line reviews. In recent years, it’s been a hot topic in both academia and industry, also thanks to the massive popularity of social media which provide a constant source of textual data full of opinions to analyse.
This article discusses one particular application of sentiment analysis: sentiment classification at the document level. In other words, given a document (e.g. a review), the task consists in finding out whether it provides a positive or a negative sentiment towards the product being discussed.
The following paragraphs describe the setup and the main components
or our classification example with samples of code in Python using scikit-learn, a popular machine learning library. The complete code is discussed at the end of this post, and available as Gist on Github.
Setting up for the experiments
We’re using Python and in particular scikit-learn for these experiments. To install scikit-learn:
pip install -U scikit-learn
Scikit-learn has a couple of dependencies, in particular numpy and scipy. If these dependencies are not resolved by pip for some reason, you can make the installation explicit with:
pip install -U numpy scipy scikit-learn
The data set used for this experiments is the well-known Polarity Dataset v2.0, downloadable from here.
The data set contains 2,000 documents, labelled and pre-processed. In particular, there are two labels, positive and negative with 1,000 documents each. Every document has been tokenised and lowercased; each line of a document represents a sentence. This pre-processing takes out most of the work we have to do to get started, so we can focus on the classification problem. Real world data are usually messy and need proper pre-processing before we can make good use of them. All we need to do here is read the files and split the words over white spaces.
Feature extraction in scikit-learn
In classification, items are represented by their features. In our case, documents are represented by their words, so we will use words as features.
scikit-learn provides several vectorizers to translate the input documents into vectors of features (or feature weights). Typically we want to give appropriate weights to different words, and TF-IDF is one of the most common weighting schemes used in text analytics applications. In scikit-learn, we can use the TfidfVectorizer:
vectorizer = TfidfVectorizer(min_df=5, max_df = 0.8, sublinear_tf=True, use_idf=True) train_vectors = vectorizer.fit_transform(train_data) test_vectors = vectorizer.transform(test_data)
The parameters used in this example with the vectorizer are:
- min_df=5, discard words appearing in less than 5 documents
- max_df=0.8, discard words appering in more than 80% of the documents
- sublinear_tf=True, use sublinear weighting
- use_idf=True, enable IDF
More options are available and the best configuration might depend on your data or on the details of the task you’re facing.
The first call to fit_transform() will create the vocabulary (i.e. the list of words/features) and the feature weights from the training data. Secondly, we call simply transform() on the test data, which will create the feature weights for the test data, using the same vocabulary as the training data.
Classification in scikit-learn
scikit-learn comes with a number of different classifiers already built-in. In these experiments, we use different variations of Support Vector Machine (SVM), which is commonly used in classification applications.
The classification procedure is fairly simple:
classifier_rbf = svm.SVC() classifier_rbf.fit(train_vectors, train_labels) prediction_rbf = classifier_rbf.predict(test_vectors)
The SVC() class generates a SVM classifier with RBF (Gaussian) kernel as default option (several other options are available).
The fit() method will perform the training and it requires the training data processed by the vectorizer as well as the correct class labels.
The classification step consists in predicting the labels for the test data.
Comments on The Complete Code
The complete code is available as Gist on Github. The script takes the data folder as parameter, assuming the same format of the original data, with two subfolders pos and neg.
The first reads the content of the files and creates lists of training/testing documents and labels.
We split the data set into training (90% of the documents) and testing (10%) by exploiting the file names (they all start with “cvX”, with X=[0..9]). This calls for k-fold cross-validation,
not implemented in the example but fairly easy to integrate.
if fname.startswith('cv9'): # 10% test data test_data.append(content) test_labels.append(curr_class) else: # 90% training data train_data.append(content) train_labels.append(curr_class)
Once the vectorizer has generated the feature vectors for training and testing, we can call the classifier as described above. In the example, we try different variations of SVM:
classifier_rbf = svm.SVC() classifier_linear = svm.SVC(kernel='linear') classifier_liblinear = svm.LinearSVC()
After performing the classification, we print the quality (precision/recall) results using classification_report(), and some timing information.
We notice that:
- The default RBG kernel performs worse than the linear kernel
- SVC() with linear kernel is much much slower than LinearSVC()
The first point opens for a discussion on Gaussian vs. linear kernels, not really part of this blog post, but as a rule of thumb when the number of features is much higher than the number of samples (documents), a linear kernel is probably the preferred choice. Moreover, there are options to properly tune the parameters of a RBF kernel.
The second bullet point is easily explained by the fact that, under the hood, scikit-learn relies on different C libraries. In particular SVC() is implemented using libSVM, while LinearSVC() is implemented using liblinear, which is explicitly designed for this kind of application.
Summary
We have discussed an application of sentiment analysis, tackled as a document classification problem with Python and scikit-learn.
The choice of the classifier, as well as the feature extraction process, will influence the overall quality of the results, and it’s always good to experiment with different configurations.
scikit-learn offers many options from this point of view.
Knowing the underlying implementation also allows for a better choice in terms of speed.
Marco
How run the code?, What is the line for run in python?
It is not clear for me.
Thank you for your aid.
Regareds
LikeLike
Hi Rodrigo
if you follow the link for the complete code on Gist/GitHub at the end of the article, you’ll see how the full script looks like. Save the script and then call it from command line with:
python sentiment_classification.py your-data-dir
where your-data-dir is the folder where you unzip the Movie Review Dataset
Notice: tested on Python 3
Cheers
Marco
LikeLike
Hi Marco,
I’m quite new in this subject, so I ask you maybe a trivial question. That’s due also to the fact that I’m working on R and I’m actually translating the Python code. However the question is more theoretical: how you created the training set labels for which we perform the svm? What actually are ‘pos’ and ‘neg’ in the script?
Thank you, Daniele
LikeLike
Hi Daniele,
“pos” and “neg” are the two labels/classes. The dataset comes with the documents already split in two sub-folders with those names, so simply all the data in the folder “pos” are positive reviews, and similarly all the docs in “neg” are negative reviews. They have been pre-labelled by the authors of the dataset as described in their papers. Moreover, the documents in each folder/class have names starting with “cvX”, with X being a digit: you can exploit this to create a 90-10 split of the data like I’m doing in the sample code. Ideally, for this kind of experiment, you would do a 10-fold validation, iterating through different 90-10 splits, one for each digit, and reporting the average precision/recall. The sample code only takes one split for simplicity. Hope this helps.
Cheers,
Marco
LikeLike
Ok, I miss reading that part! So they are pre-labelled, I think that should exist something to not label manually in the training phase, even if it seems counterintuitive!
Thank you for the answer and the CV hint!
I’ll continue to work on and will try to perform it.
Daniele
LikeLike
Hi Marco,
Thanks for this great resource, I’m currently using it to further my understanding of text SVMs. If I wanted to expand this from not only pos neg, but using emotions as well (anger, fear etc). What would I have to do?
Cheers.
LikeLike
Hi Stan, if you want to stick with SVM or another supervised learning approach, you’ll need labelled data. I’m not aware of a freely available data set of documents annotated with emotions. There are a few lexical resources around, e.g. https://github.com/marcoguerini/DepecheMood/releases (this is the paper: http://aclweb.org/anthology/P14-2070 ) that you could use with an unsupervised approach.
Cheers,
Marco
LikeLike
Hi Marco,
I’m quite new in this subject, so I’m trying to run the code but I don’t know which lines do I have to change in it to add the folder that I have already unzip from Movie Review Dataset.
Thanks for your help
LikeLike
Hi Elsa, there is no need to change the code as it takes the argument from the command line — if you follow the example on gist/github (last link at the end of the article) you can just run it with:
python sentiment_classification.py your-data-dir
LikeLike
Hi marco and thanks for sharing your work,
I was wondering if i want to use your script with a dataset with more than 2 classes, let’s say four, is enough to change classes = [‘pos’, ‘neg’] to the actual classes? something like : classes = [‘one’,’two’,’three’,’four’] ?
and for use it with my dataset, can i just make 4 folder for the four classes and then put inside 1000 files each that starts with ‘cv’ + some number?
Thanks in advance Nico.
LikeLike
Hi Nico,
extending the code for a multi-class task should be straightforward, I don’t see any particular problem. The naming “cv”+number is the approach used in the movie data set so it can make cross-validation (or k-fold validation) easier to perform — you don’t have to follow it if you have a clear train-vs-test split with your data.
Cheers,
Marco
LikeLike
Hi, how do you use this to actually classify your own files?
LikeLike
Hi,
assuming you have labelled data, you simply load your data in the variables train_data, train_labels, test_data, test_labels (the *_data variables are lists of documents, the *_labels variables are list of labels in the same order as the documents). You can then pass these variables to vectoriser and classifier using the same code.
Cheers,
Marco
LikeLike
Could we use it to do sentiment_analysis on other data like tweets? save pickles? And i am sorry that i couldnt read the result so understanding,like the precision is 0.92 for neg that means 92% correction on test data?and is there any ways to use this get sentiment value like pos0.67 something. I tried pickles and i always get 0.6 0.8 1.0 something
LikeLike
Hello Marco,
Very good example. I am trying to train the system using all movie data (including the ‘9’ series) and then use an independent test vector (outside of the movie data) to get the prediction. My question is what should I put as the curr_class in Line 44 of your code at https://gist.github.com/bonzanini/c9248a239bbab0e0d42e? Since this is independent test vector, I have no reasonable way of knowing the class (unlike the movie review dataset).
I went ahead and put everything as ‘neg’ just to try it out, but I get the following message for RBF:
Training time: 16.707707s; Prediction time: 0.031200s
C:\Python34\lib\site-packages\sklearn\metrics\classification.py:1135: UndefinedMetricWarning: Precision and F-score are
ill-defined and being set to 0.0 in labels with no predicted samples.
‘precision’, ‘predicted’, average, warn_for)
C:\Python34\lib\site-packages\sklearn\metrics\classification.py:1137: UndefinedMetricWarning: Recall and F-score are ill
-defined and being set to 0.0 in labels with no true samples.
‘recall’, ‘true’, average, warn_for)
precision recall f1-score support
neg 0.00 0.00 0.00 1
pos 0.00 0.00 0.00 0
avg / total 0.00 0.00 0.00 1
As you can see it did not work. Will very much appreciate any help from you.
LikeLike