In this article I apply a series of natural language processing techniques on a dataset containing reviews about businesses. After that, I train a model using Logistic Regression to forecast if a review is “positive” or “negative”.
Roberto SannazzaroFollowFeb 3
The natural language processing field contains a series of tools that are very useful to extract, label, and forecast information starting from raw text data. This collection of techniques are mainly used in the field of emotions recognition, text tagging (for example to automatize the process of sorting complaints from a client), chatbots, and vocal assistants.
A condensed version of the yelp dataset will be used. This version contains a collection of 1000 observations, originally in JSON format, then converted into
The review dataset being used:
Made out of 9 features (‘business_id’, ‘cool’, ‘date’, ‘funny’, ‘review_id’, ‘stars’, ‘text’, ‘useful’, ‘user_id’) this dataset contains a collection of reviews made by users from yelp, for each review a user gave a score from 1 to 5 stars. In order to create an efficient model to forecast if a review is “positive” or “negative”, we start from a model that takes the
text variable as a predictor and the
stars variable as the target.
Data preprocessing and explorative analysis
Once the dataset is reduced to 2 columns it is possible to conduct a small explorative analysis. It is important to know which distribution the target variable (stars received) follows, in this way it is possible to understand if there is a bias in the dataset — imbalance between positive or negative reviews. This influences the results of the model, giving the propensity to predict outcomes that are more present in the training set.
As we can see from the plot, there is a major component of positive reviews (5 stars), which creates an imbalance or bias.
In order to be able to obtain useful results, it is necessary to reduce the complexity of the problem, an efficient way to do so is to divide the reviews into positive and negative, using this division as the dependent variable.
Before proceeding with any other visualization, it is mandatory to apply some preprocessing procedures very common in NLP:
- Remove any non-useful characters (slashes, punctuation, HTML tags, question marks, etc.)
- Convert the whole text to lowercase characters
def functions will be very useful while preprocessing the text as described before. From here it is possible to determine which single words and a combination of words (bigrams) are more common:
After a small indexing adjustment we can create a bubble chart displaying the most common words in positive and negative reviews:
And for the positive reviews:
After this short but interesting insight, we can proceed into the next phase: model creation.
A very simple, fast to train, and very efficient algorithm is Logistic Regression. The
scikit-learn library provides a tool that helps to build this model, but before doing this and before doing the classical splitting between train and test set, it is mandatory to perform few steps like stemming, vectorization, and removal of stopwords:
- Stemming allows us to reduce every word to its root. This procedure avoids ‘dispersion’ in the text. For example, conjugation of the verb ‘to be’ like: ‘am’, ‘are’, ‘is’ are converted into its root form ‘be’.
- The removal of the stopwords consists of removing every word like ‘the’, ‘that’, ‘of’ that would cause a decrease in the model accuracy.
- Vectorization transforms every observation (review) in the dataset into a numerical representation. This phase is mandatory, and for every machine learning algorithm we would like to train, it is necessary to input numerical data. Vectorization provides the ability to translate text into a numeric representation of itself.
Let’s take a look at a review before and after applying stemming and stopwords removal:
Now it is possible to proceed with the text vectorization. The
sklearn.feature_extraction.textCountVectorizer class offers a tool that is very simple to use. This tool needs to be initialized with the
max_featuresargument which establishes the max length of the dictionary that will be created in order to represent the text. For example, after choosing 1500 as number of features the algorithm will create a dictionary based on the 1500 (features) with the highest amount of frequency, so each review in the dataset will be represented by a list containing 1500 elements. Each element represents a feature of the dictionary created previously, with the number assigned matching the number of times a word occurred in the observation (review).
Let’s check this example:
For each observation (Doc 1, Doc 2, Doc n..) a number represents the occurrences of this feature (word) in the observation (review).
To implement this in technique in Python, only two lines of code are necessary:
It is now possible to split the dataset into a training set and test set:
Then to train the Logistic Regression model with 10 folds:
It is possible to understand from the report that the accuracy is 88.5%, and the bias toward positive reviews is quite evident as the accuracy in predicting positive reviews is much bigger than the accuracy in predicting negative reviews. That means that given a block of text, we can predict that if it is ‘positive’ or ‘negative’ with an accuracy of 88.5%.
This difference is more evident in the confusion matrix:
This model is not perfect, but it does his job. As it was mentioned before the bias toward positive reviews is quite big. To improve this model there are some possible solutions, for example:
- Increase the number of observations (the golden rule)
- Use a different algorithm, like Naïve Bayes, decision trees or some RNN, CNN or HAN.
- Use a different stemming technique
- Use a different stopwords collection
After manually modifying some parameters like the class_weights, it is possible to slightly improve the score. This practice is certainly not the best, but knowing that the model is biased toward positive reviews, a change in the class weights by decreasing the weight for positive reviews and increasing the weights for negative can lead to a (slightly ) higher accuracy.
Would like the notebook? Just tell me in the comments ⬇️⬇️