Sentiment Analysis

This is a demonstration for a machine learning model that determines if a given phrase has a positive or negative sentiment.

About the model and application

Try it out!


Loading...

Result:

Probability that your statement is positive:

Probability that your statement is negative:

Training the sentiment analysis model

For this project, I had a corpus of 5000 movie reviews that were labeled positive or negative, and there were 2500 of each category. I trained a logistic regression model to recognize if a new piece of text has a positive or negative sentiment.

      import pandas as pd

      data = pd.read_csv('resources/Reviews.csv')
      print("Number of positive and negative reviews", '\n', data["sentiment"].value_counts())
      test_data.head()

      ...

      Number of positive and negative reviews 
      1    25000
      0    25000
                                                    review  sentiment
      0  My family and I normally do not watch local mo...          1
      1  Believe it or not, this was at one time the wo...          0
      2  After some internet surfing, I found the "Home...          0
      3  One of the most unheralded great works of anim...          1
      4  It was the Sixties, and anyone with long hair ...          0
    
Processing the text data

First I split the data into a 70/30 test train split and then vectorize the review text. Vectorizing means transforming the text into a standardized format that the model can process. This needs to be done for the training data and any new input, which is why it is used in the cloud function later on. I used the TFIDF vectorizer from sklearn initialized with standard settings for the TFIDF vectorizer.

      from sklearn.model_selection import train_test_split
      from sklearn.feature_extraction.text import TfidfVectorizer

      x_train, x_test, y_train, y_test = train_test_split(
        test_data["review"],
        test_data["sentiment"],
        test_size=0.3,
        random_state=42
      )

      vectorizer = TfidfVectorizer(max_df=0.9, min_df=0.0005, ngram_range=(1,2)).fit(x_train)
    

Previously I tried preprocessing the data manually before feeding it into the TFID Vectorizor, this entailed removing stop words, punctuation, contractractions, and applying lemmatization. But this additional step did not result in a higher accuracy - likely because the vectorizer has similar preprocessing steps built in. Since then I removed my custom preprocessing.

Training the model

The next step was to fit the model to the training data. For the given test data, this model achieved a ~90% acurracy.

      from sklearn.linear_model import LogisticRegression

      x_train = vectorizer.transform(x_train)
      x_test = vectorizer.transform(x_test)


      model = LogisticRegression(solver='lbfgs')
      model.fit(x_train, y_train)
    

Code for training the model can be found HERE.

Deploying this ML model to a Google Cloud Function

The demo above is calling a Google Cloud Function that executes a static version of the trained model. To implement this service I set up a Google Cloud Function and created serialized versions of the language vectorizer and trained model.

Storing a pretrained model and vectorizer

The function is using a pretrained version of the sentiment analysis model. This is because it would require too much data and computation to store the training data and retrain the model on each request. Also, new data is not being added to the training set, so it is not necessary to update or retrain the model on each request. The same is true for the language vectorizer.

After training the model and fitting the vectorizer to the given data, I serialized the two using the pickle library.

      pickle.dump(model, open('serialized/trained_model.sav', 'wb'))
      pickle.dump(vectorizer, open('serialized/trained_vectorizer.sav', 'wb'))
    
Unpacking the model and vectorizer in Google Cloud

For the Google Cloud function, I set up a Google Storage bucket where I upload the trained_model.sav and trained_vectorizer.sav files. In the cloud function, I add these file names and the bucket name as environment variables.

      BUCKET = os.environ['BUCKET']
      VECTORIZER_FILENAME = os.environ['VECTORIZER_FILENAME']
      MODEL_FILENAME = os.environ['MODEL_FILENAME']
    

Next I connect to the bucket using the Google Cloud storage library.

      from google.cloud import storage

      client = storage.Client()
      bucket = client.get_bucket(BUCKET)
    

Now that I connected to the bucket, I can download the serialized model and vector and deserialize them using pickle (the same library used earlier).

      vectorizer_blob = bucket.get_blob(VECTORIZER_FILENAME)
      model_blob = bucket.get_blob(MODEL_FILENAME)

      vectorizer = pickle.loads(vectorizer_blob.download_as_string())
      model = pickle.loads(model_blob.download_as_string())
    

At this point the model and vectorizer are ready to use as normal.

      vectorized_text = vectorizer.transform([input])
      result = model.predict_proba(vectorized_text)
    

Unpacking the vectorizer and model are the heaviest operations in this process, the vectorizer more so, because its serialized form is about 43mb while the model is 483kb. This is probably because the vectorizer is storing every word in the training data in order to map it to a vector. For this Google Cloud function I had to allocate 512mb of data.

Code for the function can be found HERE.