Handy concepts for all NLP analysis techniques

Photo by Chris Ried on Unsplash

This article will change the new beginners’ thoughts to learn natural language processing (NLP). When I started learning natural language processing first time is always something that how I will use all these concepts.

The prerequisite for this article is the basic knowledge of natural language concepts. You can read the below article to brush up on the concepts.

NLP — Zero to Hero with Python

Topics to be covered:

  1. Reading sentiment text file
  2. Data Exploration and Text Processing
  3. Data Cleaning — Stopwords, Stemming, and Lemmatization
  4. Model Building — Naive Bayes
  5. Saving and Revoking the model
Reading sentiment text file

importing all the necessary libraries.

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

Reading the sentiment file with pandas library method and download from the kaggle.

train_ds = pd.read_csv( "sentiment_train", delimiter="\t" )
train_ds.head(5)
A photo by Author

The above text file has two columns, sentiment and text. The sentiment column has binary values i.e. “0” and “1”.

To read the sentences properly, we need to increase the width of the column.

pd.set_option('max_colwidth', 800)

Now we will view the data based on sentiment “1” and “0” rows i.e. positive and negative sentences respectively. The below code snippet will show the text whose sentiments are “1” up to the first five rows.

train_ds[train_ds.sentiment == 1][0:5]
A photo by Author

This code snippet will show the text whose sentiments are “0” up to the first five rows.

train_ds[train_ds.sentiment == 0][0:5]
A photo by Author
Data Exploration and Text Processing

Data Exploration

To check the information of the data with info() method.

train_ds.info()
A photo by Author

Now, we will check the total number of counts of the positive and negative sentiments in the column with the help of a seaborn and count plot.

import matplotlib.pyplot as plt
import seaborn as sn
%matplotlib inline
plt.figure( figsize=(6,5))
# create count plot
ax = sn.countplot(x='sentiment', data=train_ds)
# annotate
for p in ax.patches:
ax.annotate(p.get_height(), (p.get_x()+0.1, p.get_height()+50))
Sentiment Counts. A photo by Author

Become a Data Scientist in 2021 with These Following Steps

Text Processing

Now, we will convert the text data to cross-sectional data with count vector model.

from sklearn.feature_extraction.text import CountVectorizer
# Initialize the CountVectorizer
count_vectorizer = CountVectorizer()
# Create the dictionary from the corpus
feature_vector = count_vectorizer.fit( train_ds.text )
feature_vector
A photo by Author

Now, to find the total length of the features uses the feature_name method.

# Get the feature names
word = feature_vector.get_feature_names()
print( "Total number of features: ", len(word))
#output:
Total number of a word: 2132

To see some features out of all.

import random
random.sample(word, 10)
A photo by Author

Now, we will transform the features to the sparse matrix.

train_ds_features = count_vectorizer.transform( train_ds.text )
type(train_ds_features)
#output:
scipy.sparse.csr.csr_matrix

To check the shape of the sparse matrix

train_ds_features.shape
#output:
(6918, 2132)

Now we will making matrix to dense matrix data frame

# Converting the matrix to a dataframe
train_ds_df = pd.DataFrame(train_ds_features.todense())
#Now, the column features are the words
train_ds_df.columns = word

To view the data frame

train_ds_df.head()
A photo by Author

To check the first row of the raw data

train_ds[0:1]

Now, we will see the first row with some columns of dense matrix

train_ds_df.iloc[0:1, 150:157]

Counting of the frequency of words

Counting the occurrence of a word and fitting to a data frame.

#Count the occurrence of words column wise
words_counts = np.sum( train_ds_features.toarray(), axis = 0 )
feature_counts_df = pd.DataFrame( dict( features = word,counts =
words_counts ) )
plt.figure( figsize=(12,5))
plt.hist(feature_counts_df.counts, bins=50, range = (0, 2000));
plt.xlabel( 'Frequency of words' )
plt.ylabel( 'Density' )
Number of words count. A photo by Author

Now, we will see the occurrence of the word with a count of “1”.

len(feature_counts_df[feature_counts_df.counts == 1])
#output:
1228

To check the maximum occurrence words and make a data frame of them.

# Initialize the CountVectorizer
count_vectorizer = CountVectorizer(max_features=1000)
# Create the dictionary from the corpus
feature_vector = count_vectorizer.fit( train_ds.text )
# Get the feature names
word = feature_vector.get_feature_names()
# Transform the document into vectors
train_ds_features = count_vectorizer.transform( train_ds.text )
# Count the frequency of the features
word_counts = np.sum( train_ds_features.toarray(), axis = 0 )
word_counts = pd.DataFrame( dict( features = word,counts =
word_counts ) )

To view the most occurrence words as data frame.

feature_counts.sort_values('counts',ascending = False)[0:15]
Most occurrence words. A photo by Author
Data Cleaning

Stopwords

Now, we will check the stopwords to remove them from the data because they are not adding any meaning in the sentiment analysis.

from sklearn.feature_extraction import text
my_stop_words = text.ENGLISH_STOP_WORDS
#Printing first few stop words
print("Few stop words: ", list(my_stop_words)[0:10])
#Output:
Few stop words: ['seem', 'her', 'else', 'noone', 'hereupon',
'find', 're', 'wherein', 'whither', 'if']

We can also add custom stopwords in the data also.

# Adding custom words to the list of stop words
my_stop_words = text.ENGLISH_STOP_WORDS.union( [‘harry’, ‘potter’,
‘code’, ‘vinci’, ‘da’,‘harri’, ‘mountain’, ‘movie’, ‘movies’])

Now, make a new data frame after removing stopwords.

# Setting stop words list
count_vectorizer = CountVectorizer( stop_words = my_stop_words,
max_features = 1000 )
feature_vector = count_vectorizer.fit( train_ds.text )
train_ds_features = count_vectorizer.transform( train_ds.text )
word = feature_vector.get_feature_names()
words_counts = np.sum( train_ds_features.toarray(), axis = 0 )
word_counts = pd.DataFrame( dict( features = word,  
counts = words_counts ) )

View the new data frame after removing stopwords.

feature_counts.sort_values( "counts", ascending = False )[0:15]
A photo by Author

Word-cloud with Python

Stemming and Lemmatization

Now, we will go the root of the words with the help of Porter Stemmer.

from nltk.stem.snowball import PorterStemmer
stemmer = PorterStemmer()
analyzer = CountVectorizer().build_analyzer()
#Custom function for stemming and stop word removal
def stem_words(doc):
### Stemming of words
stem_words = (stemmer.stem(w) for w in analyzer(doc))
### Remove the words in stop words list
non_stop_words = [ word for word in list(set(stem_words) -
set(my_stop_words)) ]
return non_stop_words

Now making the new data frame of root words.

count_vectorizer = CountVectorizer( analyzer=stemmed_words,
max_features = 1000)
feature_vector = count_vectorizer.fit( train_ds.text )
train_ds_features = count_vectorizer.transform( train_ds.text )
word = feature_vector.get_feature_names()
words_counts = np.sum( train_ds_features.toarray(), axis = 0 )
feature_counts = pd.DataFrame( dict( features = word,
counts = words_counts ) )
feature_counts.sort_values( "counts", ascending = False )[0:15]
Root Words. A photo by Author

Now, converting the vector matrix to a data frame.

# Convert the document vector matrix into dataframe
train_ds_df = pd.DataFrame(train_ds_features.todense())
# Assign the features names to the column
train_ds_df.columns = features
# Assign the sentiment labels to the train_ds
train_ds_df['sentiment'] = train_ds.sentiment
Model Building

Naive Bayes Model

Train and Test set

from sklearn.model_selection import train_test_split
train_X, test_X, train_y, test_y = train_test_split( 
train_ds_features,train_ds.sentiment,
test_size = 0.3,random_state = 42 )

We will use the Bernoulli Naive Bayes classifier

from sklearn.naive_bayes import BernoulliNB
nb_clf = BernoulliNB()
nb_clf.fit( train_X.toarray(), train_y )
#output:
BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None,
fit_prior=True)

To predict the sentiments, we will predict the data

test_ds_predicted = nb_clf.predict(test_X.toarray())

Now, printing the classification report of Naive Bayes classifier.

from sklearn import metrics
print( metrics.classification_report( test_y, test_ds_predicted ) )
Classification Report. A photo by Author

Print the confusion metrics

from sklearn import metrics
cm = metrics.confusion_matrix( test_y, test_ds_predicted )
sn.heatmap(cm, annot=True, fmt='.2f' );
Confusion metrics. A photo by Author
Saving and Revoking the model

To save the file we will use pickle library.

import pickle
pickle.dump(nb_clf, open("Sentiment_Classifier_model", 'wb'))

To revoke the same pickle file for prediction.

# load the model from disk
loaded_model = pickle.load(open("Sentiment_Classifier_model", 'rb'))
test_ds_predicted = loaded_model.predict(test_X.toarray())
print( metrics.classification_report(test_y, test_ds_predicted))
Classification report after revoking the pickle file. A photo by Auhtor
Conclusion:

This article will give basic concepts in natural language processing model building and dealing with words to make them features for our prediction. We can use other models like TF-IDF and N-Grams for prediction also will cover in future articles.

I hope you like the article. Reach me on my LinkedIn and twitter.

Recommended Articles

1. Understand List as Big O and Comprehension with Python Examples
2. Python Data Structures Data-types and Objects
3. Exception Handling Concepts in Python
4. Principal Component Analysis in Dimensionality Reduction with Python
5. Fully Explained K-means Clustering with Python
6. Fully Explained Linear Regression with Python
7. Fully Explained Logistic Regression with Python
8. Basics of Time Series with Python
9. Data Wrangling With Python — Part 1
10. Confusion Matrix in Machine Learning


Understand NLP Model Building Approach with Python was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.