A Customized Loss Function for Review Rating Prediction

Image source

In this project, Anime review ratings are predicted by using word embedding and deep learning methods. The predictions are done by approximately 1 mean absolute error (MAE) and a specialised loss function is proposed. It is seen that new loss function is not only faster in learning but also more resistant to overfit than mean squared error (MSE).


Review source

The problem is prediction of an Anime review’s overall rating by using the review text. Ratings are out of ten. So, above review sample is not a positive one and its rating is 4 out of 10. There are other popular works to classify IMDB reviews as positive and negative. This study goes a step further by scoring reviews.

Data and Preprocessing

The kaggle user NatLee has collected and shared myanimelist review data here:

I don’t want to write down basic data cleaning process. Other operations:

  • Anime names are removed from their reviews. Since their meanings can affect to predictions. For example, “Great Teacher Onizuka” seems like a positive review by itself:)
  • All of “n’t” character arrays are replaced with “ not” in reviews.
  • All of stop words except “not”, “most”, “no”, “too” and “very” are removed from reviews. Because excluded words can affect to rating.
  • Only English characters are allowed on reviews, punctuations and numbers are also filtered.

Next, maximum word count should be decided for word tokenization. To do this, word counts of reviews are visualised.

Word counts for each review

As seen on the above chart, number of words has power law distribution. So, small part of reviews has huge number of words, most of reviews have small number words. 65% of reviews have less than 250 words, which is chosen as maximum number of words. If a review exceeds 250 words, last 250 words are considered. Lastly, words are tokenized by using nltk tokenizer. For text editing and tokenization I utilized from this study.


The word embedding method can be used after word tokenization. It is a model that vectorize words according to their meanings. Keras has an embedding layer to apply word embedding. CNN and LSTM are also used in model. Below network model is based on another one which is presented here.

model = Sequential()
model.add(Embedding(5000, 100, input_length=250))
model.add(Conv1D(filters=32, kernel_size=3, activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))

First, I used MSE as loss function then created a customized loss function defined as follows:

New piecewise loss function

At the first part of piecewise function, it ignores the errors when predicted value will be same with truth after rounding. Since every user have to give an integer between 1–10, they are also rounding the score in their minds. For example, they give 8 if their score is 7,5. At the second part, it multiplies difference by two then computes mean squares error. The reason of multipling by two is increasing the effect of error when difference is more than 0,5. MSE gives smaller values below 1. As an example, consider the difference is 0,6. It means 0,36 error in MSE. New function returns 1,44 in same case. In summary, MSE shrinks error below 1, raises error above 1. New function shrinks error below 0,5, raises error above 0,5.

from keras import backend as Kdef piecewise_loss(y_true, y_pred):
errorGreater = K.greater((y_true-y_pred),0.5)
errorLess = K.less_equal((y_true-y_pred),-0.5)
error = K.cast(errorGreater|errorLess, K.floatx())
return K.mean(K.square(error*2*(y_true-y_pred)))

This is how to write a customized loss function in Keras. Function should take two parameters: y_true, y_pred. And tensor compatible Keras backend functions are used.

model.compile(optimizer='sgd', loss=piecewise_loss, metrics=['mae'])

The function is passed as loss parameter in model compiling.


Both MSE and piecewise loss functions are same in terms of accuracy: MAE is approximately 1.

Validation data errors of two different loss functions in 20 epochs

Above chart shows MAE values for validation dataset. The blue one has piecewise loss function, orange one has MSE loss function. As seen, piecewise loss function provides to model faster learning than other. It reached minimum error after 14. epoch. Other one reached minimum point at 18. epoch. The piecewise loss function has given stable errors while other one has fluctuating errors. Besides, MSE loss function starts to overfit in 20. epoch, piecewise loss function is more resistant to overfit since it ignores small errors.