With the huge influx of unstructured text data from a plethora of social media platforms , different forums and a whole wealth of documents, it’s evident that processing these sources of data to distill the information that they contain is challenging because of the inherent complexity involved in processing them.

Natural Language Processing (NLP) helps greatly in processing, analyzing and understanding these sources to gain information and meaningful insights; With the recent advances in computing and easier access to computing resources, certain Deep Learning models have achieved SOTA in solving some of the most challenging NLP tasks.

The NLP series by Women Who Code Data Science track gives the learners a comprehensive learning path; starting from the basics of NLP, gradually introducing advanced concepts like Deep Learning approaches to solve NLP tasks.

In this blog post, let us focus on answering the following questions.

  • What is NLP?
  • What are some interesting use cases of NLP?
  • What are the challenges in processing natural language?
  • What are common text preprocessing techniques?

What is NLP?

Natural language processing (NLP) can be considered to be a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language; in particular, how to program computers to process and analyze large amounts of natural language data.

With interesting applications such as text classification, sentiment analysis, machine translation, speech to text, text to speech, and so on, NLP has evolved over the past few decades from rule-based approaches, statistical techniques to AI-powered applications in the recent past.

Interesting use cases of NLP

Let’s take a look at some of the common use cases of NLP.

  • Machine Translation: Machine Translation is the task of automatically converting one natural language into another while preserving the meaning of the input text and producing fluent text in the output language. However, this task of machine translation comes with inherent challenges.

  • Text Classification: Text Classification is the process of assigning tags or categories to text according to its content; It’s a fundamental problem in NLP and can be done either manually(tedious, time-consuming, and susceptible to human errors) or by leveraging ML techniques.

  • Sentiment Analysis: Sentiment Analysis is the contextual mining of text which identifies and extracts subjective information in the source text, such as recognizing polarity(positive, negative, neutral), identifying emotions, etc. A typical example is in the e-commerce industry, where mining and analyzing reviews for gaining insights on customer satisfaction and experience, identifying potential areas for improvement are important.

Virtual assistants such as Siri, Alexa and Cortana; Google Translate, Speech to text and text to speech converters are all cool NLP applications that we use in our everyday lives!

Challenges in understanding natural language

Natural language has such great diversity, and every language has its own rich grammar and uniqueness. The following are some of the inherent challenges that arise in NLP tasks.

Ambiguity

Ambiguity is an intrinsic characteristic of human conversations and is particularly challenging in Natural Language Understanding scenarios where there might be different forms that are relevant in natural language and in the AI system that we’ve programmed. In AI theory, the process of handling ambiguity is called disambiguation.

Synonymity

Synonymity stems from the fact that we can express the same idea with different terms (which are also dependent on the specific context); For example, ‘big’ and ‘large’ have a similar meaning when referring to sizes, whereas ‘large’ doesn’t make sense when used as a qualifier to the word ‘sister.’

Co-reference

Co-reference is the process of finding all expressions that refer to the same entity in a text . Co-Reference resolution is an important step for a lot of higher-level NLP tasks that involve natural language understanding and is often instrumental in improving the performances of neural architectures like RNN and LSTM.

Syntactic Rules

Knowledge about the structure and syntax of the language is often helpful.

Text preprocessing

  • Contraction Mapping/ Expanding Contractions: Contractions are a shortened version of words or a group of words, quite common in both spoken and written language. In English, they are quite common, such as I will to I’ll, I have to I’ve , do not to don’t, etc. Mapping these contractions to their expanded form helps in text standardization.

  • Tokenization: Tokenization is the process of separating a piece of text into smaller units called tokens. Given a document, tokens can be sentences, words, subwords, or even characters depending on the application.

  • Noise cleaning: Special characters and symbols contribute to extra noise in unstructured text. Using regular expressions to remove them or using tokenizers, which do the pre-processing step of removing punctuation marks and other special characters, is recommended.

  • Spell-checking: Documents in a corpus are prone to spelling errors; In order to make the text clean for the subsequent processing, it is a good practice to run a spell checker and fix the spelling errors before moving on to the next steps.

  • Stopwords Removal: Stop words are those words which are very common and often less significant. Hence, removing these is a pre-processing step as well. This can be done explicitly by retaining only those words in the document which are not in the list of stop words or by specifying the stop word list as an argument in CountVectorizer or TfidfVectorizer methods when getting Bag-of-Words(BoW)/TF-IDF scores for the corpus of text documents.

  • Stemming/Lemmatization: Both stemming and lemmatization are methods to reduce words to their base form. While stemming follows certain rules to truncate the words to their base form, often resulting in words that are not lexicographically correct, lemmatization always results in base forms that are lexicographically correct.

However, stemming is a lot faster than lemmatization. Hence, to stem/lemmatize is dependent on whether the application needs quick pre-processing or requires more accurate base forms.

References and Additional Reading

[1]https://towardsdatascience.com/a-practitioners-guide-to-natural-language-processing-part-i-processing-understanding-text-9f4abfd13e72
[2]https://www.analyticsvidhya.com/blog/2020/04/beginners-guide-exploratory-data-analysis-text-data/
[3]https://towardsdatascience.com/a-complete-exploratory-data-analysis-and-visualization-for-text-data-29fb1b96fb6a
[4]https://towardsdatascience.com/preprocessing-text-data-using-python-576206753c28