Garbage In, Garbage Out. Clean data, Accurate Model.
Cleaning data is a required step when building machine learning models. The concept of garbage in, garbage out applies to machine learning. If a model gets dirty data, the model will produce poor results, and vice versa, making data cleaning and preprocessing the first principal step when working with data.
In classical machine learning, the standard data cleaning methods are, removing duplicate entries, outliers, and unwanted entries, also filling missing values with the mean or mode. These methods are helpful in specific cases, such as when dealing with numerical values that do not yield good results when dealing with textual data.
Natural language processing is a subfield of computer science, linguistics, and artificial intelligence that aims to help machines understand human language by using written or spoken words as data. Because human speech is complex, with sometimes words having multiple meanings; for example, current can refer to the flow of water or electricity, and also to something modern or trendy, applying standard cleaning methods will have little or no effect on textual data.
Prerequisites — The following are required to follow through with this article:
- Basic knowledge of Python.
- Know how to import Python libraries.
- Good grasp of working with functions.
Outline –
- What is Text Preprocessing?
- Methods of Text Preprocessing with Python.
- Popular tools used for Text Preprocessing.
What is Text Preprocessing?
Text preprocessing is cleaning data to prepare it for model training. It involves the removal of noise in texts which prevents proper representation of the text in vector form. This noise can be: extra whitespaces, punctuations, numbers, emojis, emoticons, URLs, HTML tags, and symbols, among others.
Removing these items depends on the goal you are trying to achieve with your data. For example, if working with a financial-related data set eliminating symbols such as a dollar sign, $, is not advisable. However, if dealing with biological scientific texts, removing such symbols would be appropriate as these would hardly require them.
Techniques of Text Preprocessing with Python
Techniques of text preprocessing Python includes:
- Lower casing.
- Eliminating whitespaces.
- Eliminating punctuation.
- Expanding contractions.
- Spell correction.
- Tokenization.
- Eliminating stopwords.
- Stemming.
- Lemmatization.
Lower casing
As the name implies, it involves converting all uppercase letters to lower case. The words Book, bOOk, and book will have different values when converted to vectors — text represented with numbers, makes lower casing essential as it ensures uniform vector representation of similar texts.
Code Example.
def convert_to_lower_case(text):
return text.lower()
sample_text = “Sometimes it pays to stay in bed on
Monday, rather than spending the rest
of the week debugging Monday’s code.
– Dan Salomon”
convert_to_lower_case(text=sample_text)
Output
‘Sometimes it pays to stay in bed on Monday, rather than spending the rest of the week debugging Monday’s code. — Dan Salomon’
The code above converts all the letters in the sample text to lower case using the String lower() method.
Eliminating whitespaces
Whitespaces are extra spaces(two or more) between characters. They are not of much use for machine learning tasks, hence the need for elimination.
Code Example.
def remove_whitespace(text):
return “ “.join(text.split())
sample_text = “Talk is cheap. Show me the code.
― Linus Torvalds”
remove_whitespace(text=sample_text)
Output.
‘Talk is cheap. Show me the code. ― Linus Torvalds’
The remove_whitespace function splits the sample text into a list, using whitespace as the separator. It then joins the list together using single whitespace. With this method, the function removes any excess whitespace present in the text.
Eliminating Punctuations
Punctuations show how to construct a sentence and how to read it. While this is important to humans, it does not matter to a machine trying to understand what a text means and represent it in vector form. That’s why removing punctuations is included when cleaning text.
Code Example.
import re
def remove_punctuations(text):
return re.sub(‘[^\w\s]’, ‘’, text)
sample_text = ‘Programming is about managing
complexity: the complexity of the
problem, laid upon the complexity of
the machine. Because of this
complexity, most of our programming
projects fail.’
remove_punctuations(sample_text)
Output.
‘Programming is about managing complexity the complexity of the problem laid upon the complexity of the machine. Because of this complexity most of our programming projects fail.’
Here, regular expression substitutes anything that is not a word \w or a whitespace \s with an empty string. If you look at the output, you will see the colon :, the comma , and the full stop . are all stripped from the original text.
Expanding Contractions.
Contractions are words made by combining and shortening two words. We commonly use them when writing and speaking. Examples are doesn’t (does + not), I’d (I + would), and don’t (do + not). You expand contractions to bring the words in texts to the base level for better analysis. A use case would be where you want to find verbs commonly used by college students every day. Expanding contractions will produce better results as the machine cannot read I’ve and I have as the same idea.
Code Example.
First you need to install the contractions library using pip install contractions
import contractions
def expand_contractions(text):
return contractions.fix(text)
sample_text = “I’m not a great programmer; I’m
just a good programmer with great
habits. — Kent Beck”
expand_contractions(sample_text)
The contractions library converted “I’m” to “I am” by iterating through every word in the sample text and looking it up with a dictionary that has the contraction as the key, with its complete form as the value. When found, it replaces the contraction with the contraction dictionary value.
Text Correction
Text correction involves correcting spelling and grammatical errors to achieve consistency with words in a text and prevent different vector representations for misspelled words in texts.
Code Example.
First you need to install the TextBlob library using pip install textblob
from textblob import TextBlob
def correct_spelling(text):
text = TextBlob(text)
correct_text = text.correct()
return correct_text
sample_text = “Truht can only be fonud in one
place: the cdoe.”
correct_spelling(sample_text)
Output.
TextBlob(“Truth can only be found in one place: the code.”)
As noted from the operation above, TextBlob returned the correct spelling of the words “truth,” “found,” and “code.” Note that TextBlob is 70% accurate and sometimes returns the wrong word.
Tokenization
Tokenization in natural language processing involves breaking down texts into smaller blocks of words or sentences called tokens. With tokenization, text is split into tokens for models to find sequences in texts, resulting in a better understanding and representation of the text by the model.
Tokens can be split into sentences, words, and subwords. However, how to apply any of these tokenization methods depends on the project’s goal. As an example, if writing a topic specific essay, a model could be built by splitting the text into sentences.
Code Example.
First, install the Natural Language Toolkit Library (NLTK) using pip install nltk
from nltk import word_tokenize
def tokenzie_word(text):
return word_tokenize(text)
sample_text = “Object-oriented programming offers
a sustainable way to write
spaghetti code. It lets you accrete
programs as a series of patches.
― Paul Graham”
tokenzie_word(sample_text)
Output.
[‘Object-oriented’, ‘programming’, ‘offers’, ‘a’, ‘sustainable’, ‘way’, ‘to’, ‘write’, ‘spaghetti’, ‘code’, ‘.’, ‘It’, ‘lets’, ‘you’, ‘accrete’, ‘programs’, ‘as’, ‘a’, ‘series’, ‘of’, ‘patches’, ‘.’, ‘―’, ‘Paul’, ‘Graham’]
The word_tokenize function splits the raw text into individual words and returns a list containing the words.
Tokenizing sentences works the same way, but splits them into sentences rather than words.
from nltk import sent_tokenize
def tokenzie_sent(text):
return sent_tokenize(text)
sample_text = “Object-oriented programming offers
a sustainable way to write
spaghetti code. It lets you accrete
programs as a series of patches.
― Paul Graham”
tokenzie_sent(sample_text)
Output.
[‘Object-oriented programming offers a sustainable way to write spaghetti code.’, ‘It lets you accrete programs as a series of patches.’, ‘― Paul Graham’]
Eliminating Stopwords
Stopwords are words that frequently appear in texts. Words like “is”, “and”, “but”, and “I” are examples of stopwords. As stopwords do not bear significant importance to the model when trying to understand the text, eliminating them is advisable.
Code Example.
from nltk.corpus import stopwords
from nltk import word_tokenize
stopwords_list = stopwords.words(‘english’)
def eliminate_stopwords(text):
tokenized_text = word_tokenize(text)
without_stopwords = [word for word in
tokenized_text if not word in stopwords_list]
return “ “.join(without_stopwords)
sample_text = “Software and cathedrals are much
the same — first we build them, then
we pray.”
eliminate_stopwords(sample_text)
Output.
‘Software cathedrals much — first build , pray .’
The code tokenizes the text into words, then iterates through the words, removing any stopword found. The output shows the words “and”, “are”, “the”, “same”, “we”, “them”, and “then” are removed.
Stemming
Stemming is the process of reducing words to their root form. Words like “running,” “ran,” “runs,” and “run” all have their root form in the word “run.” Stemming’s purpose is to reduce data redundancy which negatively affects the performance of machine learning models.
from nltk.stem import PorterStemmer
from nltk import word_tokenize
stemmer = PorterStemmer()
def stem_words(text):
tokenized_text = word_tokenize(text)
stemmed_words = [stemmer.stem(word) for word in
tokenized_text]
return “ “.join(stemmed_words)
sample_text = “One: Demonstrations always crash.
And two: The probability of them
crashing goes up exponentially with
the number of people watching.”
stem_words(sample_text)
Output.
‘one : demonstr alway crash . and two : the probabl of them crash goe up exponenti with the number of peopl watch .’
The output above shows the effect of stemming on texts. For example, the word “watching” becomes “watch,” because it attempts to strip off the suffixes. Stemming has its drawback, as the root form may not always be a word that has a meaning. For example, “goes” becomes “goe,” which does not bear meaning. Lemmatization eliminates this drawback.
Lemmatization
Lemmatization is similar to stemming as they both reduce words to their root form. The difference between them is that lemmatization reduces words to the root form present in the dictionary.
Code Example.
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
lemmatizer = WordNetLemmatizer()
def lemmatize_words(text):
tokenized_text = word_tokenize(text)
lemmatized_words = [lemmatizer.lemmatize(word)
for word in tokenized_text]
return “ “.join(lemmatized_words)
sample_text = “Software undergoes beta testing
shortly before it’s released.
Beta is Latin for still doesn’t
work.”
lemmatize_words(sample_text)
Output.
‘Software undergo beta test shortly before it ’ s release . Beta be Latin for still doesn ’ t work .’
The lemmatized words are in the dictionary, and each still retains its meaning solving the drawbacks of stemming. Lemmatization’s disadvantage is that it requires intensive computing power.
Popular tools used for Text Preprocessing
- Natural Language Toolkit (NLTK): Natural Language Toolkit (NLTK) is an open-source library for working with human language data using the Python programming language. It was created by Steven Bird and Edward Loper in the Department of Computer Science and Information at the University of Pennsylvania and released in 2001. Initially built for research and educational purposes, you can use NLTK for classification, stemming, tokenization, part-of-speech tagging, and parsing.
- TextBlob: TextBlob, as stated on its official website, “stands on the giant shoulder of NLTK and pattern, and plays nicely with both.” It is a Python library for handling textual data. The library can perform NLP tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, and more. It has a gentle learning curve and can be a stepping stone to understanding NLP concepts.
- spaCy: spaCy is a recent library for working with human language data released in 2015 by Matthew Honnibal and Ines Montani. Written in CPython, it offers speed and is perfect for use in production environments. spaCy is also optimized to work with deep learning models built with Tensorflow, Pytorch, or MXNet through Thinc, its own machine learning library.
Conclusion.
This article was a comprehensive analysis of text processing when building natural language models. It proves valuable as it provides a uniform representation of raw texts in vector form and yields better results. Applying these techniques based on the project and goals will lead to success.