Term Frequency - Inverse Document Frequency (TF-IDF): Preparing Machines To Read Natural Language With and Without Code


Given our dependence on technology, you’d have thought computers would be up amongst some of the smartest things on the planet. 

Unfortunately, computers are pretty dumb - they can’t even read human language. They often depend on either an individual or a group of people putting their minds together to provide it with instructions. 

Well that begs the question, how on earth do we make our computers make sense of text?  

It’s a good question considering the majority of us have been saved by autocorrect at some point. Heck, I’ve even got Grammarly installed on my computer to aid me when I start writing gibberish. But how? 

We humans can understand various types of data. Whether it’s strings, letters, numbers, hieroglyphics, etc., we’d be able to interpret it in a way that is meaningful. On the other hand, computers can only read numbers. If we’d like to teach our computers to read, we’d have to speak it’s language, thereby meaning we ought to convert our text into numbers. 

Sending Computers to School

Term frequency-inverse document frequency (TF-IDF) is one of many ways to convert text to numbers. Each method has its own way of performing the transformation of text [including its advantages and disadvantages] to numbers. TF-IDF does this transformation by using numerical statistics to reflect the importance of a word to a document in a corpus (a corpus is a collection of documents). 

Without getting to "mathy", tf-idf is performed by multiplying two different metrics:

  • The number of times a word appears in a document
  • The inverse document frequency of the word in a corpus.

As a result, the tf-idf value would increase proportionally to the number of times a word appears in the document and is then offset by the number of documents that the word appears in. This helps to adjust for the fact that some words appear more frequently in text such as “a”, “the”, and “what”, therefore these words rank lower even though they appear in many documents because they do not mean much to a specified document. 

Coded Example 

The majority of us are on Twitter, right? Okay, even if you said No, it wouldn’t be so difficult to follow along. We are going to be using Twitter data for our coded example.

# importing libraries
import re
import string
import pandas as pd

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import TfidfVectorizer
# previewing data
twitter_data = pd.read_csv("/content/sample_data/train.csv")

As we’ve stated previously, a machine learning (ML) model does not understand the text in the tweet unless we convert it into numbers. To further add, some ML models aren’t able to handle missing values in our data. Even though we won’t be doing any modelling for our data, we will drop those columns to simplify this example. 

# drop empty columns 
twitter_data.drop(["keyword""location"], axis=1, inplace=True)


The first obstacle we face is that some of our text uses capital letters, therefore, “Our” and “our” would not be considered as the same word. To overcome this I am going to convert all the text into the same case. 

# convert all text to lower case
print(f"Before lowercasing:\n{twitter_data.loc[5'text']}\n\n")
twitter_data["text"] = twitter_data["text"].apply(lambda x: x.lower())
print(f"After lowercasing:\n{twitter_data.loc[5'text']}")

Great! Everything is lowercase now, but there is still an issue. Some of our tweets have hashtags, hyperlinks, stock market tickers, punctuation, and twitter's famous RT (retweet) symbol - essentially things that may not add any value to the meaning of the text when we try to predict whether a tweet is positive or not. Let’s remove them...

# remove stock market tickers like $GE
twitter_data["text"] = twitter_data["text"].apply(lambda x: re.sub(r'\$\w*''', str(x)))
# remove old style retweet text "RT"
twitter_data["text"] = twitter_data["text"].apply(lambda x: re.sub(r'^RT[\s]+''', str(x)))
# remove hyperlinks
twitter_data["text"] = twitter_data["text"].apply(lambda x: re.sub(r'https?:\/\/.*[\r\n]*''', str(x)))
# remove hashtags - only removing the hash # sign from the word
twitter_data["text"] = twitter_data["text"].apply(lambda x:re.sub(r'#''', str(x)))

# function to remove punctuation
def remove_punctuations(text):
    for punctuation in string.punctuation:
        text = text.replace(punctuation, '')
    return text

# remove punctuation
twitter_data["text"] = twitter_data['text'].apply(remove_punctuations)

Now we can apply tf-idf to our text so that our computer can begin to read what we’ve written. To further simplify this we’ve performed singular value decomposition (SVD) to reduce the dimensionality of our tf-idf features. 
# apply Tf-idf
tfidf = TfidfVectorizer()
X = tfidf.transform(twitter_data["text"])
print(f"shape after tf-idf: {X.shape}"

# apply svd
svd = TruncatedSVD()
X_svd = svd.fit_transform(X)
print(f"shape after svd: {X_svd.shape}")

Same Example, Without Code!

That was a lot of steps we ran through to get to that solution, and sometimes your team may not be equipped with someone with technical abilities, but that doesn't mean that you should miss out on the power of Natural Language Processing (NLP). 

With our AI & Analytics Engine, you could seamlessly run through this process without knowing a word of code! 

Once you have your project set up and the data imported, all that is left is applying the transformations to the columns you wish to transform. To perform transformations, all you have to do is select the "ADD ACTION" button and boom! You're in action. 

Watch the short video to see how it works... 

AI & Analytics Engine - Google Chrome 2021-02-25 15-13-48_Trim

Once you've queued all the actions, you simply select "COMMIT ACTIONS" to save all the steps you've taken on the data so that you could revisit it whenever you like. 

Wrap Up 

In this article, we cover very simple processing steps in NLP: Converting text to lowercase strings, cleaning text, and transforming the text into a numbers so that the computer could have a chance at understanding it. We also covered the 2 approaches that could be used to perform these steps: One is more technical and requires expertise in programming skills, whereas the other and more simple solutions leverages the power of our AI & Analytics Engine which does not require any prior knowledge of programming, thereby making AI & analytics accessible to more people. 

If you have enjoyed this article and want to to get update from PI.EXCHANGE including tutorials, informative articles and product updates subscribe! 



AWS Glue DataBrew, how does it compare to The AI & Analytics Engine’s Smart Data Preparation? 
January 27, 2021

Data preparation. Love it or hate it, one thing is certain — it can chew up hours...

Read more
Machine Learning Classification Algorithm: Decision Trees [Tutorial with & without code]
January 14, 2021

In this blog, we will implement one of the most common classification algorithms in machine learning: decision trees

Read more