# Term Frequency - Inverse Document Frequency (TF-IDF): Preparing Machines To Read Natural Language With and Without Code

Given our dependence on technology, you’d have thought computers would be up amongst some of the smartest things on the planet.

Unfortunately, computers are pretty dumb - they can’t even read human language. They often depend on either an individual or a group of people putting their minds together to provide it with instructions.

Well that begs the question, how on earth do we make our computers make sense of text?

It’s a good question considering the majority of us have been saved by autocorrect at some point. Heck, I’ve even got Grammarly installed on my computer to aid me when I start writing gibberish. But how?

We humans can understand various types of data. Whether it’s strings, letters, numbers, hieroglyphics, etc., we’d be able to interpret it in a way that is meaningful. On the other hand, computers can only read numbers. If we’d like to teach our computers to read, we’d have to speak it’s language, thereby meaning we ought to convert our text into numbers.

## Sending Computers to School

Term frequency-inverse document frequency (TF-IDF) is one of many ways to convert text to numbers. Each method has its own way of performing the transformation of text [including its advantages and disadvantages] to numbers. TF-IDF does this transformation by using numerical statistics to reflect the importance of a word to a document in a corpus (a corpus is a collection of documents).

Without getting to "mathy", tf-idf is performed by multiplying two different metrics:

• The number of times a word appears in a document
• The inverse document frequency of the word in a corpus.

As a result, the tf-idf value would increase proportionally to the number of times a word appears in the document and is then offset by the number of documents that the word appears in. This helps to adjust for the fact that some words appear more frequently in text such as “a”, “the”, and “what”, therefore these words rank lower even though they appear in many documents because they do not mean much to a specified document.

## Coded Example

The majority of us are on Twitter, right? Okay, even if you said No, it wouldn’t be so difficult to follow along. We are going to be using Twitter data for our coded example.

`# importing libraries`
`import re`
`import string`
`import pandas as pd`

`from sklearn.decomposition import TruncatedSVD`
`from sklearn.feature_extraction.text import TfidfVectorizer`
` `
`# previewing data`
`twitter_data = pd.read_csv("/content/sample_data/train.csv")`
`twitter_data.head() `

As we’ve stated previously, a machine learning (ML) model does not understand the text in the tweet unless we convert it into numbers. To further add, some ML models aren’t able to handle missing values in our data. Even though we won’t be doing any modelling for our data, we will drop those columns to simplify this example.

`# drop empty columns `
`twitter_data.drop(["keyword", "location"], axis=1, inplace=True)`
`twitter_data.head()`

The first obstacle we face is that some of our text uses capital letters, therefore, “Our” and “our” would not be considered as the same word. To overcome this I am going to convert all the text into the same case.

`# convert all text to lower case`
`print(f"Before lowercasing:\n{twitter_data.loc[5, 'text']}\n\n")`
`twitter_data["text"] = twitter_data["text"].apply(lambda x: x.lower())`
`print(f"After lowercasing:\n{twitter_data.loc[5, 'text']}")`

Great! Everything is lowercase now, but there is still an issue. Some of our tweets have hashtags, hyperlinks, stock market tickers, punctuation, and twitter's famous RT (retweet) symbol - essentially things that may not add any value to the meaning of the text when we try to predict whether a tweet is positive or not. Let’s remove them...

`# remove stock market tickers like \$GE`
`twitter_data["text"] = twitter_data["text"].apply(lambda x: re.sub(r'\\$\w*', '', str(x)))`
`# remove old style retweet text "RT"`
`twitter_data["text"] = twitter_data["text"].apply(lambda x: re.sub(r'^RT[\s]+', '', str(x)))`
`# remove hyperlinks`
`twitter_data["text"] = twitter_data["text"].apply(lambda x: re.sub(r'https?:\/\/.*[\r\n]*', '', str(x)))`
`# remove hashtags - only removing the hash # sign from the word`
`twitter_data["text"] = twitter_data["text"].apply(lambda x:re.sub(r'#', '', str(x)))`

`# function to remove punctuation`
`def remove_punctuations(text):`
`    for punctuation in string.punctuation:`
`        text = text.replace(punctuation, '')`
`    return text`

`# remove punctuation`
`twitter_data["text"] = twitter_data['text'].apply(remove_punctuations)`

`print(twitter_data.loc[5, "text"])`
` `

Now we can apply tf-idf to our text so that our computer can begin to read what we’ve written. To further simplify this we’ve performed singular value decomposition (SVD) to reduce the dimensionality of our tf-idf features.
` `
`# apply Tf-idf`
`tfidf = TfidfVectorizer()`
`tfidf.fit(twitter_data["text"])`
`X = tfidf.transform(twitter_data["text"])`
`print(f"shape after tf-idf: {X.shape}") `

`# apply svd`
`svd = TruncatedSVD()`
`X_svd = svd.fit_transform(X)`
`print(f"shape after svd: {X_svd.shape}")`
` `
` `

## Same Example, Without Code!

That was a lot of steps we ran through to get to that solution, and sometimes your team may not be equipped with someone with technical abilities, but that doesn't mean that you should miss out on the power of Natural Language Processing (NLP).

With our AI & Analytics Engine, you could seamlessly run through this process without knowing a word of code!

Once you have your project set up and the data imported, all that is left is applying the transformations to the columns you wish to transform. To perform transformations, all you have to do is select the "ADD ACTION" button and boom! You're in action.

Watch the short video to see how it works...

Once you've queued all the actions, you simply select "COMMIT ACTIONS" to save all the steps you've taken on the data so that you could revisit it whenever you like.

## Wrap Up

In this article, we cover very simple processing steps in NLP: Converting text to lowercase strings, cleaning text, and transforming the text into a numbers so that the computer could have a chance at understanding it. We also covered the 2 approaches that could be used to perform these steps: One is more technical and requires expertise in programming skills, whereas the other and more simple solutions leverages the power of our AI & Analytics Engine which does not require any prior knowledge of programming, thereby making AI & analytics accessible to more people.

#### If you have enjoyed this article and want to to get update from PI.EXCHANGE including tutorials, informative articles and product updates subscribe!

###### AWS Glue DataBrew, how does it compare to The AI & Analytics Engine’s Smart Data Preparation?
January 27, 2021

Data preparation. Love it or hate it, one thing is certain — it can chew up hours...