Application of Machine Learning for Entity Resolution on social-media user accounts

Application of Machine Learning for Entity Resolution on social-media user accounts

Social-media sites on the internet provides a wealth of access to information posted by users publicly. This gives rise to many publicly available data sources that have the potential to provide an insight into the interests and activities of a user. Messages posted by users often encapsulate their typical “fingerprint” online behavior.

Law enforcement authorities are often interested in finding leads for further investigation of illegal activities such as cybercrime and terrorism, based on content posted online. The problem is that such persons typically tend to use ambiguous user handles and fake profiles to post and share information about such activities. It is often of interest to law-enforcement authorities to trace the "real person" behind such user accounts from among a suspected list of identifiable persons with a genuine profile somewhere else on the internet. Since any publicly posted content can be accessed legitimately, it can be tapped to build an intelligent algorithm that will automatically match a profile with obfuscated identity with a legitimate one. This can be based, for example, on the general nature of style and content.

PI.EXCHANGE has partnered with WorldStack Inc., to provide such a functionality as a model trained and deployed as an API endpoint on the AIA Engine platform. This enabled WorldStack Inc. to provide a "user matching" service to law-enforcement agencies such as the Department of Home Affairs (Australian Government) and the Australian Federal Police.





The data collected by WorldStack Inc. for this pilot project consists of social media posts obtained as a result of searching and applying some filtering criteria. Social-media platforms were scoured for posts resulting from keyword searches. In addition to the text of the messages, metadata (such as profile name, profile information, message timestamp, post URL, etc.) was also collected. This resulted in around a million messages spanning from 16 Apr 2018 to 26 Apr 2018.
Below is a word cloud of the occurrences of the keywords that were searched:

A variety of social media sites were used as the source for data collection. Shown below is a histogram of the number of messages sourced from each platform. The largest number of messages came from Twitter, followed by Reddit, Flickr, Google Plus, and NewsAPI. The least number of messages were from Youtube:

Data Preparation

As part of the data-preparation step, users with less than 20 messages posted in total were filtered out. This is because it would be impossible to "fingerprint" a user's typical behaviour if they have very little online activity. We ended up with 1000+ profiles this way. Retweets on twitter were not included in this count and were filtered out, since content posted by other users must not be used to build features for another user.


ML Modelling, Labelling and Feature Extraction

To pose the problem as supervised classification, PI.EXCHANGE and WorldStack Inc. devised the following idea to create data with noisy labels. We make all possible ordered pairs of user accounts ("Account A", "Account B") including pairing of an account with itself, and label each pair as "not match" if "A" and "B" are two separate profiles, and to label them as match if they are the same profile. The principle behind this approach is that there will be far and few alias account pairs in the pool, and hence mislabelling them will only produce a small amount of noise in the labels. An additional modification to the approach was needed: Only labelling those pairs where the two accounts are the same profile will make a model learn to predict that two accounts are matched only if they have the same exact content. This is however, not the case in real life. Hence, we created positive labels in a different way: We split an account into part "1" and part "2" by taking a random split of the messages in the account. We then make four special pairs for each account: part 1 with part 1, part 1 with part 2, part 2 with part 1, and part 2 with part 2. All of these are labelled as "positive":


Account X Account Y Same Person?
A B False
B A False
A1 A1 True
A1 A2 True
A2 A1  True
B1 B1 True
B1 B2 True
... ... ...


To extract features, the text data was tokenised into words and sentences. We then took statistical measures such as word counts, word length distributions, frequency of use of capitalisations, frequency of use of spaces, punctuations, smileys, and cosine-distance-based similarity measures between vocabularies. Vectors of equal length were thus assembled for each pair as features. Thus, we have features and labels for this data.

The resulting dataset had 4,148 positive and 1,074,332 negative labels. The data was thus highly imbalanced. Despite this, the AIA Engine's scorers make sure that an appropriate classification metric is used for optimisation so that the important class is predicted with good fidelity.

By this method, a small number of "false positives" will result from the model. These are examples of a pair of two different profiles that the model thinks should match, even though they were labelled as "not matched" originally. These are good candidates to investigate further, since the two sets of messages in the two accounts will appear statistically similar to two sets of messages derived from the same profile.


Making an "App" for the AIA Engine to solve the problem

The AIA Engine platform currently supports the following three types of problems that can be solved with machine learning: regression, classification, and time-series forecasting. Hence, the above dataset can be used to train a classification model. To do so, one must first create a project, and then upload a dataset.

To create a project, select an organisation from the dashboard and click on the "New Project" button to open up a dialog. This dialog will step through to request a project name, the organisation that the project needs to be under, followed by a description. Additional users can be optionally added to the project with different levels of permission to control who can view, edit, or delete datasets and models in the project:

Once a project is created, the user uploads the dataset with features extracted, from the dashboard card of the project.

A dataset can be uploaded through multiple types of connections. Here is a simple "File on local PC" scenario:

The file we prepared with extracted features is stored in the local PC in CSV format. When that file is chosen for the dataset, the user clicks "Import" and will see a progress bar of the upload process:

Once the uploading process is complete, the dataset is automatically analyzed. This involves correctly identifying column types as numerical/categorical/date-time/text/ignorable types, as well as the computation of appropriate descriptive statistics for the column type. The results of the analysis are shown for the user to inspect. This way, the user can discover what business problems can be solved from the data, as well as identify potential problems with their data. For example in the case of numeric columns, an estimate of the probability density function (using kernel density estimation) is shown alongside a histogram with numeric ranges as bins:

Once the data is uploaded and analysed, the user can then create a new "App" or application, which essentially involves defining their prediction task. A target column is simply chosen by the user, with which the engine can correctly identify the problem type. Here, the target column "y" in the dataset are the labels that we created:

When the dialog closes, a new "App" entity is created on the dashboard and a separate space is created where it is shown that the application is in the "Processing" stage:

In the background, the dataset is being split into train data and test data for model evaluation, and being uploaded to the cloud.

More importantly, the AIA Engine's Model Recommender computes certain meta statistics from the dataset (such as shape parameters) to automatically rank the best "model template" (ML Algorithm) for the particular dataset and task, from the currently available ones in the platform. It does so by estimating the final performance of all available models if they were trained on the selected data. The recommender's output predicts answers to common questions like "How good will the model's predictive performance be?", "How long does the model take to train?", "What is the latency for the model to predict on new data?" that are important to any business.

Once the app has finished processing and becomes "ready", the features tab will show estimates of feature importance based on a simple linear classifier trained on a sample of the data, to give a rough idea of the features that may be significant for the final model.

At this stage, one is ready to train their first model by hovering over the floating action button at the bottom right and clicking on "Train New Model". This opens up the new model dialog and shows the recommender's output to enable the user to choose the best training algorithm to suit their needs:

We can order the algorithms by any of the three questions as discussed above:

"How good will the model's predictive performance be?", "How long does the model take to train?", "What is the latency for the model to predict on new data?"

As we can see, training an AdaBoost model would require 30 minutes whereas one can get comparable performance by simply training a decision tree with 20% of the cost (training time translates directly to cost incurred). LightGBM and XGBoost can offer slightly higher performance, but they will take much longer to train. Hence, a decision tree model is preferred and chosen to train in the next step of the dialog. For a benchmark comparison, the other three models are also included. Once the "train models" button is clicked at the last stage of the model creation dialog, one is taken to the listing page of models under the app, where they are shown to be training. These models are trained and evaluated on the same train/test split to enable a fair comparison:

When one or more models have completed training, one can click on the "comparison" tab to find out which model was the best among the ones chosen for training. This page includes an overlay of evaluation charts (such as ROC and precision-recall curves, in this case) as well as a sortable table of multiple evaluation metrics. This enables the business to decide the best model based on the metric that suits them best:

In reality, we would have only trained a decision-tree classifier since it already offers a very high macro-averaged F1 score, which is a suitable metric for high class imbalance. That model can now be deployed to a specified endpoint. When a trained model is deployed to a specified endpoint, its details are shown. These include the URL for invoking the API and access key information if applicable:

The screen also gives a sample code snippet that can be given to the IT team for integration with production APIs.


Sample Results of Matched Profiles

The deployed API endpoint above can be invoked with information about messages extracted from two user profiles, based on the feature-extraction recipe described above. To enable further investigation of the matched profiles, the client can get the commonality word cloud of from their messages. There were 55 such matches, out of a million possibilities. These commonality clouds revealed that the model is able to pick up similarities in terms of language, style, and topics:
For example, it identified "ABC News" and "Associated Press" accounts were matched. This is probably due to similarities in trending news topics:

Two gaming console related customer service accounts were matched (Ask Playstation UK and EA Help, both on twitter). This was likely due to the style and mood of the answers given by agents serving customers with problems and concerns about similar products:

Very interestingly, two dutch-language user accounts and two portuguese-language user accounts were identified as matches. This is probably because the number of people who speak the language in small in a given interest group, compared to English-language speakers. Similarly, two news reporters from the USA tweeting about sporting events were matched, indicating that the model can pick up similar interests.
Many more interesting pairs of matches were found thus, using the deployed API endpoint.



The PI.EXCHANGE platform provided a convenient way for WorldStack Inc. to quickly engineer features from data they collected, and rapidly construct a pilot exploratory model to solve their unique business problem, with the help of the Recommender-powered AIA Engine platform. Further exploration of predictions using the deployed API endpoint offered many new insights to the client for them to tune the model with better training data.