Gen Archives

Identifying Sockpuppet Accounts on Social Media Platforms

Evolution of Disinformation Campaigns from 2016 - Present

Daniel Kats

Senior Principal Researcher

Published

April 29, 2020

Read time

12 Minutes

Identifying Sockpuppet Accounts on Social Media Platforms

Written by

Daniel Kats

Senior Principal Researcher

Published

April 29, 2020

Read time

12 Minutes

In this article

Share this article

While online disinformation campaigns are not new phenomena, the topic generated renewed interest in the wake of the 2016 US presidential election. This has led to large-scale journalistic and scholarly investigations of disinformation operations on social media platforms in the years since. This article attempts to summarize our current understanding of disinformation operations and how they have evolved from 2016 to the present. We also outline key ways that such operations can be detected with examples from recent campaigns, the ongoing challenges in this area, and common misconceptions.

Key Takeaways:

The objective of a modern disinformation campaign generally isn’t to drive one political outcome—rather, to sow division among people or subtly drive wedges with respect to a politically or socially charged issue.
Facebook and Twitter have received much of the media attention when it comes to disinformation campaigns, while platforms like Instagram and Wikipedia are often glossed over; however, these platforms have seen a significant increase in the amount of large-scale coordinated inauthentic activity as well.
Bots and sockpuppets are best identified through five key features: IP-based correlation, temporal-based correlation, signs of automation in metadata, social subgraphs, and content similarity. However, just because two bots share similar features doesn’t mean they operate in the same way. Techniques that identify one type of bot network may be totally ineffectual at identifying another network.
Security researchers face challenges in understanding the full extent of sockpuppets and their disinformation efforts since social media platforms limit the scope of data, or simply don’t publish any datasets at all.

Definitions

To begin our discussion, we need to start with a common set of definitions. We define a sockpuppet as a fictitious online identity created for the purposes of deception. Sockpuppets are usually created in large numbers by a single controlling person or group. They are typically used for block evasion, creating false majority opinions, vote stacking, and other similar pursuits. Sockpuppets are as old as the internet—and even older than that. In academic literature, they are sometimes called Sybils, and they are a staple of research in distributed systems security.

In contrast, bots are fully automated accounts which disseminate and sometimes produce content. A simple example of a bot that is not a sockpuppet is the Big Ben Clock account on Twitter, which tweets “bong” every hour of every day. Bots are relatively rare on social media and are generally much easier to detect.

Trends

The use of sockpuppets is rapidly rising. In 2013, Wikipedia published research detailing their identification of “2700 unique suspected cases” of sockpuppets. These numbers seem almost quaint now. In 2017, a study was published analyzing 3656 verified sockpuppets in the comment section of various discussion communities. The University of Oxford estimated 70 countries were affected by state-sponsored social media manipulation in 2019, a 150% increase from just 2 years prior. Academic estimates suggest that 5-15% of all Twitter accounts are sockpuppets, while Facebook estimates that 5% of all accounts on its platform are fake.

Old Understanding of Social Media Manipulation: “Fake News”

In 2016 and early 2017, much of the discussion around social media manipulation used the term “Fake News.” “Fake News” usually refers to websites that intentionally publish hoaxes and erroneous news stories to purposefully mislead the public. A Buzzfeed News Analysis showed that in the final 3 months of the US 2016 presidential election campaign, the 20 top-performing fake news stories generated more engagement than the 20 top-performing real news stories. These fictitious stories were amplified on social media by sockpuppets from the Russia-based Internet Research Agency as well as real unsuspecting users.

Other reporting showed that entire websites were created to look like local papers, but whose sole purpose appeared to be to spread false news stories. One such example was The Denver Guardian, which claimed to be Denver’s “oldest news source,” but was uncovered by NPR to have been registered in July 2016.

Figure 4: Google Trends for "Fake News" over the past 5 years in USA

However, this is an outdated way to view modern misinformation campaigns. Whether because of increased scrutiny by Facebook of trending news stories, or because of increased public awareness, modern disinformation campaigns have mostly moved on to greener pastures.

Figure 5: Courtesy of Buzzfeed

Anatomy of Modern Disinformation Campaigns

Modern disinformation campaigns attempt to disseminate images or slogans replicating grassroots activism. One recent example is the Chinese information operation directed against the pro-democracy protests in Hong Kong, which was discovered by Twitter in August of 2019. Facebook went on to identify 7 pages with 15.5 thousand combined followers exhibiting what it calls “coordinated inauthentic behavior,” while Twitter removed 936 active accounts spreading active disinformation as well as 200,000 accounts “of lesser sophistication.” An example of a high-quality image-based slogan created by these accounts can be seen in figure 6.

Figure 6: A high-quality example graphic from the Chinese disinformation campaign against Hong Kong

These tactics are now universal. The goal is not necessarily achieving a specific political outcome but to sow division based on hot-button political issues within a country such as race and religion. Below are two images from a disinformation campaign against France (figure 8, figure 9).

These campaigns have been regularly observed by Twitter and Facebook on their platforms in recent years. Since October 2018, Twitter has identified at least 24 separate nation-state-level disinformation campaigns on its platform including one discovered in March 2020 that was centered in Ghana and Nigeria but directed against the United States. In that case, Facebook removed 69 pages, Twitter removed 71 accounts, and Instagram removed 85 accounts. As you can see from the image below (figure 7), the operators are masquerading as grass-roots racial-justice activists.

Figure 7: An image from a now-deleted Facebook page from the Ghana-centered disinformation campaign

Moving beyond Twitter and Facebook

While disinformation campaigns on Instagram have not received as much media attention as those on Twitter and Facebook, it has seen its own share of large-scale coordinated inauthentic behavior. For example, in November of 2018 Facebook removed 99 Instagram accounts with a combined 1.25 million followers. As Instagram becomes the platform of choice for the modern audience, we only expect these numbers to increase.

While Wikipedia is not typically discussed in the same context, it is a major battleground in the information war. A study by Kumar et al in 2016 found malicious actors creating hoax articles at scale in coordination with broader campaigns on social media.

Identifying Sockpuppets

So how are sockpuppet accounts detected, both by the social media giants and in academia? There are, broadly speaking, five techniques that we look for – or we would say features that are used.

IP-based correlation (accounts that are closely linked geographically)
Temporal-based correlation (closely linked in time)
Signs of automation in username/handle and other account metadata
Social subgraphs
Content similarity

IP-Based Correlation

We mentioned earlier that Facebook and Twitter were able to catch a disinformation campaign orchestrated by the Chinese government. Now let us see how that happened. While Facebook and Twitter are banned in mainland China, the companies noticed some accounts posting from Chinese-based IPs. This suggested that the posters were aligned with the Chinese security services and were able to receive an exemption to bypass China’s Great Firewall.

A 2017 paper by Kumar et al. at the prestigious WWW conference used this technique as a first-round classifier of sockpuppet accounts. While many sophisticated actors are able to hide their IPs through the abuse of malicious VPNs and proxy servers, even professionals make mistakes. For example, this type of slip-up led to the unmasking of hacker Guccifer 2.0, who claimed to be an independent lone hacker, but was in fact an officer of the Russian GRU. This shows that IPs are still a powerful feature in sockpuppet detection.

Temporal Correlation

Humans are fickle creatures. Some days we Tweet a lot, other days we don’t even open the app. However, a 2016 paper by Chavoshi et al. found that sockpuppet operators do not behave like this. In fact, they found that sockpuppets exhibit regularity in their posting habits (see Figure 10) and this correlation extends into the long term. Ordinary users don’t act in concert like this, so finding these kinds of correlations is a strong sign of coordination. This type of detection has been a staple in bot-detection for years, which leads to the term “coordinated inauthentic activity.”

Figure 10: Chavoshi et al. show coordination between the posting times of two accounts

Figure 11: Chavoshi et al. find bots post much more regularly than typical users

Account Metadata

Research by Zheng et al, published in the IIHMS conference in 2011, suggests that bot accounts are generally short-lived. Platforms are diligent at catching sockpuppets after their active period, at which point they are deleted or abandoned. This is reinforced by recent data we analyzed from Twitter’s data archive of detected sockpuppets. Below is a graph of account age of the Ghana troll farm dataset provided by Twitter. You can see that most accounts were created relatively recently.

Figure 12: Courtesy of CNN

Other research, such as that by Kumar et al, suggests that sockpuppet accounts tend to have high-entropy names or handles, such as those with many numbers or random characters. This is also a technique common to early spam operators. The people behind sockpuppets, since they create many accounts, are not selective about picking interesting and original usernames, whereas this is not the case for real users. Indeed, Yang et al. found that using a feature based on bigram-likelihood over usernames was valuable to their overall classifier.

Social Subgraphs

Sockpuppet accounts form cliques, therefore they have symmetric and predictable group structures. An analysis by Chu et al, published in 2010, suggests that the followers / friends ratio of bot accounts trends to 1 much faster than that of humans. This can be seen clearly in figure 13.

Figure 13: Chu et al. show that bots have more symmetric follower/friend ratio than humans

This is great news, because it implies that the bot networks are extremely vulnerable. Because of their high connectivity, once you find a single bot account, you can find related bots. This was shown in a paper by Minnich et al. published in 2017, who introduced a technique called BotWalk to “spider out” from a single bot to discover related bots. Minnich shows how tightly connected many sockpuppet or bot groups are in figure 14.

Figure 14: Follower relationships between bots. Colors represent highly correlated clusters. Courtesy of Minnich et al

In the Ghana Twitter dataset, we can see this symmetric relationship, which is unusual between authentic users.

While Twitter does not provide a list of follower/following relationships, we instead look at mentions in the Ghana dataset. We observed that 41 of 70 accounts in the dataset mentioned each other, and thus were vulnerable to approaches such as BotWalk. Of the remaining accounts, many were so new that they had fewer than 5 tweets.

Content Similarity

The traditional indicator of sockpuppets or bots is posting identical content. This was common in the old days of the internet as spam on message boards and was a key feature in email spam detection. More recently, this approach has been useful in flagging violators such as the Michael Bloomberg presidential campaign in early 2020.

Figure 15: Identical tweets relating to the Michael Bloomberg presidential campaign. Courtesy of the LA Times.

Challenges

While these tools are powerful and give us plenty to work with, there is still some room for pessimism. First, Echeverria et al. showed in 2018 that while certain bots share similar characteristics, they can be quite different to other types of bots. As a result, techniques that identify one type of bot network may be totally ineffectual at identifying another network.

Figure 16: Echeverrıa et al. show a T-SNE plot of bot classes against real users

Secondly, researchers suffer from a lack of data. While Twitter exposes these disinformation datasets, they are missing key features that researchers typically use to create new models such as usernames (for accounts with fewer than 5000 followers), IP ranges, and follower graphs [1]. Facebook and Instagram (which is owned by Facebook), on the other hand, simply does not publish any datasets at all. This means that researchers often have a delayed or outdated understanding of how disinformation looks today.

Concluding Remarks

At the Norton Labs we have been studying disinformation campaigns on social media for a while. We have developed new and innovative ways of identifying disinformation campaigns on social media, by standing on the shoulders of other researchers who have come before us. We hope that our work will help protect consumers while also leading the community to view this problem of disinformation collectively and openly, with data sharing between researchers, so that we can develop better defenses. After all, what’s at stake here is democracy and truth.

Notes

[1] While Twitter says this is to protect the “privacy” in case of false positives, this approach is nonsensical for historical data where the users have had an opportunity for recourse, such as data that is older than a year.

Editorial note: Our articles provide educational information for you. NortonLifeLock offerings may not cover or protect against every type of crime, fraud, or threat we write about. Our goal is to increase awareness about cyber safety. Please review complete Terms during enrollment or setup. Remember that no one can prevent all identity theft or cybercrime, and that LifeLock does not monitor all transactions at all businesses.

Copyright © 2020 NortonLifeLock Inc. All rights reserved. NortonLifeLock, the NortonLifeLock Logo, the Checkmark Logo, Norton, LifeLock, and the LockMan Logo are trademarks or registered trademarks of NortonLifeLock Inc. or its affiliates in the United States and other countries. Other names may be trademarks of their respective owners.

Daniel Kats

Senior Principal Researcher

Daniel earned his Masters at the University of Toronto Systems & Networking Group. His research involves building machine learning systems for security, and the subtle impact of those systems on the people who use them.