Bots date back to early years of computing, as the first chatbot was programmed by the MIT in 1964. Its name was Eliza, and it simplistically simulated a chat with a psychiatrist. But bots have a come a long way, and today we have Apple’s Siri, IBM’s Watson, Amazon’s Alexa, Microsoft’s Cortana and Google Assistant.
Bots can be very useful, namely digital assistants such as conversational chatbots for customer service, or publishers that distribute personalized media content. But they also have the power to produce great damage, especially when spreading fake news. The first type will openly declare to be automated, while the latter won’t.
Bot-like activity is typically used in reference to Twitter accounts that retweet things hundreds of time a day, spamming the same link repeatedly, and using multiple accounts to amplify the same message. But that isn’t the case anymore. Over the past couple of years, bots have become smarter, and they now start looking very similar to a real person’s profile.
A challenge for the School of AI
Cisco was the main sponsor of the last edition of the School of AI, supporting our vision of education free of costs for all participants. Thanks to that, Eugenia Voytik and Darío López teamed up during eight weeks to develop a system that can do what even Twitter can’t: recognize bots.
According to the MIT Technology Review¹, bots are one of the most effective ways to broadcast extremist viewpoints on social media, “but also to amplify such views from other, genuine accounts by liking, sharing, retweeting, hearting, and following, just as a human would. By doing so they’re gaming the algorithms, and rewarding the posts they’ve interacted with by giving them more visibility.”
The battle against spam bots is an arms race between developers. “When you work in safe AI you have to develop faster than the malware” says Eugenia, “it’s like a competition!”, she adds.
During the first phase of the project, Eugenia and Darío realized that several bots on Twitter had built relationships with their followers: people would comment, like, and share content as if they were real people.
Looking at it from a different angle
As a Biologist, Eugenia’s mind understood how to approach the problem in a different way. While studying the problem, she automatically thought of DNA sequencing, as each one of us is made of a different DNA sequence composed by the combination of four different elements.
“I think that, as a Biologist, it was easier to understand the mechanism, as you can compare the same with DNA sequences. When you’re doing genome sequences you also compare different types to find common patterns, subsequences.”
- Eugenia Voytik
This approach is called Social Fingerprinting, and allows to build a sequence of letters. “Looking at the chronological activity of a Twitter account, you can see that there is a sequence of tweeting, retweeting and reply. If you set a letter for each of this actions, you can build a long string for each account, as we do for each person on earth.” Once built this logic, they were able to apply it to thousands of accounts, identifying patterns in the activities of bots. “This means they do the same actions and that there is an algorithm behind it”, says Eugenia, “and that’s how you can say that they are spam bots!”, she adds.
The Machine Learning Approach
To implement this idea, Eugenia and Darío integrated supervised and unsupervised learning approaches for spam bot detection.
“The difference between these supervised and unsupervised learning can be compared to learning a new language.”
- Eugenia Voytik
Supervised learning would be like learning a language at a school: “when you go to classes, your teacher corrects you when you wrongly pronounce a word. The teacher already knows the answer and can reproduce it to teach you, so you can reproduce this output by yourself.” Unsupervised learning, on the other hand, is “when you are learning a language only by studying a book, and you have to learn everything by yourself, make all the connections and finally produce an output. You need to organize all the learning by yourself.”
So in their case, supervised learning refers to labeled data, for example, inputs such as the number of hashtags used per post, or the number of followers, and outputs that say which specific account is a bot or not. They use these features to train the model and, each time they add new data, the model is already trained and can automatically say if it’s a bot or not.
When it’s unsupervised, it means that they add to the model the sequences they don’t know if are bots or not, for example. In this case, they used it to cluster data or to find the longest substring that would allow identifying bots.
A work in progress
Eugenia and Darío presented the official results of their project to at the closing event of the School of AI, on the 14th of December 2018. “Our approach works 1.3 times better than Twitter”, said Darío. “From 26 bots that we have identified, Twitter has now suspended only 20 of them”, he adds.
According to the MIT Technology Review, “In a few years, conversational bots might seek out susceptible users and approach them over private chat channels. They’ll eloquently navigate conversations and analyze a user’s data to deliver customized propaganda.”²
“To fight this malware, we need to be one step ahead and predict what else can they do and to be prepared”, says Eugenia.
Social Media Marketing Specialist