What would it look like to open source the Twitter algorithm?

Author | Travis Fischer

Translator | Hirakawa

Planning | Luo Yanshan

This article was originally published on Travis Fischer’s personal blog.

This is the first in a series of articles exploring what “open source Twitter’s algorithm” might look like from a more practical perspective.

Introduction

Elon Musk’s recent call to open-source Twitter’s algorithm inspired me as an open-source keen and experienced software engineer to start working on the field. My main goal is to try to answer the following question:

What would an open source Twitter algorithm look like? To answer this question, we first need to answer some related questions:

Why open source Twitter’s algorithm?
What do we mean when we say “Twitter’s algorithm”?
What does Twitter’s core data model look like?
What does Twitter’s network graph look like?
How does Twitter’s algorithmic push work?
What are the main engineering challenges we should know about?

Motivation

Why is Musk taking over Twitter? In his own words:

The more we trust Twitter as a public platform, the less civilized the risk is. Musk’s motives are clear and consistent with his usual approach. It’s why he’s worked so hard to build a sustainable colony on Mars, why he’s devoting resources to understanding the potential dangers of AI superintelligence, and why he’s so insistent on tackling climate change.

His guiding motivation was to increase the likelihood of humanity having a positive future.

Unlike most of us, he has the ability to put a lot of money into the process in order to get there. The purity of his purpose, dedication, and practical results are indisputable, whether through the investment of personal wealth or through his vast experience as the world’s most successful serial entrepreneur.

Musk and former Twitter CEO Jack both believe that increasing the transparency and optionality of Twitter’s core algorithm would benefit the world. There are too many legitimate concerns around free speech, censorship, privacy, robot armies, echo chambers… Fundamentally, these are nuanced topics, and the only way to meaningfully improve them — while maximizing public trust in the platform and each other — will be to provide a higher level of transparency.

So the remaining goal is to “open source Twitter’s algorithm,” which sounds very good in theory and might actually benefit Twitter’s core business, which is a little harder to comprehend. So let’s see if we can improve our understanding of this conversation from an engineering perspective.

How Twitter Works

Main Timeline View

Twitter provides users with two versions of the main timeline view: the default algorithmic feed “Home” and “Latest Tweets”. The latest tweets view is a bit simpler, with a reverse-chronological list of tweets from accounts you directly follow. This used to be the default view until Twitter introduced algorithmic feeds in 2016.

Algorithmic feeds are how most people use Twitter, because the default settings make a big difference. Twitter describes Algorithmic Push as follows:

A stream of tweets from accounts you follow on Twitter, as well as other content we suggest you might be interested in based on accounts you interact with frequently, tweets in discussions, and more. There’s a lot of complexity in this “and more”. We’ll dive into that later, but first let’s understand why Twitter uses algorithmic push. The reason is very simple, it is the user experience:

You follow hundreds of people on Twitter — maybe thousands — and when you open up Twitter, you might feel like you’re missing some of their most important tweets. Today, we’re excited to share a new Timeline feature that helps you keep track of the best tweets from the people you follow. (Source; 2016) This explanation makes sense from a user experience perspective, and algorithmic feeds certainly give Twitter more freedom to experiment with the product.

The real motivation, however, is that algorithmic feeds are driven by the ad-driven business model Twitter currently employs. Push more relevant content ⇒ higher engagement ⇒ more ad revenue. This is a proven, classic social networking strategy.

Alright, now that we’ve seen Twitter’s algorithmic feeds at a high level, let’s take a deeper look at how it works under the hood.

core data model

A good way to understand a complex system like Twitter is to start by understanding its core data model and work your way up from there. These resource models and the relationships between them form the basis of all of Twitter’s high-level business logic.

We will focus on the latest version (v2) of the Twitter public API, which was originally released in 2020.

core resource model

Tweet – A short post that can reference other tweets, users, entities, and attachments.
User – an account on the Twitter platform.

Core Tweet Relationship

Timelines – A reverse-time stream of tweets from a specific account.
Likes – Likes a tweet is a core user interaction, expressing interest in a tweet. Note that “likes” were historically known as “favorites”.
Retweets – Retweets allow you to expand the readability of another user’s Tweets to your own audience.

core user relationship

Follows – Following a user creates a directed edge in the network graph, which allows you to subscribe to their tweets and opt-in to receive their private messages.
Blocks – Blocks help people restrict specific accounts from contacting them, viewing their tweets, and following them.
Mutes – Mute an account, allowing you to delete an account’s tweets from your timeline without having to unfollow or block the account. A muted account will not know that you muted him, and you can unmute it at any time.

The world on the back of a turtle

Twitter’s public API also exposes other resource models (such as spaces, lists, media, polls, locations, etc.) and other relationships (such as mentions, quoted tweets, bookmarks, hidden replies, etc.). In order to focus as much as possible on the main content, we will ignore these for now.

Keep in mind that this is also just the public API. Internally, a platform like Twitter is a complex web of services, databases, caches, workflows, people, and all the glue that holds them together. I have no doubt that Twitter uses different abstractions at different levels of its public and internal APIs, depending on various factors such as who the API is used for, performance requirements, privacy requirements, etc. For an overview of this complexity, read The Infrastructure Behind Twitter: Scale (2017). (https://blog.twitter.com/engineering/en_us/topics/infrastructure/2017/the-infrastructure-behind-twitter-scale?)

In other words, the world on the back of turtles, we intentionally limit ourselves to only a few turtles – but other turtles exist, and it’s important to keep that in mind as we discuss.

Sam Hollingsworth, “The World on Turtle’s Back”

Network Diagram

Social networks like Twitter are examples of very large graphs, where nodes are models of users and tweets, and edges are models of interactions such as replies, retweets, and likes.

A visualization of Twitter’s dynamic network graph by Michael Bronstein, Twitter’s Graph ML division (2020).

A large part of Twitter’s core business value comes from this vast underlying dataset of users, tweets, and interactions. Whenever you log in, view a tweet, click on a tweet, view a user profile, post a tweet, reply to a tweet, etc. – every interaction you have on Twitter is logged to an internal database.

The data obtained from Twitter’s public API is only a small part of Twitter’s internal tracking data. This is important because Twitter’s internal recommendation algorithms have access to all of this rich interaction data, and any open source effort may only be able to use a limited dataset.

push algorithm

From “Using Large-Scale Deep Learning on the Twitter Timeline (2017)”: Before the introduction of ranking algorithms, the composition of the timeline was easy to describe: all the tweets of the people you followed were collected since your last visit , and displayed in reverse chronological order. While the concept is easy to understand, delivering this experience reliably to Twitter’s hundreds of millions of users is a huge infrastructural and operational challenge.

For ranking, we made some additional adjustments. After collecting all the tweets, a relevance model scores each tweet. The model’s score predicts how meaningful and engaging a tweet will be to you. The top-scoring Tweets will then appear above your timeline, and the rest below. “

Twitter’s algorithmic feeds are provided by a personalized recommendation system that predicts which tweets and users you are most likely to engage with. The two most important aspects about this recommender system are:

Base data used to train ML models. This is the large-scale proprietary network diagram of Twitter that we described above.
Ranking information to consider when determining relevance. Let’s dig into these rankings to understand what Twitter’s “relevance” means.

Ranking information

From “Using Large-Scale Deep Learning on the Twitter Timeline (2017)”: To predict whether a tweet will appeal to you, our model considers the following features (or bullet points):

The tweet itself: its recency, the presence of media cards (image or video), the total number of interactions (such as retweets and likes).
Tweet author: Your past interactions with this author, the strength of your connection with them, the origin of your relationship.
You: Tweets that you have found attractive in the past, and how often and to what extent you use Twitter. The list of features we consider and their various interactions grows, providing our model with more nuanced patterns of behavior. “

This 2017 description of the ranking information may be a bit outdated, but I have no doubt that this core information will remain highly relevant in 2022. This list has likely generalized to dozens or even hundreds of key machine learning models that underpin Twitter’s algorithm.

A visualization of a deep learning model for determining the likelihood that one user will follow another user in the future. This model represents a small subset of various recommender systems within Twitter. Image credit: Deep Learning on Dynamic Graphs; 2021

Algorithm push pseudocode

If you’re a developer, this TypeScript pseudocode might be more helpful to illustrate how Twitter’s algorithmic feeds work:

 export abstract class TwitterAlgorithmicFeed { /** * 伪代码，帮助理解Twitter算法推送的工作原理。 */ async getAlgorithmicTimelineForUser(user: User): Promise<Timeline> { const rawTimeline = await this.getRawTimelineForUser(user) const relevantTweets = await this.getPotentiallyRelevantTweetsForUser(user)
 const mergedTimeline = await this.mergeTimelinesForUserBasedOnRelevancy( user, rawTimeline, relevantTweets )
 return this.injectAdsForUserIntoTimeline(user, mergedTimeline) }
 /** * 返回特定用户所关注的用户的推文流，按时间逆向排序。 */ abstract getRawTimelineForUser(user: User): Promise<Timeline>
 /** * 返回给定时间内按与给定用户相关度排序的推文流。 * * 仅考虑给定用户没有关注的用户的推文。 */ abstract getPotentiallyRelevantTweetsForUser(user: User): Promise<Timeline>
 /** * 返回按与给定用户相关性排序的推文流，同时考虑最新推文的原始时间线， * 以及包含潜在相关推文的网络图时间线子集。 */ abstract mergeTimelinesForUserBasedOnRelevancy( user: User, rawTimeline: Timeline, relevantTweets: Timeline ): Promise<Timeline>
 /** * 返回一个推文流，将广告注入给定用户的时间线。 */ abstract injectAdsForUserIntoTimeline( user: User, timeline: Timeline ): Promise<Timeline> }

TypeScript pseudocode for understanding how Twitter’s algorithmic feeds work. Note that this is for demonstration only and has been greatly simplified. The full source code is available on GitHub.

Engineering Notes

Open-sourcing all aspects of Twitter’s algorithmic push will inevitably encounter some significant engineering challenges.

This is how a senior engineering manager at Twitter reacts to the open-source Twitter algorithm feed.

scale

The first challenge is scale. Twitter’s network graph is very large. Engineering and operational challenges often outweigh other considerations necessary to ensure a good user experience.

The following points can help you understand the scale we’re talking about:

Twitter’s network graph contains hundreds of millions of nodes and billions of edges. (Source; 2021)
Twitter has more than 300 million monthly active users worldwide. (Source; 2019)
On average ~6K tweets are posted per second and over 6 million queries to get the timeline. (Source; 2020)
“Public conversations that take place on Twitter often generate hundreds of millions of tweets and retweets every day. This could make Twitter one of the largest producers of graph-structured data in the world, perhaps second only to the Large Hadron Collider” . (Source; 2020) Simply put, most developers and even most companies are not equipped to handle such large amounts of data in a lab environment, let alone a production-like environment.

To address this challenge, Twitter offers select API partners a 1% sampled version of the public Tweet Firehose, as well as the ability to obtain a smaller subset of filtered streams.

Additionally, Twitter’s scale presents some unique challenges when building graph machine learning algorithms, as their network graphs must choose between strong and eventual consistency. This complicates things because there is no guarantee that every node in the graph will have the same characteristics.

real time

The real-time nature of Twitter presents another unique challenge. Users want Twitter to be as close to real-time as possible, which means the underlying network graph is highly dynamic and latency becomes a real user experience issue. When users refresh their tweets, they expect near-instant results, with global-scale refreshes in seconds. Doing this efficiently is very difficult when the underlying network graph is constantly changing.

Sequence Graph Network is an interesting open source project from Twitter Research. They propose a framework for deep learning on highly dynamic graphs (changing over time), representing graphs as sequences of temporal events.

reliability

Another major challenge is platform reliability. Hundreds of millions of people use Twitter as a central part of their online digital identity. The engineering and operational challenges inherent in running a global platform like Twitter with a reliable and good user experience and uptime expectations are unimaginable.

Security & Privacy

From “Rebuilding Twitter’s Public API (2020)”: “One of the platform’s biggest concerns from the start has been serving a healthy public conversation and protecting Twitter users’ personal data.

The new platform pushes all security and privacy-related logic to backend services, strictly specifying the location of relevant business logic. As a result, the API layer is independent of this logic and privacy decisions are applied uniformly across all Twitter clients and APIs.

By isolating where decisions are made, we prevent inconsistent data exposure. This way, what you see in your iOS app will be the same as what you would get by programmatically querying the API. “

Summary (as of now)

Hopefully this article helped you understand how Twitter’s algorithmic feeds work, what its underlying network graph looks like, and some of the major engineering considerations (a very challenging problem at scale).

Here are some deep questions that I’ll answer in a follow-up article:

What would an open source Twitter algorithm look like?
Is it possible to abstract away all the engineering complexities required to run a global production system like Twitter and develop a truly useful open source software specification or API?
Is it possible to produce meaningful results without access to Twitter’s full dataset?
What exactly does meaningful mean here? How would we define success?
What needs to be done to make this a reality?
Any practical suggestions to help improve the situation? (Because we don’t have $43 billion to buy Twitter)

The text and pictures in this article are from AI Frontline

This article is reprinted from https://www.techug.com/post/what-will-twitter-algorithm-open-source-look-like/
This site is for inclusion only, and the copyright belongs to the original author.

Leave a Comment Cancel Reply