iTranslated by AI
Real-time Recommendation System at ByteDance
Introduction
This article is a note on the following paper published by Bytedance and featured in the RecSys'22 Workshop (Monolith: Real Time Recommendation System With Collisionless Embedding Table). I won't dive too deep into the details, so please read the paper if you're interested.
In a nutshell, this paper focuses on "how to implement online learning" in actual recommendation services. In many companies, model training is typically done via batch processing in an offline environment. I'll look into why online learning is necessary. I chose this paper because it deals with an online learning system that actually runs in a production service.
Also, part[1] of the implementation has been made public.
General Resources on Online Learning
- https://www.datacamp.com/blog/what-is-online-machine-learning
- https://huyenchip.com/2020/12/27/real-time-machine-learning.html
- https://huyenchip.com/2022/01/02/real-time-machine-learning-challenges-and-solutions.html
- https://github.com/online-ml/awesome-online-machine-learning
Below, images without citations are taken from the paper mentioned above.
Introduction
In the paper, the following two points are cited as major differences between recommendation models and language or image models:
- (Sparsity and Dynamism) Most features are sparse, categorical, and change dynamically.
- (Non-stationary Distribution) The distribution of training data is non-stationary (Concept Drift).
Sparsity and Dynamism
- In language models, the number of tokens is on the order of 100k to 200k at most, but the number of users and items in recommendation models can be orders of magnitude larger, in which case they cannot fit into the memory of a single host.
- Furthermore, embedding table sizes often grow over time due to new users and items.
(Applied) Existing methods
- Giving up
- Treat only the top N most frequent items in the training data separately and map everything else to a single embedding.
- hash(id) mod j
- Prevent collisions
- linear probing
- double hash
- Avoid matching
- Use of combinations Quotient-Remainder Embedding
These methods treat all IDs equally, assuming they all appear with the same frequency, but in reality, they are long-tailed, with a small number of users and items appearing with high frequency.
Also, as the table grows over time, collisions occur, which leads to a decrease in model accuracy.
Non-stationary Distribution
In language and image models, the time scale is months to as long as a century (meaning the influence of time is almost negligible), but in recommendation systems, even the same user's interests can change minute by minute. In other words, the data is non-stationary, and the fact that the data distribution changes between training and inference is called Concept Drift.
Solution
This is a paper that attempts to solve the above two challenges with:
- Collisionless embedding tables
- Real-time online training with high fault tolerance.
Design
Monolith utilizes the distributed Worker-ParameterServer functionality of TensorFlow.

The architecture is one where Workers read data and calculate gradients, while the Parameter Server receives the gradients and updates the model parameters.
Hash Table
The main features are the following three points:
- Use of Cuckoo Hashmap
- Insertion only after the frequency of occurrence exceeds a threshold
- TTL (Time To Live)
Since the latter two are self-explanatory, let's look a bit closer at only the Cuckoo Hashmap.
Cuckoo Hashmap
This is a hash map that allows keys to be inserted without collisions.
-
lookup and deletion are possibleO(1) - Amortized
insertion is possibleO(1)

If a cycle occurs, it is rehashed.
As a side note, Cuckoo refers to the bird, which has the following terrifying habit:
It removes one egg of the Great Reed Warbler and then lays its own egg. One Cuckoo egg and one Great Reed Warbler egg were left in the nest. In about 12 days, the Cuckoo chick is born. The Cuckoo hatches earlier than the Great Reed Warbler. After a few hours, the chick throws the Great Reed Warbler's egg out. Eventually, only the Cuckoo chick remains.
The name "Cuckoo Hashmap" seems to have been given because this "ejection" behavior is the same.
Online Training

- Supports both batch and online training
-
Joins features at inference time to user logs (presumably)
- These are not the features at the exact moment the user log arrived.
- Utilizes disk-level caching to allow joining even with user log delays (on the order of days)
-
Negative sampling
- Uses log adds correction to correct the distribution during inference.
-
Incremental parameter synchronization
- The model size is several terabytes. Synchronization must be done without stopping inference or impacting the network.
- Since sparse parameters account for the majority of the model, only parameters updated since the last synchronization are synced at minute-level intervals.
- The synchronization frequency for dense parameters is kept low (as sudden changes are less likely in algorithms using momentum).
-
Daily parameter snapshots
- This is for system anomalies, such as a Parameter Server crash.
Experiments
Is the "Collisionless" Hash Table Effective?

- In both cases, the "collisionless" hash table shows higher AUC and better performance.
- The fluctuation of AUC in the real-world data results (right figure) is due to Concept Drift. (The parameter synchronization frequency at this time is unknown.)
Is Real-Time Online Training Important?


- Regardless of the frequency, performance is higher with online training than without.
- The shorter the synchronization interval, the higher the performance becomes.

- A significant performance improvement was observed in the Ads model of the production system.
Is the Parameter Synchronization Mechanism Robust?
Omitted
Impressions
- If the model size grows any larger, distributed training (which is currently a hot topic) might become necessary, but combining it with model parallelism would likely increase the difficulty significantly.
- Please let me know if you know of any other stories about production services implementing online learning.
- Bytedance uses online learning in various services, which makes it a valuable reference. In fact, they state the following: [4]
Ultimately, your For You feed is powered by your feedback: the system is designed to continuously improve, correct, and learn from your own engagement with the platform to produce personalized recommendations that we hope inspire creativity and bring joy with every refresh of your For You feed.
References
-
There is no Flink code for joining features and logs. ↩︎
-
I wrote "almost" because there is a description that data prepared in real-time with Kafka is first stored in HDFS, and the training part is performed in batches. This is as of the time the article was written (2020.6), so it might be online learning now, or the system may have changed significantly. ↩︎
-
It seems they use KSQLDB instead of Flink. They also perform data structuring and downsampling to handle extremely large volumes of data. ↩︎
-
https://newsroom.tiktok.com/en-us/how-tiktok-recommends-videos-for-you ↩︎
Discussion