Surfacing Interesting Content

Comments:"Surfacing Interesting Content • stdout.heyzap"

URL:http://stdout.heyzap.com/2013/04/08/surfacing-interesting-content/

Heyzap is a social network for mobile gamers, and as such, we get lots of user generated content. This is a collection algorithms we’ve been using to surface the most interesting content.

Currently Popular

Do you need to know which songs your users are listening to? Which tags are trending on Twitter? No need to break out a cron job, this algorithm will keep you up to date in real-time.

Half-life:

Vote Rate:

Vote Distribution:

What it Does

We use this algorithm to find popular games trending on our network. Each time a user plays a game, we cast a “vote” for that game. Each vote has a “score”, which decays with age. For our popular games, votes decay at a rate of 50% per week. To display the most popular games, add up the all scores for each of the votes. Using this algorithm, a game played 20 times this week will be ranked higher than a game played 30 times last week, and less than a game played 50 times last week.

Similarly, if you wanted to track trending hashtags, you would cast a vote each time a tag appears. You could also use this algorithm to track word frequencies in news articles, or which countries are visiting your site.

In the visualization above, votes are cast randomly at a set of items. The orange bars indicate the current “popularity score” of each item, and the red bars indicate the probabilistic rate at which each item should accrue new votes.

The longer the half-life, the slower the algorithm will respond to new votes. At the extreme ends, a half-life of zero would answer “Which post was most recently voted on?”, whereas a half-life of infinity would answer “Which post has the most votes?”.

How it works

A straightforward implementation using a cron job might be:

Each time a vote is cast, add 1 to the popularity score of the corresponding item.
Once per day, divide all popularity scores by two.

You would probably actually want to divide the scores by 2^(1/24) each hour to minimize transients and make the decay more continuous.

However, I loathe cron jobs, so here is an easier method:

Each time a vote is cast, add 2^((now - epoch) / half_life) to the corresponding item.

As we only care about the rank of the popular items, the only difference between the outputs of the two implementations is that this one is perfectly continuous, as opposed to the stuttering decay of the cron variation.

One drawback to the continious implementation is float overflow. With a carefully chosen epoch, we can make use of a double-precision float’s 9 exponent bits and one sign bit to allow the algorithm to run for 2048 half-lives. If your half-life is one day, you can run the algorithm for five years before needing to migrate the epoch.

Using Redis as an External Index

In all my examples, I’m using Redis as an external index. You could add a column and an index to your posts table, but it’s probably huge, which presents its own limitations. Additionally, since we only care about the most popular items, we can save memory by only indexing the top few thousand items.

If you’re not familiar with Redis, I’m using ZSETs. ZSETs are sorted sets. Half-array, half-dictionary. The value in the dictionary corresponds to the key’s relative “index” in the array. They have O(Log(N)) inserts, O(Log(N)) slices, and are indexed by double-preciesion foats, which make them perfect for this implementation.

Implementation

classPopularStreamSTREAM_KEY="popular_stream"HALF_LIFE=1.day.to_i# 2.5 \* half_life (in days) years from nowEPOCH=Date.new(2015,10,1).to_idefself.onVote(post)# dict[post.id] += valueREDIS.zincrby(STREAM_KEY,post.id,2**((Time.now.to_i-EPOCH)/HALF_LIFE))trim(STREAM_KEY,10000)enddefself.get(limit=20)# arr.sort.reverse[0, limit]REDIS.zrevrange(STREAM_KEY,0,limit).map(&:to_i)enddefself.trim(key,n)# arr = arr[-n, n]REDIS.zremrangebyrank(key,0,-n)ifrand<(2.to_f/n)end# run this in five years# you could make EPOCH and STREAM_KEY dynamic# to make this process easier. Otherwise migrate and deploy the new valuesdefself.migrate(new_key,new_epoch)REDIS.zunionstore(new_key,[STREAM_KEY],:weights=>[2**((new_epoch-EPOCH)/half_life)])endend

Hot Stream

If the age of the post is more relevant than the age of the votes, we can simplify things considerably by treating all votes as though they were cast at the time the post was created. This is the algorithm used by Reddit’s front page.

What it does

If we start the decay for all votes on a post at the same time, we can simplify the formula for a posts score to:

post_creation_time/half_life+log2(votes+1)

In the visualization below, votes are cast randomly on a series of posts. Each column represents the “hot” score of each post. The tallest column would be the #1 post on the “hot” page, the second tallest #2, and so on.

Half Life:

Vote Rate:

Vote Distribution:

How it works

As I’ve tried to show in the picture above, adding a constant to log(votes) is the same as multiplying votes by a constant. log(c) + log(n) = log(n*c). So, each half-life we add to log(votes) doubles those votes power, giving us the same decay we had in the previous algorithm.

This means we don’t have to worry about overflows anymore!

Implementation

classHotStreamSTREAM_KEY="hot_stream"# How long until a post with 100 votes is less interesting than one with 10 votes?# Reddit uses 12 hoursTENTH_LIFE=12.hours.to_f# just to make it clear it's still the same algorithmHALF_LIFE=TENTH_LIFE*Math.log(10)/Math.log(2)defself.onVote(post)# dict[post.id] = valueREDIS.zadd(STREAM_KEY,post.id,post.created_at.to_i/TENTH_LIFE+Math.log10(post.votes+1))trim(STREAM_KEY,10000)enddefself.get(limit=20)# arr.sort.reverse[0, limit]REDIS.zrevrange(STREAM_KEY,0,limit)endend

Drip Stream

This algorithm uses the same decay used in the hot steam, plus a threshold to create a Digg-like, rate-limited, append-only stream.

What it does

Whenever a new post crosses the threshold, the threshold is incremented by the “drip period”, and the post is added to the drip stream. Since we’re constantly increasing the base score of each new post, a new post should be added to the stream once per drip period.

In the visualization below, votes are cast randomly on a series of posts. Each column represents the “hot” score of one post. The threshold is marked with a horizontal red line. As posts cross the threshold and are added to the drip stream, they are marked red.

Half Life:

Drip Rate:

Vote Rate:

Vote Distribution:

Implementation

classDripStreamSTREAM_KEY="drip_stream"THRESHOLD_KEY="drip_stream_threshold"# How long until a post with 100 votes is less interesting than one with 10 votes?# Reddit uses 12 hoursTENTH_LIFE=12.hours.to_f# How often should a new story be pushed to the stream?DRIP_PERIOD=1.hour.to_fdefself.newVote(post)returnifREDIS.zscore(STREAM_KEY,post.id)score=post.created_at.to_i/TENTH_LIFE+Math.log10(points))threshold=(REDIS.get(THRESHOLD_KEY)||score).to_f+DRIP_PERIOD.to_f/TENTH_LIFEifscore>thresholdREDIS.set(THRESHOLD_KEY,threshold+DRIP_PERIOD.to_i/TENTH_LIFE)# dict[post.id] = valueREDIS.zadd(STREAM_KEY,post.id,Time.now.to_i)trim(STREAM_KEY,10000)endenddefself.get(limit=20)# arr.sort.reverse[0, limit]REDIS.zrevrange(STREAM_KEY,0,limit).map(&:to_i)endend

Friends Stream

This creates a Twitter-like stream of people/places/things you are following.

Isn’t that trivial?

Sure, usually. That’s why it’s at the end.

SELECT*FROMpostsWHEREuser_idIN(7,23,42,...)ORDERBYcreated_atLIMIT20

Unfortunately, as you scale, IN queries get slow. Mongo pulls down 20 posts from each user, sorts them all by hand, then crops. When users follow thousands of other users, that gets slow. The SQL databases I tried at the time didn’t cut it either.

However, don’t take my word for it. Just remember this is here if you start seeing thousand-entry IN queries in your slow log.

How it works

The active ingredient is a ZSET of all users and their most recent post. That ZSET can be quickly intersected with the set of followed users, then sliced to create a list of recently active people you follow.

In this implementation, I’m using the actives list to union ZSETs containing each user’s stream. You could just as easily use the list to pair down the arguments to your IN query.

Implementation

classFriendsStreamUSER_STREAM_KEY=lambda{|user_id|"user_stream_#{user_id}"}USER_FRIENDS_KEY=lambda{|user_id|"user_friends_#{user_id}"}USER_ACTIVE_FRIENDS_KEY=lambda{|user_id|"user_active_friends_#{user_id}"}FRIENDS_STREAM_KEY=lambda{|user_id|"friend_stream_#{user_id}"}ACTIVE_USERS_KEY="active_users"defself.follow(user,to_follow)REDIS.sadd(USER_FRIENDS_KEY[user.id],to_follow.id)enddefself.push(post)REDIS.zadd(USER_STREAM_KEY[post.user_id],post.id,post.created_at.to_i)trim(USER_STREAM_KEY[post.user_id],40)REDIS.zadd(ACTIVE_USERS_KEY,post.user_id,post.created_at.to_i)trim(ACTIVE_USERS_KEY,10000)enddefself.get(user,limit=20)REDIS.zinterstore(USER_ACTIVE_FRIENDS_KEY[user.id],[ACTIVE_USERS_KEY,USER_FRIENDS_KEY[user.id]])active_friends=REDIS.zrevrangebyscore(USER_ACTIVE_FRIENDS_KEY[user.id],0,limit)REDIS.zunionstore(FRIENDS_STREAM_KEY[user.id],active_friends.map(&USER_STREAM_KEY))REDIS.zrevrange(FRIENDS_STREAM_KEY[user.id],0,limit).map(&:to_i)endend

Hiring Plug

Heyzap is always hiring great engineers. If you found this interesting, or better yet obvious, drop us an email. Make sure to mention you read this article (I think I get a bonus).

Email: jobs@heyzap.com

About Us : heyzap.com/about

Surfacing Interesting Content • stdout.heyzap