Over the past year, the Strava platform team has been working to rebuild the segment leaderboards system. This is the final article in a series of four blog posts detailing that process, and describes how the leaderboards event stream architecture enables simple and easy extensions to the core functionality. For some added context, please read part one, which details the background of the leaderboards system, part two, which distilled the problems of the previous leaderboards systems down to a set of principles a new system should solve, and part three, which describes details of the core leaderboards system.
As a quick refresher, part three of this series described a leaderboard architecture in which effort mutations in the Ruby on Rails app were logged to a Kafka topic partitioned by segment and user. Those mutations are consumed by a worker which refreshed effort data from the canonical effort store, and then applied the resulting updates to leaderboards storage.
While this system is the core of our final leaderboards system, it is far from the entire surface area of the leaderboards service at Strava. In addition to basic leaderboard functionality, the leaderboards service is also responsible for:
- Awarding top ten achievements (including KOM/QOM) from efforts on an activity.
- Maintaining counts of the total number of efforts ever on a segment, and the total number of unique users who have attempted a segment.
- Serving as a base store for dynamically filtered leaderboards (for example: a leaderboard composed only of the other users you are following, or the members of a club).
A naive way to support these features would be to execute queries against Cassandra leaderboard storage:
- Achievements — whenever we render an activity’s efforts within the Strava product, query the leaderboard for every segment traversed by that activity to determine if any of the activity’s effort resulted in a top ten.
- Counts — whenever we show a leaderboard, fetch all rows from it to determine the count.
- Filtered leaderboards — Whenever we need to show a dynamically filtered leaderboard, fetch all rows from the leaderboard and filter them for users which meet the criteria.
While straightforward, querying Cassandra every time is incredibly costly — you may have noticed the phrase “fetch all rows” more than once in the previous paragraph. Full Cassandra table scans are computationally expensive, and time consuming. If we attempted these naive query patterns during regular query load, we would overwhelm our infrastructure and cause undesirable and unpredictable latency for requests.
The next logical step, then, is to denormalize the data these features require into a derivative datasets. These datasets contain precomputed answers to the common questions the above queries aim to answer, giving us predictable performance and predictable latency for read requests.
Replication and Synchronization
Reflecting back on the three features our leaderboards service needs to support, we can classify each access pattern into one of three derivative data sets:
- Leaderboard Cache — copy of an often-requested portion of all leaderboards, held in a low latency format (i.e. in-memory), offering quick reads.
- Achievements — store of efforts which are within the top 10 of a leaderboard.
- Counts — aggregate counts of total efforts and unique number of users seen per leaderboard.
To keep derivative data stores like these in sync, we will want to update them as soon as the leaderboards themselves are updated. To achieve this goal, oftentimes engineers will do the simple and straightforward thing of adding lines of code in the upstream application to update derivative data stores as well. For example, to keep leaderboard aggregate counts in sync we would add code to increment the total effort counter every time a new effort is added or removed from a leaderboard.
This approach may sound straightforward at first, but there is a lot of hidden complexity:
- Latency in updating downstream data sets add latency to upstream processing.
- Availability of the upstream processing is now tied to availability of the downstream data stores.
- Upstream processing is now concerned with proper error-handling/retry/data consistency requirements of downstream data sets.
- Added complexity of code, integration, and testing downstream updates within the upstream system.
In nearly all cases, these complications can be avoided by instead architecting your system in a stream processing approach, where changes in the upstream system are asynchronously replicated to downstream ones. In this design, the upstream triggering system logs mutations it is making, then downstream services consume those updates and apply changes to their own data stores. In our implementation, the worker updating leaderboards simply logs leaderboard mutation messages as it updates leaderboards. Downstream clients consume those mutations to keep their derivative data stores in sync.
Derivative Data Sets
In the rest of this article we will summarize how the leaderboards stream processing architecture allowed us to build three derivative data sets. We’ll slowly build out a diagram of the entire leaderboard architecture, showing how all three of the systems fit together as part of the larger whole.
While Cassandra meets our use case for leaderboard storage quite nicely, due to its data model, it has a limited querying ability. You are not able to craft queries as expressive as those in more traditional RDBMS (such as MySQL). The normal approach to this limitation is to predefine query patterns and denormalize data into separate Cassandra tables, one for each query pattern.
This technique is somewhat challenging for leaderboards. For the dynamic leaderboards, it is prohibitively expensive to materialize all of them in Cassandra. We would be looking at an additional leaderboard per segment for each user that has traversed the segment and each club in which a member has traversed the segment, together ordering on the tens or hundreds of millions. Thus, we have to fetch all efforts from the base leaderboard, and filter them in memory. These are not common requests, but do happen with enough frequency, and are expensive enough to Cassandra, that we wanted to look at ways to limit them to some degree.
Additionally, aside from paginated leaderboard results, the other main query we must fulfill is calculating a user’s ranking on a given leaderboard. Since the ranking is dependent on the number of rows above the user in a given leaderboard, we have to fetch all efforts that are faster than the user’s entry to determine the rank. This could potentially be the entire leaderboard (if the user is in last place), another very expensive operation.
The solution to this problem was to put a cache in front of Cassandra to serve common leaderboard requests. This cache would hold a subset of frequently requested leaderboards, stored in a data structure to facilitate quick response times. The cache should be able to serve leaderboard requests faster than Cassandra, and could potentially provide a richer querying API to handle those requests without needing a full scan.
Populating the Cache
Populating the cache should be done on a cache miss as seen from leaderboard read requests. On a miss, the client logs a message into a Kafka topic noting the leaderboard which had the cache miss. A downstream consumer (discussed shortly) consumes the cache miss message, and populates the cache by querying for canonical leaderboard data from Cassandra.
You may be wondering why we didn’t craft the cache as a read through cache, where the client queries Cassandra on a miss, and then caches that response as part of the initial read request. The reason is that we actually cache the entire leaderboard (all efforts), not just the specific response for the specific request. And we do not want to introduce latency to the read request to do this cache operation, so instead it is deferred to a out-of-band downstream consumer.
This may see counterintuitive, but it’s important to note that caching specific leaderboard responses is incredibly hard. Leaderboard responses include a ranking value for each effort returned. This is the effort’s ranking in the leaderboard, noting how many faster efforts there are before it. Caching partial sections of leaderboard standings with a ranking is dangerous — because as new efforts are added or removed from the leaderboard, the ranking value for other efforts on the leaderboard change. For example, if I upload a best effort on a segment with 10,000 athletes and I am the new 250th best effort, the remaining 9,750 efforts all need to have their ranking shifted down by one. If we cached portions of leaderboards, on any given best effort mutation we would have to invalidate all partial responses that had their standing changed by that effort. This is certainly possible, but incredibly challenging and untenable.
Caching, and then invalidating, the entire leaderboard on each new best effort mutation is certainly simpler and easier to reason about. Unfortunately, popular (read: large) leaderboards see tens of updates a day, lowering cache hit rate for that class of leaderboard. This is extra-unfortunate since those popular leaderboards are the slowest to query, and then need caching the most.
The ideal solution is to cache the whole leaderboard in a structure which maintains ranking on updates, but allows partial updates as new efforts are added and removed from the leaderboard. This is tricky to do, as cache population due to cache misses needs to be synchronized with these partial updates. If you process them independently, you could end up with inconsistent data. Imagine Cassandra is queried by the actor processing the cache miss, but before the actor can write the leaderboard to cache, an effort is added to that same leaderboard in Cassandra. This effort was added after the first actor queried Cassandra, but since the leaderboard doesn’t exist yet (we are waiting on that actor to process the miss), the separate actor processing the leaderboard update does not apply the change. The first actor processing the miss then writes the leaderboard to cache, and we’ve lost the leaderboard update.
To achieve this synchronization, we rely on our good friend Kafka topic partitioning. Both cache miss messages, along with leaderboard updates, are logged into the same topic partitioned this time by segment and leaderboard type. This ensures that all writes to a particular type of segment leaderboard — both populates and updates — are serialized, processed by no more than one actor at a time.
Cache Data Store
What data store will work with this caching strategy? Basic object caches, like Memcache, are out, since they do not support partial updates or ordered data structures. What we need is a cache that has slightly richer features and data structures which support partial updates. As it turns out, Redis is a good candidate for this problem.
The Redis sorted set data structure is a natural fit, because it allows us to sort efforts by their elapsed time. However, due to details outside of the scope of this blog post, the actual implementation of leaderboards in Redis required keeping both a sorted set and a hash in order to efficiently implement all needed leaderboard queries. This represented a challenge, as Redis does not have any sort of transaction isolation, and we did not want to use locks, for reasons mentioned in earlier posts in this series.
However, Redis does have the ability to run Lua scripts via an embedded Lua interpreter, and the entire Lua script runs from start to finish atomically on the Redis instance. Leveraging this fact, we were able to write a suite of Lua scripts which atomically update both the sorted set and hash atomically. The cache worker then simply executes these scripts when processing cache miss and leaderboard update messages.
Finally, it’s important to address part one of this series, where we noted Redis was a “bad choice” for storage. We’re less concerned with the poor Redis ops story in this deployment, as all cache data is ephemeral. If our Redis instance crashes, we will lose all cached data, but the leaderboards system will still have high availability — we can read from Cassandra temporarily until a replacement node is back online.
To record top ten achievements, a worker consumes leaderboard mutation messages logged by the cache worker, filtering them out for only efforts on either the KOM or QOM (colloquially called XOM) leaderboards. For each XOM effort seen, we query the leaderboard it appears on, to see if the effort is in the top ten results. If it is, we enqueue a job to record the achievement in our Rails app, which owns the display of these achievements.
Why does the worker need to look up the top ten leaderboard results if we have the leaderboard mutation already? The answer is that the leaderboard mutation does not include the ranking of the user on the leaderboard. This was a conscious decision. While calculating such a rank is possible at effort add time to Cassandra, calculating it is prohibitively expensive as it requires counting all the rows for efforts with a elapsed time faster than the effort just inserted. Rather than slow down the workers updating Cassandra, we decided to do this work via a separate worker after the leaderboard has been updated. That downstream worker is then able to query for leaderboards like any other client (using the cache), and only query the top ten results: a much cheaper query than asking for the ranking of the effort on the whole leaderboard.
The tradeoff with this approach is a small chance of missed achievements during a high volume of concurrent effort uploads to the same leaderboard. For instance, if I upload an effort that is the new KOM on a segment, but before the worker calculating achievements processes the leaderboard mutation, my friend uploads their effort which is even faster and takes over the KOM. Technically my effort was the KOM, for the brief moment between my effort upload and my friends. However, when achievements worker actually queries for the top ten based on my effort it would see my effort as second place, with my friends effort as the KOM.
This is certainly not ideal, but this situation is very rare. Even if it does occur, most users would be okay (or even in some cases, unaware) they held the XOM for that short amount of time. Additionally, even if they did know they held it, under normal operating conditions they’d have only held the XOM for a second or two, not long enough to really consider themselves the XOM holder with any authority.
As mentioned above, we maintain two aggregate total effort counts on a segment: the number of efforts ever, and the number of distinct users. The old Scala-powered system updated both counts via addEffort or removeEffort calls. The total effort count was implemented as a denormalized Redis counter, while the distinct user count was simply the HLEN of the redis hash per segment.
For the distinct user count, there is no equivalent O(1) HLEN command for Cassandra. We have to use the more expensive COUNT(*) query on the overall the leaderboard. Since this is so much more expensive, we made the decision to denormalize effort counts into another data store, keyed by the segment. For every leaderboard mutation seen on the overall leaderboard, we issue a COUNT(*) command and write the result to the denormalized store. This will obviously calculate counts for every segment in the whole system, even for segments which no client ever queries for, but popular leaderboards see hundreds of requests for their count per day, so the tradeoff is worth it for predictable O(1) read performance.
Total effort counts are more complicated. We cannot simply increment/decrement counts as we consume effort mutation messages, because that operation is not idempotent. If the consumer crashes and picks up from an earlier offset in the log we will be double incrementing/decrementing efforts. It turns out this is a common problem in distributed systems, with the distinction that we’re not really looking for a counter, really we want a set of all efforts on a segment, using the cardinality of that set as the total count.
However, sets are quite memory intensive; we would need to store every effort ever at Strava to produce accurate counts. This is also another common problem, for which most solutions rely on probabilistic data structures. In our case, a HyperLogLog data structure works well for use case. It supports idempotent adds in O(1), with O(1) cardinality checking. So effort counts are now as simple as consuming effort mutations, adding each effort to a HyperLogLog per segment.
Unfortunately, as of today, this is not the way our production system maintains total effort counts. We still naively update counts via addEffort or removeEffort calls. As mentioned earlier, to accurately build a HyperLogLog counter for all efforts ever per segment we’d have to play all efforts ever into it. This is a non-trivial backfill operation, and instead we just decided to migrate the counters from the old infrastructure and maintain them the same way. This is obviously not ideal, but is no worse than the current implementation and allowed us to more quickly migrate away from the old leaderboards infrastructure.
Hopefully, after this article, it is clear that a stream processing architecture provides a great foundation to easily build services that produce derivative datasets. We saw it was easy to build several systems — a robust leaderboard cache containing a subset of often-requested leaderboards, a system to detect and publish top ten achievements, and a system to maintain denormalized aggregate counts.
While these datasets are updated asynchronously, with some latency, between their triggering actions upstream and an update reflected in the derivative data store, the latency is generally very low. This is a worthwhile tradeoff — in exchange for a latency hit, we get a system with increased reliability and consistency of updates, decoupling of components, scaling independence and flexibility, and simpler error and retry handling.