Using neural networks to gobble up the trail of breadcrumbs left by fraudsters

You’re a Lyft driver and you’ve just accepted a ride. You start making your way to the pickup location and suddenly you get a call.

“Hello Alex, I’m Tracy calling from Lyft HQ. This month we’re awarding $200 to all drivers with a rating of 4.7 stars and above, and I just wanted to congratulate you for being an awesome 4.9 star driver!”

“Hey Tracy, thanks!”

“No problem! And because we see that you’re in a ride, we’ll dispatch another driver so you can park at a safe location… Alright, your passenger will be taken care of by another driver. Before we can credit you the award, we just need to quickly verify your identity. We’ll now send you a verification text. Can you please tell us what those numbers are?…”

At this point, you’ve just given up complete access over every last cent in your driver account without even realizing it.

Behavior fingerprints

Posing as Lyft support, a particular kind of scammer would request for a pickup and call the driver that accepts the ride. Polite and professional, the scammer would introduce himself as a Lyft HQ representative and then congratulate the driver on having been selected for a monetary award, ostensibly for being an outstanding driver. To credit the award to his account, the driver would have to verify his identity and provide account credentials. For added effect, the scammer would claim that it’s an important call and “re-dispatch” the ride for the passenger. This apparent administrative privilege is possible since the scammer is really the passenger and can simply cancel the ride. To the uninitiated driver, unfortunately, this elaborate display of authority on top of a well-rehearsed script is often very convincing. Past this point, the scammers would have access to any hard-earned money that hasn’t been cashed out from the account. What follows would not surprise anyone familiar with the concepts of social engineering and account takeover.

Analytically, we quickly learned to identify suspicious accounts through telltale signs in their user activity. For instance, the scammers’ accounts would exhibit unusually high driver contact and passenger cancellation rates with few completed rides. There were many variants of the exact history of user activity, of course, as these fraudsters changed their tactics over time. Their varied behavior made it challenging to codify their user activity into simpler, structured features that clearly distinguish good from bad users. To the experienced risk analyst, however, the patterns are obvious. There are “behavior fingerprints” that resulted from their modus operandi that didn’t — and perhaps couldn’t — fundamentally change.

In this post, we reveal the practical motivations for a paradigm shift in Fraud Research Science work from pure, hardcoded feature engineering to a more model-centric approach over the past year. This newer approach focuses on modern machine learning techniques such as deep learning that expands the types of sources we can work with and helps us better capture the predictive properties of our signals. We then dive into one of our latest production machine learning models that detect behavior fingerprints in the fraudulent users’ account activity. We end by sketching out some of the current directions we’re looking into at Lyft Fraud.

A (manual) labor of love

At Lyft, fraud decision-making is split between business rules handcrafted by analysts and machine learning models developed by research scientists. These business rules and machine learning models form the backbone of our detection system that trigger pre-authorizations and identity challenges targeted at blocking fraudsters. To power these decision-making tools, our team puts a lot of effort into analyzing the behavior of fraudulent users and distilling the signals they leave behind into hand-engineered features. But hand-engineering features is hard.

To be precise, what’s hard isn’t generating a bunch of features. Rather, what’s hard is hand-engineering features that are robust: predictive not just in the short term but also in the medium and long term. Like diseases, it’s often easier to devise features that detect the symptoms rather than the cause of the symptoms. For instance, unusually high driver contact and cancellation rates might together form a good business rule that initially detects many of the con artists’ accounts as described in the introduction. But in our case, these scammers quickly learned to call our drivers’ switchboard-assigned phone numbers using alternative phone numbers to escape detection. To be effective against an adaptive adversary, it’s important to develop robust features that look at things fraudsters find hard to control or change.

At its core, any counter-fraud measure worth its salt is designed to irrevocably drive the fraudster’s operational costs up to the point where the fraud vector becomes economically unsustainable. In the example above, the high driver contact rate features wouldn’t have driven up the operational costs because they simply shifted to using a burner phone separate from the ones used to create their passenger accounts to call the drivers. What we needed was a way to capture the sequence of behaviors that exposes the recurring pattern of cancellations after the requested rides are dispatched amidst all the other account activity. Designed right, features should be robust — unaffected by variance in the general fraud pattern since the fraudsters’ modus operandi remains the same. In order to escape detection, the scammers will have to do something drastic, such as actually going through with the ride without cancellation. While a couple of obvious methods come to mind, such as using stolen credit cards or doing coupon fraud, both would incur much greater cost.

But easier said than done. Robust features require rich sources of signals that capture behavioral patterns that aren’t easy to defeat. The “classical way” to do this is to ingest as diverse a set of sources as possible into something like a logistic regression model with some interaction terms involved. An example of this sort of feature engineering is hinted above: we can take, say, the harmonic mean of the cancellation, contact, and (for good measure) the ride non-completion rates. Perhaps we can even throw in a time-based rolling window to make sure that we’re not susceptible to “incubated accounts” with good ride history. If it’s not yet apparent that designing robust features is hard, recognize that this is but one fraud vector of hundreds by dozens of highly adaptive fraud rings that actively target Lyft’s various product lines.

Shifting attention

To improve our fraud decisioning, we started focusing on more modern, powerful modeling methods in the past year. For instance, we shipped gradient-boosted decision trees (GBDT) ensemble models that vastly improved our performance due to the decision tree’s inherent ability to capture interactive effects between multiple features. That meant that we could spend more time exploring feature sources that weren’t directly predictive of fraud but improved our models in concert with our existing feature set. Hand-engineering features that were independently correlated with our fraud labels — necessary for, say, a Naive Bayes classifier — thus became less important than finding the right combination of features. Our success and migration to GBDT models was the start of our pursuit of better machine learning modeling approaches that made the most of our signals offered.

More recently, we’ve shifted our attention to neural networks that were even more powerful and could gracefully work with far richer streaming data sources. It wasn’t that we weren’t aware of these data sources — often, our risk analysts were already poring over things like user activity logs and financial transaction histories in manual account reviews. We’ve even crafted some pretty predictive features around them for our GBDTs. The issue was that with most “shallow learning” methods, we’ve had to compromise on losing part of the information when transforming them into the structured features that work with these methods. For instance, even though GBDTs can capture the interactions between the cancellation, contact, and ride non-completion rates without the need for something like the harmonic mean, we couldn’t express the temporal relationship between these events. Part of the promise of neural networks was their ability to extract even more of the information inherent in these signals by working directly with “less processed” features in their more natural, sequential forms.

Automating feature engineering through architecture engineering

To handle complicated sequential signals, we explored various neural network architectures that gave the model dynamic temporal behavior for a time sequence. In other words, we wanted a deep learning model that was able to update its belief about a user as more information streams in. And while we did initially think of alternative methods, such as n-grams to capture particular action subsequences and hidden Markov models (HMMs), they didn’t seem appropriate. In the former case, the number of n-grams to capture important subsequences would be the permutative complexity of all sequences — too expensive and difficult to maintain as we constantly improve our product. In the latter case, there isn’t an obvious way to sidestep the non-Markovian nature of user activity with respect to whether a user is fraudulent. For instance, if a user exhibits suspicious log-in activity early on that isn’t strictly indicative of fraud, it’s not obvious how to preserve that information as we observe more user activity.

Reviewing the literature about modeling sequential data quickly pointed us to deep learning as the state of the art method. It also seemed at the time to be the most practical way to marry sequential features with our existing structured features given our earlier work on natural language processing (NLP) tasks with our Support Experience team. Practically, it was also easier for serialization, etc. E.g., running an HMM on top of a GBDT model. Our existing model serving infrastructure would have to change significantly to accommodate arbitrary ML model stacks.

To provide intuition into how neural networks work and insight on how we work with neural networks, we dive deep 🤓 into how we use one of the richest signal sources for fraudulent patterns: the user’s activity log. Specifically, the activity log is a temporally ordered sequence of user actions taken on our app along with their various metadata. These user actions range from ride request button presses to map-magnifying screen pinching. The action metadata include the duration of action, the time elapsed since the previous action, and the force applied on the phone screen by the user. Being one the most voluminous event streams we have at Lyft, it is impractical to take the classic approach of handcrafting features from it and we had to turn to a deep learning approach that benefitted from the scale of the available data. And that meant finding the best neural network architecture for our use case; i.e., architecture engineering.

Neural network alchemy

As with most applications of deep learning, our approach was largely empirical. To find the best performing model, we took heavy inspiration from surveys and recent papers on deep learning techniques for NLP and searched over thousands of neural network architectures using automated cross-validation jobs. The search consisted of small changes such as specific activation functions and embedding dimensions to larger ones such as the order of the network layers and specific architectures proposed. In the end, we settled on a neural network that maps user actions to feature embeddings and a convolutional-recurrent architecture with an attention mechanism. Fancy. 😎

User action embedding layer

User action embeddings are dense vectors that encode the semantics of each specific user action. Our usage is inspired by word embeddings commonly used in NLP applications, such as GloVe embeddings and word2vec. Like word embeddings, each dimension in the embedding space encodes some property of the set of all user actions. For instance, one dimension could encode how likely an action is related to user log-in and another could encode how much keystroke input is needed for the action. In our case, we noticed that “similar” ride cancellation-related interactions are clustered together away from the ride request one when doing a t-SNE visualization on our top 50 user actions. These semantic clusterings also acted as a sanity check against our training pipeline.

We observe a nearest neighbors-type clustering effect of similar actions in a t-SNE visualization. And while we posit that there should be linear substructures similar to what is observed in GloVe and word2vec embeddings, we didn’t dwell too much on it.

Convolution network

The 1D convolutional network (ConvNet) forms the second component of our neural network. True to its namesake, the 1D ConvNet is built around the idea of 1D convolutions where trainable filters are convolved with the sequence of embedded user actions.

Example of a 1D convolution between a 3-filter and a sequence of scalar values. In a ConvNet, the filter weights are trainable and the sequence is an input.

Convolutional filters with learnable parameters help extract similar local features across multiple locations and encode subsequences of user actions that together form more meaningful local interactions. For instance, a single ride request action doesn’t mean much by itself. But when considered together with repeated ride cancellations and requests that precede it, the subsequence of repetitive user actions paints a much more suspicious picture of the user. One way to think about this idea is: user action subsequences are to user activity logs as word phrases are to sentences.

Another way to intuit ConvNets on user activities are how they operate if we learned user action n-gram embeddings, where n is the size of an analogous convolution filter. They both try to encode the semantics of consecutive user actions. But compared to the n-gram embedding, processing the single user action (1-gram) with a convolution layer with n-filters reduces the parameter size. This reduction is because we don’t need to learn an embedding for every unique n-gram, which also means we need enough observations of all possible n-grams. Considering the interaction of a small number of n-filters with, say, fully-connected layers can help us capture the same amount of information as a one-to-one embedding mapping without high sample complexity.

To further improve our sample efficiency, we stack multiple convolutional layers such that each learns abstractions of user action subsequences in a hierarchical fashion. This approach allows our network to be even more expressive with fewer parameters than simply using a large convolutional layer after an embeddings layer.

Example of a sequence of 2D user action embeddings that goes through two layers of 1D 2-filter convolutions. In this case, we get generalized versions of 4-grams built from bigrams at the last ConvNet layer. This figure is adapted from lecture slides from Stanford’s CS224d.

Recurrent network

In traditional sequence-analyzing models (think HMMs), the probability is usually conditioned on a window of preceding elements. To make model training tractable, these models often make simplifying assumptions such as the Markov assumption.

Here, we make the simplifying Markovian assumption that the ith element in the sequence is conditionally dependent only on the n preceding elements (the n-gram).

These models typically achieve better performance with higher order n-grams, some Laplace smoothing, and backing off to lower order n-grams when the higher order ones haven’t been observed. This classical approach usually requires a lot of n-grams and huge memory resources. For instance, in Heafield et al.’s state of the art work for NLP applications, “[u]sing one machine with 140 GB RAM for 2.8 days, [they] built an unpruned model on 126 billion tokens.” This approach would not have been practical for us. 😅

Recurrent neural networks (RNNs) are one way to condition the probability on all the previous elements in a sequence using a parametric (and somewhat blackbox-ish) function. Very (very) roughly, the RNN “memorizes” the sequence of user actions observed so far and saves it as a hidden state within the “memory cell.” As new actions are sequentially ingested, the cell considers the hidden state together with the current user action to update its “memory.” To obtain the “probability,” the cell considers the hidden state together with the current user action and outputs an estimate.

Memory cells come in all shapes and forms. While it’s important to be aware of the more popular variants (e.g., LSTMs and GRUs), we focus here on the insight that forms the architectural foundation of the RNN.

Intuitively, when the inputs from the preceding ConvNet are passed into the RNN in our neural network, it determines how much of the information about user action subsequences should be retained for future consideration. It allows us to efficiently encode a temporal relation between earlier subsequence embeddings with later ones. To further improve on the RNN component, we experimented with a few ideas and augmented it with an attention mechanism.

Putting it all together

Our behavior fingerprinting neural network is implemented as a stack of the embedding layer, ConvNet, and RNN in that order on Tensorflow through the Keras interface. We concatenate the RNN’s output with the structured features and pass it through fully-connected layers that returns a softmax multi-class output that determines the probability assigned to each possible fraud user segment. On a per-model basis, we found that the addition of the behavior fingerprinting module to our production structured features-only neural network produced a relative lift in recall of over 40% for the same precision with respect to all fraudulent users.

Bringing out the big GANs

Historically, we’ve always had to first identify the fraudulent pattern and, in turn, use that to train our models. Today, not only are we fingerprinting fraudulent behavior, we’re also working on learning good user behavior to detect when someone is deviating from it. To that end, we’re developing models that uses a semi-supervised version of the generative adversarial networks algorithm to detect what are anomalous user embeddings.

At a high level, we’re looking at ways to automatically encode human intuition about what’s not fraudulent in a user account. As opposed to our older approach of purely building discriminative models, we’re working on a generative model of the good user distribution. We indirectly sample fraudulent users from the complement of the good user distribution and use it to train a discriminative model with our existing true fraud targets. This approach is similar to what is described in Dai et al.’s work on the BadGAN. The hope is that understanding good behavior better can help us use it to protect and even reward good users.

Further reading

If you enjoyed this post, follow and recommend! And while we’re big fans of deep learning (in the right context), that’s not all that we do. To learn more, check out our other Research Science and Lyft Fraud posts!

As always, Lyft is hiring! If you’re passionate about developing state of the art machine learning models or building the infrastructure that powers them, read more about our Research Science and Engineering roles and reach out to me!

This post would not have been possible without the help of Yaniv Goldenberg, Vinson Lee, Patrick LaVictoire, Cam Bruggeman, Josh Cherry, Ryan Lane, Elaine Chow, and Will Megson. Many thanks!