Quantifying Fan-Player Relationships
Background: In a previous post, we briefly discussed how we have created a metric to quantify a fan’s avidity for specific Clubs. For this, we estimated the strength of the relationship a fan has with every MLB Club based on the fan’s previous behavior. For example, if a fan purchases Red Sox merchandise and attends Red Sox games, they are likely a Red Sox fan. For this post, we’ll describe how we have expanded on the idea of estimating fan-team relationships, and instead focus on estimating fan-player relationships. This information can be used by the league to understand trends in merchandise purchasing and reach out-of-market fans, and can be used by the individual Clubs to customize marketing efforts and better understand their fanbases.
Approach: The difficulty in estimating fan-player relationships, similar to estimating fan-team relationships, lies in the fact that the value of the true relationship is unknown. That is, unlike predicting things like game attendance or whether Christian Yelich will get a hit, where a historical ground truth value is used to build a model, we don’t have a ground truth for how avid our fans are for a given player.
Our solution to quantifying these relationships was pursued through latent factor variable techniques, which can be described as inference of an unobserved variable from observations. A simple example of this is attempting to infer the topic of a conversation from a sample of the words in that conversation. That is, if you overheard someone talking and you could only make out the words, “milk” and “purr,” you might infer that they were talking about a cat. In a nutshell, that is what we attempted to do. We don’t have the ability to know for sure whether a fan likes a certain player, but by using a fan’s observable actions (e.g., merchandise purchases, All-Star votes, etc.), we can infer the existence of a relationship between that fan and player.
While there are many ways to attempt latent factor variable modeling, we used an approach from R. Xiang et al. (2010) who modeled relationship strengths between individual users of social networks. In this study, the authors used social media connection metrics, such as friend networks, group memberships, and common friends, to model profile similarities with the goal of estimating relationship strengths between two users. Repurposing this approach using fan-player interaction metrics, we estimated relationship strengths between fans and MLB players.
Data: The features we used can be broken down into two categories:
1) Causal factors — features that may influence a fan’s relationship with a player
2) Observed behaviors — actions that the fan carries out
Causal factors are features between a fan and player that already exist and that could impact their relationship. One obvious causal factor is fan location. The closer a fan lives to a team the more likely they will have relationships with players of that team than players from other teams. Other causal factors include the fan’s avidity score for the player’s team, the player’s on-field performance, and the player’s popularity both in the league and within their team. To estimate popularity, we combine a player’s total merchandise sales, total all-star votes, and social media following.
Observed behaviors are actions we can directly record between a fan and a specific player, namely merchandise sales, All-Star ballot votes, and website views.
Methodology: We have two sets of data (causal factors and observed behaviors) and a latent variable (the strength of the relationships between fans and players) that connects these two data sets. We can use the causal factors to produce an initial estimate of the latent variable. This is then used to predict the observed behaviors whose estimates, in turn, are fed back to update the estimate of the latent variable, and so on until the algorithm converges on a final estimate. This process utilizes the expectation maximization algorithm to find the optimum values. Below is the general framework for our process.
First, using the causal factors shared between fans and players, the relationship score is approximated. This initial approximation, estimated using logistic regression, helps us determine the importance, or weight, that each causal factor has on the fan-player relationship.
Next, a Poisson regression model is used to generate estimated weights for the observed behaviors (i.e., fans’ merchandise sales, all-star votes, and website traffic) based on the estimated relationship score from the initial logistic regression model. These new estimated weights are then adjusted for the total number of interactions of a single fan to ensure that a fan with more interactions does not dominate the model’s behavior. If there are no observed behaviors between fans and players, these fans’ “favorite” players will be based purely on causal factors. Once these values are approximated, the process is repeated until a final score is converged upon.
Results: As with team avidity, typical quantitative evaluation methods for player avidity cannot be used because the true strength of the relationship between a player and fan is unknown. That said, initial results from our approach are promising. Fans who vote and purchase merchandise for specific players have higher player avidity scores with these selected players relative to other players in the league and relative to other players on that fan’s favorite team.
For example, our team avidity metric tells us that a fan’s favorite team is the Mets because they have purchased tickets to Mets games and view the Mets homepage the most, followed by the Diamondbacks because they have streamed several of their games on MLB.TV. Based on this behavior, our player avidity model is going to give higher weights to all Mets players than Diamondbacks players and higher weights to all Diamondbacks players than other players in the league. Next, this fan voted for Pete Alonso, Jacob deGrom, and Wilmer Flores for the All-Star game, giving each of them higher weights. And finally, this fan also bought a Christian Yelich jersey, increasing his weight.
Overall, our player avidity model ranks this fan’s top 3 players as: 1) Pete Alonso, 2) Jacob deGrom, and 3) Wilmer Flores, with Christian Yelich in the top 10. Knowing this fan’s player avidity ranking allows us to customize how we connect with them. So, given that two of their top three players are Mets, we’ll customize our Mets messaging to them by showcasing these players, and for any broader MLB media (e.g., All-Star messages), we’ll include a combination of their favorite players.
MLB has tens of millions of fans and because there are hundreds of players with whom each fan could potentially have a relationship, the number of possible fan-player relationships is staggering. Employing expectation maximization on such a large data source would be both computationally expensive and inefficient. Luckily, our results suggest that this model can be trained on a subset of the data and still be powerful, which allows us to calculate player avidity scores for our entire fan population.
- Xiang, Rongjing & Neville, Jennifer & Rogati, Monica. (2010). Modeling relationship strength in online social networks. Proceedings of the 19th International Conference on World Wide Web, WWW ’10. 981–990. 10.1145/1772690.1772790.