MLB Pitch Classification
The history of MLB’s pitch classification system and a deep dive into recent enhancements.
Written by Cory Schwartz and Sam Sharpe
When you’re watching your favorite pitcher in your local ballpark, on your local RSN, or on MLB.TV, it’s unlikely that you’d say he threw a pitch at 88 miles per hour with a 2400 RPM spin rate and two inches of vertical movement above average. You’d say he threw a slider. Using data captured by your eyes, your brain has performed a pitch classification.
MLB performs a similar task in real time for nearly 750,000 pitches each season using patented pitch classification neural network software that is customized for every pitcher in Major League Baseball. The automated classifications for each pitch are displayed on Gameday, At-Bat, and MLB.TV, as well as on local and national broadcasts, on the scoreboards and in-stadium displays in all 30 MLB venues, and on Baseball Savant.
Real-time automated pitch classification began during the 2006 Postseason, when MLB launched the first automated pitch tracking system. The technology was expanded to all 30 ballparks by the start of the 2008 regular season. At first there were only two neural networks used for all pitchers, one for lefties and one for righties. But this model produced disappointing results since each pitcher’s repertoire is different, and even the same pitch will look different when thrown by different pitchers. Some pitchers touch 100 MPH with their fastballs, while others sit in the low 90’s; some pitchers throw slow, looping curveballs while others throw theirs with shorter, harder break.
To improve classification accuracy, we made several changes over the course of the 2008 and 2009 seasons: pitcher-specific scaling for velocity to better differentiate fastballs from changeups; adding biasing into the classifications to better reflect pitcher-specific repertoires; and enhancing the neural networks themselves to improve our differentiation of two-seamers vs. four-seamers, cutters vs. sliders, and other similar pitches.
During the 2010 season, we introduced customized neural networks for each pitcher resulting in more accurate, real-time classifications. Steady improvements in player-specific classification continued over the ensuing years, as we better learned the nuances of pitcher specific repertoires and data itself. Finally, enhancements in 2018-2019 and planned for 2020 incorporate a machine learning driven pitch arsenal detection system and an improved neural network framework implemented in a modern code base that further strengthens the classifier. So how does this work?
The neural network software used to classify pitches utilizes a library of manually-classified training pitches for each pitcher to define his specific repertoire. As each pitch is thrown live, the neural network compares its velocity, spin rate, and movement to those properties of previous pitches for that pitcher and chooses the highest likelihood pitch type in his repertoire.
We can re-train the neural network for any pitcher as frequently as necessary throughout the season to incorporate new pitches, and/or to adapt to significant changes in the pitcher’s velocity, spin or movement of any particular pitch type. And, neural networks can be re-trained as needed during live games, with changes taking effect immediately in-game.
We classify each pitch in each pitcher’s repertoire as the pitcher himself calls it; for instance, a pitch that one pitcher calls a cutter may be called a slider by another, even if the velocity, spin, and movement of both pitches are essentially identical. Direct quotes from the pitchers themselves, as well as from pitching coaches, managers, and other on-field staff, are used to verify each pitcher’s repertoire along with images of the pitchers using different pitch grips.
Two-seam fastballs and sinkers are generally labeled interchangeably unless the pitcher specifically refers to one term or the other. Knuckle-curves (or “spike” curves) are specifically identified, as opposed to a traditional grip, when that particular grip can be confirmed.
Pitches for rookies with no previous MLB experience are classified using a generic neural network, one each for righties and lefties. We then customize each such pitcher’s repertoire repeatedly until the neural network has enough training pitches to accurately and consistently classify his specific repertoire.
Several factors can result in inaccurate and/or inconsistent classifications:
- Pitches that behave very similarly and are hard to distinguish from one another
- Tracking system calibration issues or interference
- Changes in the pitcher’s velocity, spin rate and/or pitch movement
- Previously unseen pitch types in the pitcher’s repertoire
- Overly noisy or inaccurate training data
Pitches that are most similar in terms of velocity, spin, and/or movement are those most commonly confused with each other, particularly these pairs:
- Two-seam fastball (FT)/Sinker (SI) vs. four-seam fastball (FF)
- Four-seam fastball vs. cutter (FC)
- Cutter vs. slider (SL)
- Slider vs. curveball(CU)/knuckle curve (KC)
Split-finger fastballs (FS) and changeups (CH) behave very similarly and can be very difficult to distinguish for those pitchers who throw both. And, depending on their velocity, changeups can be difficult to differentiate from fastballs for some pitchers.
Other common pitch abbreviations not mentioned above for reference: Eephus (EP), Knuckleball (KN)
Some pitchers’ repertoires are very distinct and easy to classify, such as Nathan Eovaldi (pitches shown from October, 2018):
Note the clear and distinct separation between all pitch types.
Other pitchers’ repertoires are much harder to classify, such as Felix Hernandez:
Note the lack of clear and distinct separation between the four-seam fastballs and sinkers, and between the sinkers and changeups.
2019 Improvements: New Pitch Identification & Misclassification
As noted above, pitchers are constantly tweaking their repertoires, and unfortunately, they don’t call us and let us know they are going to start throwing a new pitch the next day. We needed a way to catch these changes automatically, so we could reduce the manual work that goes in to spotting them.
- Has the pitcher added a new pitch to his arsenal?
- Have we defined the pitcher’s arsenal incorrectly (i.e. did we label all sliders as cutters)
- Is a single pitch labeled incorrectly?
We will cover the first two below.
Pitch Arsenal Detection
We approach 1) and 2) similarly. As we started to test out our method, it surfaced the cutter Marco Gonzalez added at the beginning of 2018. He is a great example of how this identification process has helped us.
Here is his arsenal pre-2018:
Our first goal was to build a classifier that would be able to classify pitches on a general league-wide level since these new or mislabeled pitch types are not encoded in our player level neural network. After exploring various methods, we used gradient boosting, specifically XGBoost.
We use various tracking metrics to help classify pitches including horizontal/vertical break, velocity, spin rate, spin axis and where the ball is released.
Using these tracking metrics alone gets us most of the way there, but for pitchers that deviate from the norm, we have a harder time classifying their pitches. For example, the league model might classify a 90 mph changeup from Jordan Hicks as a fastball, but we know he throws a 100+ mph fastball. To try to “personalize” these classification without knowing the pitcher’s identity, we also add scaled versions of these measurements using rolling metrics. We solve the above example by adding a scaled pitch velocity:
Now, given a scaled velocity of about 0.5, it is easier to determine that the 90 mph changeup is in the middle of Hicks’ velocity range, and thus, is probably not a fastball. We apply similar scalings to other metrics to define the periphery of a pitcher’s arsenal.
We perform other data manipulation tricks such as normalizing all horizontal break and spin axis values to be right-handed. There is no need for the model to have to learn the differences between L/R handed pitchers if we already can encode that ourselves.
Many pitch types can be paired up since they are mostly just naming conventions (sinker/two-seam fastballs, knuckle curve/curve, splitter/change-up), so we group these pitch types together into one classification.
Now back to our example. After Gonzalez’s first start in 2018 we see that his arsenal looks a bit different. Note the small cluster of new pitches between 85 and 90 mph with little horizontal break and similar vertical break to his fastball.
At this point in the process we have no assumptions about the pitch types in his arsenal. We take the set of unknown pitches and classify them with our generic classifier:
A classifier isn’t going to solve the problem by itself. The information we gained with the classifier only tells us that there might be 15 or so cutters mixed into a list of his last 750 pitches. We aren’t aware that these pitches form their own cluster without manually looking at it. It could be that his fastball has more glove side movement than most pitchers’ and we are mistakenly classifying that subset of fastballs.
Here is where unsupervised learning can help. To determine how many pitches he has, we use gaussian mixture models (GMM) to cluster the pitches in the same 3D space as the plots shown above. Since this is an automated procedure, we can’t eyeball the plots to choose a good number of clusters (k). Selecting k is sometimes rather ambiguous, and there exist many approaches. We simply iterate through a range of pitch arsenal sizes and pick the best one based on BIC.
Applying this process to Marco Gonzalez’s pitches we get the following clusters:
Ok, Gonzalez has 4 pitches, now what? We take these cluster assignments, align them with the classification probabilities from our supervised model, and sum up the likelihoods in each cluster.
Each cluster now has a probability distribution over all pitches, and we can simply assign the pitch with the maximum likelihood to that cluster (i.e. pitch 3 = FC, pitch 2 = CH / FS etc). In this case, the first cluster is slightly ambiguous. If you have ever used these types of methods, you know that they don’t work out this perfectly every time. Therefore, we implement some business rules to determine if we should add clusters and help our recall of new or mislabeled pitches:
- If a pitch makes up at least 40% of a cluster, add a cluster and assign it that pitch type
- If the total probability of a pitch across clusters sums to at least 50%, add a cluster and assign it that pitch type.
Finally, we compare the current arsenal to our estimated arsenal and send an alert to prompt re-training if there are any differences.
Technology & Automation
We code this process into Python modules and package it into an Airflow job. The job compiles reports and sends an email daily to alert us of any detected changes or errors. We have evolved them to include links to generated plots (like above) and options to ignore alerts for specific pitchers, which is very useful for guys like José Berríos whose pitch movement doesn’t quite match their name for it.
2020 Improvements: Automated Classification
We set out with a couple goals to improve the current automated classification procedure in the ballpark.
Update the codebase
We wanted to incorporate all machine learning under the umbrella of the ML team. Updating the codebase to Python works well with our workflow. We can also switch over from custom neural network modules to open-source ML libraries, specifically TensorFlow.
Improve accuracy for debuting pitchers
When pitchers debut we often don’t have a trained network for their arsenal, so we use a generic neural network or we initialize with a similar pitcher’s network. We want a simple way to improve the prior knowledge in the network for debuts.
Improve accuracy for all pitchers
Obviously, we would like to accurately classify all pitchers, so we pick features and architectures that help us achieve better accuracy.
PitchNet: New Pitch Classification Neural Network
When developing our new system, PitchNet, we hypothesized that a centralized network would help performance, so each pitchers’ network wouldn’t have to relearn the concept of a “fastball”. However, we still need some way to personalize the predictions for each pitcher. We use embeddings, a low dimensional vector representation, to model pitchers much like how word embeddings are used in modern natural language processing.
Modeling pitcher embeddings combined with our primary tracking data and a simple single hidden layer network improves our evaluation metrics for previously seen pitchers and modest improvements for debuting pitchers.
The architecture has other benefits besides improving performance. When we train individual pitchers, we can freeze the weights of our network and only fine tune our pitcher embeddings on more recent data. When pitchers debut we now have simple low dimensional representations of our pitchers that we can average together to create better initializations for debuting pitchers. While training the network, we also randomly mask pitcher ids with a generic pitcher id so we have a quality “generic pitcher” embedding to fall back on. Let’s say we know a prospect has a high 90s fastball, but don’t know much about his other pitches. We can average a generic embedding with Aroldis Chapman’s embedding, and now we already have a better prior for their repertoire than a generic embedding on its own.
We now also have all the fun byproducts of low dimensional representations! For example, we can visualize pitcher arsenal similarity with t-SNE:
To evaluate performance, we back tested PitchNet against archived predictions from the current network on ~183K pitches from July-August in 2018. We trained PitchNet on historical data predating the archived predictions and compared two metrics on the held out pitches.
- Accuracy — How often does the model predict the correct pitch? (The archived data doesn’t include probability distributions, so we cannot compare any entropy-based measures.)
- In-Arsenal % — How often does the model predict a pitch in the pitcher’s arsenal?
We first looked at “known pitchers” which are pitchers that pitched in 2017. Note, there are possible advantages each model has during this evaluation. Over these two months, the current network could have been updated to reflect changes in repertoires, while the PitchNet was fixed using only historical pitches. We are using current pitch classifications as ground truth (and for training PitchNet), so we may have retroactively changed some classifications based on new information about a pitcher’s name for that pitch and not necessarily because the network labeled it incorrectly at the time of prediction.
Using all debut games from 2017–2018 we also compared the current network to PitchNet trained without these pitchers. Remember that the current network can be re-trained in-game with human prior knowledge, so we expect that it will have a higher in-arsenal percentage and it is promising to see commensurate accuracy.
PitchNet’s new initialization abilities means we can average the embeddings of a handful of similar known pitchers as a starting point for a debuting pitcher. The process of picking “similar” pitchers for initialization is manual and subjective, so we needed to find a more automatic way of testing this feature. We aggregated all available prospect pitch grades for pitchers in 2017–2018 and computed similarity scores between debuting and known pitcher’s grades to select top similar known pitchers. The initialization procedure increases accuracy and in-arsenal percentage by almost 4% and 5% respectively.
We plan to productionalize this new approach to automated pitch classification at the start of the 2020 season and test the old and new networks in parallel to ensure a seamless transition and a quality product.