Clustering 2021 MLB Starting Pitchers
Using k-means clustering to find similarities in pitching style between starters this season
Starting pitchers come in all shapes and sizes. That’s what makes pitching such a unique and beautiful thing. Look at Randy Johnson and Greg Maddux, two starters who dominated during the late 20th century. Yet they were also two pitchers who couldn’t be more dissimilar in the way they pitched.
Some guys spin it more than others. Some guys throw it harder than others. No pitcher is exactly the same. However, with the use of k-means clustering, we can attempt to find groupings, or shared characteristics between starters this season.
What is k-means clustering? To those unfamiliar, k-means clustering is an algorithm used to partition the number of observations into various clusters in which each observation belongs to the cluster with the nearest mean. In simple words, it essentially tries to group players based on similarities in their stats and characteristics.
With the use of Alex Stern’s article on k-means and inspiration from Joey DiCresce’s article on clustering NFL cornerbacks, I created my own clustering for 2021 MLB starting pitchers.
To begin the process, I eliminated all starting pitchers with less than 90 IP for this season. Using SPs with fewer innings than qualified influenced the clusters substantially and restricted accurate analysis. After removing unqualified starters, I began to decide on the metrics I wanted to base the clustering on. Those metrics are as follow:
K/9, H/9, BB/9, HR/9
GB to FB Ratio
Soft Contact %
wFB/C (Standardized Runs Allowed on Fastball)
wOffspeed/C (Standardized Runs Allowed on Offspeed Pitches)
O-Swing % (Chase Rate)
Zone Contact %
Swinging Strike %
Physical Characteristics (Height & Weight)
Using per nine pitching stats (K/9, H/9, etc.) helps identify “good” and “bad” pitchers without having to use overall pitching ability stats like ERA, xFIP, or SIERA. I opted to leave those aforementioned stats out as I decided clustering pitchers off of raw stuff would be a more effective way to measure similarities.
GB to FB ratio helps us identify groundball and flyball pitchers, while Soft% can help do a similar comparison with hard and soft contact pitchers. wFB/C and wOffspeed/C are advanced pitch-by-pitch measures to understand the success of a given pitch (Fastball or Offspeed/Breaking). In addition, O-Swing%, Z-Contact%, and Swinging Strike% allow us to cluster pitchers who induce swings and misses and those who pitch to contact. Finally, I opted to include the height and weight of each pitcher to give us some physical characteristics.
I decided to create 9 total clusters. Here are the centers for each metric in their respective clusters.
And here are each of the 74 pitchers in their respective clusters.
If you’re wondering why you’re favorite starting pitcher their respective cluster, here is a bit of an explanation as to how.
Cluster 1 - “Groundball Pitchers”
Highest GB to FB ratio
Second lowest HR/9
Also walks a lot of batters
Highest BB/9
Second lowest Outside Swing %
Cluster 1 sees the highest groundball rate by a pretty wide margin. Because of their admiration to keep the ball on the ground, a pretty low average HR/9 falls in this cluster. On the other hand, Cluster 1 seems to be a victim of too many walks. A result of this being their low chase rate from batters. Sandy Alcantara and German Marquez seem to be ideal fits for this group as they fall 2nd and 4th in GB% but 20th and 8th in BB% amongst qualified starters respectively.
Cluster 2 - “Prevents Hard Contact, Strong Pitch Arsenal”
Second highest wOffspeed/C
Positive K/9
Second highest Soft Contact %
Third lowest HR/9
The second cluster is arguably one of the best to be in performance-wise. With very few, if any negatives, this cluster displays a great combination of inducing soft contact while also displaying a strong mix between offspeed and breaking pitches. Plenty of Cy Young award winners in this group between Cole, Kershaw, and Bieber. There is a lot to be excited about with the young arms of Julio Urias and Luis Garcia as they are clearly on the right path.
Cluster 3 - “Smaller Size, Stronger Arm”
Shortest height
Second lowest weight
Third highest wFB/C
Good Swinging Strike %
With the likes of Marcus Stroman and a couple of other undersized SPs in this cluster, there was no other option than to name this group “Smaller Size, Stronger Arm.” An intriguing mix of wFB/C and Swinging Strike % makes this cluster full of entertaining starters who like to rely on their fastball to get outs (Buehler, Cueto).
Cluster 4 - “Pitches to Contact, Doesn’t See Much Success”
Extraordinarily high H/9
Lowest K/9
Lowest wFB/C and wOffspeed/C
Lowest Outside Swing %
Highest Zone Contact % and Lowest Swinging Strike %
There’s a lot more I can add in terms of where this cluster averages in other respective statistics but I feel like I’ve bashed them enough so I’m gonna lay off. If you couldn’t tell already, this group is pretty ugly. K-means pretty much just put all of the worst pitchers in this cluster and I don’t really fault them for it. These guys have struggled on average in practically every aspect of pitching this season. A high Zone Contact% means these guys are really having a hard time getting batters to swing and miss at their stuff. With an average ERA north of 5, it’s a bit alarming for former two-time all-star Patrick Corbin to fall in this cluster. Speaks volumes of just how poor he’s been this season.
Cluster 5 - “All-Star Level Pitchers”
Second highest K/9
Second lowest H/9
Good Zone Contact %
High Swinging Strike %
High BB/9
When looking at this cluster, it’s hard to imagine one better. I know I mentioned cluster 2 might be the second-best cluster but this one might just top it. And if you’re wondering well, why are these clusters battling for second-best? Isn’t there a discussion for one of these clusters to be the best? Well, not really. And you will see why when we get to cluster 7.
Back to cluster 5. Cluster 5 displays a great mix of high strikeouts per nine and low hits per nine. Truly a recipe for good pitching (unlike Cluster 4). One downside is that this cluster is full of pitchers who tend to get too cute with their stuff. A high BB/9 with a low Zone Contact % tells us these pitchers don’t enjoy pitching to contact. This makes sense given Freddy Peralta, Carlos Rodon, and Max Scherzer, some of the best strikeout pitchers in the game, fall in this cluster.
Cluster 6 - “Pounds the Zone but Limits Hard Hit Balls”
Low K/9
High H/9
High Soft Contact %
Low BB/9
Above-average Zone Contact %
Cluster 6 consists of solid middle of the rotation arms that successfully pitch to contact. Unlike the other pitch to contact cluster, this group of pitchers averages a high Soft Contact % in comparison to other starters. Their high H/9 despite the high soft contact illustrates how some of these starters might be on the unluckier sides of things this season (E.g. Zach Eflin 4.17 ERA, 3.61 xFIP).
Cluster 7 - “A Tier Above Everyone Else”
Highest K/9
Lowest H/9
Lowest BB/9
Lowest HR/9
Highest Soft Contact %
Highest wFB/C and wOffspeed/C
Highest O-Swing %
Lowest Z-Contact %
Highest Height & Weight
Yeah, this cluster is just simply a tier above everyone else. Maybe even many many more tiers above everyone else. Just two pitchers in this one: Jacob deGrom and Lance Lynn - who many claim to be the frontrunners in the AL and NL Cy Young race had deGrom not been shut down until September. This cluster ranks 1st in every statistic besides GB/FB ratio. The epitome of pitching ability.
Cluster 8 - “Middle of the Rotation Arms with Plus Potential”
Slightly above average K/9
Average Soft Contact %
Slightly above average wFB/C
Below average Outside Swing %
This cluster was a bit dull. Nothing really stood out which was a bit disappointing when I saw who the names in this cluster were. For the most part, these guys are middle-of-the-rotation arms (Taillon, Walker, Rodriguez) but some show plus ability and potential (Darvish, Ryu, Manaea).
Cluster 9 - “Struggles to Keep the Ball in the Park”
Highest HR/9
Lowest Soft Contact %
Above-average Flyball Rate
Cluster 9 is pretty much a recipe for disaster. A high flyball rate and high HR/9 means these pitchers produce a lot of flyballs but see quite a few of them end up leaving the yard. This cluster consists of the league leader in HR-FB ratio, Yusei Kikuchi, so what you’re seeing here seems to check out.
How Each Cluster Compares to One Another
Remember when I said Cluster 4 was pretty ugly? Well, it got even uglier. Cluster 4 is pretty much all alone in terms of overlap with other clusters. Cluster 2 and 5 were the 2 clusters fighting for second best amongst them and as we can see from this visualization it’s hard to differentiate the two. Pitchers in cluster 2 could’ve ultimately fallen in cluster 5 and vice versa.
Final Remarks
Starting pitchers are unique in every fashion of the game. It’s impossible to completely replicate a pitcher but by using k-means clustering we can find groupings amongst them based on pitching style. It is important to note that not all of the pitchers follow the exact guidelines set by their cluster. As you can see in the visualization above there are overlaps between them. Some are worse fits than others and some are better fits than others but each player shares some form of similarity with the cluster they are placed in.
Hope you enjoyed! Follow me on Twitter for more content - @aborelli24