07/02/2013 06:16 pm ET Updated Sep 01, 2013

All-Star Profiling

In just a few days, baseball fans will be treated to the annual mid-year battle of the American and National Leagues: the Major League Baseball All-Star Game. Unlike the post-season interleague battle, the World Series, the teams in this game will consist of players from multiple ball clubs. How do these all-stars get chosen?

The selection process is twofold: There is a fan-voting component and a manager-selection component. During the 10 or so weeks prior to the game, fans vote for their favorites in each of the non-pitching positions. When voting ends, the top vote getters in each position (top three for outfielders) get to be the starters for their respective leagues. The backup position players and all the pitchers are then voted on and selected by the players and managers of all major league teams. In all, around 72 major league players each year earn the honor of being named an all-star.

While the individual statistics of the game do not count in players' official records, the outcome of the game does have a significant implication. The winning league earns home field advantage for its World Series team that year. Since the team with home field advantage wins the World Series about 60-percent of the time, this is an advantage indeed.

So the rosters of the all-star teams matter! Not only are career-defining honors at stake, but so is a major factor in the ultimate career-defining accomplishment: the World Series.

RBIs, Home Runs, and... Tweets?

We analyzed data from the first half of the 2012 baseball season using a technique from data mining called partitioning. Data mining has been in the news lately due to the NSA's monitoring and analyzing of phone records. The particular technique that we employ here attempts to profile an all-star player by identifying characteristics common to most players that make the team. For this study, we examined traditional statistics of baseball players, including batting average, RBIs, etc., along with some nontraditional characteristics, such as country of origin and the number of Twitter followers.

We discovered that for the 2012 season, the single statistic that best separates all-stars from non-all-stars was the total-bases statistic (which is the number of singles, plus twice the number of doubles, plus three times the number of triples, plus four times the number of home runs). Of those players who had at least 141 total bases before the all-star break in 2012, 65 percent made the all-star team, while only 3 percent of the players who had fewer than 141 total bases made the team.

For those in this latter category (fewer than 141 total bases), the number of RBIs was important. No batter in this category who had fewer than 25 RBIs made the all-star team, while 14 players who had at least 25 RBIs did. For these all-stars, on-base percentage (OBP) was the next most significant statistic, and following that was an interesting one: the number of Twitter followers!

Consider players who had fewer than 141 total bases but at least 25 RBIs and an OBP of at least 0.337. There were 46 such players, seven of whom had at least 102,120 followers on Twitter. All seven of these players made the all-star team! Of the 46 players with fewer than this many followers, only 15 percent made the team.

Since fan popularity is a part of the selection process, it is not surprising that a player's Twitter following is significant in this type of statistical analysis.

Fortunate? Snubbed?

Our model did a pretty good job at determining the statistical factors that were significant in 2012 (an R-squared value of 0.77 for the statistically minded readers), but the partition tree didn't categorize players perfectly. If we look at some of the players the model predicted incorrectly, we find some interesting stories.

First, there were some players who did make the all-star team but who our model suggested should not have. There were only seven such players (examples of statistical Type I errors), and a few of them are worth noting. National League infielders Rafael Furcal (STL), Bryan LaHair (CHC), and Dan Uggla (ATL) each made the team, but our model suggested they should not have. Each of these players had a less-than-stellar second half. Batting averages, slugging percentages, and total bases dropped (significantly, in some cases) after the all-star break. For these players, it was probably good that the all-star selection took place when it did.

The Type II errors in our model involved the players who the model said should have made the all-star team but did not. There were 17 of these players, and many of the names are well-known. Players Josh Reddick (OAK), Albert Pujols (LAA), Mike Moustakas (KCR) and Edwin Encarnacion (TOR) are great examples here. It is interesting to note here that these four players were among those whom sports writers and commentators in 2012 called the worst "snubs" -- players who should have made the team.

What About This Year?

Let's see what the 2012 analysis might say about the 2013 all-star roster. If we apply the same partition categories to the players from this year, many expected names emerge as suggested all-stars: Miguel Cabrera (DET), Buster Posey (SFG), Mike Trout (LAA), and Joey Votto (CIN) are good examples. Each of these is at (or near) the top of his position's current voting list, and we should expect to see each of these players on the field next week.

We want to close by highlighting seven players who are not leading in votes but who our model suggests deserve all-star nods. Domonic Brown (PHI) is having a great year. He has over 160 total bases, and as of this writing, he is leading the National League in home runs. Nelson Cruz (TEX) is another player that the analysis liked, and he is among the leaders in home runs in the American League. Our model also favors Carlos Gomez (MIL), Jean Segura (MIL), and Starling Marte (PIT); each of these has, or is on pace to have, over 141 total bases. These three also rank first, second, and third in triples in the National League (a trio of triples!). Two other players to notice are Howie Kendrick (LAA) and Andrew McCutchen (PIT). These two will have over 141 total bases by the all-star break, and they have had great seasons so far.

Watch for these players to be among the players selected after the fan voting determines the starters. If you don't see them on the rosters, then look for them to be among those whom the sports pundits call "snubs" regarding the 2013 All-Star Game.