Mining the Ball Field

Grab a mass of data, like the 2011 baseball season's results. How can you glean useful information from all the numbers? This question sits squarely in the field of data mining, which is the science of extracting useful information from large sets of data.
This post was published on the now-closed HuffPost Contributor platform. Contributors control their own work and posted freely to our site. If you need to flag this entry as abusive, send us an email.

Take me out to the ball game, take me out with the crowd. Give me some statistics and computer time. I might find an unknown who's about to hit his prime!

The fields are cut, the peanuts and Cracker Jack are piled high, and soon the 2012 Major League Baseball season will begin with opening day. Who will win and who will lose? Which players will excel and which will wish they could return to today and start again? As popularized in the 2011 film Moneyball, baseball teams use analytical, evidence-based, sabermetric approaches to study and measure the effectiveness of players. A digital sea of data is only a few clicks away. Yet, how might a baseball fan begin to analyze it?

We were at a similar momentary impasse with March Madness as we considered completing a bracket for the Division I Men's Basketball Tournament. The posting Got March Madness? Try Math! described how to adapt a mathematical method used to rank college football teams for the Bowl Championship Series. With it, one could create a personalized March Madness bracket. Some such adaptations outscored over 98% of the over 6 million brackets submitted to the ESPN Tournament Challenge. Such a method is effective, in part, as it gives greater reward for wins against strong teams.

Can we tailor such ideas to baseball? Yes. In fact, such work was already started this past summer by Drs. John Harris and Kevin Hutson of Furman University and their undergraduate research assistants Will Decker, Jordan Lyerly, Aaron Markham, and Rob Picardi. While one could use such ideas to rank MLB teams and predict their play from day to day, this group's work focused on using such algorithms to study players.

To see why such ideas can be advantageous, look at batting average. This statistic does not integrate the strength of the players involved in a hit. Each at-bat can be perceived as a game between the pitcher and the hitter. A win results for the pitcher with an out and for the hitter with a hit. Yet, such methods can involve (and be improved by using) scores. What's the score for a single versus a triple? As is often the case in math, one can lean on the developed ideas of others. Here the Furman group used a system called Runs To End of Inning (RUE) that measures a particular event's contribution to runs produced in an inning. For instance, a single is 1.025, a strikeout is 0.207, and a homerun is worth 1.942 as catalogued in The Book: Playing the Percentages in Baseball by Tom Tango, Mitchel Lichtman, and Andrew Dolphin. The group integrated such a scoring system and treated each at bat as a game between a pitcher and hitter. In the end, they were able to rank the pitchers and hitters from the 2010 season. Not surprisingly, many of the names that topped their list are household names like Josh Hamilton, Joey Votto, and Felix Hernandez.

However, sprinkled among the superstars were lesser known players such as Mike Napoli. In 2010 while playing for the Los Angeles Angels, Mike Napoli hit 0.238 with a 0.316 on base percentage. To put this in perspective, the league average in 2010 for batting average was 0.257. From this perspective, Napoli was in the bottom of the pack. Yet, Furman's ranking methods placed him as the 62nd best hitter out of the 444 total batters in Major League baseball. How could this be? They key is asking who did he face when he stepped up to the bat? Among the eight pitchers he faced most often 2010, four of them were nominated for the Cy Young Award.

Mike Napoli was traded during the offseason and played for the Texas Rangers in 2011. In this different environment, Napoli's batting average improved dramatically from 0.238 in 2010 to 0.320 in 2011; his on base percentage jumped from 0.316 to 0.414. Indeed, Furman's work uncovered a diamond in the rough.

Does this work only for batters? Doug Fister was a highly ranked pitcher from the 2010 season, according to the results of Furman's work. His statistics were far from stellar. He posted a 4.11 Earned Run Average and won 6 games while losing 14 for the Seattle Mariners. In 2011, Fister was traded midseason from the Mariners to the Tigers. Before being traded, Fister posted 3 wins and 12 losses for the Mariners. In Detroit, he won 8 and lost only 1 game for the remainder of the season with a 1.79 ERA! Again, the math almost foresaw such a possibility.

Who are the wildcard players this year? If you are in the stands, who is playing exceptionally well even if the statistics are not yet reflecting it? Grab a mass of data, like the 2011 season's results. How can you glean useful information from all the numbers? This question sits squarely in the field of data mining, which is the science of extracting useful information from large sets of data. In fact, April 2012 is Math Awareness Month sponsored by the American Mathematical Society, the American Statistical Association, the Mathematical Association of America, and the Society for Industrial and Applied Mathematics. Want to learn more about the field? Visit their site Math Awarness Month to read essays and view activities occurring this month.

So, root, root, root for the home team. If they don't win, it's a shame! It's one, two, three strikes, you're out in that game between pitchers and hitters. What statistics could you use from the game, what methods could you use or adapt to help uncover who just might grab the nation's attention in that old ball game?

Popular in the Community

Close

What's Hot