Play Ball: Introducing Fungoes
When I was growing up and didn't have a car, I always was involved in
two activities--working on my four-seam fastball from my Luis Tiant
wind-up or peaking and poking with BASIC. Now that I am older, I have
moved beyond BASIC and dived into baseball statistics, perhaps because
I never got a good feel for a curveball. Nearly all of my statistical
investigations have used open-source tools. I have tested the limits of
OpenOffice.org and Gnumeric. I have put numbers produced from g77-compiled
programs into Perl and TCL, and I have created graphs with gnuplot.
Until now, I have kept my work and the strange collection of tools I use
to myself.
How un-open source of me.
Around 1856, ex-cricket reporter Henry Chadwick started playing around
with box scores and numerical representations of a baseball game.
Since then, box scores have become a free data source for fans, a public
recording of the facts. Retrosheet is an
excellent place to find old box scores and download the data.
Although Retrosheet uses DOS tools on its site, the open-source tool
Chadwick is used on the data.
More recently, Bill James looked at baseball stats in different ways and wrapped them in a
different philosophy. He was not the first one to do so, however. Branch Rickey,
who also brought a guy named Jackie Robinson to the big leagues,
used statistical analysis when he was general manager of the Brooklyn Dodgers.
James' work grew into "sabermetrics". Sabermetrics, now taught and used
in college courses, has its proponents and opponents. This parallels,
in many ways, GNU and free software.
With Linux, Linus Torvalds arguably brought the first serious attention
to free and open-source software. Billy Beane, General Manager of the Oakland
As and central figure of the book MoneyBall, is the technical architect
who helped bring sabermetrics into the public eye. Whereas Linus used free
and open-source software to build Linux, Billy Beane used sabermetrics
to build the Oakland As, with equally successful results.
Success has built slowly around sabermetrics, and many teams have come
to integrate it into their decision-making. The Cleveland Indians, with
Mark Shapiro as the General Manager, rely on statistical analysis in
drafting, signing and trading players. Theo Epstein helped lead the
Boston Red Sox to a championship, with consultant Bill James, using
sabermetrics in player decisions. Likewise, Linux and free and open-source
software has integrated itself slowly into larger companies and their
decision-making process.
Baseball
Prospectus is a Web site that publishes daily articles about baseball,
using sabermetrics as the foundation. Baseball Prospectus also
publishes a book or two throughout the year. The site relates
to sabermetrics in the same way that Linux Journal
relates to Linux.
In February of this year, the good folks at O'Reilly published a book
called Baseball Hacks, penned by Joseph Alder.
Although not specifically about open-source tools--Excel and Access get
some ink--most of the tools he discusses to collect baseball stats and
mine the data are open source. He devotes a good chuck of the book to
using MySQL and The R Project--a great project too few people have heard
about--so people can fulfill their need for baseball statistical analysis. He also
includes a few great sections on collecting live data, including pulling
box score information off of the
Major League Baseball's Web site
and shoving the information into a database.
Baseball Hacks removed my final excuse for not
moving much of my baseball and football stats work into an open-source
project. So I found acceptable replacements for my greenies, passed
my drug test and started Fungoes.
Fungoes will have several parts, including a display of historical
statistics. The site will offer different kinds of stats and allow
people to sort and filter on various ones. Users will be able to find
out who hit the most home runs, as well as who hit the fewest.
baseball=# select name,yearid,ab,hr baseball-# from batting_career_totals baseball-# where ab > 3000 and debut > 1900 baseball-# order by hr asc baseball-# limit 5; name | yearid | ab | hr -----------------+--------+------+---- Duane Kuiper | 12 | 3379 | 1 Bill Bergen | 11 | 3028 | 2 Al Bridwell | 11 | 4169 | 2 Johnny Cooney | 20 | 3372 | 2 Frank Taveras | 11 | 4043 | 2
Note: above is the query that finally forced me to start this project.
I was writing an article called
"Ironic Announcers for MLB's Home
Run Derby" while researching a book I hope to write someday. I
could not find a good way to include any statistics information
beyond cut-and-pasting.
The Fungoes project offers many interesting challenges for me. Many
statistics will need to be sorted and calculated. The range for sorting, the number
of columns and the filtering possibilities will test my humble programming
and database skills. In addition, because players can switch teams during the
year, finding a way in which to display their data in an informative
and easy-to-view manner increases the challenge.
Baseball
Reference is an excellent site that provides team and player
statistics,. It currently does a great job of displaying static data.
For example, here is the site's display for
Jody
Gerut.
Year Ag Tm Lg G AB R H 2B 3B HR RBI SB CS BB SO BA OBP SLG +--------------+---+----+----+----+---+--+---+----+---+--+---+---+-----+-----+----- 2003 25 CLE AL 127 480 66 134 33 2 22 75 4 5 35 70 .279 .336 .494 2004 26 CLE AL 134 481 72 121 31 5 11 51 13 6 54 59 .252 .334 .405 2005 27 TOT 59 170 15 43 11 1 1 14 1 1 20 20 .253 .330 .347 TOT NL 15 32 3 5 2 0 0 2 0 0 2 6 .156 .206 .219 CLE AL 44 138 12 38 9 1 1 12 1 1 18 14 .275 .357 .377 CHC NL 11 14 1 1 1 0 0 0 0 0 2 3 .071 .188 .143 PIT NL 4 18 2 4 1 0 0 2 0 0 0 3 .222 .222 .278 +--------------+---+----+----+----+---+--+---+----+---+--+---+---+-----+-----+----- 3 Seasons 320 1131 153 298 75 8 34 140 18 12 109 149 .263 .334 .434 +--------------+---+----+----+----+---+--+---+----+---+--+---+---+-----+-----+----- 162 Game Avg 573 77 151 38 4 17 71 9 6 55 75 .263 .334 .434 Career High 134 481 72 134 33 5 22 75 13 6 54 70 .279 .336 .494
In 2005 Jody Gerut played for three teams: the Cleveland Indians, Chicago
Cubs and Pittsburgh Pirates. As shown above, his totals are split in several ways, by
season totals, by totals per team and by totals per league. Fungoes will
strive to display stats in the same way, but using dynamic pages.
Fungoes will strive to display stats in the same way as Baseball
Reference, but using dynamic pages. First, I have to get the data or
spend time doing data entry. Lucky for me, the good people at
The Baseball
Databank offer historical baseball data in
two formats, MySQL and comma-separated values (csv). To get an idea of
the size of the data, here are the current row counts for
all the tables
the site offers.
TABLE => ROWS Master => 16566 Teams => 2505 TeamsFranchises => 120 TeamsHalf => 52 Batting => 87308 Pitching => 36898 Fielding => 126130 FieldingOF => 21603 Salaries => 17277 Managers => 3067 ManagersHalf => 93 Allstar => 4115 AwardsPlayers => 2383 AwardsSharePlayers => 5930 AwardsManagers => 47 AwardsShareManagers => 282 HallOfFame => 3369 HOFold => 260 BattingPost => 9069 FieldingPost => 8981 PitchingPost => 3597 SeriesPost => 229 Schools => 724 SchoolsPlayers => 5684 xref_stats => 16413
The next purpose of the Fungoes site will be to download box score
information from mlb.com and store the information in the Fungoes
database. At the end of the year, the data will be verified and added
to the historical data. I also want to display box score data and offer
sortable statistics for the current season.
Finally, what would be the point of having all this data if I didn't offer my
humble opinion on the season and my excellent analysis of the numbers?
Several records could fall this season. Barry Bond's quest to topple
Babe Ruth and Hank Aaron's home run records will cause plenty of debate.
Therefore, the Fungoes site will need to have an area for posting
articles that use data from the site.
To help make it easier for these articles to refer to data on the site,
I am going to add the functionality of
My T Url, a Tinyurl clone, to
all of the pages on the site. Doing so will allow each page to have a
small URL, for example, http://fungoes.mek.cc/link/000a1 instead of
http://fungoes.mek.cc/baseball-stats/player/aaronha01?batting%5forderby….
Behind the scenes, open-source tools running on a Linux box will power the site.
I looked at many different Web toolkits and development platforms. The current
flavor of the month is Ruby on Rails, but it just doesn't seem ready for
the abuse I would be inflicting on it. Plone has some good merits
but not enough to win me over. And, Drupal and many of the LAMP applications
don't handle large, complex SQL statements in a simple manner. In the
end, I decided to use OpenACS, with AOLserver as my Web server, PostgreSQL for
SQL and TCL as the scripting language.
On a side note, I currently maintain Uptime and My T Url. Both Uptime and My T Url are
free services--Web site monitoring and short URLs, respectively--with
GPLed source code. They both run under AOLserver, using TCL and
PostgreSQL. I also am involved in the OpenACS community from time to time.
OpenACS, AOLserver, PostgreSQL and TCL are not what most people consider
to be their standard toolset. I am not sure how many open-source packages
use PostgreSQL as their first database choice. Typically, people are
shoe-horning PostgreSQL support into an application that originally used MySQL.
AOLserver has an
undeserved bad rap, mainly due to the "AOL" in its name. AOLserver,
originally called Naviserver, is a multithreaded Web server built on
top of TCL. AOL eventually bought Naviserver and renamed it AOLserver.
AOL currently uses AOLserver for many of its Web sites.
TCL, the ugly duckling of scripting languages, too often is overlooked,
and people sometimes avoid looking at AOLserver and OpenACS because of TCL.
TCL is an easy language to learn; it sports only 90 or so core commands.
AOLserver adds about 20 commands on top of TCL. OpenACS contains about
2,000 procedures in its core packages, but you need only about 10% of
them to build sites. The hardest part of learning OpenACS, in fact, is
not using AOLserver and TCL but understanding that somebody probably
already wrote a procedure or function that does what you need to do.
To be completely honest, OpenACS suffers from two main problems. First, it
can be hard to installed. Second, as with many packages that have been
around for a while, it suffers from bloat. I suffer from a little
"bloating" myself, though, so I don't hold that against OpenACS.
My plan for extended spring training, which I'll discus in a follow-up
article, consists of the following:
- Install two instances of OpenACS, one for development and one for
production. - Use Subversion for source control.
- Complete the change-over of the Baseball Databank to PostgreSQL.
- Integrate Hack 27 from Baseball
Hacks for collection of the current year's baseball stats. - Create a baseball stats package for OpenACS that
creates all of the database tables and loads the data.