For Geeks Only
First of all, the title for this page is not just us being clever...there is going to be computer code here. If you are not interested in that kind of thing, please get the heck off this page before you see something you really do not like. We're not kidding...beat it!
For those of you interested, there are basically two semi-cool things that we did with this site, at least from a technical standpoint. If you ask us, the page header with the text and logo superimposed over the handwritten bracket looks pretty cool, too...but anyone that has a free hour and can figure out how to download and install The Gimp can figure out how to do that.
Web Scraping
The first cool thing that we figured out how to do is something called Web Scraping. The actual practice of web scraping has long been used, and is still being used, by some bad people to do some potentially bad things, but none of the folks that have RPI web sites were A) willing to hook me up with a data dump (understandable) or B) had their technical house in order enough to have an API or SDK (inexcusable).
So, we had to turn to web scraping. Here is a very brief explanation of how to do a very simple web scrape using PHP. There are robust software products that only do web scraping, but for something this simple, all you need to know is a little bit about PHP and regular expressions. The basic steps we followed are outlined below:
First, you need to load up the web page that you would like to scrape. For the purposes of this exercise, let's look at the official RPI page on the NCAA's web site. It looks something like this:
Next, you need to view the source of the page you want to scrape. Since we am using Google Chrome for this exercise, let's click on the "Control the current page" button, which looks like this:
, then click on "Developer", then click on "View Source". Here is a picture that should illustrate the clickstream.
The next step is to find the portion of the page source that displays the data you want to scrape. In this case, everything is contined in a table, with each team on its own row ("TR" tag), and each data item in its own cell ("TD" tag). We have highlighted the first such row in the picture below to illustrate what you should be looking for in the page source.
The next thing we need to do is use a PHP regular expression to extract the data. Here is the PHP code to do that:
$html = file_get_content("http://web1.ncaa.org/app_data/weeklyrpi/2009MBBrpi1.html");
//Use Regular Expressions to get an array with all of the data
preg_match_all( '/<TR><TD align=center>(.*?)<\/TD><TD align=center>(.*?)<\/TD><TD align=center>(.*?)<\/td><td align=center>(.*?)<\/td><td align=center>(.*?)<\/td><td align=center>(.*?)<\/td><td align=center>(.*?)<\/td><td align=center>(.*?)<\/td><td align=center>(.*?)<\/td><\/tr>/s',
$html,
$teams,
PREG_SET_ORDER
);
- item[1]: Rank
- item[2]: Previous Rank
- item[3]: Team Name
- item[4]: Conference
- item[5]: W-L
- item[6]: Road W-L
- item[7]: Neutral Site W-L
- item[8]: Home W-L
- item[9]: Non-Div-I W-L
So, when we want to put it all together, at least for our simple example, we are going to get all of our data from the web page (RPI Ranking from the NCAA site), store it into an array, and then output the data in the first 5 rows of that array onto the screen. Here is the code for the page that will make that happen:
<?php
echo("Preparing to open web location: http://web1.ncaa.org/app_data/weeklyrpi/2009MBBrpi1.html...<BR><BR>");
//Get the HTML content for the RPI page
$html = file_get_contents("http://web1.ncaa.org/app_data/weeklyrpi/2009MBBrpi1.html");
echo("File opened...<BR><BR>");
//Use Regular Expressions to get an array with all of the data
preg_match_all( '/<TR><TD align=center>(.*?)<\/TD><TD align=center>(.*?)<\/TD><TD align=center>(.*?)<\/TD><TD align=center>(.*?)<\/TD><TD align=center>(.*?)<\/TD><TD align=center>(.*?)<\/TD><TD align=center>(.*?)<\/TD><TD align=center>(.*?)<\/TD><TD align=center>(.*?)<\/TD><\/TR>/s',
$html,
$teams,
PREG_SET_ORDER
);
$numTeams = count($teams);
echo("I found " . $numTeams . " RPI Rankings...printing data for first 5");
echo("<BR><BR>");
?>
<TABLE>
<TR bgcolor="#CCCCCC">
<TH>RANK</TH>
<TH>PREV RANK</TH>
<TH>TEAM</TH>
<TH>CONF</TH>
<TH>W-L</TH>
<TH>ROAD</TH>
<TH>NUETRAL</TH>
<TH>HOME</TH>
<TH>NON-DIV-I</TH>
</TR>
<?php
//Iterate through the array and write to the screen
for ($i=0; $i<5; $i++)
{
echo("<TR>");
echo("<TD align='center'>");
echo($teams[$i][1]);
echo("</TD>");
echo("<TD align='center'>");
echo($teams[$i][2]);
echo("</TD>");
echo("<TD align='center'>");
echo($teams[$i][3]);
echo("</TD>");
echo("<TD align='center'>");
echo($teams[$i][4]);
echo("</TD>");
echo("<TD align='center'>");
echo($teams[$i][5]);
echo("</TD>");
echo("<TD align='center'>");
echo($teams[$i][6]);
echo("</TD>");
echo("<TD align='center'>");
echo($teams[$i][7]);
echo("</TD>");
echo("<TD align='center'>");
echo($teams[$i][8]);
echo("</TD>");
echo("<TD align='center'>");
echo($teams[$i][9]);
echo("</TD>");
echo("</TR>");
}
?>
</TABLE>
</FONT>
Of course, our production script writes all 300+ rows, and it writes them to our data mart, but we wanted to keep it simple for the purpose of illustrating the concept.
In order to see it in action, click HERE. My fellow Tar Heels fans should love the result, since I am using the 2009 data, and you will see North Carolina at the top of the list, where they will hopefully be once again, sooner rather than later.
Data Mart
If you are a Business Intelligence expert, prepare to be underwhelmed...but we decided to build a simple Data Mart, instead of a merely relational schema. The reason behind this is that we wanted to eventually offer some additional features that allowed users to do some BI-type things, like generate ad hoc team reports, that would best be suited by a true dimensional model / cube.
As of yet, the cube has not been created, and probably will not be until the 2010-2011 season. The easiest way to get this produced and out there was for us to use MySQL, which is always fun, but which does not have the ability to make cubes built in. We have not had time yet to play around with any of the open source business intelligence software...so, no cube...at least not yet.
For those interested in the dimension model itself, here it is:
As you can tell from the picture, it is as simple as it gets. Hopefully, this will have the opportunity to grow into something more complex, but currently, the most complex parts of our process are the 3 PHP scripts that do the daily web scraping / writing to the data mart, calculate the T.I. for all 347 teams, and build the bracket itself.