In this blog series we’ll be experimenting with the most interesting blends of data and tools. Whether it’s mixing traditional sources with modern data lakes, open-source devops on the cloud with protected internal legacy tools, SQL with noSQL, web-wisdom-of-the-crowd with in-house handwritten notes, or IoT sensor data with idle chatting, we’re curious to find out: will they blend? Want to find out what happens when IBM Watson meets Google News, Hadoop Hive meets Excel, R meets Python, or MS Word meets MongoDB?
Today: MS Access meets H2. Test your Baseball Knowledge
The Challenge
Today’s challenge is sport related. How well do you know Major League Baseball? Do you know who had the best pitching and batting record between 1985 and 1990? Do you know who has been the highest-paid baseball player of all times?
For a long time, baseball has been, and arguably still is, the most data focused sport. The most famous documentation of data analytics in Major League Baseball is, of course, the Moneyball book (and subsequent movie), but data analysis in baseball has a much longer history.
For this blending challenge, we used batting and pitching statistics for all active players from 1985 to 2015 in the National League and the American League. Such data has been made publicly available through the Sean Lahman’s website. (A loud “thank you” to all of the site contributors for making this standard baseball encyclopedia publicly available.) The Lahman Database stores player statistics as well as data about managers, birthdates, awards, all-star games, and much more.
Now, in most companies each department owns specific data, sometimes even using different separated databases. For instance, salaries and demographic data are often owned by HR, while performance metrics are owned by Operations. In this experiment, we assume the HR Department hosts salaries and demographics data in a MS Access database and the Operations Department stores the performance metrics (batting and pitching stats, among others) inside an H2 database.
MS Access is part of the Microsoft Office package and therefore available on most Windows based PCs. H2 is a relatively new open source database downloadable at http://www.h2database.com/html/download.html.
Today’s technical challenge is to attempt to blend data from a MS Access database and an H2 database. Will they blend?
Afterwards, on the blended data, we will give a short-guided tour of the KNIME WebPortal, to detect the best-paid and/or best-performing baseball players for each decade.
Topic. Best paid and best performing baseball players from 1985 to 2015.
Challenge. Blend data from MS Access and H2 databases and guide the users through the analytics on a web browser.
Access Mode. Database Connectors and KNIME WebPortal.
The Experiment
Accessing MS Access database
- To access the MS Access database we use the UCanAccess driver. UCanAccess is a pure Java JDBC Driver implementation, which allows Java developers and JDBC client programs to read/write from and to Microsoft Access databases (.mdb and .accdb files). Download the driver file from http://ucanaccess.sourceforge.net/site.html.
- Copy the required jar files (ucanacces-xxx.jar, jackcess-xxx.jar, commons-lang-xx.jar, commons-logging-xxx.jar, jar) into <KNIME_INSTALL>\jre\lib\ext
- Register the driver jar (<KNIME_INSTALL>\jre\lib\ext\ucanacces-xxx.jar) at File → Preferences → Databases in KNIME Analytics Platform. The driver should subsequently be visible in the database driver list for the Database Connector node.
- In the workflow, we used a List Files node and a String Manipulation node to create the UCanAccess JDBC URL as "jdbc:ucanaccess://",\$Location\$, ";showSchema=true".
- The UCanAccess JDBC URL is then fed to the Database Connection node through the Table Row to Variable node.
- Next, two Database Table Selector nodes select the Salaries table and the Master table, containing the players’ salaries and demographics information, respectively.
- Finally, two Database Connection Reader nodes export the data into KNIME Analytics Platform.
Accessing H2 database
- To connect to the H2 database, we used the H2 Connector node.
- After connecting to the database, two Database Table Selector nodes select the Batting and the Pitching statistics tables.
- Finally, two Database Connection Reader nodes read the data into KNIME Analytics Platform.
Blending data from MS Access and H2 databases.
After some transformation and cleaning in the “Preprocessing and joining data” wrapped node, which involved computation and normalization of the average stats for each player as well as normalization of salaries through the CPI (Consumer Price Indexes), we created a sequence of 4 webpages to be displayed on the KNIME WebPortal:
- Page 1 visualizes the average salary for baseball players of the two major baseball leagues, i.e. the National League and the American League, over time.
- Page 2 allows the user to select a 5-year time frame between 1985 and 2015.
- Page 3 displays the top 10 best-paid pitchers for the selected time frame in both National and American League and allows you to select some of them.
- Page 4 numerically describes the selected players through average statistics and radar plots.
- Page 5 displays the top 10 best-paid hitters for the selected time frame in both National and American League and allows you to select some of them.
- Page 6 numerically describes the selected players through average statistics and radar plots.
The final workflow is available for download on the EXAMPLES server under EXAMPLES/01_Data_Access/02_Databases/07_MSAccess_meets_H2EXAMPLES/01_Data_Access/02_Databases/07_MSAccess_meets_H2*.
Note. If you are using a different database, just change the connector node. In the category, Database/Connector in the Node Repository, a number of dedicated connector nodes are available. If you cannot find yours, you can always use the generic Database Connector node. The Database Connector node can connect to all databases, if the appropriate JDBC driver is is provided.
The Results
Yes, they blend! The database blending was successful. The question though is: are the players worth the money?
The scatter plot in page 1 on the WebPortal (Fig. 2) shows the growth of the average salary over the years. As said, salaries have been normalized with. That is quite some growth! Was it worth it?
Let’s select a time frame on page 2, for example 1985 to 1990.
On page 3, let’s select 3 of the top 10 best-paid pitchers who were active between 1985 and 1990: Orel Hershiser, Dan Quisenberry and Dwight Gooden. Who was the best pitchers at that time?
The answer to our question is in page 4. The tables in Fig. 3, extracted from page 4, summarize the pitching stats in a radar plot for each player. Clearly, all the three selected pitchers exhibit great pitching stats, but Orel Hershiser seems to have the best records.
Now it’s your turn and you can have a look at who was the best batters for the selected time frame! Who is your favorite baseball player? Using our workflow, you can check out their batting and pitching stats, directly in KNIME Analytics Platform or on a web browser!
This WebPortal application is the proof of the successful blending of data from MS Access and H2 database. The most important conclusion of this experiment is again: Yes, they blend!
The workflow is available on the KNIME EXAMPLES server under EXAMPLES/01_Data_Access/02_Databases/07_MSAccess_meets_H2EXAMPLES/01_Data_Access/02_Databases/07_MSAccess_meets_H2*.
Coming next...
If you enjoyed this, please share it generously and let us know your ideas for future blends.
We’re looking forward to the next challenge. What about blending XML formatted data with JSON formatted data? Will they blend?
* The link will open the workflow directly in KNIME Analytics Platform (requirements: Windows; KNIME Analytics Platform must be installed with the Installer version 3.2.0 or higher)