Data Henrik: softlayer

Showing posts with label softlayer. Show all posts

Monday, November 17, 2014

A quick look at dashDB and a happy SQuirreL

dashDB slogan on its website

This morning I took some time to take a look at dashDB, a new IBM DWaaS (Data Warehouse as a Service) offering. When you go to the dashDB website, you are offered two choices: Use the dashDB service available on IBM Bluemix or use a Cloudant account to add a warehouse to your JSON database. Let me give you a brief overview of what you can do with dashDB and how I connected a local (open source) SQuirreL SQL client to my new dashDB database.

Cloudant Warehousing (dashDB)

dashDB is a cloud-based analytics database ("analytics in a dash")) with roots in Netezza and DB2 with BLU Acceleration. Data is stored in table (rows and columns) format. It is ready to connect all kinds analytic tools, local or cloud-based, and is already set up for geo-spatial data analysis (instructions on how to use the ESRI ArcGIS Desktop are provided). The best is that your regular SQL database/analytic tools continue to work, see below for details.

dashDB: schema discovery

I started my journey by logging into my existing Cloudant account. There, on the dashboard menu is a new item "Warehousing". When clicking on the "New Warehouse" button, you can select the Cloudant databases that you want to import into the warehouse. Because multiple databases can be associated with a Cloudant account or a Bluemix Cloudant service, this step let's you pick the data of choice. After the source data is chosen, the dashDB database is created and so-called schema discovery turns the JSON documents into rows of tables. Thereafter, the data is ready to have analytics applied. That is the time to launch the dashDB control center, another so-called "dashboard".

The welcome screen shows some of the analytic options, e.g., the database is ready to be used with either Cognos, SPSS, InfoSphere DataStage, R scripts, or all of them and more:

Analytis for dashDB: Cognos, SPSS, DataStage, R

SQuirrel SQL client - dashDB connected

Because some time ago I already tested and blogged about a predecessor of dashDB (see here: how to set it up and how to use R), I was more interested in trying out a JDBC-based client with my new cloud-based data warehouse. Included as part of the dashboard are several sections that help you with the application setup. So it was easy for me to obtain the JDBC URL and configure it and the listed userid/password in my local SQuirrel SQL client (it will work in IBM Data Studio and the Optim tool, too). As you can see from the screenshot, the database connection from my laptop to the cloud-based dashDB succeeded. Ready for some SQL.

My lessons learned from testing database queries on the converted data (JSON to relational) will be part of another blog entry. Stay tuned...

Friday, October 24, 2014

Apply DB2 skills to Hadoop by using Big SQL on Bluemix

IBM Analytics for Hadoop

One of the many services offered on the Platform-as-a-Service (PaaS) IBM Bluemix is "IBM Analytics for Hadoop", basically InfoSphere BigInsights as cloud service. Because it is a Big Data service and it offers SQL capabilities I was eager to test it. Here is a first report how I got some queries running against data on my Hadoop file system in the cloud.

Software for download

After creating the Analytics for Hadoop service (which I left unbound) I launched the dashboard.
The first I noticed is a link to download software. I brought up a long list of drivers, Eclipse plugins, API packages and so forth. If you want to explore the full power of BigInsights / Hadoop you don't need to search for all the relevant and compatible software, it's just a click away. I passed on that offer and instead created new directories in the HDFS. I read in the BigInsights tutorials that the directories are needed to hold my data and to create workbooks or so-called BigSheets to transform data from files to something processable.

Creating the directories is done by clicking on the appropriate icon on top of the directory tree structure, picking the right parent directory and specifying a name. Been there, done that. I then used another icon to invoke the GUI for file upload. Few minutes later I had uploaded two files with historic weather data (see my older blog entries) and was in the BigSheets section of the dashboard, ready to create a workbook. I named it "weather" and also chose a file. Because the input file is not a CSV by definition but uses a semicolon to separate data, I had to apply the "Character Delimited Data with Text Qualifier" reader to it. I specified the semicolon as delimiter and also checked the "header included", i.e., the first row included the column names.

New BigInSights / Hadoop directory and uploaded files

After creating the workbook I clicked on the option "Create Table" and then experienced a "lesson learned":

Lesson learned: creating a table over a workbook in #BigInsights can be done for directories, not files. "sheet.weather" (not shit weather)
— Henrik Loeser (@data_henrik) October 23, 2014

So I created another directory, moved the files in there, deleted the workbook and wanted to create it again. However, I got an error message. Another lesson learned:

Lesson learned: When deleting a workbook from #BigSheets, first purge, then recreate. #bigdata #hadoop #bluemix
— Henrik Loeser (@data_henrik) October 23, 2014

Ok, purge the deleted workbook, then recreate it. Done. I had read that I can share the data of a workbook for use by Big SQL and Hive by creating a table based on the workbook. In my test I created "sheets.weather" as a first table. I also created a second workbook based on my first, removed some columns, and created a second table "sheets.myweather" (not too creative in naming stuff...). Anyway, the main purpose of this exercise was to import data and to get ready for SQL queries.

Next in my journey was to actually using what is called Big SQL, running SQL queries against data in my Hadoop file system. The Analytics for Hadoop service on Bluemix offers a basic SQL web console for this. As an alternative you can also issue queries from applications via JDBC (see software download above) or use other tools. As I was short on time and didn't want install any software, I decided to use the provided SQL console. When executing SQL statements you have the choice of "Big SQL" and "Big SQL V1". I picked the new "Big SQL" as it is based on the DB2 query compiler and runtime infrastructure. This comes in handy when you are unsure about available tables, column names etc. As you can see from the first screenshot below, I started by querying the Big SQL (DB2) system catalog to retrieve my available table names. The second screenshot shows a simple SQL query against the uploaded weather data.

Overall, given my short amount of time I had, it was a nice experience. Having the opportunity to apply DB2 skills towards data in a Hadoop file system reduced the barrier. Given that the data is already uploaded, I am sure I will try out other features of InfoSphere BigInsights / IBM Analytics for Hadoop, too. Have you tested it or Bluemix in general? You can sign up for a free trial (no credit card required) at http://bluemix.net.

DB2 catalog tables for Hadoop

Basic SQL query against Hadoop data

Monday, September 1, 2014

What a plot: DB2, R, and Bluemix help with vacation weather

Last week I reported on how I set up a in-memory DB2 database on IBM Bluemix and loaded some historic weather data. Over the last couple days I used some spare time to play with the Cloud-based analytic capabilities that are provided as part of the Softlayer/Bluemix/DB2 combination. Most of the time went into learning (some basics of) R, an environment for statistical computing and graphics. As an example I wanted to find out what temperatures to expect for a possible September vacation on the German island of Norderney.

[Update 2014-11-04: The Analytics Warehouse service on Bluemix is now called dashDB]

For my small exercise I used data provided by the German Meteorological Service "Deutscher Wetterdienst". It allows to freely download and use (under some conditions) data from several decades of weather observation. I uploaded the data to DB2/Bluemix as described in my previous post.

Bluemix: Change DB2 column name and type

While playing with the data I noticed that the column names required escaping of quotes and the observation dates were stored as integer values (yyyymmdd). In a second upload I simplified the column names and adapted the column data type using the DB2 load wizard (see picture). Thereafter I was set for my experiments with R.

The DB2 Cloud environment provides several examples for programming in R, a special function library "bluR" to easily connect R with DB2-based data, and it features the RStudio to develop, test, and execute code in R. Within RStudio it is possible to execute several demos to learn more about analytics, graphing, and data processing. For the DB2 in-memory database API for R there is a demo as well. You can invoke it using the "demo(blur)" command:

DB2 API demo for R in RStudio

The demo shows how to connect to DB2, execute a query and use the fetched data for analytic processing in R. Last week I already tweeted about how I tested escaping of quote characters (use C style, not SQL style):

For escaping quotes in #SQL statements for #DB2 in #RStudio/#rstats use C style, not SQL. (from my #Bluemix tests) pic.twitter.com/XslDgxZU7L
— Henrik Loeser (@data_henrik) August 29, 2014

The data set which I uploaded to DB2 has daily minimum and maximum temperatures (and lots of other meteorological) for about 70 years. I used a SQL query and then the ggplot2 library to create a graphic. It shows the band for the minimum temperatures for each September day as well as the band for the maximum daily temperatures.

DB2 with R: Historic September temperatures

The code for this graphic is pretty simple (and I started last week looking at R and DB2) and available from my Github account:

1:  ########### R script to analyze historic weather data for min/max values  
2:  ## Written by Henrik Loeser  
3:  ## Connection handle con to BLU for Cloud data warehouse is provided already  
4:  ## For plotting, we are using ggplot2 package  
5:  ##   
6:  library(ggplot2)  
7:  library(bluR)  
8:    
9:  ## initialize DB2 connection and environment  
10:  con <- bluConnect("BLUDB","","")  
11:  bluAnalyticsInit(con)  
12:    
13:  ## query DB2 weather data and fetch min/max values of min/max values  
14:  ## (lower/upper boundary each)   
15:  query<-paste('select max(lufttemperatur_maximum) as maxmax,min(lufttemperatur_minimum) as minmin,min(lufttemperatur_maximum) as minmax,max(lufttemperatur_minimum) as maxmin,tag from (select lufttemperatur_maximum, lufttemperatur_minimum, day(mdatum) as tag from blu01023.klima where month(mdatum)=9) group by tag order by tag asc')   
16:  df <- bluQuery(query,as.is=F)  
17:    
18:  ## Some plotting needs to be done  
19:  jpeg(type='cairo',"tempe.jpg",width=800,height=600)   
20:  ggplot(df, aes(x = TAG))+ylab("Temperature")+xlab("Day")+          
21:     geom_ribbon(aes(ymin = MINMIN, ymax=MAXMIN), fill='blue')+  
22:     geom_ribbon(aes(ymin = MAXMAX, ymax=MINMAX), fill='red')+  
23:     geom_ribbon(aes(ymin = MAXMIN, ymax=MINMAX), fill='white')+  
24:     geom_line(aes(y = MINMIN), colour = 'black') +  
25:     geom_line(aes(y = MAXMIN), colour = 'black') +  
26:     geom_line(aes(y = MINMAX), colour = 'black') +  
27:     geom_line(aes(y = MAXMAX), colour = 'black')   
28:    
29:  sink('/dev/null')   
30:    
31:  bluClose(con)  
32:  ## connection is closed, we are done

Pretty cool (my opinion)! I am already turning into a data scientist. And you can test this yourself on IBM Bluemix with the Analytics Warehouse service (DB2 in-memory database feature).

Wednesday, May 21, 2014

Clouds ahead: Playing with DB2 and Cognos (FREE)

By now everybody should have heard about IBM BLU Acceleration for Cloud. The tagline is "Data Warehousing and Analytics for Everyone" which caught my eye. So this morning I wanted to find out how easy it is to get started. My conclusion: Almost too easy for an IBM product... ;-)

First you have to visit the official BLU for Cloud website at http://bluforcloud.com/. There you click on the button "Try BLU Acceleration now" and you are taken to an overview of currently four different usage plans. The plan I chose is the free trial plan which is hosted on SoftLayer, but there are also metered plans available on SoftLayer or Amazon WebServices (AWS), and a managed service on IBM BlueMix.

After signing up by providing my Google ID (or alternatively name and email address), I was provided with my new account information within seconds and the system stated that everything was ready to go:

BLU Cloud account created successfully

After clicking on the "Start BLU Acceleration" link as shown in the screenshot above, the web-based managed console came up. It allows working with DB2 database objects, to query and analyze the data, and to run reports against the two sample databases. Of course you can upload your own data and try some of the analytic tools on them. My interest was Cognos and I tried the drill down reports:

Cognos drill down in BLU Cloud

In the graphical report based on the sample database you can click on the regions, product categories, etc. and then continue to subcategories and subregions ("drill down"). Of course you could try out Cognos or Industry Models with your own data.

What I did next is to explore how well the Cloud offering integrates with my local tools. So I opened a local shell on my Linux-based ThinkPad and launched the DB2 command line processor. First I added the Cloud-based DB2 server to my local directory using the CATALOG TCPIP NODE command. Next was to add the remote database using CATALOG DATABASE. Last I connected to the database providing the username and, when prompted, the password. Yeah, connected! That was easy!

Catalog the Cloud-based DB2 server and database on local machine

From start to finish it took me about 5 minutes for the signup process, logging into the Cloud service and adding the remote DB2 server to my local system. Have you tried it? It is free and fun. And I can tell my boss that I know my way around in the Cloud.

Pages