Data Henrik: analytics

Showing posts with label analytics. Show all posts

Friday, November 25, 2022

Finally together: Db2 and Zeppelin

United: Db2 and Zeppelin

If you followed my blog, you may have noticed that I wrote about Db2 and about Zeppelins in the past - but not together. Today, I am going to discuss how I configured a JDBC interpreter in an Apache Zeppelin notebook to connect to a Db2 on Cloud database. So, finally, within a single blog post, I can talk about both of them. Let's get started.

Some advanced SQL to analyze COVID-19 data

Learn to write SQL

All of us are impacted by COVID-19. Looking at daily case numbers, the basic reproduction number or mortality rates has become a routine. But what is behind those numbers? How are they computed and how does SQL help? In this post, I want to discuss how some SQL analytics clauses help to dig through the data. When I teach database systems, I always try to show how it applies to life. Here, SQL may not directly help to cure the disease, but SQL is essential to compute and understand the statistics.

IBM Watson Studio: Download pandas DataFrame as CSV or Excel file

Gist code snippet

Right now, I am working with IBM Watson Studio on some analytics side project. It is nice to work with Jupyter Notebooks, Python and pandas and data again. Even Db2 is involved.

From working with pandas DataFrames locally, I knew how to turn the data into CSV or Excel files. But working with a hosted environment, accessing the file system is not possible and some other solution is needed.

ETL in the cloud using SQL Query and Db2

From COS to Db2 using SQL

The SQL Query service on IBM Cloud allows to process data stored on Cloud Object Storage (COS) by writing SQL queries. So far, results were either shown in the console and / or written back to files on COS. Thus, I was happy to notice a new feature: Query results can now be written back to Db2 tables (on cloud). From my tests both Db2 on Cloud and Db2 Warehouse on Cloud are supported.

Track it from the command line: Search audit events in LogDNA using Python

Take a look at security logs

Earlier this year, IBM and LogDNA announced an integrated offering Activity Tracker with LogDNA. It allows to manage and search activity events in LogDNA instances on IBM Cloud. There are IAM, account management and all kinds of service instance events that can be tracked. Viewing the events is typically done in the LogDNA UI. I, however, want to perform searches on the command line and integrate it with Cloud Functions. In this article, I discuss the small tool that I wrote the search the activity logs and export them.

Startup lessons from a Fuckup Night

Last Wednesday, I attended the Fuckup Night Friedrichshafen Vol. II. If you don't know, Fuckup Nights is a global movement and event series dedicated to professional failures. That is, usually founders of failed startups tell their stories. Typically, it is a mix of funny adventures into the world of business, some sad parts and most importantly some lessons learned. So what were the lessons I took away? Read on...

Automated reports with IBM Cloud Functions, Db2 and Slack

GitHub Traffic Analytics

One of my (many) favorite IBM Cloud solution tutorials is about combining serverless and Cloud Foundry for data retrieval and analytics. I blogged about it and described how an automated IBM Cloud Functions action retrieves GitHub statistics and stores them in Db2. Using an embedded Cognos dashboard and regular Javascript / HTML tables, the solution offers GitHub Traffic Analytics. I extended that solution by automatic weekly reports that are posted to Slack.

Securing your Python app with OpenID Connect (OIDC)

Authorized access only using OpenID Connect

Some weeks back I introduced to a tutorial on how to analyse GitHub traffic. The tutorial combines serverless technology and Cloud Foundry to automatically retrieve statistics and store them in Db2. The data can then be accessed and analyzed using a Python Flask app. Today, I going to show you how the web site is protected using OpenID Connect and IBM Cloud App ID.

Tutorial: Analyze and visualize open data with Apache Spark

Life Expectancy Map

Many government agencies and public administrations offer access to data, contributing to open data. Using IBM Watson Studio with Jupyter Notebooks and Apache Spark it is simple to retrieve, combine and analyze data from different sources. The result can be easily visualized. Learn what it takes with this IBM Cloud solution tutorial.

Automated, regular database jobs with IBM Cloud Functions (and Db2)

IBM Cloud Functions and Db2

Yesterday, I blogged about the latest tutorial I wrote. The tutorial discusses how to combine serverless and Cloud Foundry for data retrieval and analytics. That scenario came up when I looked into regularly downloading GitHub traffic statistics for improved usage insights. What I needed was a mechanism to execute small Python script on a daily or weekly basis. After looking into some possible solutions, IBM Cloud Functions was the clear winner. In this blog, I am going to discuss how simple it is to implement some regular, automated activities, such as maintenance jobs for a cloud database.

Use Db2 and IBM Cloud to analyze GitHub traffic data (tutorial)

Architecture: GitHub Traffic Analytics

In a new solution tutorial, I show you how to automatically retrieve and store GitHub traffic data the serverless way with IBM Cloud Functions and Db2. The data can then be analyzed via a Web app deployed to Cloud Foundry on IBM Cloud. The app is secured with App ID using OpenID Connect. The new service Dynamic Dashboard Embedded provides visualization of the views and clones of GitHub repositories.

IBM Cloud: Some fun with Python and Cloud Foundry Logging

IBM Cloud: Turn Log Data into Donut

Last month, after receiving user questions, I blogged about how to decipher Cloud Foundry log entries. Today, I want to point you to a small Cloud Foundry Python app I wrote. It helps to better understand Python and Cloud Foundry logging. You can also use it to test the IBM Cloud Log Analysis service which provides an easy-to-use interface to logs generated by applications running in the IBM Cloud. In the premium plans, external log events can also be fed into the service for consolidated storage and analysis.

As usual, the code for my app is available on Github: https://github.com/data-henrik/application-log-analysis/. Once deployed to IBM Cloud, the app can be used to send messages on a chosen log level back to the server. The server-side log level, i.e., the threshold for processed log messages can also be set. The app produces diagnostic output on "stdout" and "stderr". The two are treated differently by Cloud Foundry. Here is a screenshot of the logging app:

Test app for Cloud Foundry logging

The produced log entries can also be used to try out the IBM Cloud Log Analysis service. Diagnostic logs are automatically forwarded to the Log Search of that service. The messages are fed into Elasticsearch and can be analyzed using Kibana. I wrote some search queries (one shown below) and then built visualizations like the shown "Donut" based on those queries. I will write more about that in a future blog post.

Search Query for Elasticsearch / IBM Cloud Log Analysis

An official tutorial using that app and Log Analysis is available in the IBM Cloud docs.

If you have feedback, suggestions, or questions about this post, please reach out to me on Twitter (@data_henrik) or LinkedIn.

Wednesday, April 5, 2017

Aero Expo, Drones and the IBM Cloud

The Aero Expo, the Global Show for General Aviation, is running in my hometown Friedrichshafen from today until the weekend. One of the expo and conference topics is drones of the future (AERODrones UAS Expo). Drones or UAV (Unmanned Aerial Vehicles) have been and are a hot topic for IBM and its customers. Let me give a brief overview of some interesting work where drones, artificial intelligence, analytics, database systems, Internet of Things (IoT) and the IBM Cloud come together.

SQL Magic in Notebooks in the IBM Data Science Experience

New Notebook in IBM Data Science Experience

At the recent IDUG DB2 Tech Conference in Brussels I gave a talk on using Jupyter Notebooks with IBM DB2 or dashDB. For the presentation I used a local installation of the notebooks and DB2 (never trust Internet connectivity). Part of the talk was about using SQL Magic in a notebook as simple interface to the database, e.g., for testing and prototyping.
After the conference I received a question about whether it is possible to use the SQL Magic with Jupyter Notebooks in the IBM Data Science Experience. The answer is yes and here is how.

CeBIT: Goldsmith in the Hybrid Cloud - How to Create Value from Enterprise Data

Gold Nuggets - Data as Gold

Data, data, data. There is a lot of data, data already stored or archived, and data about to be produced, generated, measured, or even missing data. But there is not always value in the accessible data, even though data is considered the new gold. Similar to the real gold and a goldsmith creating jewels, data first needs to be worked on, refined, transformed and made interesting to consumers, turned into information or insight.

Coincidence? CeBIT visitors and weather featuring Jupyter Notebooks, Spark and dashDB

Jupyter Notebook via Bluemix

Next week I am going to talk at the CeBIT fair in Hanover. As usual I am interested in how the weather will be. And with every conference or fair a common question is about attendance. Why not combine the two, analyse past CeBIT weather and visitor count for some Friday fun? Today I am going to look into Jupyter Notebooks on Apache Spark with some Open Data stored in dashDB, all available via IBM Bluemix.
(Note that I am in a hurry and don't have time for detailed steps today, but that I share the sources and will add steps later on.)

The screenshot on the right is the result of what I am going to produce today. The source file for the notebook, the exported HTML file, input data, etc. can be found in this GitHub repository. If you came here for DB2 or dashDB you might wonder what Jupyter Notebooks are. Notebooks are interactive web-pages where you have sections ("cells") that contain text or code. The text can be in different input formats including Markdown. The code cells support various programming languages, can be edited inline and are executed on demand. Basically a notebook is an interactive, on-demand business/database report. And as you can see in the screenshot, the code is able to produce graphs.

The IBM Analytics for Apache Spark service on Bluemix provides those analytic notebooks and it is the service I provisioned for my tests. Once you launch the service you can start off with sample notebooks or create them from scratch. I started with samples to get up to speed and the composed my own one (see my notebook source on GitHub). It has several cells written in Python to set up a connection to dashDB/DB2, execute queries, fetch data and process that data within the notebook. The data is used to plot out a couple graphs.

For my example I am using a dashDB (a DB2-based service) that I provisioned on Bluemix as a data store. I used the LOAD wizard to create and fill one table holding historic CeBIT dates and visitor counts and another table with historic weather data for Hanover, Germany (obtained from Deutscher Wetterdienst). Within the notebook those tables are queried and the data fetched into so-called data frames. The data frames are used to transform and shape the data as needed and as source for the generated graphs. Within the notebook it is possible to combine data frames, execute queries on them and more - something I didn't do today.

To get to my dashDB-based graphs in a Jupyter Notebook on IBM Analytics for Apache Spark I needed to get around some issues I ran into, including data type casts, naming of result columns, labeling of graphs, sourcing columns as input for a graph and more. For time reason I refer to the comments in the source code for my notebook.

After all that introduction, here is the resulting graph. It shows that during a sunny and warm week with close to no rain there were fewer CeBIT attendees. A little rain, some sun and average temperature yielded a high visitor count. So could it be that the weather to attendee relationship is bogus for computer fairs and may only hold for museums? Anyway, it was fun learing Jupyter Notebooks on Bluemix. Now I need to plot my weekend plans...

Historic CeBIT Weather and Attendance

Friday, January 29, 2016

Combining Bluemix, Open Data on Tourism and Watson Analytics for some Friday Insight

Inbound and Outbound Tourism, Watson Analytics

Yup, it is Friday again and the weekend is coming closer and closer. The Carnival or Fasnet/Fastnacht season is close to its peak and some school holidays and inofficial holidays are coming up late next week. Tourists are pouring into carnival strongholds. Why not take some time today to test drive the Bluemix Analytics Exchange and Watson Analytics with tourism data and try to get some insight?

Introduction and resources for migrating from Oracle to DB2

Well, a boring headline for an interesting topic. Originally I had planned to title today's blog entry "Forget about Black Friday and Cyber Monday - Save big by moving to DB2". But then writing this entry has dragged on for some days...

Database Conversion Workbench

In the past weeks I have been asked several times about how to migrate off Oracle and on to DB2. Let me give you a quick overview of the technical side, for the financial part you have to ask an IBM business partner or an IBM seller. Although you can move to DB2 from database systems like Microsoft SQL Server, MySQL, and others, I will focus on Oracle because of the compatibility features built into DB2.

When moving off Oracle this could be for a SAP system (or other vendor application) or other applications ("non-SAP"). For SAP environments and applications from several other software vendors there is a (kind of) standardized process to migrate a system. The reason is that there are database-specific definitions and feature exploitations. A great example is how SAP is making use of the DB2-only BLU Acceleration to boost performance for analytic environments. Many software vendors provide tools for database migration and related processes or services.

For custom scenarios where the application code is available, a manual migration applies. The traditional barrier to a migration, the (more or less) incompatibility of products, has been torn down by adding compatibility features to DB2. Some of those features come ready to use by any user, some require specific preparation of DB2 because they may impact the traditional handling/"look and feel". The DB2 Knowledge Center has a good introduction and overview into that topic: "DB2 Compatibility Features". If you are comping to DB2 with a background in Oracle then use the Terminology Mapping to discover how products, features, and commands are named in the world of DB2.

From release to release there have been several enhancements to the SQL compatibility with database vendors such as Oracle. An overview by release can be found in the related "What's New" section of each of the recent DB2 releases:

I have to point out that the features mentioned in the linked documents are only related to the SQL language, but that there have been several other features dedicated to making a transition from Oracle to DB2 as smooth as possible. Some of them are covered in the section "Application development enhancements":

New features and enhancements related to application development in DB2 9.7
Application development improvements in DB2 10.1
The related section for DB2 10.5 (only JSON support listed)

If you prefer a book instead of the DB2 Knowledge Center, then I recommend the IBM Redbook "Oracle to DB2 Conversion Guide: Compatibility Made Easy". It gives an overview of DB2, the tools needed for a migration in a non-SAP environment, and the conversion approach. In the appending you will also find a nice terminology mapping, i.e., explaining how Oracle commands and features are named in the world of DB2.

A key tool for custom migrations is the Database Conversion Workbench (DCW). It is a plugin into the IBM Data Studio, a free tool for database administration, design, and SQL development. The DCW allows to analyze a database schema with respect to DB2 compatibility. The result is a valuable foundation for estimating the overall conversion effort. Once that is done (or not needed), the Database Conversion Workbench helps in the process of moving the database schema, database-side SQL packages, and thereafter the data from another database system to DB2. DCW also includes a package visualizer to help understand package dependencies which simplifies the code conversion. See this page for an overview and more resources around the Database Conversion Workbench.

An important DB2 feature related to the compatibility is the ability to run PL/SQL code. Several administrative PL/SQL packages ship with DB2 which can be found in the "built-in modules" section. Moreover, there are also some PL/SQL packages that can be used "as is" and are available from IBM developerWorks in the database conversion community: See here for the add-on PL/SQL packages.
That's all for now with my overview of resources for the Oracle to DB2 migration. I hope that it provides a good introduction into that (sometimes confusing) topic.

BTW: I have been covering many migration-related topics in my blog. Serge Rielau and now Rick Swagerman have provided many SQL tips for DB2 in their blog.

Friday, October 24, 2014

Apply DB2 skills to Hadoop by using Big SQL on Bluemix

IBM Analytics for Hadoop

One of the many services offered on the Platform-as-a-Service (PaaS) IBM Bluemix is "IBM Analytics for Hadoop", basically InfoSphere BigInsights as cloud service. Because it is a Big Data service and it offers SQL capabilities I was eager to test it. Here is a first report how I got some queries running against data on my Hadoop file system in the cloud.

Software for download

After creating the Analytics for Hadoop service (which I left unbound) I launched the dashboard.
The first I noticed is a link to download software. I brought up a long list of drivers, Eclipse plugins, API packages and so forth. If you want to explore the full power of BigInsights / Hadoop you don't need to search for all the relevant and compatible software, it's just a click away. I passed on that offer and instead created new directories in the HDFS. I read in the BigInsights tutorials that the directories are needed to hold my data and to create workbooks or so-called BigSheets to transform data from files to something processable.

Creating the directories is done by clicking on the appropriate icon on top of the directory tree structure, picking the right parent directory and specifying a name. Been there, done that. I then used another icon to invoke the GUI for file upload. Few minutes later I had uploaded two files with historic weather data (see my older blog entries) and was in the BigSheets section of the dashboard, ready to create a workbook. I named it "weather" and also chose a file. Because the input file is not a CSV by definition but uses a semicolon to separate data, I had to apply the "Character Delimited Data with Text Qualifier" reader to it. I specified the semicolon as delimiter and also checked the "header included", i.e., the first row included the column names.

New BigInSights / Hadoop directory and uploaded files

After creating the workbook I clicked on the option "Create Table" and then experienced a "lesson learned":

Lesson learned: creating a table over a workbook in #BigInsights can be done for directories, not files. "sheet.weather" (not shit weather)
— Henrik Loeser (@data_henrik) October 23, 2014

So I created another directory, moved the files in there, deleted the workbook and wanted to create it again. However, I got an error message. Another lesson learned:

Lesson learned: When deleting a workbook from #BigSheets, first purge, then recreate. #bigdata #hadoop #bluemix
— Henrik Loeser (@data_henrik) October 23, 2014

Ok, purge the deleted workbook, then recreate it. Done. I had read that I can share the data of a workbook for use by Big SQL and Hive by creating a table based on the workbook. In my test I created "sheets.weather" as a first table. I also created a second workbook based on my first, removed some columns, and created a second table "sheets.myweather" (not too creative in naming stuff...). Anyway, the main purpose of this exercise was to import data and to get ready for SQL queries.

Next in my journey was to actually using what is called Big SQL, running SQL queries against data in my Hadoop file system. The Analytics for Hadoop service on Bluemix offers a basic SQL web console for this. As an alternative you can also issue queries from applications via JDBC (see software download above) or use other tools. As I was short on time and didn't want install any software, I decided to use the provided SQL console. When executing SQL statements you have the choice of "Big SQL" and "Big SQL V1". I picked the new "Big SQL" as it is based on the DB2 query compiler and runtime infrastructure. This comes in handy when you are unsure about available tables, column names etc. As you can see from the first screenshot below, I started by querying the Big SQL (DB2) system catalog to retrieve my available table names. The second screenshot shows a simple SQL query against the uploaded weather data.

Overall, given my short amount of time I had, it was a nice experience. Having the opportunity to apply DB2 skills towards data in a Hadoop file system reduced the barrier. Given that the data is already uploaded, I am sure I will try out other features of InfoSphere BigInsights / IBM Analytics for Hadoop, too. Have you tested it or Bluemix in general? You can sign up for a free trial (no credit card required) at http://bluemix.net.

DB2 catalog tables for Hadoop

Basic SQL query against Hadoop data

Monday, September 1, 2014

What a plot: DB2, R, and Bluemix help with vacation weather

Last week I reported on how I set up a in-memory DB2 database on IBM Bluemix and loaded some historic weather data. Over the last couple days I used some spare time to play with the Cloud-based analytic capabilities that are provided as part of the Softlayer/Bluemix/DB2 combination. Most of the time went into learning (some basics of) R, an environment for statistical computing and graphics. As an example I wanted to find out what temperatures to expect for a possible September vacation on the German island of Norderney.

[Update 2014-11-04: The Analytics Warehouse service on Bluemix is now called dashDB]

For my small exercise I used data provided by the German Meteorological Service "Deutscher Wetterdienst". It allows to freely download and use (under some conditions) data from several decades of weather observation. I uploaded the data to DB2/Bluemix as described in my previous post.

Bluemix: Change DB2 column name and type

While playing with the data I noticed that the column names required escaping of quotes and the observation dates were stored as integer values (yyyymmdd). In a second upload I simplified the column names and adapted the column data type using the DB2 load wizard (see picture). Thereafter I was set for my experiments with R.

The DB2 Cloud environment provides several examples for programming in R, a special function library "bluR" to easily connect R with DB2-based data, and it features the RStudio to develop, test, and execute code in R. Within RStudio it is possible to execute several demos to learn more about analytics, graphing, and data processing. For the DB2 in-memory database API for R there is a demo as well. You can invoke it using the "demo(blur)" command:

DB2 API demo for R in RStudio

The demo shows how to connect to DB2, execute a query and use the fetched data for analytic processing in R. Last week I already tweeted about how I tested escaping of quote characters (use C style, not SQL style):

For escaping quotes in #SQL statements for #DB2 in #RStudio/#rstats use C style, not SQL. (from my #Bluemix tests) pic.twitter.com/XslDgxZU7L
— Henrik Loeser (@data_henrik) August 29, 2014

The data set which I uploaded to DB2 has daily minimum and maximum temperatures (and lots of other meteorological) for about 70 years. I used a SQL query and then the ggplot2 library to create a graphic. It shows the band for the minimum temperatures for each September day as well as the band for the maximum daily temperatures.

DB2 with R: Historic September temperatures

The code for this graphic is pretty simple (and I started last week looking at R and DB2) and available from my Github account:

1:  ########### R script to analyze historic weather data for min/max values  
2:  ## Written by Henrik Loeser  
3:  ## Connection handle con to BLU for Cloud data warehouse is provided already  
4:  ## For plotting, we are using ggplot2 package  
5:  ##   
6:  library(ggplot2)  
7:  library(bluR)  
8:    
9:  ## initialize DB2 connection and environment  
10:  con <- bluConnect("BLUDB","","")  
11:  bluAnalyticsInit(con)  
12:    
13:  ## query DB2 weather data and fetch min/max values of min/max values  
14:  ## (lower/upper boundary each)   
15:  query<-paste('select max(lufttemperatur_maximum) as maxmax,min(lufttemperatur_minimum) as minmin,min(lufttemperatur_maximum) as minmax,max(lufttemperatur_minimum) as maxmin,tag from (select lufttemperatur_maximum, lufttemperatur_minimum, day(mdatum) as tag from blu01023.klima where month(mdatum)=9) group by tag order by tag asc')   
16:  df <- bluQuery(query,as.is=F)  
17:    
18:  ## Some plotting needs to be done  
19:  jpeg(type='cairo',"tempe.jpg",width=800,height=600)   
20:  ggplot(df, aes(x = TAG))+ylab("Temperature")+xlab("Day")+          
21:     geom_ribbon(aes(ymin = MINMIN, ymax=MAXMIN), fill='blue')+  
22:     geom_ribbon(aes(ymin = MAXMAX, ymax=MINMAX), fill='red')+  
23:     geom_ribbon(aes(ymin = MAXMIN, ymax=MINMAX), fill='white')+  
24:     geom_line(aes(y = MINMIN), colour = 'black') +  
25:     geom_line(aes(y = MAXMIN), colour = 'black') +  
26:     geom_line(aes(y = MINMAX), colour = 'black') +  
27:     geom_line(aes(y = MAXMAX), colour = 'black')   
28:    
29:  sink('/dev/null')   
30:    
31:  bluClose(con)  
32:  ## connection is closed, we are done

Pretty cool (my opinion)! I am already turning into a data scientist. And you can test this yourself on IBM Bluemix with the Analytics Warehouse service (DB2 in-memory database feature).

Pages

Friday, November 25, 2022

Monday, May 18, 2020

Thursday, May 7, 2020

Tuesday, November 26, 2019

Wednesday, August 7, 2019

Friday, February 8, 2019

Monday, October 22, 2018

Wednesday, June 13, 2018

Wednesday, June 6, 2018

Tuesday, April 24, 2018

Monday, April 23, 2018

Thursday, November 30, 2017

Wednesday, April 5, 2017

Tuesday, November 29, 2016

Wednesday, March 16, 2016

Friday, March 11, 2016

Friday, January 29, 2016

Wednesday, December 3, 2014

Friday, October 24, 2014

Monday, September 1, 2014