Data Henrik: in-memory

Showing posts with label in-memory. Show all posts

Monday, August 25, 2014

Setting up and using a DB2 in-memory database on IBM Bluemix

[Update 2014-11-04: The Analytics Warehouse service on Bluemix is now called dashDB.]
Last Friday I was on the way back from some customer visits. While traveling in a German highspeed train I used the Wifi service, connected to IBM Bluemix and created a DB2 in-memory database. Let me show you how I set it up, what you can do with it and how I am connecting to the cloud-based database from my laptop.

Sitting in #train with 300 km/h and creating #DB2 in-memory #database with #Bluemix http://t.co/OzJqLzNDFr
— Henrik Loeser (@data_henrik) August 22, 2014

Unbound DB2 service on Bluemix

The first thing to know is that on Bluemix the DB2 in-memory database service is called IBM Analytics Warehouse. To create a database, you select "Add service" and leave it unbound if you want, i.e., it is not directly associated with any Bluemix application. That is ok because at this time we are only interested in the database. Once the service is added and the database itself created, you can lauch the administration console.

The console supports several administration and development tasks as show in the picture. It includes loading data, to develop analytic scripts in R, to execute queries and link the data with Microsoft Excel for processing in a spreadsheet, and it has a section to connect external tools or applications to the database.

Administration/development task in DB2 BLU console on Bluemix

One of the offered task is very interesting and I twittered about it on Friday, too:

Great! You can even import #JSON #data from @Cloudant into #DB2 in-memory db service in #Bluemix pic.twitter.com/BoLS0YM7nH
— Henrik Loeser (@data_henrik) August 22, 2014

You can set up replication from a Cloudant JSON database to DB2, so that the data stream is directly fed in for in-memory analyses. I didn't test it so far, but plan to do so with one of my other Bluemix projects.

A task that I used is to (up)load data. For this I took some historic weather data (planning ahead for a vacation location), let the load wizard extract the metadata to create a suitable data, and ran some queries.

Uploading data to DB2 on Bluemix

Specify new DB2 table and column names

For executing (simple) selects there is a "Run Query" dialogue. It allows to choose a table and columns and then generates a basic query skeleton. I looked into whether a specific German island had warm nights, i.e., a daily minimum temperature of over 20 degrees Celsius. Only 14 days out of several decades and thousands of data points qualified.

Last but not least, I connected my local DB2 installation and tools to the Bluemix/Softlayer-based instance. The "CATALOG TCPIP NODE" is needed t make the remote server and communication port known. Then the database is added. If you already have a database with the same name cataloged on the local system, it will give an error message as shown below. You can work around it by specifying an alias. So instead of calling the database BLUDB, I used BLUDB2. The final step was to connect to DB2 with BLU Acceleration in the cloud. And surprise, it uses a fixpack version that officially is not available yet for download...

DB: => catalog tcpip node bluemix remote 50.97.xx.xxx server 50000
DB20000I The CATALOG TCPIP NODE command completed successfully.
DB21056W Directory changes may not be effective until the directory cache is
refreshed.
DB: => catalog db bludb at node bluemix
SQL1005N The database alias "bludb" already exists in either the local
database directory or system database directory.
DB: => catalog db bludb as bludb2 at node bluemix
DB20000I The CATALOG DATABASE command completed successfully.
DB21056W Directory changes may not be effective until the directory cache is
refreshed.
DB: => connect to bludb2 user blu01xxx
Enter current password for blu01xxx:

   Database Connection Information

Database server        = DB2/LINUXX8664 10.5.4
SQL authorization ID   = BLU01xxx
Local database alias   = BLUDB2

I will plan to develop a simple application using the DB2 in-memory database (BLU Acceleration / Analytics Warehouse) and then write about it. Until then read more about IBM Bluemix in my other related blog entries.

Wednesday, July 23, 2014

Watch this! Move your DB2 monitoring to the in-memory interface (WLM monitoring)

Since its first days as a database management system, DB2 has been been changed. It has been extended by new features to serve customer requirements and has been adapted to the state of the art in hardware and software technology. One major new feature has been the introduction of the DB2 Workload Management in version 9.5 and related more comprehensive monitoring with finer granularity (in-memory metrics monitoring) in version 9.7. As with many product changes, it takes a while for customers to really put them to use and reap the benefits, especially when the existing functionality still works.

Thus I was happy when I saw a new article on IBM developerWorks describing how to move off the (old) snapshot monitoring interfaces in DB2 and to the in-memory metrics monitoring. What is included in the article is an overview of the advantages of the newer interface. This should get you motivated to read the rest of the article (and then to migrate if not done yet). It contains a side-by-side comparison of old and new interfaces and has many sample SQL queries. The queries demonstrate how to obtain DB2 runtime metrics using the old and new interface for some popular monitoring tasks. You can find the documentation of the SQL interface to the in-memory metrics in the DB2 Knowledge Center in this overview. Most of the pages in the manual have further SQL samples to get you started.

So take a look, it will also help you with one of the upcoming DB2 quizzes on this blog.

Friday, August 9, 2013

DB2 BLU: Some additional metadata, the synopsis table, and page map indexes

Today I continue my journey into the metadata of DB2 with BLU Acceleration which I recently started. I had created a table and looked at the catalog and found a synopsis table. The new column-organized tables for in-memory analytics are supposed to be simple to use, no indexes to create. No indexes? Let me check...

First on today's list is a look at SYCAT.INDEXES, the catalog view holding index information.

select indschema, indname, tabname, colnames, indextype from syscat.indexes where tabname like '%BLUDATA%'

INDSCHEMA     INDNAME               TABNAME                       COLNAMES                           INDEXTYPE
------------- --------------------- ----------------------------- ---------------------------------- ---------
SYSIBM        SQL130809032717072049 BLUDATA       +SQLNOTAPPLICABLE+SQLNOTAPPLICABLE CPMA
SYSIBM    SQL130809032717232599 SYN130809032716942015_BLUDATA +SQLNOTAPPLICABLE+SQLNOTAPPLICABLE CPMA

2 record(s) selected.
What is interesting to find, is that two indexes have been created. One is on the base table, one on the synopsis table. They are of a new index type CPMA which stands for Cde Page MAp (and CDE for Columnar Data Engine). I found this in the description of SYSCAT.INDEXES and by dumping the index data using db2dart (see below).

Next I wanted to see how the synopsis table changed by inserting data. The following small script generated my data file with 10000 rows.

#!/bin/bash

for i in {1..10000}
do
   echo "$i,0.8,$i,''" | tee -a bludata.csv
done

Then I was ready for loading the data. Here are some snippets of the output.

load from bludata.csv of del replace into hl.bludata
...
SQL3500W The utility is beginning the "ANALYZE" phase at time "08/09/2013
05:06:47.313324".
...
SQL3500W The utility is beginning the "LOAD" phase at time "08/09/2013
05:06:48.215545".
...
SQL3110N The utility has completed processing. "10000" rows were read from
the input file.
...
SQL3500W The utility is beginning the "BUILD" phase at time "08/09/2013
05:06:48.802651".

First the data is analyzed to optimize data layout and the column-specific compression. The comes the actual load phase followed by a build phase (for maintaining the internal index). With an empty table the synopsis table was empty, too. Now, with 10000 rows loaded, we can find some entries:

select * from sysibm.SYN130809032716942015_BLUDATA

IDMIN       IDMAX       QUOTAMIN QUOTAMAX SOLDMIN     SOLDMAX     TSNMIN               TSNMAX
----------- ----------- -------- -------- ----------- ----------- -------------------- --------------------
          1        2328     0.80     0.80           1        2328                    0                 1023
        754        2810     0.80     0.80         754        2810                 1024                 2047
       1296        3563     0.80     0.80        1296        3563                 2048                 3071
       1567        4187     0.80     0.80        1567        4187                 3072                 4095
       3895        6243     0.80     0.80        3895        6243                 4096                 5119
       4790        6996     0.80     0.80        4790        6996                 5120                 6143
       5061        7387     0.80     0.80        5061        7387                 6144                 7167
       5694        8283     0.80     0.80        5694        8283                 7168                 8191
       7750       10000     0.80     0.80        7750       10000                 8192                 9215
       8984        9767     0.80     0.80        8984        9767                 9216                 9999

10 record(s) selected.

For every 1024 rows there is an entry in the synopsis table. Not much surprising stuff can be found as my test data only has variations in the ID and SOLD columns.

My last action for today was to look at the index data as stored on disk. By invoking db2dart with the DI option it is possible to dump formatted index data. Both indexes only have 3 pages each and you can try using db2dart yourself to see the output. What is included in each report as part of the index metadata is the index type which is as follows:
Index Type = NonUnique Bi-directional Large RID CDE Page Map Split-High No exclusive lock optimzation RID-list Compression Variable Slot Directory Prefix Compression
Seeing it labeled as Page Map index verifies that I dumped the right index. That concludes my Friday "spare time", have a nice weekend. And if you have time to read, here is the link to my other DB2 BLU-related posts.

Monday, August 5, 2013

DB2 BLU: A look at the catalog tables and the synopsis table

DB2 10.5 with BLU Acceleration allows to create column-organized tables. They are the foundation for "actionable compression" and smart in-memory processing. I plan to cover some details in upcoming posts and already have written some articles about DB2 BLU earlier. Today I show you what happens in the DB2 catalog, the metadata, when you create a table.

I started by "db2set DB2_WORKLOAD=ANALYTICS". This is the magic knob to simplicity and performance for analytic queries. After creating a database, I connected and then I was ready to create a table.

create table hl.bludata(id int, quota decimal(5,2),sold int, comment varchar(200))
DB20000I The SQL command completed successfully.

The most important metadata for tables is accessible in the SYSCAT.TABLES catalog view. By sending the following query to DB2 a small subset of the columns is fetched for the table I just created:

select tabname,tabschema,property,compression,tableorg from syscat.tables where tabname like '%BLUDATA%'

TABNAME                        TABSCHEMA PROPERTY                         COMPRESSION TABLEORG
------------------------------ --------- -------------------------------- ----------- --------
BLUDATA                        HL                            Y                         C
SYN130805063137432548_BLUDATA SYSIBM                        YY                        C

2 record(s) selected.

But wait, there are two tables. The first has the name and schemaname as specified, the second is created in the schema SYSIBM and is created under the covers. It is the so-called synopsis table and its name starts with the prefix SYN followed by a timestamp and the name of the base table. Both tables are column-organized (TABLEORG=C). For both we also see a value for PROPERTY which is a character vector. The first "Y" is at position 20, indicating a column-organized table. The synopsis table also has a "Y" at position 21, officially marking it a synopsis table (see the descriptions for SYSCAT.TABLES).

The purpose of the synopsis table is to help with data organization of the base table and to aid DB2 with fast data skipping during query processing (I plan to talk about dyanmic list prefetching in a future post).

Because both tables are related to each other, another catalog table is involved, SYSCAT.TABDEP, where table or object dependencies are recorded.

select tabname,dtype,bname,bschema,btype from syscat.tabdep where tabname like '%BLUDATA%' or bname like '%BLUDATA%'

TABNAME                       DTYPE BNAME BSCHEMA BTYPE
----------------------------- ------ ------- ------- -----
SYN130805063137432548_BLUDATA 7    BLUDATA HL      T

1 record(s) selected.

Based on the metadata the table "SYN130805063137432548_BLUDATA" is a synopsis table (DTYPE=7) and depends on the table I created (BLUDATA). But how does the synopsis table look like? I am using the command "describe table" to see what columns have been defined:

describe table SYSIBM.SYN130805063137432548_BLUDATA

                                Data type                     Column
Column name                     schema    Data type name      Length     Scale Nulls
------------------------------- --------- ------------------- ---------- ----- ------
IDMIN                           SYSIBM    INTEGER                      4     0 Yes
IDMAX                           SYSIBM    INTEGER                      4     0 Yes
QUOTAMIN                        SYSIBM    DECIMAL                      5     2 Yes
QUOTAMAX                        SYSIBM    DECIMAL                      5     2 Yes
SOLDMIN                         SYSIBM    INTEGER                      4     0 Yes
SOLDMAX                         SYSIBM    INTEGER                      4     0 Yes
TSNMIN                          SYSIBM    BIGINT                       8     0 No
TSNMAX                          SYSIBM    BIGINT                       8     0 No

8 record(s) selected.

What can be seen is that for all non-character columns of my table we have a corresponding MIN and MAX column. In addition we have two BIGINT columns TSNMIN and TSNMAX. TSN stands for Tuple Sequence Number and helps to recompose a row based on individual column values. The synopsis table plays a pivotal role (pun intended) for faster query execution. But that is part of a future post...

Thursday, August 1, 2013

CTQ: Come on, Tune my Query (with DB2 BLU)

DB2 10.5 comes with BLU Acceleration. In a previous post I had explained what technologies are used for BLU and other analytic accelerators. For BLU it is a mix of columnar storage organization with extreme compression together with exploitation of parallel computing and new CPU instruction sets plus enhancements to bufferpool and I/O management for in-memory processing - quite a long list. But how can I find out whether BLU Acceleration is used?

The key to performance is looking for the new CQT operator in the query access plans (explain output). A usage example is shown here in the DB2 Information Center:

Access Plan:
-----------
 Total Cost:   570246
 Query Degree:  32

              Rows 
             RETURN
             (   1)
              Cost 
               I/O 
               |
               191 
             LTQ   
             (   2)
             570246 
             505856 
               |
               191 
             CTQ   
             (   3)
             570245 
             505856 
               |
               191 
             GRPBY 
             (   4)
             570245 
             505856 
               |
           2.87997e+08 
             ^HSJOIN
             (   5)
             377881 
             505856 
         /-----+------\
   2.87997e+08         2e+06 
     TBSCAN           TBSCAN
     (   6)           (   7)
     329484           4080.56 
     499008            6848 
       |                |
   2.87997e+08         2e+06 
 CO-TABLE: TPCDS  CO-TABLE: TPCDS   
   STORE_SALES       CUSTOMER
       Q1               Q2

But what is the operator CTQ and what does CTQ stand for? Is it "Critical To Quality", "Copy The Query", "Come on, Tune the Query"or "Critical To Quickness"? TQs are so-called Table Queue operators and are used to pass a special result set from A to B. They are often seen in partitioned databases (InfoSphere Warehouse) as either Directed Table Queues (DTQs) or Broadcast Table Queue (BTQs) to move data across nodes in the cluster. For BLU Acceleration a CTQ is the "columnar to row" gatekeeper and means that the world of BLU has been left and regular DB2 technology takes over.

Ideally, there is only a single CTQ in a query plan and high up because this would mean (almost) everything is processed with BLU Acceleration. If you only store some tables in columnar format, other tables in classic row organization (remember: DB2 with BLU Acceleration is hybrid) and see many CTQs in an access plan and only near the bottom of the plan, rethinking the strategy for the physical database design might be a good idea.

Come on, Tune the Queries - of course with DB2 and BLU Acceleration...

Friday, July 26, 2013

Epilepsy and computers: Types of Memory

Last months my son and I spent a week in hospital again for a regular post-surgery check up. As you might know from past blog posts, my youngest son had a brain tumor which caused epilepsy. The tumor and some brain tissue including the hippocampus on his right side were removed two years ago. A hippocampus is like the central memory controller in a computer and fortunately we have two of them for redundancy and failover.

When you work with computers, especially database systems like DB2 or Oracle, all day, it is interesting to look deeper into how the human memory works. There is a sensory memory acting like an I/O buffer, the short-term memory (think CPU registers, caches, and main memory), and the long-term memory (disk, tape, etc.). What I learned at the recent hospital visit is about the different types of long-term memory. Typically it is classified as declarative memory and procedural memory. You can think of the declarative memory as your "storage of facts" or the data in a database system. The procedural memory is for processes that can be performed like walking, playing piano, riding a bike, etc. In DB2 you could compare it to the package cache where information about how to execute statements is stored.

Having different types of memory and different ways of how new information is acquired and managed leads to the surprising fact that even though someone might have trouble learning new facts (because a hippocampus is missing), that person could be still excellent in learning and reciting piano pieces.

As written earlier: How does someone cope with memory problems caused by epilepsy? Our son has occupational therapy to develop strategies for more efficient use of his working memory and also to train (build up) his capabilities. The Epilepsy Society has this summary on the memory issue and workarounds that can be used like sticky notes (external memory), task lists, etc.

Friday, July 19, 2013

Redbook Solution Guide: Faster DB2 Performance with IBM FlashSystem

A so-called IBM Redbook Solution Guide about using flash memory for DB2 has been published. It gives an overview about how and where performance could be improved by using an IBM FlashSystem and provides some test results. The short article also has many links to background material. It is not a deep article on that topic, but a good introduction into possible performance gains and when to consider investing into flash memory instead of an entire new system.

Wednesday, July 3, 2013

IBM and DB2 analytics accelerators, in-memory computing, Big Data, NoSQL, ...

In recent months I had to answer questions about what technologies and products IBM offers for acceleration of analytic queries, whether any products like DB2 (for z/OS and Linux, UNIX, and Windows) or Informix offer in-memory computing capabilities, why the products on the mainframe and Intel/POWER platforms differ, and much more. Often an additional request is to answer with as few as possible sentences. :)

Here is an attempt to give you some answers and background information from my point of view, not necessarily IBM's. The article is written on a single afternoon with the plan to point to it and say "RTFM" instead of repeating myself...

Let me start this endeavor how I would do it in front of students when I teach. So let's take a brief look at some relevant technologies and terms:

In-memory computing or in-memory processing: Instead of loading data from disk to memory on demand, all or most of the data to be processed is kept in memory for faster access. This could be traditional RAM or even flash memory to allow persistence. If RAM is used, data is initially loaded from disk.
Columnar storage or column-oriented databases: Traditional (relational) database systems use row-oriented storage and are optimized for row by row access. However, for analytic applications typically large amounts of data with similar values or properties is scanned and aggregated. With column-oriented storage of many but the same values, only a single entry would be stored pointing to all related "rows" (rowid - this is similar to RID-list compression for indexes). Columnar storage usually yields excellent compression ratios for analytic data sets.
Massively parallel processing (MPP) and Single Instruction, Multiple Data (SIMD): By processing several data sets or values in parallel, either by means of many CPUs or special CPU instructions, the entire data volume can be processed faster than in a regular, serial way (one CPU core processing the entire query).
Appliances, workload optimized appliance or expert-integrated systems: A combination of hardware and software that typically is "sealed off". Thus, it does not offer many knobs to turn which makes it simple to use. Analytic appliances and data warehouse appliances therefore have a focus on hardware and software for analytic query processing.
Big Data: This term is used to describe the large data sets that typically cannot be stored and processed with traditional database systems. MapReduce or "divide and conquer" algorithms are used for processing, similar to parallel processing as mentioned above. The boundaries between "traditional" database systems and Big Data systems are moving.
NoSQL: For handling big data sets other ways of instructing the database systems than SQL are used, hence "No SQL". However, sometimes "NoSQL" could also mean "not-only SQL", hinting that the boundaries are moving and hybrid systems are available.

Based on specific customer requirements, on what is available in terms of technology, what makes sense cost-wise, and what fits into the strategy, different IBM products are available which use one or more of the mentioned technologies or relate to the terms.

DB2 for Linux, UNIX, and Windows (LUW) has been using parallelism (MPP, see above) for many years for data warehouses. Based on the shared nothing principle for the data, it can parallelize query processing and aggregate data quickly. It has been around as DB2 Parallel Edition, DB2 Database Partitioning Feature, DB2 Balanced Configuration Unit (BCU), InfoSphere Warehouse, and other names.
DB2 LUW with BLU Acceleration, introduced in DB2 10.5, makes use of an appliance-like simplicity. By setting a single switch, data is stored column-oriented with the mentioned benefits. Based on improvements to the traditional bufferpool or data cache of row-oriented database technology, it can hold all or only parts of the data in memory. Thus, in-memory processing is possible. BLU Acceleration utilizes SIMD technology to parallelize and speed up work and data is layed out for the processor architecture. Individual tables can be either stored in a row-oriented or column-oriented way and both types of table can be accessed in a single (complex) query.
IBM Netezza is an analytic or data warehouse appliance. It makes use of specialized hardware and MPP (see above) to speed up query execution.
Informix has the Informix Warehouse Accelerator to serve complex analytic queries. It uses column-oriented storage and parallelism and data is held in memory.
The IBM DB2 Analytics Accelerator (IDAA) can be added to DB2 for z/OS. The main design goal was to keep the System z DB2 attributes like security and the approach to administration. IDAA is based on the Netezza technology (see above) and integrates the appliance into the DB2 for z/OS environment. All access and administration is done through DB2.
InfoSphere BigInsights is IBM's Hadoop-based offering for the Big Data market.

To simplify the purchase process, the administration, to reduce the so-called time to value and for many more marketing reasons ;-), IBM has introduced several appliances that make use of the above listed products. They are all named IBM PureData Systems now, but based on the specific name, serve different markets:

IBM PureData System for Hadoop: An appliance built with InfoSphere BigInsights for Big Data-related problems.
IBM PureData System for Analytics: Based on Netezza technology
IBM PureData System for Operational Analytics: The appliance is built using DB2 LUW with the Database Partitioning Feature
IBM PureData System for Transactions: This appliance is built with DB2 LUW. It does (right now) not serve the analytics or Big Data market, but is designed for OLTP workloads. It uses the DB2 pureScale technology for a shared disk cluster with the goal of continous availability and scale out performance.

With my spare time this afternoon coming to an end, I will end this introduction. Please leave comments for hints of where to add/remove/fix it. And as you could see, I was able to resist the urge to mention SAP HANA, Oracle Exadata, Sybase IQ, Teradata, Greenplum, Microsoft, etc. :)

Pages