Data Henrik: query

Showing posts with label query. Show all posts

Saturday, April 27, 2019

Db2: SQL-based explain and printed acccess plan

Vote for this Db2 idea

This Friday in Munich, I gave a talk about the new native JSON functions in Db2 at the German Db2 user group (DeDUG) meeting. To speed up queries and to enforce uniqueness or some structural rules, Db2 functional indexes can be used with the new JSON_VALUE. As usual for some prototyping, I utilized a Jupyter Notebook for my presentation. Now, how to demonstrate that indexes are really used? Show the access plan! But how in a notebook?

DB2 Battle: Optimization Profile vs. Statement Concentrator (Part 1)

DB2 explain output

Today I wanted to try using a DB2 Optimization Profile for a statement impacted by the DB2 Statement Concentrator. It turned out to be a longer exercise than thought, but also with more fun, covering a lot of the DB2 tuning and monitoring infrastructure. So set some minutes aside and follow my journey into the world of query optimization, workload monitoring, session actuals, and more. I split it into two parts because of length (Update on 02-Jul-2015: A third part is here).

The original question I tried to answer was related to REOPT and a query like this:
select id, s
from betw
where id between ? and ?

The Hunt for the Chocolate Thief (Part 2) - Putting IBM Bluemix, Cloudant, and a Raspberry Pi to good use

I am still on the hunt for the mean chocolate thief, kind of. In the first part I covered the side of the Raspberry Pi and uploading data to Cloudant. I showed how to set up an infrared motion sensor and a webcam with the RPi, capture a snapshot and secure the image and related metadata in a Cloudant database on the IBM Bluemix Platform-as-a-service (PaaS) offering. In this part I am going to create a small reporting website with Python, hosted as a IBM Bluemix service.

Similar to an earlier weather project, I use Python as scripting language. On Bluemix, which is based on Cloud Foundry, this means to "bring your own buildpack". I already described the necessary steps which is to tell Bluemix how to create the runtime environment and install the needed Python libraries. So how do I access the incident data, i.e., the webcam snapshots taken by the Raspberry Pi when someone is in front of the infrared motion sensor? Let's take a look at the script:

 import os  
 from flask import Flask,redirect  
 import urllib  
 import datetime  
 import json  
 import couchdb  
   
   
 app = Flask(__name__)  
   
 # couchDB/Cloudant-related global variables  
 couchInfo=''  
 couchServer=''  
 couch=''  
   
 #get service information if on Bluemix  
 if 'VCAP_SERVICES' in os.environ:  
   couchInfo = json.loads(os.environ['VCAP_SERVICES'])['cloudantNoSQLDB'][0]  
   couchServer = couchInfo["credentials"]["url"]  
   couch = couchdb.Server(couchServer)  
 #we are local  
 else:  
   with open("cloudant.json") as confFile:  
    couchInfo=json.load(confFile)['cloudantNoSQLDB'][0]  
    couchServer = couchInfo["credentials"]["url"]  
    couch = couchdb.Server(couchServer)  
   
 # access the database which was created separately  
 db = couch['officecam']  
   
 @app.route('/')  
 def index():  
   # build up result page  
   page='<title>Incidents</title>'  
   page +='<h1>Security Incidents</h1>'  
   
   # Gather information from database about which city was requested how many times  
   page += '<h3>Requests so far</h3>'  
   # We use an already created view  
   for row in db.view('incidents/incidents'):  
     page += 'Time: <a href="/https://blog.4loeser.net/incident/'+str(row.key["id"])+'">'+str(row.key["ts"])+'</a><br/>'  
   
   # finish the page structure and return it  
   return page  
   
 @app.route('/incident/<id>')  
 def incident(id):  
   # build up result page  
   page='<title>Incident Detail</title>'  
   page +='<h1>Security Incident Details</h1>'  
   doc=db.get(id)  
   # Gather information from database about the incident  
   page += '<br/>Incident at date/time:'+str(doc["timestamp"])  
   page += '<br/>reported by "'+doc["creater"]+'" at location "'+doc["location"]+'"'  
   page += '<br/>Photo taken:<br/><img src="/https://blog.4loeser.net/image/'+id+'" />'  
   # finish the page structure and return it  
   return page  
   
 @app.route('/image/<id>')  
 def image(id):  
   #redirecting the request to Cloudant for now, but should be hidden in the future  
   return redirect(couchServer+'/officecam/'+id+'/cam.jpg')    
     
   
 port = os.getenv('VCAP_APP_PORT', '5000')  
 if __name__ == "__main__":  
      app.run(host='0.0.0.0', port=int(port))

Overview of Security Incidents

The setup phase includes reading in access data for the Cloudant database server. Either that information is taken from a Bluemix environment variable or provided in a file "cloudant.json" (similar to what I did on the RPi). The main part of the script defines three routes, i.e., how to react to certain URL requests. The index page (index()) returns an overview of all recorded incidents, an incident detail page (incident(id)) fetches the data for a single event and embeds the stored webcam image into the generated page, and the last route (image(id)) redirects the request to Cloudant.

Looking at how the index page is generated, you will notice that a predefined Cloudant view (secondary index) named "incidents/incidents" is evaluated. It is a simple reduce function that sorts based on the timestamp and document ID and returns just that composite key.

Incident Detail: Hadoop involved?

function(doc) {
if (doc.type == "oc")
emit({"ts" : doc.timestamp, "id" : doc._id}, 1);
}

Then I access the timestamp information and generate the list as shown in the screenshot above.

The incident detail page has the document ID as parameter. This makes it simple to retrieve the entire document and print the details. The webcam image is embedded. So who got my chocolate? Take a look. It looks like someone who got a free copy of "Hadoop for Dummies" at the IDUG North America conference.

Maybe another incident will shed light into this mystery. Hmm, looks like someone associated to the "Freundeskreis zur Förderung des Zeppelin Museums e.V." in Friedrichshafen. I showed the pictures to my wife and she was pretty sure who took some chocolate. I should pay more attention when grabbing another piece of my chocolate and should more closely watch how much I am eating/enjoying.

Zeppelin Brief seen at robbery

Have a nice weekend (and remember to sign up for a free Bluemix account)!

Friday, July 13, 2012

Big Data, Big Problems, Right Questions?

The term "Big Data" has been around for a while. Most of my time I have dealt with problems around traditional data (relational data), new data structures, non-traditional data, semi-structured data, in-memory data, on-disk data, hard-to-find data, lost data, object-oriented data, object-relational data, old data, new data, archive data, wrong data, corrupted data, THE right data, and all kinds of other data.

Almost daily I receive emails or letters where my name is misprinted, the address is incorrect, or where the email was not even intended for me (long story). Few days ago I listened to a longer discussion about problems with various kind of applications at several companies, all non-IBM-related. One key problem was that due to failing components in a too complex environment, data got outdated, was missing or is incorrect. The consequences impacted the "regular guys", ranging from having no or incorrect parts in an assembly line over not updated timesheets to not being able to "badge in". When I talk with my students at the Cooperative State University (who all have a job in the industry), many of them have seen or had to deal with incorrect or outdated data and "information" based on it.

The issues remind me of one of the first lessons in physics at school: "Garbage in, garbage out". For Big Data, the amount of data is in a different dimension, but the same principles as with the other data apply. It is important to know that there are or could be issues with the input data. How is that dealt with? Does it impact the algorithms and the result? Is the derived "information" really information? How is the data interpreted? Is it ok to base decisions on them or is it taken as just one of many indicators? Do those dealing with information found from Big Data know the source, is everything fully disclosed? There are many more questions, many of them non-technical. One important question is what questions to ask against the data. Is this to "proof" some speculation and guesswork or to really find out something new that may be validated by further work? And then we are back to a problem we still face since years with existing, "small" data.

Now to a small problem and statistics...:
Over the weekend we will have big fireworks locally here over the Lake Constance. My kids want to watch them for sure. I could point out that only 20% of the persons I considered for my statistical findings will attend. My kids could respond that 70% will attend. They chose people aged 8-80 years, I those aged 0-8 years...

The context counts and asking the right questions, regardless of small or big data.

Monday, July 2, 2012

A small DB2 SQL quiz - why the SQL error?

I have been playing with some new DB2 10.1 features and thought that a small SQL quiz at the beginning of the week is appropriate. Consider the following SQL statements and the related output:

db2 => insert into ih values(1,'a')
DB21034E The command was processed as an SQL statement because it was not a
valid Command Line Processor command. During SQL processing it returned:
SQL0407N Assignment of a NULL value to a NOT NULL column "TBSPACEID=2,
TABLEID=257, COLNO=2" is not allowed. SQLSTATE=23502
db2 => select * from ih

ID          S
----------- ------------------------------

0 record(s) selected.

db2 => describe select * from ih

Column Information

Number of columns: 2

SQL type              Type length Column name                     Name length
-------------------- ----------- ------------------------------ -----------
496   INTEGER                   4 ID                                        2
449   VARCHAR                  30 S                                         1

How was the table defined or why is it giving an error? Any guesses? Post your answer and explanation as comment. I will post the solution in a couple days. [Edit: done and linked]

Wednesday, May 23, 2012

Partial Early Desk Cleaning - New in DB2

Remember those days when your parents told you to clean up your room or your desk? There are two strategies to deal with cleaning up a room or a desk. Do a little bit here and there or wait till it is really crowded (and filthy...?). The latter is more of a "last resort" action, e.g., before visitors arrive or someone gets really, really angry. The former (today often called "pro-active") means that some time is taken daily or weekly to clean up parts of the room or the desk. In the best scenario, the desk always shines, there is rarely the need for a bigger clean-up and it gives a good feeling (or it could mean you don't really have a real job...).

Now let's talk about DB2. Starting with version 10 it features PED and PEA. The first is Partial Early Distinct (PED) and means that the big clean-up task of removing duplicates from a result set is not done at the end, but as early as possible ("here and there"). The big advantage is that it will speed up the query because smaller intermediate result sets can be move through the system, less memory is used for the sort heap (needed for eliminating duplicates), and some more. Partial Early Aggregation (PEA) works similarly and applies to GROUP BYs.

Some more of "partial early" is explained in a DB2 V10.1 Query performance enhancements paper published at the DB2 LUW Best Practices website.

More later, I promised my boss to clean up my desk...

Tuesday, November 30, 2010

Indexing PDF Files with DB2 Text Search

This week I was asked how to build a full text index on PDF documents in DB2. Because my hard drive is full of them, I created a new DB2 database and set up DB2 Text Search with so-called rich text document filters. After creating a table, loading data and indexing the documents, I eventually used the CONTAINS and SCORE functions to search in my PDF files. But let me show you the steps in some more details...

Because DB2 Text Search support can be installed as part of the DB2 installation, my system already had the base support set up.

Thus my endeavors started with downloading the DB2 Accessory Suite. It contains additional document filters that help to extract full text information out of different file types such as PDF, word processor, or spreadsheet documents. The DB2 Information Center has instructions for downloading and installing the accessory suite. I checked that all necessary files from the regular DB2 installation were already in place and then succeeded with setting up rich text support.

Actually, setup requires two steps. Install the software on the server, then setting it up for an instance. Some confusion, but I succeeded. Next was enabling rich text support, instructions are on Information Center again. Now on to creating the actual test database (on my Linux shell):

db2 CREATE DB FT

DB2 Text Search uses an environment variable DB2DBDFT. This should be set to the database you work with to simplify administration. Hence:

export DB2DBDFT=FT

Finally, we are able to set up our new database for full text search:

db2ts "ENABLE DATABASE FOR TEXT"

The db2ts command is the DB2 Text Search command processor. Now we can create a simple table in the database to hold the documents after connecting to the db:

db2 connect to ft

db2 "create table hlDocs(id int not null primary key, description varchar(200), doc BLOB(10M))"

The second statement above creates a table named "hlDocs" with three columns. The first column is named "id" and of type integer. Note that DB2 Text Search requires a primary key column in the indexed table. The 2nd column will hold a short string describing the document. The last column is a BLOB that we use to store the PDF files. To easily insert PDF documents into the table, I created a small delimited file for import (named "blob.del"):

1,'train schedule Paris',ParisSchedule.pdf

2,'Schedule recycling paper collection',Papier_Abfallplan2010.pdf

3,'New resume',Current_resume.pdf

The above data can now be imported with the PDF files assumed in the same as our current directory:

db2 import from blob.del of del lobs from . insert into hlDocs

It is important to check that all rows were inserted:

Number of rows read         = 3

Number of rows skipped      = 0

Number of rows inserted     = 3

Number of rows updated      = 0

Number of rows rejected     = 0

Number of rows committed    = 3

Now the test data is in, but we don't have an index yet. That can be done using the db2ts command again. We first create the index, then update it:

db2ts "CREATE INDEX idx1 FOR TEXT ON hldocs(doc) FORMAT inso"

db2ts "UPDATE INDEX idx1 FOR TEXT"

CAUTION: Note that the option "FORMAT inso" is not clearly explained in the documentation (yet - hence this post). That option tells DB2 to use the additional document filters that we installed as part of the DB2 Accessory Suite.

Finally, we can test the index using the CONTAINS and the SCORE functions.
db2 "select id,description from hldocs where contains(doc,'Paris')=1"

ID          DESCRIPTION     --------------------------------------------------
          1 'train schedule Paris'

1 record(s) selected.

db2 "select id,description,cast(score(doc,'Friedri*') as decimal(5,3)) as score from hldocs order by 3 desc"

ID DESCRIPTION                                   SCORE
----------- ------------------------------------------
          2 'Schedule for paper collection'      0.046
          3 'Resume'                             0.022
          1 'Train schedule Paris'               0.000

3 record(s) selected.

Most of the steps are dedicated to proper setup of DB2 Text Search with the additional filters for PDF documents. However, once it is set up, inserting data and searching the documents is simple.

Data Henrik

Pages