![]() |
Vote for this Db2 idea |
Henrik's thoughts on life in IT, data and information management, cloud computing, cognitive computing, covering IBM Db2, IBM Cloud, Watson, Amazon Web Services, Microsoft Azure and more.
Showing posts with label query. Show all posts
Showing posts with label query. Show all posts
Saturday, April 27, 2019
Db2: SQL-based explain and printed acccess plan
Friday, February 27, 2015
DB2 Battle: Optimization Profile vs. Statement Concentrator (Part 1)
![]() |
DB2 explain output |
The original question I tried to answer was related to REOPT and a query like this:
select id, s
from betw
where id between ? and ?
Labels:
applications,
best practices,
catalog,
DB2,
diagnostics,
fun,
indexing,
IT,
performance,
Python,
query,
sql,
version 10.5
Friday, July 25, 2014
The Hunt for the Chocolate Thief (Part 2) - Putting IBM Bluemix, Cloudant, and a Raspberry Pi to good use
I am still on the hunt for the mean chocolate thief, kind of. In the first part I covered the side of the Raspberry Pi and uploading data to Cloudant. I showed how to set up an infrared motion sensor and a webcam with the RPi, capture a snapshot and secure the image and related metadata in a Cloudant database on the IBM Bluemix Platform-as-a-service (PaaS) offering. In this part I am going to create a small reporting website with Python, hosted as a IBM Bluemix service.
Similar to an earlier weather project, I use Python as scripting language. On Bluemix, which is based on Cloud Foundry, this means to "bring your own buildpack". I already described the necessary steps which is to tell Bluemix how to create the runtime environment and install the needed Python libraries. So how do I access the incident data, i.e., the webcam snapshots taken by the Raspberry Pi when someone is in front of the infrared motion sensor? Let's take a look at the script:
The setup phase includes reading in access data for the Cloudant database server. Either that information is taken from a Bluemix environment variable or provided in a file "cloudant.json" (similar to what I did on the RPi). The main part of the script defines three routes, i.e., how to react to certain URL requests. The index page (index()) returns an overview of all recorded incidents, an incident detail page (incident(id)) fetches the data for a single event and embeds the stored webcam image into the generated page, and the last route (image(id)) redirects the request to Cloudant.
Looking at how the index page is generated, you will notice that a predefined Cloudant view (secondary index) named "incidents/incidents" is evaluated. It is a simple reduce function that sorts based on the timestamp and document ID and returns just that composite key.
function(doc) {
if (doc.type == "oc")
emit({"ts" : doc.timestamp, "id" : doc._id}, 1);
}
Then I access the timestamp information and generate the list as shown in the screenshot above.
The incident detail page has the document ID as parameter. This makes it simple to retrieve the entire document and print the details. The webcam image is embedded. So who got my chocolate? Take a look. It looks like someone who got a free copy of "Hadoop for Dummies" at the IDUG North America conference.
Maybe another incident will shed light into this mystery. Hmm, looks like someone associated to the "Freundeskreis zur Förderung des Zeppelin Museums e.V." in Friedrichshafen. I showed the pictures to my wife and she was pretty sure who took some chocolate. I should pay more attention when grabbing another piece of my chocolate and should more closely watch how much I am eating/enjoying.
Have a nice weekend (and remember to sign up for a free Bluemix account)!
Similar to an earlier weather project, I use Python as scripting language. On Bluemix, which is based on Cloud Foundry, this means to "bring your own buildpack". I already described the necessary steps which is to tell Bluemix how to create the runtime environment and install the needed Python libraries. So how do I access the incident data, i.e., the webcam snapshots taken by the Raspberry Pi when someone is in front of the infrared motion sensor? Let's take a look at the script:
import os
from flask import Flask,redirect
import urllib
import datetime
import json
import couchdb
app = Flask(__name__)
# couchDB/Cloudant-related global variables
couchInfo=''
couchServer=''
couch=''
#get service information if on Bluemix
if 'VCAP_SERVICES' in os.environ:
couchInfo = json.loads(os.environ['VCAP_SERVICES'])['cloudantNoSQLDB'][0]
couchServer = couchInfo["credentials"]["url"]
couch = couchdb.Server(couchServer)
#we are local
else:
with open("cloudant.json") as confFile:
couchInfo=json.load(confFile)['cloudantNoSQLDB'][0]
couchServer = couchInfo["credentials"]["url"]
couch = couchdb.Server(couchServer)
# access the database which was created separately
db = couch['officecam']
@app.route('/')
def index():
# build up result page
page='<title>Incidents</title>'
page +='<h1>Security Incidents</h1>'
# Gather information from database about which city was requested how many times
page += '<h3>Requests so far</h3>'
# We use an already created view
for row in db.view('incidents/incidents'):
page += 'Time: <a href="/https://blog.4loeser.net/incident/'+str(row.key["id"])+'">'+str(row.key["ts"])+'</a><br/>'
# finish the page structure and return it
return page
@app.route('/incident/<id>')
def incident(id):
# build up result page
page='<title>Incident Detail</title>'
page +='<h1>Security Incident Details</h1>'
doc=db.get(id)
# Gather information from database about the incident
page += '<br/>Incident at date/time:'+str(doc["timestamp"])
page += '<br/>reported by "'+doc["creater"]+'" at location "'+doc["location"]+'"'
page += '<br/>Photo taken:<br/><img src="/https://blog.4loeser.net/image/'+id+'" />'
# finish the page structure and return it
return page
@app.route('/image/<id>')
def image(id):
#redirecting the request to Cloudant for now, but should be hidden in the future
return redirect(couchServer+'/officecam/'+id+'/cam.jpg')
port = os.getenv('VCAP_APP_PORT', '5000')
if __name__ == "__main__":
app.run(host='0.0.0.0', port=int(port))
![]() |
Overview of Security Incidents |
Looking at how the index page is generated, you will notice that a predefined Cloudant view (secondary index) named "incidents/incidents" is evaluated. It is a simple reduce function that sorts based on the timestamp and document ID and returns just that composite key.
![]() |
Incident Detail: Hadoop involved? |
if (doc.type == "oc")
emit({"ts" : doc.timestamp, "id" : doc._id}, 1);
}
The incident detail page has the document ID as parameter. This makes it simple to retrieve the entire document and print the details. The webcam image is embedded. So who got my chocolate? Take a look. It looks like someone who got a free copy of "Hadoop for Dummies" at the IDUG North America conference.
Maybe another incident will shed light into this mystery. Hmm, looks like someone associated to the "Freundeskreis zur Förderung des Zeppelin Museums e.V." in Friedrichshafen. I showed the pictures to my wife and she was pretty sure who took some chocolate. I should pay more attention when grabbing another piece of my chocolate and should more closely watch how much I am eating/enjoying.
![]() |
Zeppelin Brief seen at robbery |
Have a nice weekend (and remember to sign up for a free Bluemix account)!
Friday, July 13, 2012
Big Data, Big Problems, Right Questions?
The term "Big Data" has been around for a while. Most of my time I have dealt with problems around traditional data (relational data), new data structures, non-traditional data, semi-structured data, in-memory data, on-disk data, hard-to-find data, lost data, object-oriented data, object-relational data, old data, new data, archive data, wrong data, corrupted data, THE right data, and all kinds of other data.
Almost daily I receive emails or letters where my name is misprinted, the address is incorrect, or where the email was not even intended for me (long story). Few days ago I listened to a longer discussion about problems with various kind of applications at several companies, all non-IBM-related. One key problem was that due to failing components in a too complex environment, data got outdated, was missing or is incorrect. The consequences impacted the "regular guys", ranging from having no or incorrect parts in an assembly line over not updated timesheets to not being able to "badge in". When I talk with my students at the Cooperative State University (who all have a job in the industry), many of them have seen or had to deal with incorrect or outdated data and "information" based on it.
The issues remind me of one of the first lessons in physics at school: "Garbage in, garbage out". For Big Data, the amount of data is in a different dimension, but the same principles as with the other data apply. It is important to know that there are or could be issues with the input data. How is that dealt with? Does it impact the algorithms and the result? Is the derived "information" really information? How is the data interpreted? Is it ok to base decisions on them or is it taken as just one of many indicators? Do those dealing with information found from Big Data know the source, is everything fully disclosed? There are many more questions, many of them non-technical. One important question is what questions to ask against the data. Is this to "proof" some speculation and guesswork or to really find out something new that may be validated by further work? And then we are back to a problem we still face since years with existing, "small" data.
Now to a small problem and statistics...:
Over the weekend we will have big fireworks locally here over the Lake Constance. My kids want to watch them for sure. I could point out that only 20% of the persons I considered for my statistical findings will attend. My kids could respond that 70% will attend. They chose people aged 8-80 years, I those aged 0-8 years...
The context counts and asking the right questions, regardless of small or big data.
Almost daily I receive emails or letters where my name is misprinted, the address is incorrect, or where the email was not even intended for me (long story). Few days ago I listened to a longer discussion about problems with various kind of applications at several companies, all non-IBM-related. One key problem was that due to failing components in a too complex environment, data got outdated, was missing or is incorrect. The consequences impacted the "regular guys", ranging from having no or incorrect parts in an assembly line over not updated timesheets to not being able to "badge in". When I talk with my students at the Cooperative State University (who all have a job in the industry), many of them have seen or had to deal with incorrect or outdated data and "information" based on it.
The issues remind me of one of the first lessons in physics at school: "Garbage in, garbage out". For Big Data, the amount of data is in a different dimension, but the same principles as with the other data apply. It is important to know that there are or could be issues with the input data. How is that dealt with? Does it impact the algorithms and the result? Is the derived "information" really information? How is the data interpreted? Is it ok to base decisions on them or is it taken as just one of many indicators? Do those dealing with information found from Big Data know the source, is everything fully disclosed? There are many more questions, many of them non-technical. One important question is what questions to ask against the data. Is this to "proof" some speculation and guesswork or to really find out something new that may be validated by further work? And then we are back to a problem we still face since years with existing, "small" data.
Now to a small problem and statistics...:
Over the weekend we will have big fireworks locally here over the Lake Constance. My kids want to watch them for sure. I could point out that only 20% of the persons I considered for my statistical findings will attend. My kids could respond that 70% will attend. They chose people aged 8-80 years, I those aged 0-8 years...
The context counts and asking the right questions, regardless of small or big data.
Monday, July 2, 2012
A small DB2 SQL quiz - why the SQL error?
I have been playing with some new DB2 10.1 features and thought that a small SQL quiz at the beginning of the week is appropriate. Consider the following SQL statements and the related output:
db2 => insert into ih values(1,'a')
DB21034E The command was processed as an SQL statement because it was not a
valid Command Line Processor command. During SQL processing it returned:
SQL0407N Assignment of a NULL value to a NOT NULL column "TBSPACEID=2,
TABLEID=257, COLNO=2" is not allowed. SQLSTATE=23502
db2 => select * from ih
ID S
----------- ------------------------------
0 record(s) selected.
db2 => describe select * from ih
Column Information
Number of columns: 2
SQL type Type length Column name Name length
-------------------- ----------- ------------------------------ -----------
496 INTEGER 4 ID 2
449 VARCHAR 30 S 1
How was the table defined or why is it giving an error? Any guesses? Post your answer and explanation as comment. I will post the solution in a couple days. [Edit: done and linked]
db2 => insert into ih values(1,'a')
DB21034E The command was processed as an SQL statement because it was not a
valid Command Line Processor command. During SQL processing it returned:
SQL0407N Assignment of a NULL value to a NOT NULL column "TBSPACEID=2,
TABLEID=257, COLNO=2" is not allowed. SQLSTATE=23502
db2 => select * from ih
ID S
----------- ------------------------------
0 record(s) selected.
db2 => describe select * from ih
Column Information
Number of columns: 2
SQL type Type length Column name Name length
-------------------- ----------- ------------------------------ -----------
496 INTEGER 4 ID 2
449 VARCHAR 30 S 1
How was the table defined or why is it giving an error? Any guesses? Post your answer and explanation as comment. I will post the solution in a couple days. [Edit: done and linked]
Wednesday, May 23, 2012
Partial Early Desk Cleaning - New in DB2
Remember those days when your parents told you to clean up your room or your desk? There are two strategies to deal with cleaning up a room or a desk. Do a little bit here and there or wait till it is really crowded (and filthy...?). The latter is more of a "last resort" action, e.g., before visitors arrive or someone gets really, really angry. The former (today often called "pro-active") means that some time is taken daily or weekly to clean up parts of the room or the desk. In the best scenario, the desk always shines, there is rarely the need for a bigger clean-up and it gives a good feeling (or it could mean you don't really have a real job...).
Now let's talk about DB2. Starting with version 10 it features PED and PEA. The first is Partial Early Distinct (PED) and means that the big clean-up task of removing duplicates from a result set is not done at the end, but as early as possible ("here and there"). The big advantage is that it will speed up the query because smaller intermediate result sets can be move through the system, less memory is used for the sort heap (needed for eliminating duplicates), and some more. Partial Early Aggregation (PEA) works similarly and applies to GROUP BYs.
Some more of "partial early" is explained in a DB2 V10.1 Query performance enhancements paper published at the DB2 LUW Best Practices website.
More later, I promised my boss to clean up my desk...
Now let's talk about DB2. Starting with version 10 it features PED and PEA. The first is Partial Early Distinct (PED) and means that the big clean-up task of removing duplicates from a result set is not done at the end, but as early as possible ("here and there"). The big advantage is that it will speed up the query because smaller intermediate result sets can be move through the system, less memory is used for the sort heap (needed for eliminating duplicates), and some more. Partial Early Aggregation (PEA) works similarly and applies to GROUP BYs.
Some more of "partial early" is explained in a DB2 V10.1 Query performance enhancements paper published at the DB2 LUW Best Practices website.
More later, I promised my boss to clean up my desk...
Tuesday, November 30, 2010
Indexing PDF Files with DB2 Text Search
This week I was asked how to build a full text index on PDF documents in DB2. Because my hard drive is full of them, I created a new DB2 database and set up DB2 Text Search with so-called rich text document filters. After creating a table, loading data and indexing the documents, I eventually used the CONTAINS and SCORE functions to search in my PDF files. But let me show you the steps in some more details...
Because DB2 Text Search support can be installed as part of the DB2 installation, my system already had the base support set up.
Thus my endeavors started with downloading the DB2 Accessory Suite. It contains additional document filters that help to extract full text information out of different file types such as PDF, word processor, or spreadsheet documents. The DB2 Information Center has instructions for downloading and installing the accessory suite. I checked that all necessary files from the regular DB2 installation were already in place and then succeeded with setting up rich text support.
Actually, setup requires two steps. Install the software on the server, then setting it up for an instance. Some confusion, but I succeeded. Next was enabling rich text support, instructions are on Information Center again. Now on to creating the actual test database (on my Linux shell):
DB2 Text Search uses an environment variable DB2DBDFT. This should be set to the database you work with to simplify administration. Hence:
Finally, we are able to set up our new database for full text search:
The db2ts command is the DB2 Text Search command processor. Now we can create a simple table in the database to hold the documents after connecting to the db:
The second statement above creates a table named "hlDocs" with three columns. The first column is named "id" and of type integer. Note that DB2 Text Search requires a primary key column in the indexed table. The 2nd column will hold a short string describing the document. The last column is a BLOB that we use to store the PDF files. To easily insert PDF documents into the table, I created a small delimited file for import (named "blob.del"):
The above data can now be imported with the PDF files assumed in the same as our current directory:
It is important to check that all rows were inserted:
Now the test data is in, but we don't have an index yet. That can be done using the db2ts command again. We first create the index, then update it:
CAUTION: Note that the option "FORMAT inso" is not clearly explained in the documentation (yet - hence this post). That option tells DB2 to use the additional document filters that we installed as part of the DB2 Accessory Suite.
Finally, we can test the index using the CONTAINS and the SCORE functions.
db2 "select id,description from hldocs where contains(doc,'Paris')=1"
ID DESCRIPTION --------------------------------------------------
1 'train schedule Paris'
1 record(s) selected.
db2 "select id,description,cast(score(doc,'Friedri*') as decimal(5,3)) as score from hldocs order by 3 desc"
ID DESCRIPTION SCORE
----------- ------------------------------------------
2 'Schedule for paper collection' 0.046
3 'Resume' 0.022
1 'Train schedule Paris' 0.000
3 record(s) selected.
Most of the steps are dedicated to proper setup of DB2 Text Search with the additional filters for PDF documents. However, once it is set up, inserting data and searching the documents is simple.
Because DB2 Text Search support can be installed as part of the DB2 installation, my system already had the base support set up.
Thus my endeavors started with downloading the DB2 Accessory Suite. It contains additional document filters that help to extract full text information out of different file types such as PDF, word processor, or spreadsheet documents. The DB2 Information Center has instructions for downloading and installing the accessory suite. I checked that all necessary files from the regular DB2 installation were already in place and then succeeded with setting up rich text support.
Actually, setup requires two steps. Install the software on the server, then setting it up for an instance. Some confusion, but I succeeded. Next was enabling rich text support, instructions are on Information Center again. Now on to creating the actual test database (on my Linux shell):
db2 CREATE DB FT
DB2 Text Search uses an environment variable DB2DBDFT. This should be set to the database you work with to simplify administration. Hence:
export DB2DBDFT=FT
Finally, we are able to set up our new database for full text search:
db2ts "ENABLE DATABASE FOR TEXT"
The db2ts command is the DB2 Text Search command processor. Now we can create a simple table in the database to hold the documents after connecting to the db:
db2 connect to ft
db2 "create table hlDocs(id int not null primary key, description varchar(200), doc BLOB(10M))"
The second statement above creates a table named "hlDocs" with three columns. The first column is named "id" and of type integer. Note that DB2 Text Search requires a primary key column in the indexed table. The 2nd column will hold a short string describing the document. The last column is a BLOB that we use to store the PDF files. To easily insert PDF documents into the table, I created a small delimited file for import (named "blob.del"):
1,'train schedule Paris',ParisSchedule.pdf
2,'Schedule recycling paper collection',Papier_Abfallplan2010.pdf
3,'New resume',Current_resume.pdf
The above data can now be imported with the PDF files assumed in the same as our current directory:
db2 import from blob.del of del lobs from . insert into hlDocs
It is important to check that all rows were inserted:
Number of rows read = 3
Number of rows skipped = 0
Number of rows inserted = 3
Number of rows updated = 0
Number of rows rejected = 0
Number of rows committed = 3
Number of rows skipped = 0
Number of rows inserted = 3
Number of rows updated = 0
Number of rows rejected = 0
Number of rows committed = 3
Now the test data is in, but we don't have an index yet. That can be done using the db2ts command again. We first create the index, then update it:
db2ts "CREATE INDEX idx1 FOR TEXT ON hldocs(doc) FORMAT inso"
db2ts "UPDATE INDEX idx1 FOR TEXT"
CAUTION: Note that the option "FORMAT inso" is not clearly explained in the documentation (yet - hence this post). That option tells DB2 to use the additional document filters that we installed as part of the DB2 Accessory Suite.
Finally, we can test the index using the CONTAINS and the SCORE functions.
db2 "select id,description from hldocs where contains(doc,'Paris')=1"
ID DESCRIPTION --------------------------------------------------
1 'train schedule Paris'
1 record(s) selected.
db2 "select id,description,cast(score(doc,'Friedri*') as decimal(5,3)) as score from hldocs order by 3 desc"
ID DESCRIPTION SCORE
----------- ------------------------------------------
2 'Schedule for paper collection' 0.046
3 'Resume' 0.022
1 'Train schedule Paris' 0.000
3 record(s) selected.
Most of the steps are dedicated to proper setup of DB2 Text Search with the additional filters for PDF documents. However, once it is set up, inserting data and searching the documents is simple.
Subscribe to:
Posts (Atom)