You're reading from Getting Started with DuckDB A practical guide for accelerating your data science, data analytics, and data engineering workflows

Product type Paperback

Published in Jun 2024

Publisher Packt

ISBN-13 9781803241005

Length 382 pages

Edition 1st Edition

Languages

Python

Concepts

Data Analysis

Authors (2):

Ned Letcher

Simon Aubury

View More author details

Table of Contents (15) Chapters

Preface

1. Chapter 1: An Introduction to DuckDB FREE CHAPTER

2. Chapter 2: Loading Data into DuckDB

3. Chapter 3: Data Manipulation with DuckDB

4. Chapter 4: DuckDB Operations and Performance

5. Chapter 5: DuckDB Extensions

6. Chapter 6: Semi-Structured Data Manipulation

7. Chapter 7: Setting up the DuckDB Python Client

8. Chapter 8: Exploring DuckDB’s Python API

9. Chapter 9: Exploring DuckDB’s R API

10. Chapter 10: Using DuckDB Effectively

11. Chapter 11: Hands-On Exploratory Data Analysis with DuckDB

12. Chapter 12: DuckDB – The Wider Pond

13. Index

Why subscribe?

14. Other Books You May Enjoy

Connecting to DuckDB in Python

When working with DuckDB in Python, the first thing to consider is how you’ll connect to a target DuckDB database from your Python code. Even though DuckDB runs embedded within a parent process, meaning that you don’t have to worry about authenticating to a remote server, you still need to consider how your Python process will make and manage connections to potentially multiple DuckDB databases.

In this section, we’ll be covering the two broad ways to issue commands to DuckDB databases in Python:

Using the default in-memory database: This database is automatically created for you when you import the duckdb module. The default database provides a convenient way to quickly perform activities such as ad-hoc data analysis when you know you will only need a single database in your session. This is what you will connect to when you invoke a function from the duckdb module that interacts with a database, such as duckdb.sql() and duckdb.execute().
Explicitly creating and managing database connections: This gives you more flexibility and control over how you work with DuckDB databases. Database connection objects are created via the duckdb.connect() function, which you then use to issue commands against the database it’s connected to.

Both approaches give you access to the same Python APIs for interacting with DuckDB databases, but with different levels of convenience and customizability. Let’s go through each of these ways of interacting with DuckDB databases in Python.

Using the default in-memory database

The simplest way to hit the ground running using DuckDB in Python is to use the default database. This is a shared in-memory database that is available anywhere in your Python process after importing the duckdb module. It has the advantage of convenience, as it saves you from having to manage database connections yourself, making it perfect for ad-hoc data wrangling tasks, or if you’re just taking DuckDB for a spin.

The DuckDB Python client provides a range of functions for interacting with DuckDB databases. When called as attributes of the duckdb module, they will dispatch your instructions to the shared in-memory database. From here on, we will refer to these functions associated with underlying connection objects as methods, in keeping with object-oriented programming terminology.

To see this in action, let’s invoke the duckdb.sql() method. The sql() method is part of DuckDB’s Relational API and is used to construct a SQL query for running against the database. The following code is a Python snippet that you will need to enter into and run inside a code cell of your Jupyter Notebook-compatible IDE. This is also how you should run all of the following code examples in this chapter:

import duckdb
duckdb.sql("SELECT 'duck' AS animal, 'quack!' AS greeting")

In most notebook-based IDEs, when you run a code cell, the value of the final expression is displayed in the cell’s output area. We’ll be leveraging this convenient feature throughout our notebook-based exercises to display the output of each example. This means that after you run this block, you’ll see the following output:

Figure 7.3 – The result of a query run against the default database using the duckdb.sql() method

With that, we have just run our first DuckDB query in Python using the DuckDB Python client’s default database. If you are wondering about the two different types of quote characters in the code we just ran, remember that in DuckDB SQL, literal string values are represented using single-quote characters. The outer double-quote characters define the Python string literal containing our SQL query that we want to send to the database. As Python can use either double quotes or single quotes to represent string literals, we recommend using double quotes for Python string literals that contain SQL queries, as this will save you from having to escape single quotes when using string literals in your SQL queries.

The result of the sql() method call is a DuckDB relation object, which represents our query. We’ll unpack these relation objects in Chapter 8, as part of our treatment of DuckDB’s Relational API. For now, we’re just focusing on the fact that since we’ve invoked the sql() method against the duckdb module, our SQL query was automatically dispatched to the default database.

The default in-memory database is particularly convenient for interactive data analysis workflows, as you can jump in and start working with DuckDB without having to set up any connections. If you’re doing ad-hoc data wrangling or exploratory data analysis in a notebook, you may often find yourself reaching for this approach. In other scenarios, the default database won’t always be the most appropriate way to work with DuckDB. Being an in-memory database, the default database is ephemeral, meaning that its contents will be lost when its parent Python process ends. If you want to persist your database state after the process ends, you’ll need to create a new persistent connection explicitly, rather than use the default database.

Another consideration is that the default database is shared globally across the duckdb module for each Python process. If you’re developing a Python library that uses DuckDB, it’s important to bear in mind that downstream users of your library may themselves want to use DuckDB’s default database for their own needs. It is therefore recommended to avoid the default in-memory database when developing library packages, so that consumers of your library can safely import and use the duckdb module without clashing with any database state created by your library’s code.

Next, we’ll cover a more flexible way to connect to DuckDB databases that involves creating and configuring new connections explicitly. This approach allows you to manage and use multiple DuckDB databases in the same Python process, including persistent file-based databases, as well as allowing you to customize each databases’ configuration.

Creating database connections explicitly

In addition to using the default in-memory database, you also have the option to use the duckdb.connect() function to explicitly create and manage connection objects. This is part of DuckDB’s support of the Python DB-API 2.0 standard, which we’ll cover in the next chapter. Managing database connection objects yourself gives you much more control over how you create and connect to the DuckDB database within your analytical environment or application logic. This is also how library packages can connect to DuckDB databases safely, without fear of clashing with any usage of DuckDB’s default database that downstream users of the library might want to perform.

The connect() function returns a DuckDBPyConnection object that represents a connection to a specific DuckDB database. Calling connect() without any arguments results in the creation of a new in-memory database. It’s important to note that this is distinct from the default in-memory database. Let’s see this in action:

conn = duckdb.connect()
conn.sql(
    """
    CREATE TABLE hello AS
    SELECT 'pato' AS animal, 'cuac!' AS greeting
    """
)
conn.sql("SELECT * FROM hello")

This gives us the following output:

Figure 7.4 – The result of a query run against a new in-memory database, using the sql() method of a connection object

In this example, we created a new in-memory database and assigned the resulting connection object to the conn variable. We then submitted two SQL queries to this new database via the sql() method of the connection object: one that created a table with a single row and another that retrieved this single row.

The connect() function has several parameters, the first of which is the database parameter, which takes a string specifying the database you want to connect to. As we’ll see later, passing a file path for this value results in a connection to a persistent database stored on disk. Its default value however is the special string :memory:, which is why calling the connect() function with no arguments returned us a connection to a new in-memory database. Let’s make another in-memory database, specifying this explicitly with the database keyword argument:

another_conn = duckdb.connect(database=":memory:")

Running this code will result in a second in-memory database being created, distinct from the first one, and which we now have a reference to via the another_conn variable. Once again, this is distinct from the default in-memory database used by the duckdb.sql() method and other methods attached to the duckdb module. If you happen to want to create an explicit connection object to the default in-memory database, you can pass the:default: string as the value of the database keyword argument:

default_conn = duckdb.connect(database=":default:")

Every database interaction requires a connection

If you’re finding the appearance of the sql() method on both the duckdb module and on connection objects confusing, it may be helpful to consider that every interaction with a DuckDB database using the Python client occurs via a DuckDBPyConnection object, even when issuing commands via methods from the duckdb module. When you call duckdb.sql(), it dispatches your query to the default database via a connection object that is automatically created when you import the duckdb module. This connection object is available via the duckdb.default_connection module attribute. To illustrate this, the following two Python expressions are functionally equivalent (but not in convenience):

duckdb.sql("SELECT 'hello'")

duckdb.default_connection.sql("SELECT 'hello'")

Connection objects provide a range of methods beyond the sql() method for interacting with their database, all of which can be accessed via the duckdb module for convenient interaction with the default database. We’ll see more of these methods in the next chapter.

Explicitly creating connection objects also allows us to specify database configuration options. To do this, you need to pass a dictionary of configuration key-value pairs to the connect() function’s config parameter. Here’s an example where we create a new in-memory database, which we configure to use a maximum of 10 GB of system memory, as well as limiting the database to using four CPU threads for parallel query execution:

custom_conn = duckdb.connect(
    config={
        "memory_limit": "10GB",
        "threads": 4
    }
)

These database configuration options may come in handy if you need to tune DuckDB’s internal configuration to match the constraints of your environment. For best performance, the DuckDB performance guide recommends having 5 GB of memory available per thread for aggregation-heavy workloads, and 10 GB of memory per thread for join-heavy workloads, which is one reason why you might want to limit the number of threads DuckDB uses. Limiting the amount of system memory used by your database is perhaps a more niche need but may be helpful when working in multi-tenant environments with resource contention. Note that we have included these two configuration examples together in this example for illustration purposes, rather than targeting a specific use-case. See the DuckDB performance guide more information around tuning DuckDB’s performance: https://duckdb.org/docs/guides/performance.

For the complete list of database configuration options that can be provided via the config parameter of the connect function, consult the DuckDB configuration documentation: https://duckdb.org/docs/configuration.

Connecting to persistent-storage databases

When using an in-memory database, no data is persisted to disk, and all data stored in the database is lost when the Python process exits. If you need your database contents to persist, you can create a disk-based DuckDB database by passing a file path to the connect() function’s database parameter. If the database file already exists, a connection to this database will be created; if not, DuckDB will first create a new database at the specified path, before returning a connection to it. Let’s look at an example, where we’ll connect to a new on-disk database and write our simple table to it:

conn = duckdb.connect(database="quack.duckdb")
conn.sql(
    """
    CREATE OR REPLACE TABLE hello AS
    SELECT 'ente' AS animal, 'quak!' AS greeting
    """
)
conn.close()

After running this, you will now have a DuckDB database in the quack.duckdb file, which contains our hello table with a single row. Let’s verify this quickly by running a shell command that invokes the DuckDB CLI to connect to the new quack.duckdb database and query the hello table. Rather than leaving the notebook, we’ll do this in a notebook cell by prefixing our command with the ! character, which instructs the notebook to run it in a new shell, as if you had run it on your command line via your terminal application:

! duckdb quack.duckdb -c "SELECT * FROM hello"

Note that if you downloaded the DuckDB CLI directory, you’ll need to replace duckdb with a relative path from the chapter_08 folder to the location of your DuckDB CLI executable. After running the command, we should see our new greeting:

Figure 7.5 – The result of a query run against a disk-based DuckDB database via the DuckDB CLI

This shows us that our database changes were successfully persisted in the DuckDB database file quack.duckdb.

Thinking about what we just did, notice that we connected to our on-disk database from a different process than the notebook’s Python kernel. It’s worth noting here that DuckDB does not allow other processes to read from a disk-based database that already has an existing write-enabled connection. This is why we made sure to run the close() method on our connection before querying it from another process. Without this step, our attempt to query the database using the DuckDB CLI would have thrown an error. In the next section, we’ll discuss closing connections in more detail.

Closing database connections

It’s good practice to make sure your connection objects are closed after you’re finished working with them. This is particularly important when making changes to disk-based databases due to the fact that when writing to a database, DuckDB maintains a temporary data structure that needs to be synchronized with the database file to ensure data integrity. Ending your Python process without first closing any connections to database files could result in unsynchronized data failing to be written to the persistent database.

As we’ve already seen, the simplest way to close a connection is to call its close() method after you’re finished using it:

sql = "INSERT INTO hello VALUES ('Fulvigula', 'quack!')"
conn = duckdb.connect(database="quack.duckdb")
conn.sql(sql)
conn.close()

After running the preceding cell, we have safely added this new row to our hello table in the quack.duckdb disk-based database, we created previously.

A convenient way to ensure that you close your connections is to use connection objects as context managers. Context managers are Python objects that can be used with the with keyword to automatically perform certain actions while entering and exiting the with statement’s code block. They are commonly used for managing external resources such as database connections, network connections, and file descriptors. Here’s an example of using a DuckDB connection object as a context manager:

sql = "INSERT INTO hello VALUES ('Labradorius', 'quack!')"
with duckdb.connect(database="quack.duckdb") as conn:
    conn.sql(sql)

As we can see from this example, to connect to DuckDB using a context manager, we use the with statement, followed by a code block that contains the database interactions we wish to perform. The as keyword allows us to capture the connection object as a variable, for reference inside the indented code block of the with statement. When the code block is exited, the connection’s close() method will be called automatically for us. This provides a convenient way to create a new connection and also automate its closing, freeing you from having to remember to explicitly close every connection you create.

Connection objects are also closed implicitly when they go out of scope. This happens, for example, when a connection object is created inside a function body, and then the function exits (without returning the connection object). The connection object is then no longer in scope and will therefore be closed. However, rather than relying on this behavior, we recommend either using an explicit close() or a context manager to ensure that connections are safely closed.

Sharing disk-based databases between processes

Persistent disk-based DuckDB databases can be shared across multiple processes, provided that all databases are connected in read-only mode. You can specify that a database is opened in read-only mode by giving the connect() function’s read_only parameter the value True:

conn = duckdb.connect(
    database="quack.duckdb",
    read_only=True
)
# query the database here...
conn.close()

Note that the read_only flag can only be set to True when connecting to a disk-based database. A read-only in-memory database would not make sense as it would never have any data that could be queried. The ability for a single database to support multiple read-only connections can be used to allow multiple consuming processes to connect to and query a single disk-based database. While it’s unlikely you’ll need to do this while performing data analysis in a notebook, this pattern may be useful for some specific DuckDB use cases where multiple consuming services need to be able to query a single database concurrently.

Installing and loading extensions

The Python client provides explicit methods for both installing and loading extensions, which correspond to the INSTALL and LOAD DuckDB SQL commands we saw in Chapter 5. These methods are invoked on connection objects, meaning that extensions are both installed and loaded on a per database-connection basis. To install and load the spatial extension in the default database, we run the following:

duckdb.install_extension("spatial")
duckdb.load_extension("spatial")

As we have called these methods against the duckdb module, the corresponding commands are dispatched to the default database. We can confirm that the spatial extension is installed and loaded in the default database with the following query:

duckdb.sql(
    """
    SELECT *
    FROM duckdb_extensions()
    WHERE loaded = true
    """
)

The result of this query shows us that the spatial extension is indeed installed and loaded. Note that you may see a slightly different set of loaded extensions depending on your environment.

Figure 7.6 – All loaded and installed extensions in our default database after installing and loading the spatial extension

Let’s run the same query again on a new in-memory database to confirm that the spatial extension was only installed and loaded for the default database:

conn = duckdb.connect()
conn.sql(
    """
    SELECT *
    FROM duckdb_extensions()
    WHERE loaded = true
    """
)

This query shows that the new in-memory database does not have the spatial extension installed or loaded:

Figure 7.7 – All loaded and installed extensions in a newly created connection in-memory database, without installing or loading any extensions

To install and load this extension in our new database, we would need to run the following:

conn.install_extension("spatial")
conn.load_extension("spatial")

This also serves as a reminder that multiple DuckDB databases running in a Python process are distinct and do not share state with each other.

The rest of the chapter is locked

You're reading from Getting Started with DuckDB A practical guide for accelerating your data science, data analytics, and data engineering workflows

Table of Contents (15) Chapters

Connecting to DuckDB in Python

Using the default in-memory database

Creating database connections explicitly

Connecting to persistent-storage databases

Closing database connections

Sharing disk-based databases between processes

Installing and loading extensions

Unlock this book and the full library FREE for 7 days

Authors (2)

Personalised recommendations for you