OLake

Fastest open-source tool for replicating Databases to Apache Iceberg or Data Lakehouse. ⚡ Efficient, quick and scalable data ingestion for real-time analytics. Starting with MongoDB. Visit olake.io/docs for the full documentation, and benchmarks

Documentation • Twitter

Connector ecosystem for Olake, the key points Olake Connectors focuses on are these

Integrated Writers to avoid block of reading, and pushing directly into destinations
Connector Autonomy
Avoid operations that don't contribute to increasing record throughput

Getting Started with OLake

Follow the steps below to get started with OLake:

Prepare Your Folder

Create a folder on your computer. Let’s call it olake_folder_path.

💡 Note: In below configurations replace olake_folder_path with the newly created folder path.
Inside this folder, create two files:
- config.json: This file contains your connection details. You can find examples and instructions here.
- writer.json: This file specifies where to save your data (local machine or S3).

Example Structure of `writer.json` :

Example (For Local):

{
  "type": "PARQUET",
     "writer": {
       "normalization":false, // to enable/disable level one flattening
       "local_path": "/mnt/config/{olake_reader}" // replace olake_reader with desired folder name
  }
}

Example (For S3):

{
  "type": "PARQUET",
     "writer": {
       "normalization":false, // to enable/disable level one flattening
       "s3_bucket": "olake",  
       "s3_region": "",
       "s3_access_key": "", 
       "s3_secret_key": "", 
       "s3_path": ""
   }
}

Generate a Catalog File

Run the discovery process to identify your MongoDB data:

docker run -v olake_folder_path:/mnt/config olakego/source-mongodb:latest discover --config /mnt/config/config.json

This will create a catalog.json file in your folder. The file lists the data streams from your MongoDB

    {
     "selected_streams": {
            "namespace": [
                {
                    "partition_regex": "/{col_1, default_value, granularity}",
                    "stream_name": "table1"
                },
                {
                    "partition_regex": "",
                    "stream_name": "table2"
                }
            ]
        },
        "streams": [
            {
                "stream": {
                    "name": "table1",
                    "namespace": "namespace",
                    // ...
                    "sync_mode": "cdc"
                }
            },
            {
                "stream": {
                    "name": "table2",
                    "namespace": "namespace",
                    // ...
                    "sync_mode": "cdc"
                }
            }
        ]
    }

(Optional) Partition Destination Folder based on Columns

Partition data based on column value. Read more in the documentation about S3 partitioning.

     "partition_regex": "/{col_1, default_value, granularity}",

col_1: Partitioning Column. Supports now() as a value for the current date.
default_value: if the column value is null or not parsable then the default will be used.
granularity (Optional): Support for time-based columns. Supported Values: HH,DD,WW,MM,YY.

(Optional) Exclude Unwanted Streams

To exclude streams, edit catalog.json and remove them from selected_streams.

Example (For Exclusion of table2)

Before

 "selected_streams": {
    "namespace": [
        {
            "partition_regex": "/{col_1, default_value, granularity}",
            "stream_name": "table1"
        },
        {
            "partition_regex": "",
            "stream_name": "table2"
        }
    ]
}

After Exclusion of table2

"selected_streams": {
    "namespace": [
        {
            "partition_regex": "/{col_1, default_value, granularity}",
            "stream_name": "table1"
        }
    ]
}

Sync Data

Run the following command to sync data from MongoDB to your destination:

docker run -v olake_folder_path:/mnt/config olakego/source-mongodb:latest sync --config /mnt/config/config.json --catalog /mnt/config/catalog.json --destination /mnt/config/writer.json

Sync with State:

If you’ve previously synced data and want to continue from where you left off, use the state file:

docker run -v olake_folder_path:/mnt/config olakego/source-mongodb:latest sync --config /mnt/config/config.json --catalog /mnt/config/catalog.json --destination /mnt/config/writer.json --state /mnt/config/state.json

For more details, refer to the documentation.

Benchmark Results: Refer to this doc for complete information

Speed Comparison: Full Load Performance

For a collection of 230 million rows (664.81GB) from Twitter data, here's how Olake compares to other tools:

Tool	Full Load Time	Performance
Olake	46 mins	X times faster
Fivetran	4 hours 39 mins (279 mins)	6x slower
Airbyte	16 hours (960 mins)	20x slower
Debezium (Embedded)	11.65 hours (699 mins)	15x slower

Incremental Sync Performance

Tool	Incremental Sync Time	Records per Second (r/s)	Performance
Olake	28.3 sec	35,694 r/s	X times faster
Fivetran	3 min 10 sec	5,260 r/s	6.7x slower
Airbyte	12 min 44 sec	1,308 r/s	27.3x slower
Debezium (Embedded)	12 min 44 sec	1,308 r/s	27.3x slower

Cost Comparison: (Considering 230 million first full load & 50 million rows incremental rows per month) as dated 30th September: Find more here.

Testing Infrastructure

Virtual Machine: Standard_D64as_v5

CPU: 64 vCPUs
Memory: 256 GiB RAM
Storage: 250 GB of shared storage

MongoDB Setup:

3 Nodes running in a replica set configuration:
- 1 Primary Node (Master) that handles all write operations.
- 2 Secondary Nodes (Replicas) that replicate data from the primary node.

Find more here.

Components

Drivers

Drivers aka Connectors/Source that includes the logic for interacting with database. Upcoming drivers being planned are

Writers

Writers are directly integrated into drivers to avoid blockage of writing/reading into/from os.StdOut or any other type of I/O. This enables direct insertion of records from each individual fired query to the destination.

Writers are being planned in this order

Core

Core or framework is the component/logic that has been abstracted out from Connectors to follow DRY. This includes base CLI commands, State logic, Validation logic, Type detection for unstructured data, handling Config, State, Catalog, and Writer config file, logging etc.

Core includes http server that directly exposes live stats about running sync such as

Possible finish time
Concurrently running processes
Live record count

Core handles the commands to interact with a driver via these

spec command: Returns render-able JSON Schema that can be consumed by rjsf libraries in frontend
check command: performs all necessary checks on the Config, Catalog, State and Writer config
discover command: Returns all streams and their schema
sync command: Extracts data out of Source and writes into destinations

SDKs

SDKs are libraries/packages that can orchestrate the connector in two environments i.e. Docker and Kubernetes. These SDKs can be directly consumed by users similar to PyAirbyte, DLT-hub.

(Unconfirmed) SDKs can interact with Connectors via potential GRPC server to override certain default behavior of the system by adding custom functions to enable features like Transformation, Custom Table Name via writer, or adding hooks.

Olake

Olake will be built on top of SDK providing persistent storage and a user interface that enables orchestration directly from your machine with default writer mode as S3 Iceberg Parquet

Contributing

We ❤️ contributions big or small. Please read CONTRIBUTING.md to get started with making contributions to OLake.

Not sure how to get started? Just ping us on #contributing-to-olake in our slack community

Documentation

You can find docs at https://olake.io/docs. If you need any clarification or find something missing, feel free to raise a GitHub issue with the label documentation at olake-docs repo or reach out to us at the community slack channel.

Community

Join the slack community to know more about OLake, future roadmaps and community meetups, about Data Lakes and Lakehouses, the Data Engineering Ecosystem and to connect with other users and contributors.

Checkout OLake Roadmap to track and influence the way we build it, your expert opinion is always welcomed for us to build a best class open source offering in Data space.

If you have any ideas, questions, or any feedback, please share on our Github Discussions or raise an issue.

As always, thanks to our amazing contributors!

Name		Name	Last commit message	Last commit date
Latest commit History 195 Commits
.github		.github
constants		constants
drivers		drivers
jsonschema		jsonschema
logger		logger
pkg		pkg
protocol		protocol
safego		safego
scripts/release		scripts/release
types		types
typeutils		typeutils
utils		utils
writers/parquet		writers/parquet
.dockerignore		.dockerignore
.gitignore		.gitignore
.golangci.yml		.golangci.yml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENCE		LICENCE
Makefile		Makefile
README.md		README.md
build.sh		build.sh
connector.go		connector.go
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.work		go.work
release-tool.sh		release-tool.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OLake

Documentation • Twitter

Getting Started with OLake

Prepare Your Folder

Example Structure of `writer.json` :

Generate a Catalog File

(Optional) Partition Destination Folder based on Columns

(Optional) Exclude Unwanted Streams

Example (For Exclusion of table2)

Sync Data

Sync with State:

Benchmark Results: Refer to this doc for complete information

Speed Comparison: Full Load Performance

Incremental Sync Performance

Testing Infrastructure

MongoDB Setup:

Components

Drivers

Writers

Core

SDKs

Olake

Contributing

Documentation

Community

About

Releases

Packages

Contributors 10

Languages

License

datazip-inc/olake

Folders and files

Latest commit

History

Repository files navigation

OLake

Documentation • Twitter

Getting Started with OLake

Prepare Your Folder

Example Structure of writer.json :

Generate a Catalog File

(Optional) Partition Destination Folder based on Columns

(Optional) Exclude Unwanted Streams

Example (For Exclusion of table2)

Sync Data

Sync with State:

Benchmark Results: Refer to this doc for complete information

Speed Comparison: Full Load Performance

Incremental Sync Performance

Testing Infrastructure

MongoDB Setup:

Components

Drivers

Writers

Core

SDKs

Olake

Contributing

Documentation

Community

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 10

Languages

Example Structure of `writer.json` :

Packages