Repartition & parallel Data ingestion; Cloud services;
- Consider source data is in a file system kinda storage (I used S3) : You'll need an AWS account (Free tier works)
- Using an MPP to manage Big Data seamlessly. So the target MPP DataWarehouse - SNOWFLAKE (Free tier works)
- Databricks account for running Spark jobs(
So this will project will help you migrate data from a normal file system-ish storage : AWS s3 to a Massively Parallel Processing DataWarehouse - Snowflake Learnings:
- Touchbase on AWS services & using S3 for source data storage
- Understand Parallel data ingestion
- Usage of basic Spark command ( I've used Pyspark here )
- Databricks - Hands on Cluster computing (It's free & very handy : Best part, you can run it on your mobile as well! yayyy)
- Finally one of my Fav DataWarehouse - Snowflake
- Also some classes, data structures and not-so-shitty code :D
- Click on this link : https://accounts.cloud.databricks.com/registration.html Redircts you to the sign up page of databricks.
- Set it up
- Install dependencies as mentioned in the code & run em : data_ingest_and_aggregate.py (https://github.com/AbhijithKp98/data_migration_tutorial_basic/blob/master/data_ingest_and_aggregate.py)
primary key - sku
clustered by - sku
An aggregated table on above rows with `name` and `no. of products` as the columns
Creating the product table in snowflake : CREATE TABLE PRODUCTS ( NAME VARCHAR(), SKU VARCHAR() NOT NULL, DESCRIPTION VARCHAR(), PRIMARY KEY (SKU) ) CLUSTER BY (SKU))
Used Classes to build the solution and modularized, so that end consumer can just run the query without any issues due to dependancy
Snowflake - A Massively Parallel Processing Entreprise level Datawarehouse which supports parallel data ingestion. The data ingestion happens in less than 7 secs i.e., Spark dataframe to Table on Snowflake Note : Writing the csv from s3 stage to snowflake is further more efficient
Created the table on Snowflake keeping SKU as both primary key & the cluster by key, so that any DML operation is supported & is computationaly efficient
The source data is loaded into a Spark dataframe & the aggregation is done on Spark.
It is written directly into snowflake & it creates a table with specified name,
automatically on the specified database & schema
Implemented the solution in this architecture as I had time & resource contraints Incase of additional time availability, Enhancements would be
1. AWS S3 - Data Staging (home for csv file)
2. AWS EMR - Compute Engine
3. Airflow - Automating the process with this Orchestration tool &
hosting it on AWS EC2
4. AWS Redshift - Target DW
5. Apache Spark - Distributed computing
6. Deployment - Would package the Tasks & dependancies on Docker and manage multiple of them on Kuberenetes.
These architecture would make it an Enterprise level solution & pretty long term.
Would have added the AWS KMS to manage all the credentials but Databricks Community edition doesn't support the same Tasks
(Workaround used : creating an IAM role with limited access)