Build software better, together

sinaptik-ai / pandas-ai

Chat with your database or your datalake (SQL, CSV, parquet). PandasAI makes data analysis conversational using LLMs and RAG.

data-science data csv sql database ai pandas data-visualization data-analysis datalake text-to-sql gpt-4 llm

Updated Apr 14, 2025
Python

Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai

Updated Mar 31, 2025
Python

UncoderIO / Uncoder_IO

Star

An IDE and translation engine for detection engineers and threat hunters. Be faster, write smarter, keep 100% privacy.

translation xdr siem sigma datalake edr threathunting roota uncoder uncoderio

Updated Feb 18, 2025
Python

awslabs / aws-orbit-workbench

Star

A Data Platform built for AWS, powered by Kubernetes.

kubernetes aws jupyter analytics gpu jupyterhub data-analysis redshift mach workbench datalake dataengineering eks eks-cluster orbit-workbench

Updated Jul 24, 2023
Python

martandsingh / ApacheSpark

Star

This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.

sql database spark hive hadoop etl pyspark data-engineering spark-streaming data-analysis databricks datalake spark-sql timetravel apachespark etl-pipeline deltalake

Updated Jul 28, 2024
Python

vim89 / datapipelines-essentials-python

Star

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

python big-data spark apache-spark hadoop etl xml python3 xml-parsing pyspark data-pipeline datalake hadoop-mapreduce spark-sql etl-framework hadoop-hdfs etl-pipeline etl-components

Updated May 6, 2023
Python

hifxit / dataligo

Star

A library to accelerate ML and ETL pipeline by connecting all data sources

python database nosql datawarehouse datalake etl-pipeline ml-pipeline

Updated May 3, 2023
Python

PaloAltoNetworks / pan-cortex-data-lake-python

Star

Python idiomatic SDK for Cortex™ Data Lake.

Updated Mar 24, 2025
Python

abdullahkhawer / aws-auto-terminate-idle-emr

Star

An AWS based solution using AWS CloudWatch and AWS Lambda based on Python to automatically terminate AWS EMR clusters that have been idle for a specified period of time.

Updated Jun 5, 2024
Python

aws-solutions-library-samples / aws-insurancelake-etl

Star

This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake Infrastructure project

aws insurance glue datalake cdk

Updated Nov 18, 2024
Python

aws-solutions-library-samples / aws-insurancelake-infrastructure

Star

This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake ETL with CDK Pipelines project.

aws insurance datalake cdk

Updated Oct 8, 2024
Python

openEDI / open-data-access-tools

Star

OEDI Data Lake Access

aws datalake nrel renewable-energy open-energy oedi

Updated Feb 28, 2025
Python

brfulu / us-accidents-data-engineering

Star

Udacity Data Engineer Nanodegree - Capstone project

aws airflow spark athena s3 datalake

Updated Dec 19, 2019
Python

legout / pydala

Star

Poor mans simple python api for creating a local or remote datalake based on several (pyarrow) datasets using duckdb

datalake pyarrow duckdb

Updated Jul 14, 2023
Python

edgBR / delta-lake-polars

Star

Building a poor man's data lake: Exploring the Power of Polars and Delta Lake

data-engineering delta datalake delta-lake polars polars-dataframe

Updated Feb 13, 2025
Python

CharlieSergeant / airflow-minio-postgres-fastapi

Star

Sample data store project to be hosted on a remote server or cluster. CICD using GitHub actions for SSH Deploy to remote server for docker compose.

python airflow docker-compose postgresql jupyter-notebook minio traefik datalake selenium-python data-engineering-pipeline fastapi

Updated Jul 25, 2023
Python

KleinYuan / llama2-csv-webapp

Star

self host/local host llama2 based web app to chat with your csvs (multiple)

meta csv pandas openai datalake streamlit large-language-models llm chatgpt pandasai llama2 pandas-ai

Updated Jan 15, 2024
Python

tuancamtbtx / dataplatform-stack

Star

How to build a complete Data Platform -> Here

data airflow kafka data-warehouse spark-streaming k8s dbt flink cdc dataplatform datalake

Updated Jul 4, 2024
Python

mehroosali / s3-redshift-batch-etl-pipeline

Star

Built functional python ETL script with functions that initialized spark clusters using pyspark library to extract songs stored in S3 bucket. Partitioned songs data by year and artist_id and compressed in parquet output files to increase load performance. Used the overwrite mode in spark to ensure every new run of ELT script is overwritten in th…

aws airflow sql spark etl analytics s3 python3 pyspark redshift datalake spark-sql airflow-dags

Updated Dec 28, 2021
Python

kimtth / pyspark-tika-text-extraction

Star

🚴‍♂️⛷Data Lake, Performance tuning for text extraction from a huge amount of files.

spark apache-spark multithreading pyspark data-pipeline datalake apache-tika tika-python

Updated Nov 15, 2021
Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datalake

Here are 78 public repositories matching this topic...

sinaptik-ai / pandas-ai

activeloopai / deeplake

UncoderIO / Uncoder_IO

awslabs / aws-orbit-workbench

martandsingh / ApacheSpark

vim89 / datapipelines-essentials-python

hifxit / dataligo

PaloAltoNetworks / pan-cortex-data-lake-python

abdullahkhawer / aws-auto-terminate-idle-emr

aws-solutions-library-samples / aws-insurancelake-etl

aws-solutions-library-samples / aws-insurancelake-infrastructure

openEDI / open-data-access-tools

brfulu / us-accidents-data-engineering

legout / pydala

edgBR / delta-lake-polars

CharlieSergeant / airflow-minio-postgres-fastapi

KleinYuan / llama2-csv-webapp

tuancamtbtx / dataplatform-stack

mehroosali / s3-redshift-batch-etl-pipeline

kimtth / pyspark-tika-text-extraction

Improve this page

Add this topic to your repo