This project was executed as a part of the Data Engineering Zoomcamp 2025 course held by DataTalks.Club - a free nine-week course that covers the fundamentals of data engineering. The goal of this project is to apply what was learnt during this course by creating an ELT data pipeline for processing server access logs data:
- Extract data from a data source (GitHub repo)
- Load the data into a data lake (GCS) and a data warehouse (BigQuery)
- Transform the data in the data warehouse using dbt
Finally, the processed data will be presented in a Looker dashboard.
Server log files are text files that automatically record events and transactions that happen on a web server, in this case, Heroku. These files help to provide insights into how search engine crawlers and humans navigate a website including such information as what pages they looked at, how people are finding your site, what errors they ran into if any. By analysing log files in real-time, you can identify issues regarding crawl errors, inefficiencies, and quickly make adjustments.
The output product of this project is a data pipeline that analyses server access logs data generated by Kabardian Poems Collection website hosted on Heroku server. The ELT data pipeline will extract the data from the server logs, process it by removing unnessesary and sensitive information, upload it to a cloud provider, transform the data to extract the aimed insights and display them in a dashboard.
The following data is extracted from the Heroku server logs:
- IP address of the user making the request:
10.1.22.113
- Timestamp - time the request was made:
2025-02-11 16:04:15.059069+00:00
- HTTP request method:
GET
- URL requested:
/poems/
- HTTP protocol:
HTTP/1.1
- Response code:
200
- Size of response:
23279
- Referer - the web page the user is visiting from:
/poems/64/
- User-Agent - the page that provided the link to make this request, e.g. search engine or social platform:
Mozilla/5.0(X11;Linuxx86_64)AppleWebKit/537.36(KHTMLlikeGecko)Chrome/131.0.0.0Safari/537.36
The data is exracted via Heroku CLI using bash commands and uploaded to GitHub repo similar to how it was done in the course. The data is stored in CSV files and uploaded daily. Each file contains requests from the previous day.
The end-to-end data pipeline includes the follwoing steps:
- downloading, processing and uploading of the initial dataset to a data lake;
- moving the data from the lake to a DWH;
- transforming the data in the DWH and preparing it for the dashboard;
- visualising the transformed data.
- Infrastracture: Terraform
- Cloud: Google Cloud Platform (GCP)
- Data Lake: Google Cloud Storage (GCS)
- Data Warehouse: Google Big Query (GBQ)
- Data Visualization: Looker Studio
- Containerization: Docker Compose
- Orchestration: Kestra
- Data Transformation: Data Build Tool (DBT Cloud)
- Programming Language: Python, SQL
Clone this repo.
Make sure you have the following pre-installed components:
-
Details below list commands used to install Docker Engine and Docker Compose on Ubuntu 24.04: # install Docker Engine (used when we want to handle only one container) docker --version # Docker version 27.4.1, build b9d17ea # verify the installation run sudo service docker start # this command downloads a test image and runs it in a container # when the container runs, it prints a confirmation message and exits sudo docker run hello-world # install Docker Compose (used when we have multiple containers to handle) docker compose version # Docker Compose version v2.32.1
- Setup up GCP free account if you don't have an account. It expires after 90 days.
- Create a new project and take note of the project number and project ID.
- Create a service account for the project.
- Configure service account to get access to this project and download auth-keys (.json).
- Make sure the service account has all the permissions below:
- Viewer
- Storage Admin
- Storage Object Admin
- BigQuery Admin
- Download SDK for local setup.
- Set env var to point to your downloaded GCP auth-key by switching to the folder where the key was saved and running the commands below on the command line:
export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json" # Refresh token/session, and verify authentication gcloud auth application-default login
- Enable the following options under the APIs and services section:
Terraform is infrastructure as cloud (IaC) and it will be used to build and destroy GCP resources. The infrastructure we need to create for this project consists of:
- a Cloud Storage Bucket (google_storage-bucket) for our Data Lake
- a BigQuery Dataset (google_bigquery_dataset)
See full Ubuntu installation instructions here.
You will find two files in the cloned terraform folder:
- main.tf
- variables.tf
- update with your project ID in variables.tf:
variable "project" { description = "ADD-YOUR-PROJECT-ID_HERE" }
- update with your project ID in variables.tf:
After following the commands below you should see "server_logs_bucket" in GCS and "server_logs_data" dataset in BigQuery:
# Refresh service-account's auth-token for this session:
gcloud auth application-default login
# Initialize configuration and import plugins for Google provider -
# cd to the folder with the Terraform config files and run the following command:
terraform init
# Create resources with Terraform plan; add your project ID when prompted:
terraform plan
# Create Infra with Apply; add your project ID when prompted and type "yes":
terraform apply
This section explains how to orchestrate the data ingestion into:
- a data lake (GCS) - the files from Github Release will be moved to a server_logs_bucket and
- a data warehouse (BigQuery) - the data from CSV files will be ingested into a table under the newly created server_logs_data dataset.
The cloned kestra folder contains flow files and docker-compose.yaml. Two of them need to be updated, see below:
- docker-compose.yaml : update "your-email-goes-here.com", row 45
- flows
-
00_gcp_kv.yaml : update rows 10, 28, 34
IMPORTANT! Watch out to NOT submit the updated file to GitHub!
-
01_gcp_setup.yaml
-
02_logs_2_gcs_2_bq.yaml
-
- flows
# change to the folder that includes your docker-compose.yaml file
cd kestra/
# build docker image
sudo docker compose build
# run docker compose in detached mode
sudo docker compose up -d
The above command will spin up two containers:
- kestra-postgres-1
- kestra-kestra-1
Wait untill the above command has finished running, open http://localhost:8080/ in your brouser and add your flows via UI: Flows > Create
Run 00_gcp_kv.yaml and 01_gcp_setup.yaml.
To run the 02_logs_2_gcs_2_bq.yaml file don't press Execute as it contains a trigger.
Go to Triggers > "Backfill executions" > Select the start date > "Execute backfills".
Create 2 new empty datasets for your project in BigQuery:
- a development dataset, e.g.: dbt_dev_env
- a production dataset, e.g.: dbt_prod_env
NOTE: Make sure you select your region in accordance with the selected region of your entire project.
- Create a dbt Cloud account or log in into the existing one.
- Set up a GitHub repo for your dbt project.
- Set up dbt Cloud with BigQuery:
- connect dbt to BigQuery development dbt_dev_env dataset:
-
create a BigQuery service account - use instructions here; NOTE: The steps for the current service account creation have slightly changed. E.g.: before you can actually create a new service account you will be first asked if you would like to re-use an existing one if such is already there. In this case, press "Continue" to land on a form to set up a new service account by proving your "Service account details":
- add service account name, e.g.: dbt-service-account
- press "create and continue"
- in the "Grant this service account access to project" select "BigQuery Admin"
- press "Done"
- select newly created service account and navigate to its "KEYS" section
- select "create new key" and the key type JSON; this will create and download the key file to your pc.
-
set up a dbt Cloud project:
- Project name: server-logs-daily-data-pipeline and press "Continue"
- Configure your development environment: Choose Bigquery as your data warehouse
- Upload the key you downloaded from BQ on the create from file option.
- Scroll down to the end of the page and set up your development credentials. This can also done under: Develop > Configure Cloud CLI > Credentials > click on project name > Development credentials > Dataset
- Setup a repository > GitHub > Connect GitHub Account > Authorize dbt Cloud > Add repository from: Git Clone > paste the SSH key from your repo. More details here.
- You will get a deploy key in dbt Cloud. Copy it and head to your GH repo and go to the settings tab. Under security you'll find the menu "D"eploy keys". Click on add key and paste the deploy key provided by dbt cloud. Make sure to tick on "Allow write access".
- Replicate file contents are in the dbt folder. File structure would look like:
-
- connect dbt to BigQuery development dbt_dev_env dataset:
Deploying with dbt Cloud
Before going into production, make sure everything is submitted to GitHub. Then:
- navigate to your dbt environments via Deploy > Environments; at this stage only Development environment should be should there
- create a Production environtment: click on the "Create environtment" button on the top right; field that need to be added/selected are:
- Environment name: Production
- Environment type: Deployment
- Connection: BigQuery
- Dataset: the name of your production dataset in BigQuery and press "Test Connection" and Save after successeful testing.
- go to Deploy > Jobs
- click on the "Create job" button, add the following details:
- Job name: dbt_build
- Environment: Production
- Execution settings > Commands:
- dbt seed
- dbt run
- dbt test
There is of course options to schedule the job which I'm not using now as I want manually run the job to see if it simply works in the first place; I therefore save the job and run it by clicking on the "Run now" button.
When the production models are ready, you can start building a dashboard. The Server Access Logs Daily Data Pipeline dashboard is built using Looker Studio. The process for building a dashboard in Looker Studio can be found in this video.
The final dashboard includes the following diagrams:
- total requests
- total requests by page url
- HTTP status codes by page category
- bot vs human requests
- identified bots