Server Access Logs Daily Data Pipeline

This project was executed as a part of the Data Engineering Zoomcamp 2025 course held by DataTalks.Club - a free nine-week course that covers the fundamentals of data engineering. The goal of this project is to apply what was learnt during this course by creating an ELT data pipeline for processing server access logs data:

Extract data from a data source (GitHub repo)
Load the data into a data lake (GCS) and a data warehouse (BigQuery)
Transform the data in the data warehouse using dbt

Finally, the processed data will be presented in a Looker dashboard.

Project Overview

Context

Server log files are text files that automatically record events and transactions that happen on a web server, in this case, Heroku. These files help to provide insights into how search engine crawlers and humans navigate a website including such information as what pages they looked at, how people are finding your site, what errors they ran into if any. By analysing log files in real-time, you can identify issues regarding crawl errors, inefficiencies, and quickly make adjustments.

Expected Output

The output product of this project is a data pipeline that analyses server access logs data generated by Kabardian Poems Collection website hosted on Heroku server. The ELT data pipeline will extract the data from the server logs, process it by removing unnessesary and sensitive information, upload it to a cloud provider, transform the data to extract the aimed insights and display them in a dashboard.

Data Source and Dataset Description

Dataset

The following data is extracted from the Heroku server logs:

IP address of the user making the request: 10.1.22.113
Timestamp - time the request was made: 2025-02-11 16:04:15.059069+00:00
HTTP request method: GET
URL requested: /poems/
HTTP protocol: HTTP/1.1
Response code: 200
Size of response: 23279
Referer - the web page the user is visiting from: /poems/64/
User-Agent - the page that provided the link to make this request, e.g. search engine or social platform: Mozilla/5.0(X11;Linuxx86_64)AppleWebKit/537.36(KHTMLlikeGecko)Chrome/131.0.0.0Safari/537.36

Data Source

The data is exracted via Heroku CLI using bash commands and uploaded to GitHub repo similar to how it was done in the course. The data is stored in CSV files and uploaded daily. Each file contains requests from the previous day.

Project Architecture and Execution Flow

Data Flow

The end-to-end data pipeline includes the follwoing steps:

downloading, processing and uploading of the initial dataset to a data lake;
moving the data from the lake to a DWH;
transforming the data in the DWH and preparing it for the dashboard;
visualising the transformed data.

Technologies

Infrastracture: Terraform
Cloud: Google Cloud Platform (GCP)
- Data Lake: Google Cloud Storage (GCS)
- Data Warehouse: Google Big Query (GBQ)
- Data Visualization: Looker Studio
Containerization: Docker Compose
Orchestration: Kestra
Data Transformation: Data Build Tool (DBT Cloud)
Programming Language: Python, SQL

Reproducing Project

1. Prerequisites

Clone this repo.

Make sure you have the following pre-installed components:

GCP account
Terraform

Docker

Details below list commands used to install Docker Engine and Docker Compose on Ubuntu 24.04:

# install Docker Engine (used when we want to handle only one container)
docker --version # Docker version 27.4.1, build b9d17ea

# verify the installation run
sudo service docker start

# this command downloads a test image and runs it in a container
# when the container runs, it prints a confirmation message and exits
sudo docker run hello-world

# install Docker Compose (used when we have multiple containers to handle)
docker compose version # Docker Compose version v2.32.1

2. Google Cloud Platform (GCP) - Project Setup

Setup up GCP free account if you don't have an account. It expires after 90 days.
Create a new project and take note of the project number and project ID.
Create a service account for the project.
Configure service account to get access to this project and download auth-keys (.json).
Make sure the service account has all the permissions below:
- Viewer
- Storage Admin
- Storage Object Admin
- BigQuery Admin
Download SDK for local setup.

Set env var to point to your downloaded GCP auth-key by switching to the folder where the key was saved and running the commands below on the command line:

export GOOGLE_APPLICATION_CREDENTIALS="<path/to/your/service-account-authkeys>.json"

# Refresh token/session, and verify authentication
gcloud auth application-default login

Enable the following options under the APIs and services section:
- Identity and Access Management (IAM) API
- IAM service account credentials API

3. Create GCP Project Infrastructure with Terraform

Terraform is infrastructure as cloud (IaC) and it will be used to build and destroy GCP resources. The infrastructure we need to create for this project consists of:

a Cloud Storage Bucket (google_storage-bucket) for our Data Lake
a BigQuery Dataset (google_bigquery_dataset)

Terraform Installation

See full Ubuntu installation instructions here.

You will find two files in the cloned terraform folder:

main.tf

variables.tf

update with your project ID in variables.tf:

variable "project" {
  description = "ADD-YOUR-PROJECT-ID_HERE"
}

After following the commands below you should see "server_logs_bucket" in GCS and "server_logs_data" dataset in BigQuery:

# Refresh service-account's auth-token for this session:
gcloud auth application-default login

# Initialize configuration and import plugins for Google provider -
# cd to the folder with the Terraform config files and run the following command:
terraform init

# Create resources with Terraform plan; add your project ID when prompted:
terraform plan

# Create Infra with Apply; add your project ID when prompted and type "yes":
terraform apply

4. Orchestration with Kestra

This section explains how to orchestrate the data ingestion into:

a data lake (GCS) - the files from Github Release will be moved to a server_logs_bucket and
a data warehouse (BigQuery) - the data from CSV files will be ingested into a table under the newly created server_logs_data dataset.

The cloned kestra folder contains flow files and docker-compose.yaml. Two of them need to be updated, see below:

docker-compose.yaml : update "your-email-goes-here.com", row 45
- flows
  - 00_gcp_kv.yaml : update rows 10, 28, 34
    
    IMPORTANT! Watch out to NOT submit the updated file to GitHub!
  - 01_gcp_setup.yaml
  - 02_logs_2_gcs_2_bq.yaml

Run Docker Compose Container

# change to the folder that includes your docker-compose.yaml file
cd kestra/

# build docker image
sudo docker compose build

# run docker compose in detached mode
sudo docker compose up -d

The above command will spin up two containers:

kestra-postgres-1
kestra-kestra-1

Wait untill the above command has finished running, open http://localhost:8080/ in your brouser and add your flows via UI: Flows > Create

Run 00_gcp_kv.yaml and 01_gcp_setup.yaml.

To run the 02_logs_2_gcs_2_bq.yaml file don't press Execute as it contains a trigger.

Go to Triggers > "Backfill executions" > Select the start date > "Execute backfills".

5. DBT

Prerequisites

Create 2 new empty datasets for your project in BigQuery:

a development dataset, e.g.: dbt_dev_env
a production dataset, e.g.: dbt_prod_env

NOTE: Make sure you select your region in accordance with the selected region of your entire project.

Setting Up dbt Cloud

Create a dbt Cloud account or log in into the existing one.
Set up a GitHub repo for your dbt project.
Set up dbt Cloud with BigQuery:
- connect dbt to BigQuery development dbt_dev_env dataset:
  - create a BigQuery service account - use instructions here; NOTE: The steps for the current service account creation have slightly changed. E.g.: before you can actually create a new service account you will be first asked if you would like to re-use an existing one if such is already there. In this case, press "Continue" to land on a form to set up a new service account by proving your "Service account details":
    - add service account name, e.g.: dbt-service-account
    - press "create and continue"
    - in the "Grant this service account access to project" select "BigQuery Admin"
    - press "Done"
    - select newly created service account and navigate to its "KEYS" section
    - select "create new key" and the key type JSON; this will create and download the key file to your pc.
  - set up a dbt Cloud project:
    - Project name: server-logs-daily-data-pipeline and press "Continue"
    - Configure your development environment: Choose Bigquery as your data warehouse
    - Upload the key you downloaded from BQ on the create from file option.
    - Scroll down to the end of the page and set up your development credentials. This can also done under: Develop > Configure Cloud CLI > Credentials > click on project name > Development credentials > Dataset
    - Setup a repository > GitHub > Connect GitHub Account > Authorize dbt Cloud > Add repository from: Git Clone > paste the SSH key from your repo. More details here.
    - You will get a deploy key in dbt Cloud. Copy it and head to your GH repo and go to the settings tab. Under security you'll find the menu "D"eploy keys". Click on add key and paste the deploy key provided by dbt cloud. Make sure to tick on "Allow write access".
    - Replicate file contents are in the dbt folder. File structure would look like:

Deploying with dbt Cloud

Before going into production, make sure everything is submitted to GitHub. Then:

navigate to your dbt environments via Deploy > Environments; at this stage only Development environment should be should there
create a Production environtment: click on the "Create environtment" button on the top right; field that need to be added/selected are:
- Environment name: Production
- Environment type: Deployment
- Connection: BigQuery
- Dataset: the name of your production dataset in BigQuery and press "Test Connection" and Save after successeful testing.
- go to Deploy > Jobs
- click on the "Create job" button, add the following details:
  - Job name: dbt_build
  - Environment: Production
  - Execution settings > Commands:
    - dbt seed
    - dbt run
    - dbt test

There is of course options to schedule the job which I'm not using now as I want manually run the job to see if it simply works in the first place; I therefore save the job and run it by clicking on the "Run now" button.

6. Building Looker Studio Dashboard

When the production models are ready, you can start building a dashboard. The Server Access Logs Daily Data Pipeline dashboard is built using Looker Studio. The process for building a dashboard in Looker Studio can be found in this video.

The final dashboard includes the following diagrams:

total requests
total requests by page url
HTTP status codes by page category
bot vs human requests
identified bots

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
dbt		dbt
doc		doc
kestra		kestra
terraform		terraform
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Server Access Logs Daily Data Pipeline

Project Overview

Context

Expected Output

Data Source and Dataset Description

Dataset

Data Source

Project Architecture and Execution Flow

Data Flow

Technologies

Reproducing Project

1. Prerequisites

2. Google Cloud Platform (GCP) - Project Setup

3. Create GCP Project Infrastructure with Terraform

Terraform Installation

4. Orchestration with Kestra

Run Docker Compose Container

5. DBT

Prerequisites

Setting Up dbt Cloud

6. Building Looker Studio Dashboard

About

Releases

Packages

Languages

kkumyk/server-logs-daily-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Server Access Logs Daily Data Pipeline

Project Overview

Context

Expected Output

Data Source and Dataset Description

Dataset

Data Source

Project Architecture and Execution Flow

Data Flow

Technologies

Reproducing Project

1. Prerequisites

2. Google Cloud Platform (GCP) - Project Setup

3. Create GCP Project Infrastructure with Terraform

Terraform Installation

4. Orchestration with Kestra

Run Docker Compose Container

5. DBT

Prerequisites

Setting Up dbt Cloud

6. Building Looker Studio Dashboard

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages