Job Market Analytics

A Data Platform Use Case by Petra Carrion

Architecture Overview

The purpose of this data platform use case is to explore and gain a deeper understanding about tools and technologies that I am interested in

Even though my personal goal is rather technical. I have also set a business goal for the data platform in order to have a vision to move towards to

The business goal is to understand the state and trends in the job market by analysing the biggest recruitment platform in Germany

Examples of business questions that can be answered with this project are the following:

Which technologies are the most demanded at the moment?

How the demand for a particular technology evolves during the time?

How long is a job offer online until it is taken offline?

Days a job offer is online

Enough of business, let me show you how what this project is from the technical point of view

The project architecture has 3 high level components

Data Ingestion

Data Lake

Data Warehouse

Data Ingestion

The data is ingested via web scraping

The data source is the biggest recruitment platform in Germany

From this data source, two entities are scrapped:

Sitemap

Job Description

Sitemap

The sitemap format is just plain XML

It contains URL links all job descriptions

If a job is present in the sitemap, it means that it is online

Otherwise, it is offline

Job Description

A job description is a full HTML page

It contains all the public available details about the job offer:

company, location, job title, and full description

Web Scrapping

The Sitemap is scrapped via regular HTTP requests

But due to the data source restrictions, real user traffic needs to be simulated to crawl and download the Job Descriptions

In order to simulate web interactions, I use the Playwright headless browser

Both entities Sitemap and Job Description are stored in the Data Lake raw layer as they are without any kind of transformation

Job Orchestration

Airflow is used in order to orchestrate and schedule the data ingestion

The web scrapping airflow dag is as follows

Scrape Data Source DAG

Job Orchestration

Additional details:

airflow_dags/scrape_data_source_dag.py

Data Lake

It is where all ingested data is stored permanently

It is located at my computers file system but it could be easily relocated to cloud storage

It is divided in two layers:

Raw Layer

Cleansed Layer

Raw Layer

It stores the data in the exactly format as it is ingested, e.g. XML or HTML

The folder structure is as follows:


raw
└── <data_source>
    ├── <entity>
    │   ├── <year>
    │   │   ├── <month>
    │   │   │   ├── <date>
    │   │   │   │   ├── <hour>-<minute>-<second>
    │   │   │   │   │   ├── <entity_id>.<file_ext>
    │   │   │   │   │   ├── <entity_id>.<file_ext>
    │   │   │   │   │   ├── ...

Here is an example of the data structure


raw
└── ѕtерѕtоոе
    ├── job_description
    │   ├── 2021
    │   │   ├── 10
    │   │   │   ├── 03
    │   │   │   │   ├── 21-00-00
    │   │   │   │   │   ├── 7577548.html
    │   │   │   │   │   ├── 7577549.html
    │   │   │   │   │   ├── ...

I have being scraping the web source since October 2021

The current raw layer data size is around 275 GB

And it grows around 2 GB per day

Cleansed Layer

The cleansed layer contains data in Parquet format

Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval

This format enables easy and performant data exploration across the whole dataset

The data in the cleansed layer is populated by computing the raw layer data

This computation parses the entities in the raw layer

And discards corrupted data

Additional information:

parse_sitemaps.py

parse_job_descriptions.py

The cleansed layer data is partitioned is as follows:


cleansed
└── <data_source>
    ├── <entity>
    │   ├── year=<year>
    │   │   ├── month=<month>
    │   │   │   ├── day=<day>
    │   │   │   │   └── part-0.parquet
    │   │   │   ├── day=<day>
    │   │   │   │   │   ├── ...

Here is an example of the cleansed layer structure


cleansed
└── ѕtерѕtоոе
    ├── job_description
    │   ├── year=2021
    │   │   ├── month=10
    │   │   │   ├── day=03
    │   │   │   │   └── part-0.parquet
    │   │   │   ├── day=04
    │   │   │   │   │   ├── ...

The cleansed layer is also the data source for the data warehouse

The data contained in the Parquet files are actually accessed directly by the data warehouse

Data Warehouse

The Data Warehouse is based on DuckDB

DuckDB is an in-process SQL OLAP database management system

The architecture is very similar to SQLite but in this case it is column oriented and optimized for analytical loads

And being an in-process database is a big advantage since it consumes no resources unless it is being actively used by a program

In order to ingest data into the data warehouse

DuckDB uses a few views that expose data stored in parquet format

I called this layer that is shared between the data layer and the data warehouse curated but from the point of view of the datawarehause, it could have been called staging or ingestion

Once the data is exposed in the as parquet views

dbt scripts are used process the curated data

And the fact and dimension tables will be populated with this data

The data model in the data warehouse is as follows:

Data Model

Architecture Overview

Thanks for your attention!