Architecture Overview
Which technologies are the most demanded at the moment?
How the demand for a particular technology evolves during the time?
How long is a job offer online until it is taken offline?
How long is a job offer online until it is taken offline?
Data Ingestion
Data Lake
Data Warehouse
Data Ingestion
The data is ingested via web scraping
The data source is the biggest recruitment platform in Germany
From this data source, two entities are scrapped:
Sitemap
Job Description
Sitemap
The sitemap format is just plain XML
It contains URL links all job descriptions
If a job is present in the sitemap, it means that it is online
Otherwise, it is offline
Job Description
A job description is a full HTML page
It contains all the public available details about the job offer:
company, location, job title, and full descriptionWeb Scrapping
The Sitemap is scrapped via regular HTTP requests
But due to the data source restrictions, real user traffic needs to be simulated to crawl and download the Job Descriptions
In order to simulate web interactions, I use the Playwright headless browser
Both entities Sitemap and Job Description are stored in the Data Lake raw layer as they are without any kind of transformation
Job Orchestration
Airflow is used in order to orchestrate and schedule the data ingestion
The web scrapping airflow dag is as follows
Job Orchestration
Additional details:
Data Lake
It is where all ingested data is stored permanently
It is located at my computers file system but it could be easily relocated to cloud storage
It is divided in two layers:
Raw Layer
Cleansed Layer
Raw Layer
It stores the data in the exactly format as it is ingested, e.g. XML or HTML
The folder structure is as follows:
raw
└── <data_source>
├── <entity>
│ ├── <year>
│ │ ├── <month>
│ │ │ ├── <date>
│ │ │ │ ├── <hour>-<minute>-<second>
│ │ │ │ │ ├── <entity_id>.<file_ext>
│ │ │ │ │ ├── <entity_id>.<file_ext>
│ │ │ │ │ ├── ...
Here is an example of the data structure
raw
└── ѕtерѕtоոе
├── job_description
│ ├── 2021
│ │ ├── 10
│ │ │ ├── 03
│ │ │ │ ├── 21-00-00
│ │ │ │ │ ├── 7577548.html
│ │ │ │ │ ├── 7577549.html
│ │ │ │ │ ├── ...
I have being scraping the web source since October 2021
The current raw layer data size is around 275 GB
And it grows around 2 GB per day
Cleansed Layer
The cleansed layer contains data in Parquet format
Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval
This format enables easy and performant data exploration across the whole dataset
The data in the cleansed layer is populated by computing the raw layer data
This computation parses the entities in the raw layer
And discards corrupted data
Additional information:
The cleansed layer data is partitioned is as follows:
cleansed
└── <data_source>
├── <entity>
│ ├── year=<year>
│ │ ├── month=<month>
│ │ │ ├── day=<day>
│ │ │ │ └── part-0.parquet
│ │ │ ├── day=<day>
│ │ │ │ │ ├── ...
Here is an example of the cleansed layer structure
cleansed
└── ѕtерѕtоոе
├── job_description
│ ├── year=2021
│ │ ├── month=10
│ │ │ ├── day=03
│ │ │ │ └── part-0.parquet
│ │ │ ├── day=04
│ │ │ │ │ ├── ...
The cleansed layer is also the data source for the data warehouse
The data contained in the Parquet files are actually accessed directly by the data warehouse
Data Warehouse
The Data Warehouse is based on DuckDB
DuckDB is an in-process SQL OLAP database management system
The architecture is very similar to SQLite but in this case it is column oriented and optimized for analytical loads
And being an in-process database is a big advantage since it consumes no resources unless it is being actively used by a program
In order to ingest data into the data warehouse
DuckDB uses a few views that expose data stored in parquet format
I called this layer that is shared between the data layer and the data warehouse curated but from the point of view of the datawarehause, it could have been called staging or ingestion
Once the data is exposed in the as parquet views
dbt scripts are used process the curated data
And the fact and dimension tables will be populated with this data
The data model in the data warehouse is as follows:
Architecture Overview
Thanks for your attention!