NDL Core Data Pipeline

This Space provides an overview of the NDL Core Data Pipeline, the data engineering backbone of the National Data Library project.

About the Project

The NDL Core Data Pipeline is responsible for collecting, refining, and processing public sector data from multiple sources, including data.gov.uk and the Office for National Statistics (ONS).

It transforms raw datasets into clean, structured, and AI-ready formats and generates vector embeddings to support search, discovery, and downstream analytics use cases.

Pipeline Overview

Data collection – ingesting raw data from public APIs and repositories
Data refinement – cleaning, normalising, and enriching datasets
Embedding generation – creating vector representations for AI workflows

The pipeline is orchestrated using Dagster and is designed to be modular, extensible, and reproducible.

Source Code & Documentation

The full implementation, documentation, and contribution guidelines are available on GitHub:

🔗 https://github.com/theodi/ndl-core-data-pipeline

Please refer to the GitHub repository for setup instructions, architecture details, and licensing information.

This Space is intended as a landing page and contextual overview rather than an executable application.