This Space provides an overview of the NDL Core Data Pipeline, the data engineering backbone of the National Data Library project.
The NDL Core Data Pipeline is responsible for collecting, refining, and processing public sector data from multiple sources, including data.gov.uk and the Office for National Statistics (ONS).
It transforms raw datasets into clean, structured, and AI-ready formats and generates vector embeddings to support search, discovery, and downstream analytics use cases.
The pipeline is orchestrated using Dagster and is designed to be modular, extensible, and reproducible.
The full implementation, documentation, and contribution guidelines are available on GitHub:
🔗 https://github.com/theodi/ndl-core-data-pipeline
Please refer to the GitHub repository for setup instructions, architecture details, and licensing information.
This Space is intended as a landing page and contextual overview rather than an executable application.