We developed an Open-Source S3-based data lake solution for the centralized ingestion, categorization, and searchability of data. The goal was to automate and improve manual data management through an integrated architecture with workflow orchestration, data cataloging, and access control.
Client/Company/Industry
GFZ Helmholtz Centre for Geosciences
Duration
12 months
Product
Service
Expertise
Software Development
The goal of the project was to develop a central data lake solution that enables the integration of various data sources, the categorization of data, and efficient search across datasets. The existing manual data management process was to be automated and optimized in order to improve both efficiency and data accessibility.
A key challenge was bringing together requirements for data storage, workflow orchestration, data cataloging, access control, and the user interface within one consistent overall architecture. In addition, suitable technologies first had to be evaluated in a proof-of-concept phase and then transferred into a viable MVP solution.
Programming Languages
Python, JavaScript/TypeScript, PL/pgSQL
Technologies
Apache Airflow, Docker, Docker-Compose, FastAPI, GitLab CI/CD, HTML, CSS, Keycloak, MinIO, Nginx, OAuth2/OpenID Connect, Playwright, PostgreSQL, Pydantic, Pytest, REST, S3, STAC, Nuxt, Vitest, Vuetify
The image shows a schematic representation of a research data management system based on a data lake architecture.
Similar problem?
The result was an MVP that enables data to be ingested and categorized. Automated workflows for data validation improved the reproducibility of scientific results. At the same time, the project established a technical foundation for the further development of an S3-based data lake infrastructure.
RIM2D is an existing, highly efficient 2D hydraulic simulation model for fluvial, pluvial, and urban flooding. As part of a strategic partnership, we supported the extension of the research code with a web application and a cloud-based GPU simulation environment, enabling its transition into a market-ready product.
We developed computer vision and AI components for a wound detection system. The result was a service that segments wound areas in patient images and calculates their size based on reference markers.