PERFACCT

Data Lake for Geoscience Research Data Management

We developed an Open-Source S3-based data lake solution for the centralized ingestion, categorization, and searchability of data. The goal was to automate and improve manual data management through an integrated architecture with workflow orchestration, data cataloging, and access control.

Client/Company/Industry

GFZ Helmholtz Centre for Geosciences

Duration

12 months

Product

Service

Expertise

Software Development

Goal

The goal of the project was to develop a central data lake solution that enables the integration of various data sources, the categorization of data, and efficient search across datasets. The existing manual data management process was to be automated and optimized in order to improve both efficiency and data accessibility.

Tasks

Planning and project organization
Designing the software architecture
Evaluating suitable software libraries
Evaluating S3 for storing unstructured data
Evaluating Apache Airflow for process and workflow orchestration
Evaluating STAC for data cataloging
Implementing a proof of concept based on MinIO, S3, Apache Airflow, and STAC
Evolving the proof of concept into an MVP-based data lake implementation
Implementing database functions with PL/pgSQL in PostgreSQL
Integrating an identity provider with Keycloak
Implementing SSO with OAuth2/OpenID Connect for the authorization of REST endpoints
Implementing a REST interface with FastAPI and Pydantic
Designing and developing the user interface with HTML, CSS, and JavaScript/TypeScript
Designing the SPA architecture
Developing new features with Vuetify
Developing unit, Nuxt, and UI component tests
Using Pytest, Playwright, and Vitest for automated testing
Configuring the reverse proxy with Nginx
Developing CI/CD pipelines
Maintaining and managing GitLab tickets

Challenges

A key challenge was bringing together requirements for data storage, workflow orchestration, data cataloging, access control, and the user interface within one consistent overall architecture. In addition, suitable technologies first had to be evaluated in a proof-of-concept phase and then transferred into a viable MVP solution.

Programming Languages

Python, JavaScript/TypeScript, PL/pgSQL

Technologies

Apache Airflow, Docker, Docker-Compose, FastAPI, GitLab CI/CD, HTML, CSS, Keycloak, MinIO, Nginx, OAuth2/OpenID Connect, Playwright, PostgreSQL, Pydantic, Pytest, REST, S3, STAC, Nuxt, Vitest, Vuetify

The image shows a schematic representation of a research data management system based on a data lake architecture.

Takeaway

The result was an MVP that enables data to be ingested and categorized. Automated workflows for data validation improved the reproducibility of scientific results. At the same time, the project established a technical foundation for the further development of an S3-based data lake infrastructure.

Similar Projects

RIM2D - Highly Efficient 2D Hydraulic Simulation of Fluvial, Pluvial, and Urban Flooding

Hydrodynamic Simulation Web Application Geodata GPU Computing

RIM2D is an existing, highly efficient 2D hydraulic simulation model for fluvial, pluvial, and urban flooding. As part of a strategic partnership, we supported the extension of the research code with a web application and a cloud-based GPU simulation environment, enabling its transition into a market-ready product.

See project

Computer Vision Based AI for Wound Detection

Computer Vision Machine Learning Medical Image Processing

We developed computer vision and AI components for a wound detection system. The result was a service that segments wound areas in patient images and calculates their size based on reference markers.

See project

All Projects