Skip to main content

Data Lake for Geoscience Research Data Management

We developed an Open-Source S3-based data lake solution for the centralized ingestion, categorization, and searchability of data. The goal was to automate and improve manual data management through an integrated architecture with workflow orchestration, data cataloging, and access control.

Client/Company/Industry

GFZ Helmholtz Centre for Geosciences

Duration

12 months

Product

Service

Expertise

Software Development

Goal

The goal of the project was to develop a central data lake solution that enables the integration of various data sources, the categorization of data, and efficient search across datasets. The existing manual data management process was to be automated and optimized in order to improve both efficiency and data accessibility.

Tasks

  • Planning and project organization
  • Designing the software architecture
  • Evaluating suitable software libraries
  • Evaluating S3 for storing unstructured data
  • Evaluating Apache Airflow for process and workflow orchestration
  • Evaluating STAC for data cataloging
  • Implementing a proof of concept based on MinIO, S3, Apache Airflow, and STAC
  • Evolving the proof of concept into an MVP-based data lake implementation
  • Implementing database functions with PL/pgSQL in PostgreSQL
  • Integrating an identity provider with Keycloak
  • Implementing SSO with OAuth2/OpenID Connect for the authorization of REST endpoints
  • Implementing a REST interface with FastAPI and Pydantic
  • Designing and developing the user interface with HTML, CSS, and JavaScript/TypeScript
  • Designing the SPA architecture
  • Developing new features with Vuetify
  • Developing unit, Nuxt, and UI component tests
  • Using Pytest, Playwright, and Vitest for automated testing
  • Configuring the reverse proxy with Nginx
  • Developing CI/CD pipelines
  • Maintaining and managing GitLab tickets

Challenges

A key challenge was bringing together requirements for data storage, workflow orchestration, data cataloging, access control, and the user interface within one consistent overall architecture. In addition, suitable technologies first had to be evaluated in a proof-of-concept phase and then transferred into a viable MVP solution.

Programming Languages

Python, JavaScript/TypeScript, PL/pgSQL

Technologies

Apache Airflow, Docker, Docker-Compose, FastAPI, GitLab CI/CD, HTML, CSS, Keycloak, MinIO, Nginx, OAuth2/OpenID Connect, Playwright, PostgreSQL, Pydantic, Pytest, REST, S3, STAC, Nuxt, Vitest, Vuetify

Project Image

The image shows a schematic representation of a research data management system based on a data lake architecture.

Similar problem?

Contact us

Takeaway

The result was an MVP that enables data to be ingested and categorized. Automated workflows for data validation improved the reproducibility of scientific results. At the same time, the project established a technical foundation for the further development of an S3-based data lake infrastructure.

Similar Projects

Project Image

RIM2D - Highly Efficient 2D Hydraulic Simulation of Fluvial, Pluvial, and Urban Flooding

Hydrodynamic Simulation Web Application Geodata GPU Computing

RIM2D is an existing, highly efficient 2D hydraulic simulation model for fluvial, pluvial, and urban flooding. As part of a strategic partnership, we supported the extension of the research code with a web application and a cloud-based GPU simulation environment, enabling its transition into a market-ready product.

Project Image

Computer Vision Based AI for Wound Detection

Computer Vision Machine Learning Medical Image Processing

We developed computer vision and AI components for a wound detection system. The result was a service that segments wound areas in patient images and calculates their size based on reference markers.

Back To Top