Site Reliability Engineer II (Remote)
Descrizione dell'offerta
Agile Lab is a company founded in 2014 with the mission to create value for its customers in data-intensive environments through customisable solutions that establish performance-driven processes, sustainable architectures and automated platforms based on data governance best practices.
Having delivered over 100 successful Elite Data Engineering initiatives, we have used this experience to create Witboost : a modular, technology-agnostic platform that enables modern organisations to discover, value and produce their data in both traditional environments and fully compliant Data Mesh architectures.
With a highly skilled team of over 260 data engineers based in Europe, Agile Lab helps organisations with their data-driven transformation.
Take a look at our handbook to discover our core values and processes.
The opportunity :
We are looking for a Site Reliability Engineer II (SRE II) to join our growing team. You will play a key role in maintaining the reliability, observability, and operational efficiency of enterprise-level distributed systems.
In this role, you’ll coordinate a small technical team (3–4 people) in managing microservices in complex production environments. You will be involved in monitoring, incident management, release coordination, and performance tuning, with a strong focus on OpenShift platforms.
You’ll also work closely with multiple cross-functional teams to ensure high availability and performance of our cloud-native services.
This role includes on-call availability.
RAL : 38.5K-48.5K
Responsibilities :
- Ensure high reliability of microservices running in OpenShift environments
- Lead and coordinate a technical team of 3–4 engineers for operational excellence
- Manage incident resolution and ticketing workflows via ServiceNow
- Collaborate with development teams to drive performance optimization and tuning
- Design, configure and maintain monitoring dashboards (Grafana, Prometheus, etc.)
- Coordinate with Service Control Room to maintain effective alerting and response
- Oversee release processes of new features, hotfixes, and updates in production