Site Reliability Engineering Lead

Hyqoo · Genova, Liguria, Italia ·


Descrizione dell'offerta

Job Title: Site Reliability Engineering (SRE) Lead

Experience : 8 to 10 Years

Duration: Contract to hire role

Location: 16121 Genova GE, Italy

Hybrid: 3 days in office, 2 days remote.

Working hours: 8 AM to 5 PM CET

Bilingual - English & Italian

Overview:

We are seeking a highly skilled and driven Site Reliability Engineering (SRE) Lead to spearhead reliability, scalability, and operational excellence across our Azure and hybrid-cloud infrastructure. This role is crucial in ensuring our systems are robust, scalable, and efficient, directly impacting our organization's ability to deliver high-quality services. The ideal candidate will be an expert programmer with strong automation skills, deep Azure and AKS knowledge, and hands-on experience in SLO/SLI design, observability, and resiliency engineering. A background in the travel, logistics, or shipping industry is a plus.

Roles and Responsibilities:

  • Leadership & Collaboration: Lead SRE practices across environments, driving incident response, blameless postmortems, and reliability improvements. Partner with DevOps, Development, and Product teams to embed reliability principles into the SDLC. Mentor team members on automation, chaos testing, infrastructure design, and cloud-native practices.
  • Reliability Engineering: Define, track, and optimize SLIs, SLOs, and error budgets for critical applications and services. Design and implement monitoring, alerting, and telemetry solutions using tools like Dynatrace, Azure Monitor, or Prometheus/Grafana. Establish proactive chaos engineering and disaster recovery practices, including Azure Site Recovery.
  • Cloud & Infrastructure Automation: Administer and optimize Azure IaaS/PaaS environments, including VMSS, VNets, Load Balancers, Storage Accounts, RBAC, and ExpressRoute. Build resilient cloud-native apps and infrastructure using Azure Kubernetes Service (AKS). Develop and maintain Infrastructure-as-Code (IaC) using Terraform, ARM templates, or Bicep.
  • Programming & Automation: Build tooling and scripts in Python, PowerShell, or Bash to automate manual tasks and improve system reliability. Implement CI/CD pipelines and integrate SRE practices into DevOps workflows. Create runbooks and self-healing systems for repetitive incidents or known error conditions.
  • Incident & Change Management: Lead Sev1/Sev2 incident resolution end-to-end, driving rapid mitigation and long-term fixes. Track and manage change requests, including version updates, config changes, restarts, and rollbacks. Document and maintain RCAs, contributing to knowledge base and process improvements.

Qualifications:

  • Bachelor's degree in Computer Science, Information Technology, or a related field.
  • 8-10 years of experience in site reliability engineering, DevOps, or a related field.
  • Proven experience in leading SRE practices and teams.
  • Strong understanding of ITIL principles and practices.
  • Excellent written and verbal communication skills.

Tools and Technologies:

  • Cloud: Deep hands-on experience with Microsoft Azure (IaaS, PaaS, AKS) and services like Azure Backup, Azure Storage, Azure Monitor.
  • Observability: Expertise in Dynatrace, Azure Monitor, or similar APM/log aggregation tools.
  • Scripting: Proficiency in Python, PowerShell, or Shell scripting.
  • IaC: Strong experience with Terraform, ARM Templates, or Bicep.
  • DevOps: Knowledge of CI/CD pipelines, GitOps, release automation (Azure DevOps, Jenkins).
  • Config Management: Proficient with Ansible, Puppet, or Chef.
  • DR & Chaos: Hands-on with Azure Site Recovery, chaos testing frameworks (e.g., Gremlin, Chaos Mesh).
  • VMware: Familiarity with hybrid environments and integrating VMware workloads with cloud.

Preferred Experience:

  • Experience in the travel, shipping, or logistics industries where 24x7 reliability and global scale are critical.
  • Exposure to container security, network policies, and zero-trust access models in AKS or hybrid environments.

Work Environment:

  • On-call support rotation required.
  • Flexibility in working hours to address high-priority issues or collaborate across time zones.

Candidatura e Ritorno (in fondo)