Service Reliability Engineering Lead
Brief
Description
Reporting to the Senior Manager – Systems Engineering, the
SRE team lead will be responsible for championing and driving operational
excellence through driving the adoption of SRE best practices and ensuring
system availability, performance, efficiency, change management, monitoring,
emergency response, security and capacity planning.
Key Responsibilities:
- Oversee and lead the implementation of the SRE frameworks
and practices within the organization using the systems operations
tool chain. Foster a collaborative and inclusive team culture that
emphasizes reliability, innovation, and continuous improvement.
- Team Management: Ensure team performance management while
fostering an environment of trust, learning, collaboration and cultivate a
culture of high performance.
- Build, recruit, retain, manage and develop a world class SRE
team.
- Operational Excellence – Define,
measure, monitor and report key SRE performance indicators and escalate
breaches and violations. This will help in informing the maturity
level of the team as well as to inform the Backlog and related
decisions.Collaborate with cross-functional teams to identify, prioritize,
and address reliability issues.
- Stakeholder Engagement by engaging the
business teams and promoting a culture of participation and collaboration
to enhance effective and informed decision making.
- Define, measure, monitor and report key systems
reliability performance indicators and escalate breaches and violations.
- Problem and Incident management – lead incident
response efforts, ensuring that incidents are resolved quickly and
effectively while minimizing downtime and customer impact. Conduct
post-incident reviews to identify root causes and implement preventive
measures.
- Capacity Planning – Monitor system
resource utilization and plan for capacity upgrades as needed to support
business growth. Optimize resource allocation and cost-efficiency.
- Security and Compliance: Collaborate
with security teams to ensure the reliability and security of systems and
applications. Ensure compliance with relevant industry standards and
regulations.
- Drivecontinuous improvement of
applications through planned chaos simulations, AIOPs,
automation and proactive alerting strategies.
- Documenting “tribal” knowledge and constant upkeep of the
playbooks, runbooks to ensure teams get the information they need right
when they need it.
- Champion and lead implementation of machine learning,
self-healing and drive the organization towards a no-ops model.
Qualifications:
- Bachelor’s degree in Computer Science, Information
Technology, or a related field (Master’s degree preferred).
- Several years of experience in SRE or a related field, with a
proven track record of improving system reliability.
- Strong leadership and team management skills.
- Proficiency in programming/scripting languages (e.g., Python,
Go, Ruby).
- Experience with containerization and orchestration
technologies (e.g., Docker, Kubernetes).
- Knowledge of cloud computing platforms (e.g., AWS, Azure,
Google Cloud).
- Familiarity with monitoring and alerting tools (e.g.,
Prometheus, Grafana, ELK Stack).
- Excellent problem-solving and communication skills.
- Ability to work in a fast-paced, dynamic environment and
handle high-pressure situations effectively.
How To Apply