Description:
As Systems Site Reliability Engineer Manager, you
will lead and support a team of SRE engineers who are working to identify
challenges, analyze causes and apply corrective action to ensure that our
systems are reliable, scalable and performant as per agreed service level
objectives.
Reporting to the Principal SRE (Site Reliability
Engineering) Lead, you will be a part of the team responsible for helping to
support 24×7 uptime and availability of production mission-critical services
within the Bank. You will help to create more consistent, automated
environments across all applications or services, proactively test and tune all
aspects of the platforms, streamline CI/CD processes, monitor, and respond to
system notifications and alerts and continually work to optimize and improve the
performance, security, and reliability of our systems.
Responsibilities:
Lead SRE (Site Reliability Engineering) initiatives
in your areas of focus
Mentor and support the members of the team to achieve
high levels of performance
Lead the identification and establishment of the
service level indicators to support SLOs (Service Level Objectives)
Take ownership of the availability, stability,
resilience, and system / service health
Provide technical leadership in initiatives to
improve availability, stability, resilience of our services
Take leadership in incident response activities to
restore services
Collaborate with Dev teams to improve services
through rigorous testing and release procedures
Participate in architecture design, platform
management, and capacity planning exercises.
Create sustainable systems and services through
automation and uplifts
Required Skills and Qualifications:
Bachelor’s degree in computer science or equivalent
5+ years’ experience as a SRE/DevOps Lead
Experience in managing SRE/DevOps/Software engineers
Strong oral and written communication skills
Attention to detail and strong troubleshooting skills
Demonstrable experience in Containerization-Docker
and orchestration (Kubernetes)
Demonstrable experience in CI/CD tools such as Azure
DevOps, circle CI, Jenkins etc.
Good understanding of Infrastructure as Code
(Terraform, Cloud Formation, Ansible)
Familiarity with Linux and UNIX systems and command
line system administration such as Bash, VIM, SSH (secure shell).
Basic scripting skills (preferably Golang, bash,
shell, etc.,)
Experience in monitoring and analyzing infrastructure
performance using standard performance monitoring tools – Dynatrace, Azure
Application insights, Prometheus, SolarWinds
Good understanding of networking concepts e.g.,
Network routing, Load balancing, and Networking protocols, a base knowledge of
TCP/IP, with an understanding of HTTP and DNS
Experience in programming (structured and OOP) with
one or more high level languages, such as Python, Java, .NET, and JavaScript
Knowledge and proven hands-on experience in
large-scale databases and distributed technologies, such as Kafka and Redis
will be an added advantage
How To Apply