Service Reliability Engineer
Detailed Description
Reporting to SRE Lead, the Service Reliability Engineer will be
responsible for stabilizing production systems, improving system availability
and reliability, ensuring automation of operational tasks, change management,
system monitoring, incidents response and capacity planning. In addition, this
role will be responsible for:
- Ensuring operational excellence
through proactively building and implementing services, including end to
end monitoring, scripting and automation, modern tooling, and maintenance
of software;
- Providing software-related
operations support, including managing level two and level three incident
and problem management;
- Define, measure, monitor and
report key SRE performance indicators and escalate breaches and
violations;
- Documenting “tribal” knowledge
and constant upkeep of the playbooks and runbooks to ensure teams get the
information they need right when they need it and;
- Implementation of machine
learning, self-healing and drive the organization towards a no-ops model.
Key Accountabilities:
- Run the production environment by
monitoring availability and taking a holistic view of system health.
- Create sustainable systems by
driving continuous improvement of the applications through chaos
experiments, automation, ML/AIOPs and proactive alerting strategies.
- Building and setting up new
development tools and infrastructure.
- Working on ways to automate and
improve development and release processes.
- Implement SRE frameworks and
practices within the organization using the systems operations tool chain.
- Operational Excellence – ensure
systems availability, performance, efficiency, change management,
monitoring, emergency response, security, and capacity planning.
- Stakeholder Engagement – Engage
the business teams and promoting a culture of participation and
collaboration to enhance effective and informed decision making.
- Define, measure, monitor and
report key systems reliability performance indicators and escalate
breaches and violations, with an eye toward pushing our capabilities
forward, getting ahead of customer needs, and innovating to continually
improve.
- Continually improve skills and
competencies by proactively participating in various internal and external
training opportunities and stretch assignments.
- Research on new fit for future
technologies and actively implement the viable solutions
Job Qualifications:
- Bachelor’s Degree in Computer
Science, Information Systems, Software Engineering, IT, or another related
field
- More than three years of work
experience in programming and /or systems analysis applying agile
frameworks
- Strong familiarity with web
servers and load balancing technologies
- Experience using SRE tools such
as Ansible, Rundeck, Terraform
- Experience using monitoring tools
such as Dynatrace/ELK/Splunk
- Experience working with multiple
programming and markup languages, such as Java, XML, JSON, YAML, Python
- Experience in Unix/Linux/AIX
Operating System and application security technologies e.g., SSL
- Experience using code versioning
& collaboration tools such as Git, Bitbucket.
- Strong knowledge of software
architecture principles
- Strong analytical and
problem-solving skills
- Experience working with agile
methodologies, such as Scrum, Kanban,
- Professional experience and
knowledge of the telecommunications industry preferred
- A proactive approach to spotting
problems, areas for improvement, and performance bottlenecks.