Job Description
Reporting
to Service
Reliability Engineering Lead – Systems Engineering. The Service
Reliability Engineer will be responsible for ensuring
system availability, performance, efficiency, change management, monitoring,
emergency response, security and capacity planning. In addition, this role will
be response for: –
- Ensuring operational excellence through
proactively building and implementing services, including end to end
monitoring, scripting and automation, modern tooling, and maintenance of
software.
- Providing software-related operations
support, including managing level two and level three incident and problem
management.
- Define, measure, monitor and report key
SRE performance indicators and escalate breaches and violations.
- Documenting “tribal” knowledge and
constant upkeep of the playbooks and runbooks to ensure teams get the
information they need right when they need it.
- Implementation of machine learning,
self-healing and drive the organization towards a no-ops model.
Responsibilities
- Run the production environment by
monitoring availability and taking a holistic view of system health.
- Implement SRE frameworks and practices
within the organization using the systems operations tool chain.
- Strong familiarity with web servers and
load balancing technologies.
- Operational Excellence – ensure systems
availability, performance, efficiency, change management, monitoring,
emergency response, security, and capacity planning.
- Stakeholder Engagement – Engage the
business teams and promoting a culture of participation and collaboration
to enhance effective and informed decision making.
- Define, measure, monitor and report key
systems reliability performance indicators and escalate breaches and
violations, with an eye toward pushing our capabilities forward,
getting ahead of customer needs, and innovating to continually improve.
- Create sustainable systems by driving
continuous improvement of the applications through chaos experiments,
automation, ML/AIOPs and proactive alerting strategies.
- Problem and Incident management – ensure
level two and level three support and incidents are addressed within SLA.
- Continually improve skills and
competencies by proactively participating in various internal and external
training opportunities and stretch assignments.
- Research on new fit for future
technologies and actively implement the viable solutions.
Qualifications
- Bachelor’s degree in computer science,
Information Systems, Software Engineering, IT, or another related field.
- More than three years of work experience
in programming and /or systems analysis applying agile frameworks.
- Experience working with agile
methodologies, such as Scrum, Kanban, XP, LSD, and FDD.
- Experience using code versioning &
collaboration tools such as Git, Docker.
- Strong analytical and problem-solving
skills
- Strong knowledge of software architecture
principles.
- Experience working in cloud-native
environments such as AWS
- Experience working with multiple
programming and markup languages, such as Android, IoS, HTML, CSS,
JavaScript, Java, Ruby, PHP, SQL, XML, JSON, YAML, and Python, and
paradigms such as object-oriented-, even-driven-, procedural-,
functional-, and declarative programming.
- Experience in Unix/Linux/AIX Operating
System and application security technologies (e.g. SSL)
- Professional experience and knowledge of
the telecommunications industry preferred.
- Competency in system and application
administration and practices preferred.
- Individual thinker with the ability to
identify and drive new uncharted solutions.
- Ability and willingness to share
knowledge with individuals with varying levels of experience.
- A proactive approach to spotting
problems, areas for improvement, and performance bottlenecks
How To Apply