Shape, manage, lead and mentor the Infrastructure, Network, Security & SRE Team in a decentralized and global environment
Identify needs, weaknesses, risks and build the road map to address them, bringing the platform and services to the next level.
Manage, maintain, upgrade and monitor the critical infrastructure in a highly available environment
Keep the overall platform, systems, data and information secure in applying best practices and techniques
Identify risks (infrastructure, network, security, …) early on and ensure they are addressed before they become actual problems
Organize the team and define related processes to achieve 24/7 level 3 support
Work closely with the rest of the Engineering team to design and architect the platform and services,
productionalize services through configuration management, monitoring, alerting, and documentation
Identify parts of the system that do not scale, provide palliative measures and drive long term resolution of these incidents
Propose ideas and solutions within the infrastructure team to reduce the workload by automation (Terraform, Ansible, …)
Perform and run blameless root cause analyses on incidents and outages aggressively looking for answers that will prevent the incident from ever happening again
Keep up to date with trends and innovation in engineering, including containers and orchestration, server less and other programming paradigms, micro services, etc.
Sustain learning and knowledge sharing culture in the organization and aim at achieving a high level of technical excellence and stability.
Have a proactive, go-for-it attitude: when you see something broken, you can't help but fix it
Prioritize tasks, work independently, and call out exceptions effectively
Requirements / Qualifications:
Build out our monitoring and alerting solution (Prometheus, Alertmanager, Grafana, others) for all our production platforms, APIs and systems
Identify the SLI (Service Level Indicators) that will align the team to meet the availability and latency objectives (Service Level Objectives)
Measure and optimize performance and solve issues across the entire stack: hardware, software, application, and network
Define relevant KPI and metrics to assess and follow on the performance of the platform and systems
Actively look for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation with the rest of the team
Represent the SRE organization in design reviews and operational readiness exercises for new and existing services
Promote and spread the SRE culture across the organization and teams
The company operates a leading global network for mobile top-up solutions, innovative mobile rewards and airtime credit services.
We’re able to provide greater access to digital communications for over five billion people across emerging economies resulting in more active participation in the global economy. Our global network interconnects more than 550 mobile operators across 160 countries and delivers smarter data-driven mobile solutions to ensure that no one is left unconnected.
Stratagem Global Recruitment Pte. Ltd., 105 Cecil Street, # 18-10 The Octagon, Singapore 069534
Registered in Singapore No.: 200915477N; Recruitment Licence number 10C3398
Registered in Singapore No.: 200915477N;
Recruitment Licence number 10C3398