Share this:
  • E-Mail
  • Facebook
  • Twitter
  • Linkedin
Apply Now

Job Title: Head of Infrastructure and SRE

Company: Digital Payments Company

Corporate Title: Flexible Salary Range: Generous Package
Location: Singapore Date Open: July 2020

Responsibilities / Duties:

  • Shape, manage, lead and mentor the Infrastructure, Network, Security & SRE Team in a decentralized and global environment
  • Identify needs, weaknesses, risks and build the road map to address them, bringing the platform and services to the next level.
  • Manage, maintain, upgrade and monitor the critical infrastructure in a highly available environment
  • Keep the overall platform, systems, data and information secure in applying best practices and techniques
  • Identify risks (infrastructure, network, security, …) early on and ensure they are addressed before they become actual problems
  • Organize the team and define related processes to achieve 24/7 level 3 support
  • Work closely with the rest of the Engineering team to design and architect the platform and services, 
  • productionalize services through configuration management, monitoring, alerting, and documentation
  • Identify parts of the system that do not scale, provide palliative measures and drive long term resolution of these incidents
  • Propose ideas and solutions within the infrastructure team to reduce the workload by automation (Terraform, Ansible, …)
  • Perform and run blameless root cause analyses on incidents and outages aggressively looking for answers that will prevent the incident from ever happening again
  • Keep up to date with trends and innovation in engineering, including containers and orchestration, server less and other programming paradigms, micro services, etc.
  • Sustain learning and knowledge sharing culture in the organization and aim at achieving a high level of technical excellence and stability.
  • Have a proactive, go-for-it attitude: when you see something broken, you can't help but fix it
  • Prioritize tasks, work independently, and call out exceptions effectively

Requirements / Qualifications:

  • Build out our monitoring and alerting solution (Prometheus, Alertmanager, Grafana, others) for all our production platforms, APIs and systems
  • Identify the SLI (Service Level Indicators) that will align the team to meet the availability and latency objectives (Service Level Objectives)
  • Measure and optimize performance and solve issues across the entire stack: hardware, software, application, and network
  • Define relevant KPI and metrics to assess and follow on the performance of the platform and systems
  • Actively look for opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation with the rest of the team
  • Represent the SRE organization in design reviews and operational readiness exercises for new and existing services
  • Promote and spread the SRE culture across the organization and teams
  • The company operates a leading global network for mobile top-up solutions, innovative mobile rewards and airtime credit services. 
  • We’re able to provide greater access to digital communications for over five billion people across emerging economies resulting in more active participation in the global economy. Our global network interconnects more than 550 mobile operators across 160 countries and delivers smarter data-driven mobile solutions to ensure that no one is left unconnected.