We’re hiring a Site Reliability Engineer
ExpressVPN is looking for SREs to join our small but growing Cloud Platform tribe. If you identify as an SRE with prior platform and observability experience, or as a Software Engineer passionate about building resilient and scalable systems, this could be the role for you. With your ability to dive deep into problem spaces and come up with automated solutions, you will help shape company wide initiatives to improve service reliability and customer satisfaction. You will work closely with product development teams within different business domains including services which serve millions of requests per minute to millions of users across the world.
What you’ll be doing
We are open to varying degrees of experience, the more experienced you are the more we’ll expect to see your expertise show.
- Designing, building and operating the services we consume from AWS and platform shared services we run on top.
- Embed and pair with product development teams across the company to solve application and infrastructure style challenges.
- Help meet reliability objectives including service readiness, SLOs and SLAs across the business.
- Offer consultation on reliability and creating scalable, secure and resilient systems.
- Build tools and infrastructure to make developers’ lives easier.
What you’ll need to succeed
We do not expect that you have a deep understanding or experience of everything listed, but you should be willing to develop in the areas you have less experience in:
- Excellent written and verbal communication skills.
- Working knowledge of scalable architectures and performance optimization techniques for services that serve millions of requests per minute to millions of users across the world.
- Exceptional interpersonal skills: Empathy, negotiation skills, problem-solving acumen, emotional intelligence.
- Solution driven with a track record of breaking down complex problems and measuring results.
- 3+ years experience with a public cloud provider such as AWS or GCP.
- 3+ years experience with observability solutions and concepts, including their usage in creating resilient systems, such as Prometheus, Datadog or Grafana.
- 3+ years experience working with databases and object storage such as MySQL, PostgreSQL and S3.
- Experience in Linux environments with the ability to troubleshoot problems at the OS, database, server, or network level.
- Strong experience of being on-call for mission critical services, incident management and running postmortems.
- Excellent understanding of Infrastructure as Code (IaC) concepts and tooling such as Terraform or CloudFormation.
- Experience with at least one programming language, such as Python or Golang.
- Familiarity with software development best practices including test driven development, continuous delivery and agile methodologies.
- Eager to learn and improve your skill set.
Nice skills to have, but not required
- Experience operating services at scale on top of Kubernetes, ideally with a service mesh such as Istio.
- Experience with distributed microservices architectures.
- Familiarity with caches and message queues such as Redis and RabbitMQ.
- Knowledge of OKRs.
- Ability to participate in build versus buy decisions.
Please upload your resume as a PDF and do not include compensation information.