Site Reliability Engineer

Waltham, MA

Site Reliability Engineer

This role can be on-site in Seattle, WA or Waltham, MA

U.S. Citizens and those authorized to work in the U.S. are encouraged to apply. We are unable to sponsor at this time. No Corp to Corp.

This role is with an Ed Tech partner
Apply direct to: creposa@syrinx.com
12+ Month Contract with Possible Extension (Client will convert full time, if desired)

The next generation of our products are delivering engaging, adaptive, and personalized learning experiences to optimally support every learner. We are hiring a Site Reliability Engineer who will work with system and software engineers to build reliable, high capacity and high-performance systems in support of our mission to reimagine learning for millions of students and learners worldwide. This position will be located at our Seattle, WA facility.

As a Site Reliability Engineer, you will help design, analyze and resolve issues with infrastructure in collaboration with product development teams; you will design, deploy and manage automation tools that increase predictability as well as decrease time to market while reducing cost.

Essential Accountabilities:

Hands-on design, analysis and troubleshooting of highly-distributed large-scale production systems;
Ownership of reliability, uptime, capacity- and performance-analysis thereof
Ensuring the repeatability, traceability, and transparency of our infrastructure automation including alignment with MHE standards and best practices for operational excellence
Identifying highest-impact opportunities to optimize existing systems
System design consulting for teams seeking to leverage or improve their production infrastructure
Anticipate, build and plan capacity for upcoming product/feature launches
Responsible for fully operationalizing software/systems projects including security requirements

Required Skills:

Expertise with cloud- continuous-deployment- based software development lifecycles (e.g. CI/CD)
Mastery of infrastructure automation technologies (like Terraform, CodeDeploy, Puppet, Ansible, Chef)
Expertise in container/container-fleet-orchestration technologies (like Docker, Vagrant, Mesosphere, etcd, zookeeper)
Cloud and container native Linux administration/build/management skills (e.g. AWS AMIs, Packer, etc.)
Cloud database operations and deployment experience (e.g. RDS MySQL/Postgres/Aurora), Caching operations & deployment experience (e.g. memcache, Redis)
Expertise with Lean/Agile deployment processes (Blue/Green, ZDT, canary, load balancers/DNS strategies)
Familiarity with site and infrastructure monitoring systems (like AWS Cloudwatch, Datadog, New Relic, Sumologic)
Strong problem solving, root cause analysis and systems engineering skills
Excellent presentation and communication skills
Experience with programming in languages like Javascript, Python, PHP, Go, or Ruby;
Strong skills in reading, understanding and writing code in the same
Ability to design and manage escalation response plans from monitoring, react, respond, remediate and retrospect in culturally aligned (proactive, customer focused, collaborative, data-driven) ways
Demonstrated expertise building and managing highly scaled production infrastructure in the cloud (AWS required; GCP, Azure, OpenStack a plus)
Expertise with SDLC branching, SCM, and code deployment systems (e.g. git/gitflow, Jenkins, CircleCI, TravisCI, etc.)
BS Degree in Computer Science (or related technical field and/or equivalent industry experience)

Site Reliability Engineer

Share This Job