Sr. Site Reliability Engineer

Seattle, WA

Site Reliability Engineer (Senior)

McGraw-Hill Education is a digital learning company that draws on its more than 100 years of educational expertise to offer solutions which improve learning outcomes around the world. The Company has offices across North America, India, China, Europe, the Middle East and South America, and makes its learning solutions available in more than 65 languages. For additional information, visit www.mheducation.com.

The next generation of our digital products are delivering engaging, adaptive, and personalized learning experiences to optimally support every student. We are hiring a Site Reliability Engineer who will work with system and software engineers to build reliable, high capacity and high-performance systems in support of our mission to reimagine learning for millions of students and learners worldwide. This position will be located at our Seattle, WA office.

We aim to break down walls between development and operations; participate in finding and building solutions which enable teams to deliver software updates in a way that is highly stable and operationally sound. We are strongly invested in the AWS Cloud, infrastructure-as-code, and monitoring-as-code. We favor the practical and pragmatic over the ideal, including finding right-sized solutions. We are anticipatory and forward-looking, reliable, and have a bias toward taking action. We understand that without our customers our efforts are worthless, and that operational changes are likely to have a direct impact on user experience. We understand that uptime is paramount, and we work backwards from there.

Essential Accountabilities:

The ability to collaborate with product teams and technical principals to prioritize our efforts.
Hands-on design, understanding, and troubleshooting of highly-distributed, large-scale production systems — both modern and legacy, monolithic and micro.
Co-ownership with the development teams over reliability, uptime, capacity, and performance.
Ensuring the repeatability, traceability, and transparency of our infrastructure automation including alignment with MHE standards and best practices for operational excellence.
Identifying highest-impact opportunities to optimize existing systems; ensuring “right-sized” solutions in consideration of technical and business constraints.
System design consulting for teams seeking to leverage or improve their production infrastructure.
Anticipate, build, and plan capacity for upcoming product/feature launches.
Working with application teams and product principals to fully operationalize software/systems projects (including security requirements), delivered on-time and within budget.
Stay current on industry trends; conceive and present to management ways to improve current practices, to improve our standing in the marketplace, and remain on the cutting edge of technology.
Mentor team members; foster growth by setting high-reaching goals; providing support as needed to achieve them.

Required:

3 years of experience as a software application engineer.
3 years of experience as a system/release engineer.
5 years of experience with the foundational AWS services: EC2, RDS, and S3.
3 years of experience with the supporting AWS services (e.g., SQS, SNS, SES, CloudWatch, ElastiCache, Lambda).
1 year of integrating continuous-integration and continuous-delivery software development lifecycles (i.e., CI/CD) into one or more applications (using Jenkins, Circle CI, or other modern CI tools).
3 years of infrastructure and/or system configuration automation technologies (e.g., Terraform, AWS CodeDeploy, Puppet, Ansible, Chef).
3 years of experience in container and orchestration technologies (e.g., Docker, Vagrant, etcd, Consul, Zookeeper).
3 years of experience with Linux-in-the-cloud, with at least 1 year of “Enterprise Linux” distributions (e.g., RHEL, CentOS, Amazon Linux).
1 year of experience with cloud database operations and deployment experience (e.g., RDS MySQL, RDS PostgreSQL, Amazon Aurora); caching operations & deployment experience (e.g., Memcache, Redis).
3 years of experience with monitoring applications and infrastructure; familiarity with common monitoring systems (e.g., CloudWatch, Datadog, New Relic, Sumo Logic).
Strong problem-solving, root cause understanding, and systems engineering skills.
Ability to design and manage escalation response plans — from monitoring, to reaction/response/remediation, to retrospection/post-mortem in culturally-aligned (proactive, customer focused, collaborative, proven-with-data) ways.
Demonstrated expertise building and managing highly-scaled production infrastructure in the cloud (AWS required; GCP, Azure, OpenStack a plus).
Excellent presentation and communication skills.
B.S. Degree in Computer Science (or related technical field, or equivalent industry experience).

Nice to Have:

Being able to translate between development, operations, security, product, and management dialects is a highly-sought skill.
Ability to translate knowledge and ideas into written-word as documentation.
Cloud and container-native Linux administration/build/management skills (e.g., AMIs, Packer).
Expertise with Lean/Agile deployment processes (e.g., blue/green, zero downtime, canary, and DNS strategies).
MHE is a polyglot organization. Being “conversational” in JavaScript/TypeScript, Python, PHP, Ruby, Golang, Java, Bash, Markdown, reStructuredText, HCL, JSON, YAML, and TOML would be valuable. Being fluent in 2-3 of them would be a huge plus.
Expertise with software development lifecycle branching and distributed source code management systems (e.g., Git/Mercurial, Git-Flow, GitHub-Flow).
A non-trivial background in open source is a huge plus.

Sr. Site Reliability Engineer

Share This Job