Lead SRE

Boston, MA

The next generation of our products are delivering engaging, adaptive, and personalized
experiences to optimally support every learner. We are hiring a Lead Site Reliability Engineer
who will work with the software engineers to build reliable, high capacity and high-performance
infrastructure in support of our mission to reimagine learning for millions of students worldwide.
If you love solving application reliability and operations problems by engineering
solutions and writing code...

If you excel at defining roadmaps and setting vision for long-term projects.
If you love to identify, define, and solve strategic problems, thinking holistically about the whole system and its interactions with other systems…
If you thrive on effective partnerships with product leaders to manage scope and deliverables for the technical side of the product roadmap...
If you know AWS services inside out and have solid networking experience…you will thrive in this position!

Essential Accountabilities:

● Hands-on design, analysis and troubleshooting of highly-distributed large-scale
production systems.
● Ownership of reliability, uptime, capacity, and performance analysis thereof.
● Deep care of the repeatability, traceability, and transparency of our infrastructure
automation.
● Identifying highest-impact opportunities to optimize existing systems.
● System design consulting for teams seeking to improve their production infrastructure.
● Anticipate, build and plan capacity for upcoming product/feature launches.
● Focused on technical decision making, leading work that affects one or more complex
systems and mission-critical areas.
● Successfully plans & executes projects involving multiple developers and complex
requirements, prioritizing strategically.
● Helps define roadmaps and set vision for long-term projects.
● Contributes to all major architectural decisions and reads all tech specs within their
domain, tracking status and considering implications to other systems.
● Identifies, defines, and solves strategic problems, thinking holistically about the whole
system and its interactions with other systems.
● Tackles tech debt proactively.
● Excels at getting the team to focus on the highest-impact projects.
● Partners effectively with product to manage scope and deliverables for the technical side
of the product roadmap.
● Leads initiatives & meetings within team and domain. Regularly leads multi-person,
multi-week projects.
● Sought out as mentor and provider of technical guidance, effective coaching.
● Convinces others about technical tradeoffs & decisions.

Required Technical Skills:

● Mastery of AWS services (IAM, EC2, S3, EBS/EFS, ELB/ALB, AutoScaling, RDS and
replication techniques, VPC, Subnets, Elastic IP, Route53, CloudWatch, CloudFront,
Lambda, CloudFormation, ECS, SNS, ElastiCache).
● Expertise in container/container-fleet-orchestration technologies (Kubernetes, ECS).
● Hands on experience with data storage and database solutions in the Cloud, including
suitable DB architecture fit for purpose, DB schema change deployments, back up and
DR solutions.
● Experience with database ETL techniques and solutions (preferably Apache Spark)
including streaming large data sets.
● Expertise in designing and management of the escalation response plans from
monitoring: React, Respond, Remediate and Retrospect aligned with the McGraw Hill’s
culture of being proactive, customer focused, collaborative, data-driven and automated.
● Mastery of infrastructure build and configuration automation technologies (like
Terraform, Ansible, Puppet, CodeDeploy).
● Strong skills in reading, understanding and writing code in at least two of: Javascript,
Python, PHP, Go, or Ruby.
● Solid network engineering skills.
● Cloud native Linux administration/build/management skills (AMIs, Packer, etc.).
● Significant experience troubleshooting interactions among concurrent and distributed
systems.
● Expertise with continuous-deployment software development lifecycles in the Cloud (e.g.
CI/CD).
● Cloud database operations and deployment experience (RDS MySQL/Postgres/Aurora),
caching operations & deployments (Memcache, Redis).
● Expertise with Lean/Agile deployment processes (ZDT: Blue/Green, Canary, DNS
strategies).
● Familiarity with site and infrastructure monitoring systems (CloudWatch, Datadog, New
Relic, Sumologic, Thousand Eyes).
● Expertise with SDLC branching, SCM, and code deployment systems (Git/Gitflow,
Jenkins, CircleCI, etc.).

Other Requirements:

● Strong problem solving, root cause analysis and systems engineering skills;
● Good presentation and communication skills;
● BS Degree in Computer Science (or a related technical field and/or equivalent industry
experience).

Lead SRE

Share This Job