logo

View all jobs

Lead SRE

Boston, MA
The next generation of our products are delivering engaging, adaptive, and personalized
experiences to optimally support every learner. We are hiring a Lead Site Reliability Engineer
who will work with the software engineers to build reliable, high capacity and high-performance
infrastructure in support of our mission to reimagine learning for millions of students worldwide.
If you love solving application reliability and operations problems by engineering
solutions and writing code...
 
  • ​If you excel at defining roadmaps and setting vision for long-term projects.
  • If you love to identify, define, and solve strategic problems, thinking holistically about the​ whole system and its interactions with other systems…
  • If you thrive on effective partnerships with product leaders to manage scope and deliverables for the technical side of the product roadmap...
  • ​If you know AWS services inside out and have solid networking experience…you will thrive in this position!
Essential Accountabilities:
 
● Hands-on design, analysis and troubleshooting of highly-distributed large-scale
production systems.
● Ownership of reliability, uptime, capacity, and performance analysis thereof.
● Deep care of the repeatability, traceability, and transparency of our infrastructure
automation.
● Identifying highest-impact opportunities to optimize existing systems.
● System design consulting for teams seeking to improve their production infrastructure.
● Anticipate, build and plan capacity for upcoming product/feature launches.
● Focused on technical decision making, leading work that affects one or more complex
systems and mission-critical areas.
● Successfully plans & executes projects involving multiple developers and complex
requirements, prioritizing strategically.
● Helps define roadmaps and set vision for long-term projects.
● Contributes to all major architectural decisions and reads all tech specs within their
domain, tracking status and considering implications to other systems.
● Identifies, defines, and solves strategic problems, thinking holistically about the whole
system and its interactions with other systems.
● Tackles tech debt proactively.
● Excels at getting the team to focus on the highest-impact projects.
● Partners effectively with product to manage scope and deliverables for the technical side
of the product roadmap.
● Leads initiatives & meetings within team and domain. Regularly leads multi-person,
multi-week projects.
● Sought out as mentor and provider of technical guidance, effective coaching.
● Convinces others about technical tradeoffs & decisions.
 
Required Technical Skills:
 
● Mastery of AWS services (IAM, EC2, S3, EBS/EFS, ELB/ALB, AutoScaling, RDS and
replication techniques, VPC, Subnets, Elastic IP, Route53, CloudWatch, CloudFront,
Lambda, CloudFormation, ECS, SNS, ElastiCache).
● Expertise in container/container-fleet-orchestration technologies (Kubernetes, ECS).
● Hands on experience with data storage and database solutions in the Cloud, including
suitable DB architecture fit for purpose, DB schema change deployments, back up and
DR solutions.
● Experience with database ETL techniques and solutions (preferably Apache Spark)
including streaming large data sets.
● Expertise in designing and management of the escalation response plans from
monitoring: React, Respond, Remediate and Retrospect aligned with the McGraw Hill’s
culture of being proactive, customer focused, collaborative, data-driven and automated.
● Mastery of infrastructure build and configuration automation technologies (like
Terraform, Ansible, Puppet, CodeDeploy).
● Strong skills in reading, understanding and writing code in at least two of: Javascript,
Python, PHP, Go, or Ruby.
● Solid network engineering skills.
● Cloud native Linux administration/build/management skills (AMIs, Packer, etc.).
● Significant experience troubleshooting interactions among concurrent and distributed
systems.
● Expertise with continuous-deployment software development lifecycles in the Cloud (e.g.
CI/CD).
● Cloud database operations and deployment experience (RDS MySQL/Postgres/Aurora),
caching operations & deployments (Memcache, Redis).
● Expertise with Lean/Agile deployment processes (ZDT: Blue/Green, Canary, DNS
strategies).
● Familiarity with site and infrastructure monitoring systems (CloudWatch, Datadog, New
Relic, Sumologic, Thousand Eyes).
● Expertise with SDLC branching, SCM, and code deployment systems (Git/Gitflow,
Jenkins, CircleCI, etc.).
 
Other Requirements: 

● Strong problem solving, root cause analysis and systems engineering skills;
● Good presentation and communication skills;
● BS Degree in Computer Science (or a related technical field and/or equivalent industry
experience).
 

Share This Job

Powered by