REMOTE: Senor Site Reliability Engineer (SRE)

Boston, MA

Senior Site Reliability Engineer (SRE)

This role is remote with periodic travel to Boston, MA or Seattle, WA office

U.S. Citizens and those authorized to work in the U.S. are encouraged to apply. We are unable to sponsor at this time. No Corp to Corp.

We are looking for SREs that are true software engineers and have a track record of producing quality solutions in Terraform, Node.js, Go, Python, etc.

We aim to break down walls between development and operations; participate in finding and building solutions which enable teams to deliver software updates in a way that is highly stable and operationally sound.

We are strongly invested in the AWS Cloud, infrastructure-as-code, and monitoring-as-code. We favor the practical and pragmatic over the ideal, including finding right-sized solutions. We are anticipatory and forward-looking, reliable, and have a bias toward taking action.

Leadership:

● Listening to the needs of our teams, learning how they work best, and delivering solutions.

● The ability to collaborate with product teams and technical leads to prioritize our efforts.

● Stay current on industry trends; conceive and present to management ways to improve current practices, to improve our standing in the marketplace, and remain on the cutting edge of technology.

● Ability to take ownership over a project, drive it forward, “sell” it to other teams inside the company as a solution for a given problem, and work with teams to drive adoption.

● If you see an opportunity to solve a problem or otherwise make something better, take the initiative.

● Mentor team members; foster growth by setting high-reaching goals; providing support as needed to achieve them.

Technical:

● Hands-on design, understanding, and troubleshooting of highly-distributed, large-scale production systems — both modern and legacy, monolithic and micro.

● Co-ownership with the development teams over reliability, uptime, capacity, and performance.

● Ensuring the repeatability, traceability, and transparency of our infrastructure automation.

● Identifying highest-impact opportunities to optimize existing systems; ensuring “right-sized” and

cost-optimized solutions in consideration of technical and business constraints.

● System design consulting for teams seeking to leverage or improve their production infrastructure.

● Anticipate, build, and plan capacity for upcoming product/feature launches.

● Working with application teams and product principals to fully operationalize software/systems projects

Required Skills:

● We are polyglot organization. Being “conversational” in JavaScript/TypeScript, Node, Python, PHP,

Ruby, Golang, Java, Bash, Markdown, reStructuredText, HCL, JSON, YAML, and TOML would be

valuable. Must be fluent in 2-3 of them.

● Must have the skills of a senior (or higher) level software application engineer.

● Must have the skills of a senior (or higher) level cloud operations engineer.

● Ability to translate knowledge and ideas into written-word as documentation/1-pagers.

● Excellent presentation and communication skills.

● Mastery of AWS services (IAM, EC2, S3, EBS/EFS, ELB/ALB, AutoScaling, RDS and replication

techniques, VPC, Subnets, Elastic IP, Route53, CloudWatch, CloudFront, Lambda, CloudFormation, ECS, SNS, ElastiCache).

● Expertise in container/container-fleet-orchestration technologies (Kubernetes, ECS, Docker).

● Expertise integrating continuous-integration and continuous-delivery software development lifecycles (i.e.,

CI/CD) into one or more applications (using Jenkins, Circle CI, Travis CI, or other modern CI tools).

● Expertise in infrastructure automation technologies (e.g., Terraform, CloudFormation).

● Expertise with Lean/Agile deployment processes (e.g., blue/green, zero downtime, canary, and DNS strategies).

● Significant experience troubleshooting interactions among concurrent and distributed systems.

● Cloud database operations and deployment experience (e.g., RDS MySQL/Postgres/Aurora), caching operations & deployments (e.g., Memcache, Redis).

● Ability to design and manage escalation response plans — from monitoring, to

reaction/response/remediation, to retrospection/post-mortem in culturally-aligned (proactive, customer focused, collaborative, proven-with-data) ways.

● Familiarity with site and infrastructure monitoring systems (e.g., CloudWatch, Datadog, New Relic, Sumo Logic, Thousand Eyes).

● Cloud and container-native Linux administration/build/management skills (e.g., AMIs, Packer).

● Strong problem-solving, root cause understanding, and systems engineering skills.

● Expertise with software development lifecycle branching and distributed source code management systems

(e.g., Git/Mercurial, Git-Flow, GitHub-Flow).

● B.S. Degree in Computer Science (or related technical field, or equivalent industry experience).

● A non-trivial background in open source is a HUGE plus.

REMOTE: Senor Site Reliability Engineer (SRE)

Share This Job