SRE - Site Reliability Engineer

Colombus, OH

Description: • Hands-on design, analysis, development and troubleshooting of highly-distributed large-scale production systems and event-driven services spanning on-prem and AWS based hosting

• Ownership of reliability, uptime, system security, cost, operations, capacity and performance-analysis

• Share a 24x7 on-call rotation with your team and respond to incidents; lead triage bridges during incidents and provide needed status updates

• Create and maintain monitoring, alerting and dashboarding solutions that improve the visibility into our applications' performance and business metrics and keep operational workload in-check.

• Use automation technologies to ensure repeatability, eliminate toil, reduce time to action and repair services

• Participate in technical training events and game day scenarios

• Partner with engineering, security, performance, qa and product management teams to improve the availability and quality of service of our products

Required Skills:

• Strong Linux administration/build/management skills

• Development experience in at least one of these languages: Java, Go, C# and/or Python; Strong skills in reading, understanding and writing code in the same

• Demonstrated expertise building and managing highly scaled production infrastructure in on-prem and AWS based environments

• Extensive experience troubleshooting n-tier architectures with diverse sets of technologies strongly desired. (e.g. load balancers, web/app/caching/database servers, queues, threading, memory, cpu, heap, storage, network, os)

• Strong experience using application and infrastructure monitoring systems (like Splunk, Cloudwatch, Datadog, New Relic, Sumologic, ELK)

• Excellent presentation and communication skills

• Mastery of infrastructure automation technologies (like Terraform, Puppet, Ansible, Chef)

• Expertise with continuous deployment based software development lifecycles (e.g. CI/CD)

• Experience with common middleware (e.g., Apache, NGINX, IIS, Tomcat, JBoss)

• Experience with SQL databases (e.g., PostgreSQL, Oracle, MySQL)

• Expertise with SDLC branching, SCM, and code deployment systems (git/gitflow, Jenkins, CircleCI, TravisCI, etc.)

• Expertise in container/container-fleet-orchestration technologies (like Docker, Vagrant, Mesosphere)

• BS Degree in Computer Science (or related technical field and/or equivalent industry experience)

SRE - Site Reliability Engineer

Share This Job