Senior Engineer/Site Reliability Engineer
Process Requires a mix of strategic engineering and design along with handson, technical work and problem solving skills. Passion for quality and automation, an ability to understand complex systems and a desire for continual improvement and innovation. Explore and evaluate new technologies and solutions to push the capabilities forward, getting ahead of customers' needs. Able to communicate concepts at different levels of abstraction to exercise influence across and at multiple levels of the organization. Share results from incident investigations to a wide IT audience through a blameless postmortem process with the goal of exposing faults so they are fixed instead of leaving issues unresolved. Ability to execute a change through an enterprise environment with consistency and reliability by applying modern software, operations and quality principles such as progressive rollouts, problem detection and rollbacks if needed.
Operations Work closely with software development engineers, systems engineers, network engineers, database administrators, monitoring team, and information security team in supporting new features, services and releases.
Deep understanding of and ability to debug standard networking protocols and components such as: HTTP, DNS, TCP/IP, ICMP and Load Balancing. Experience with infrastructure provisioning on public, private, and hybrid clouds using state of the art tools such as Terraform, vRelaize, Cloud Foundry or CloudFormation. Experience with configuration management tools such as Puppet, Chef or Ansible.
Effectively use metrics, monitoring, and instrumentation of the application and infrastructure to:
- proactively discover problems before users notice
- achieve optimal application performance, stability and availability
- determine optimal configurations for application software and application servers
- scale infrastructure to meet demand
Experience with and desire to influence emerging operation techniques including, but not limited to: Delivery and deployment through containers, Docker Swarm, or Kubernetes; AutoRemediation to automatically resolve incidents; Applications to test resiliency of systems (ex: Chaos Monkey).Qualifications
You must possess the below minimum qualifications to be initially considered for this position. Preferred qualifications are in addition to the minimum requirements and are considered a plus factor in identifying top candidates.
- Bachelor's degree in a Technical Discipline with 6+ years (Computer Science or computer engineering or Electrical Engineering or Mathematics or Physics or related technical discipline)
- Minimum of 6+ years' experience with remote deployment and administration of Linux servers
- Hands-on experience in configuring monitoring tools like Prometheus, Grafana and Zabbix
- Experience running a highly visible, 24x7 mission-critical service using SRE and DevOps practice
- Experience to work in global team
- Master's degree in Computer Science, Computer Engineering, or a STEM field
- Experience running serverless infrastructure
- GIT and parallel development, branching strategies and methodologies including CI/CD
- Interesting personal projects or contributions to open-source projects