Job Overview This role involves leading architecture and solution design for AI/ML networking infrastructure, data center, and WAN networking opportunities...
Senior Site Reliability Engineer
RedhatJob Overview
Develop, scale, and operate OpenShift managed cloud services, Red Hat’s enterprise Kubernetes distribution. Contribute to running OpenShift at scale by enabling customer self-service, improving monitoring sustainability, and automating work. Influence complex scale challenges unique to managed cloud services using skills in coding, operations, and large-scale distributed system design. Work in a global, transparent team environment that fosters learning from failures and supports continuous improvement.
Responsibilities
- Contribute code to increase the scalability and reliability of the service
- Contribute software tests and participate in peer review to increase the quality of our codebase
- Help and develop peers’ capabilities through knowledge sharing, mentoring, and collaboration
- Participate in a regular on-call schedule, including occasional paid weekends and holidays
- Practice sustainable incident response and blameless postmortems
- Resolve customer issues escalated from the Red Hat Global Support team
- Work within a small agile team to develop and improve SRE software, support your peers, plan and self-improve
Qualifications
- Bachelor’s degree in Computer Science or related technical field involving software or systems engineering (or equivalent hands-on experience in Site Reliability Engineering)
- Experience programming in at least one language: Python, Golang, Java, C, C++ or another object-oriented language
- Experience working with public clouds such as AWS, GCP, or Azure
- Ability to collaboratively troubleshoot and solve problems in a team setting
- Experience troubleshooting as-a-service offerings (SaaS, PaaS, etc.) and working with complex distributed systems
- Direct experience with Kubernetes or OpenShift is a plus
- Demonstrated ability to debug, optimize code and automate routine tasks
- Basic understanding of Unix/Linux operating systems
- Desired: 5+ years managing Linux servers (RHEL, CentOS, or Fedora) in cloud (AWS, GCE, Azure)
- Desired: 3+ years with enterprise systems monitoring (Prometheus a plus)
- Desired: 3+ years with configuration management (Ansible, Puppet, Chef)
- Desired: 2+ years programming with Golang, Java, or Python
- Desired: 2+ years delivering a hosted service
- Desired: Ability to quickly troubleshoot system issues
- Desired: Understanding of TCP/IP networking and protocols like DNS and HTTP
- Desired: Solid communication skills and customer interaction experience
- Desired: 1+ year with Kubernetes or Docker-based containers