Willis Towers Watson
SRE Tech Lead
Posted on Jun 2 South Jordan, UT 146 views
We are looking for a Site Reliability Engineer (SRE) Tech Lead/Architect for our Cloud Engineering team. This position is responsible for leading the design, deployment, scaling, and maintenance of a complex, multi-tenant hybrid cloud and on-premises infrastructure that spans both Windows and the Linux OS. You have passion about security, reliability, and automation in line with DevOps and SRE principles. You have experience, but you are always willing to learn new things. You value expertise and a passion for the craft of Cloud Engineering and in coordinating efforts with Software Engineering, Systems Engineering, and InfoSec teams. While you recognize and stay up-to-date with current techniques and tools, you are prudent—knowing what is and what isn’t a good fit.
• Become familiar with the career aspirations of current and aspiring Site Reliability Engineers and help in setting short- and long-term goals to support them in those pursuits.
• Mentor Site Reliability Engineers and others in the organization on reliability, reducing toil, operating software at growing scale, reducing technical complexity and sprawl, and writing software and tooling to improve resilience and automating operations.
• Participate in interviewing and the hiring process.
• Conduct 1-on-1 meetings with Site Reliability Engineers.
• Keep leadership well informed of Site Reliability Engineering direction and focus, and communicate changes or status updates to SREs across various teams
• Explore new ways of improving communication between Site Reliability Engineers and with other teams.
• Promote inclusion and collaboration between various functional disciplines.
• Write and maintain architectural, stakeholder, and policy documentation.
• Encourage and inspire others to innovate.
• Look for new ways to improve:
- our processes and the quality of our infrastructure
- the velocity with which teams deliver, using expertise from various functional disciplines.
- remediation of production incidents more quickly and safely.
Productivity and Initative
• Define success and accountability for the Site Reliability Engineering discipline
• Adhere to and advocate for best practices including Infrastructure as Code, monitoring, high availability, disaster recovery, security, and DevOps methodologies.
• Keep Site Reliability Engineers focused on goals.
• Provide prompt assistance and remediation solutions during critical situations and production incidents.
• Work with teams to implement and refine SRE standards as they are decided upon by the technology organization.
•10+ years of hands-on technical experience with many of the following technologies, at least 50% of day-to-day function will be focused in this area.
• Windows and Linux Servers
• Cloud platforms, preferably Azure
• Active Directory
• Secrets management with Azure Key Vault, HashiCorp Vault or similar systems
• Configuration management tools like Ansible and Terraform
• Load balancers such as F5 Big-IP
• Web servers such as IIS (Internet Information Services)
• Application Performance Monitoring with tools like Application Insights / Azure Monitor
• Monitoring tools such as Azure Monitor, Zabbix, Solar Winds
• Continuous Integration and Continuous Delivery with tools like TeamCity, Octopus Deploy, Concourse or GitHub Actions
• Log Aggregation tools like SumoLogic or Splunk
• Networking tools such as DNS (Domain Name System), DHCP (Dynamic Host Configuration Protocol), proxy servers and software-defined networking in cloud environments such as Azure
• One or more scripting languages, such as PowerShell and BASH
• Command line interfaces
Bachelor's degree preferred; high school diploma required
EOE, including disability/vets