Willis Towers Watson

Site Reliability Engineering Team Lead – Monitoring and Alerting

Posted on Jan 14 52 views South Jordan, UT

Our engineering team has built the largest private Medicare marketplace in the country. We passionately focus on the continuous improvement of the systems we build and the culture we promote. We build a platform that provides the best possible support to our customers who are shopping for insurance, and where our insurance carriers can be confident that their products are accurately and impartially represented.
We are looking for an Engineering Team Lead for our Insights Engineering team. This position is responsible for leading the design, deployment, scaling, and maintenance of platforms used by two dozen engineering teams to instrument real-time telemetry monitoring and alerting. This team is embedded within our Platform Engineering Group.

We operate in a complex, multi-tenant, hybrid cloud and on-premises infrastructure that spans both the Windows and Linux OS. We strive for security, reliability, and automation in line with DevOps and Site Reliability Engineering principles. If you are passionate about learning and improvement through metrics and automation, and passionate about engendering that mindset in others, we want to hear from you.

The Role

• Keep leadership well informed of your team's direction and focus
• Ensure that your entire team is well informed of changes or status
• Explore new ways of improving communication among your team and with other teams
• Promote inclusion and collaboration between various functional disciplines
• Conduct 1-on-1 meetings with all team members
• Write and maintain architectural, stakeholder, and policy documentation

• Encourage and inspire others to innovate
• Look for new ways to improve our processes
• Look for new ways to improve the quality of our infrastructure
• Look for new ways to increase the velocity with which your team delivers, leveraging expertise from various functional disciplines
• Look for new ways to remediate production incidents more quickly and safely
• Encourage participation in department Communities of Practice

• Hold everyone accountable for being on time and staying productive
• Adhere to and advocate for best practices including Infrastructure as Code, monitoring, high availability, disaster recovery, security, and DevOps methodologies

• Know what needs to be worked on and keep the team focused on the goal
• Provide timely assistance and remediation solutions during critical situations and production incidents
• Take ultimate responsibility for the success or failure of delivering on time and with the highest quality possible

Group Culture
• Organizational leadership and influence without relying on hierarchy
• Guide the culture and attitude of the team toward an optimistic, proactive, and encouraging direction
• Foster an environment where it is safe to fail and to learn from failure

The Requirements

  • Hands-on Engineering
    • 10+ years of hands-on technical experience with many of the following technologies
    • Windows and Linux Servers
    • Infrastructure Monitoring tools like Zabbix, Sensu, Nagios, SolarWinds, etc.
    • Application Performance Monitoring like New Relic, DataDog, etc.
    • Log Aggregation tools like SumoLogic, Splunk, ELK, etc.
    • Cloud platforms, preferably with Azure
    • Secrets management with Consul and Vault or similar systems
    • Configuration management tools like Salt and Terraform
    • Continuous Integration and Continuous Delivery with tools like TeamCity, Octopus Deploy, Concourse, or Azure DevOps
    • Firewalls and load balancers such as F5
    • Web servers including IIS, NGINX, and Tomcat
    • Proficiency, high-comfort, and familiarity with
    • One or more programming languages, such as Python or Go
    • One or more scripting languages, such as Powershell and BASH
    • Command line interfaces
    • Networking infrastructure
    • Git
  • People Management
    • 3+ years of experience with and responsibility for the following HR concerns regarding your team members
    • Recognize and coach problematic behavior and discussing corrective actions
    • Reward those colleagues who go above and beyond their job duties
    • Become familiar with the career aspirations of each team member, and assist in setting short and long-term goals to support them in those pursuits
    • Resolve conflicts between individuals, and know when to get leadership or HR involved
    • Manage paid-time-off and work-from-home requests
    • Interview, hire, and onboard high-quality job applicants
  • Bachelor's degree or equivalent experience strongly preferred; HS diploma required

EOE, including disability/vets