How to Build an SRE Team in Your Organization

Sylvia FronczakThu, 01/16/2020 - 09:49
Subject

So you’ve read up on the value of having an SRE team and want to start one at your organization. However, that task can feel daunting. This isn’t like creating a new application team that has one product to care for. The SRE team’s responsibilities can encompass products for all application teams and all infrastructure needs.

How can you incorporate all of that without tearing the whole organization apart and putting it back together? And how do you ensure that we give the SRE team the right objectives and goals?

The first step in the journey involves starting small.

Start Small

When first putting together an SRE team, consider a pilot project. You don’t have to go all in with a 20-member team and $5M budget. Instead, figure out where you need results and grow from there. Use iterative agile processes to build upon successes and learnings from the pilot to grow further. This will allow you to focus on the problems with the most potential value.

Also, don’t think of the SRE team as a team that takes care of all the availability and reliability needs of your organization starting on day one. Instead, consider how you could grow the SRE team as a byproduct of focusing on high-risk and high-value work first.

One way to smart small would be to pick a system that currently experiences reliability issues. Now, please note, this shouldn’t be your least reliable system. It should be the least reliable system that results in lost profit. If you work on improving the reliability of an application that doesn’t mean much to the business, then your SRE pilot will not prove anything.

Once you’ve selected your application or system, build an SRE team with the singular goal of improving the reliability of that application. One of the first tasks the SRE team will have involves determining the required reliability level.

But first, let’s staff our team.

Find the Right People

Putting your SRE team together doesn’t mean you have to hire a brand new team off the street. In fact, strong SRE teams have a mix of established domain knowledge and fresh viewpoints. Consider having a mix of existing employees and new hires.

For your pilot, you might want two to four engineers to keep the team fast and nimble. When your team isn’t huge, there will be less overhead and bureaucracy, giving your team the agility they need to get results.

You’ll also want to make sure they report to someone that has the authority to reduce bureaucracy and give the team the leeway they need to implement new processes and automation. The team needs the right amount of respect so that they don’t become an ops team or a tools team. Make sure the management chain fully understands the SRE team’s purpose and value.

Next, how do we decide who to put on this team?

SRE Qualities

Let’s look at the qualities of a good SRE.

  1. Ability to solve problems and troubleshoot. SREs need to be able to troubleshoot various issues around availability and reliability. Oftentimes, they’re troubleshooting issues with applications they didn’t write themselves. So they need to be able to debug without deep domain knowledge.
  2. Desire to automate. One of an SRE’s goals includes automating away toil. Therefore, your SRE requires an innate desire to reduce manual work. They should want to reduce the manual burden that your traditional ops team can’t.
  3. Curiosity. With curiosity, SREs can find novel solutions to reliability problems. Curiosity also helps in finding unexpected causes to familiar problems.
  4. Teamwork. An SRE team needs to work together and with development teams. You need people that will band together behind a common goal.
  5. Communication. Whether discussing problems during a high-pressure outage or talking about them a more relaxed session while looking at long term automation strategies, your SRE must have strong communication skills.
  6. Ability to see the big picture. Oftentimes, reliability problems can be solved in many ways. The SRE must be able to look at each potential solution in a larger context. Otherwise, they may solve one problem but cause two others.

And remember that the people you choose will need to build the right SRE culture. Choose wisely.

Finding People

When staffing your SRE team, you have two options:

  1. Hire externally.
  2. Hire from within.

When starting out, first look within your existing organization. This will let you bring people into the SRE fold that already have domain knowledge. Additionally, you’ll know a lot about them based on their work.

So look at the qualities above and see if you already have individuals that can meet the requirements. But be careful of just seeding your team with your organization’s superstars. Just because someone does well on one development team or operation team doesn’t mean they’ll be able to contribute as an SRE. You have to look for a set of skills that will allow the person to thrive.

Additionally, a superstar used to saving the day might not be able to build trust and comradery with others on the team. Site reliability engineering is a team sport, and one toxic individual can make it difficult for the team to succeed.

Ultimately, bringing a team together should be done with care. You might not want to take a bunch of engineers that have never interacted with each other and expect everything to work out.

As an alternative to finding individuals from different teams, consider using an existing high performing team. Then convert them over to an SRE team that’s focused on reliability.

And of course, if you don’t have the talent within, or if you get to a point of rapid growth, you will need to look for these candidates externally.

Training and Development

Now that you’ve identified the right people, how do you prepare them to succeed? Learning and development!

Becoming an SRE doesn’t just take a title change. It’s a culture change and a change in mindset. Without the proper fundamentals, you’re not giving your new team the opportunity to succeed.

Fortunately, ASPE’s Implementing Site Reliability Engineering course provides in-depth learning that can give your team the base knowledge that will take them further through the SRE journey.

Charter and Governance

Once your team is getting settled, define the charter for your SRE team. Establishing a charter helps define what the priorities of the SRE team include and how they operate.

More importantly, the charter defines what the SRE team should not engage in. Otherwise, you may find that your SRE team gets pulled in too many directions that don’t provide business value.

As for governance, always tie the SRE work back to business value. Site reliability engineering puts emphasis on metrics and objectives. Therefore, the team should receive governance that focuses on business metrics and developer productivity. Make sure that whoever oversees the SRE team has clear guidelines and expectations from the team and that the team has the tools to provide the relevant data.

Is This Still Too Much?

I’ve given just a brief overview of things to consider when building an SRE team. However, that can still be too much for some teams and organizations to get behind. And that’s OK.

If you’re not ready to build an SRE team yet, focus on building SRE practices in your organization first. Take individuals from a few teams and start building a culture of SRE with them. Again, provide the right training and development to build the right skills and culture. And then help your people teach others.