Test Data Management and Its Role in DevOps

Christian MelendezMon, 04/30/2018 - 08:58

In my career, there've been many times when I've experienced the false joy of my code change being ready to be released to production. I say false joy because everything worked as expected on my computer, in dev, in testing and also staging. But in production, my recent code changes were causing intermittent problems.

You know the types of problems. It's always something little, like data being longer than expected for certain fields. No matter how careful I was when testing my change, there was a scenario that I forgot or I didn't know existed. If only I could have good data that helped me to do my job better! My joy wouldn't be diluted by those errors.

Having data for good quality testing is key. And that's where test data management (TDM) comes into play. But what's its role in DevOps? Is it possible to integrate TDM? And how would we go about it?

Let's find out.

What Is Test Data Management?

Missing a use case of our app: it's a common problem in all types of organizations that develop software. No matter how we think our app will be used, users tend to exceed our expectations for creativity when using it. And production is production. It serves as a reminder to us that our test cases aren't invincible. If only we could test our changes in production.

But what if we could have production-like environments? And what if these could be production-like environments not just in terms of infrastructure, but also in terms of data?

TDM is the process of creating production-like data for testing purposes. In some cases, there's no difference. But when there's sensitive data, things change. That sensitive data needs to be masked; then, if it's compromised, the impact will be low.

Tests are important in DevOps. Without tests, it's easy to lose confidence in our applications, and deployments tend to be scary. You need data for good tests. TDM won't prevent you from introducing bugs, but it will help you to reduce the chances by giving you the ability to build data of good quality. That's because if you're able to reproduce an error in production, you'll be able to fix it and make sure it won't happen again. Bugs will continue emerging, but they won't be the same ones over and over. Thus, it's important that anyone in the team can access the data they need when they need it.

Test Data Management Should Be Self-Service For Everyone

Well, we know now that there's a process to prepare data and there are tools for the job. The next obvious thought then will be to automate it and give access to the team. That way, operations and DBAs stop being the bottleneck.

When you're building this process, everyone should be involved. You should agree on what data can and should be used, what data will be masked, and how much data will be needed---much like a data discovery phase, if you will. But after there's an agreement, the team should work on having this process automated so that everyone can create, update, or duplicate data for testing. DBAs will appreciate it; it's one less thing to worry about.

You don't have to reinvent the wheel. There are already some tools for this job.

People shouldn't have to wait too long to get the data they need---and, in fact, they won't wait. They'll find ways to get around things that take time and will shift testing to the right. That's why it's key that you choose the proper tool and plan ahead what data will be needed.

Test Data Management Should Help You Keep Healthy Data

It's also important to keep in mind that we're talking about having production-like data in other environments. Initially, that might not be a problem. But as the data grows, it could be costly. This process will also force you to keep your data healthy, not bloated.

My recommendation is that you start by having all the data you need, but as you grow, start generating the data for each test case. Or you can even have a mix of pre-populated data and static data that you generate in the code for each test case. Look forward to having more static data because it will be cheaper and you'll have more control over it!

It will always depend on the use case, but I've seen databases with tables that have data from years, not months or days. It affects not only costs by using more storage but also performance when writing. When this happens, why don't we consider keeping just the data that's needed and then moving historical data for reporting somewhere else? Or what about working with tables per day, week, or month? It's complex, but as with everything, there are always tradeoffs you need to consider.

The plan is to include TDM in your delivery pipeline. So always keep an eye on the time it takes to prepare data, and make sure to optimize. It's key to reduce lead time to deliver your software, so the less time it takes for TDM, the better.

It Should Be Part of CI/CD Processes

Once TDM is a self-service process that testers and developers can use when they need it (and once it runs fast), it's time to implement it in your continuous integration and delivery process.

In DevOps, every process or task that increases silos are shifted to the left. Shifting to left means that you take to doing it at the very beginning of the workflow. We've seen this happen with deployments, security, testing, and basically everything we need to develop and deliver software.

If we're talking about shifting to the left, it means that we should even start including TDM on a developer's machine. Same for testers and any other team member involved in the process. Some argue that developers and testers should generate static data for testing and that they should invest heavily in unit testing, mocks, and stubs. Some even say you should even use containers. It's not just cheaper to go this route---it's also faster. I get it, and actually, I'm an advocate for that. But I won't lie; doing that is not an easy task.

So while you work on increasing test coverage with unit tests, the easiest way to start is by integrating TDM into your workflow. It's better that your process become reliable first. Then you can optimize and improve it.

Good Testing Data Increases Reliability

Want to increase deployment reliability and at the same time reduce the lead time? Invest in TDM and make it part of your process. Shift it to the left. Don't let testing become an afterthought. A sign that you're not shifting to the left enough is when development is finished and you still need to wait for tests to validate those changes.

Automate your tests. It's OK that you started testing manually, but try to take the time to automate as much as possible---even the process of preparing the data. After that, include it in your DevOps implementation. Make it part of your delivery process. But you also need to pay attention to the time it takes to generate testing data. This will force you in some way to think constantly in your data architecture.

And it's good to be thinking. If you do so, you'll always improve, and you'll decouple things that are not just hard to change but also to test.

Check out our Fundamentals of Test Data Management course to learn the principles, best practices, and tools used for test data management through lecture and hands-on exercises.