Mutation testing involves running your test suite many times, modifying the application code in different ways to see if the tests catch the change. If your application’s behavior can be changed without causing a test case to fail, then that particular behavior must not have been tested very thoroughly.
Example: Consider this simple function to determine whether a number is positive.
https://gist.github.com/grossvogel/a14f266c6dd3a056fac1db829187a1ab
Suppose we ‘mutate’ the code by replacing >
with >=
, and the test suite still passes. This would mean that our test suite doesn’t verify that the function returns false
when n = 0
, an important edge case.
What if we replace the >
with <
, effectively reversing its output? If the tests still pass, we know this function is pretty much untested. Our standard coverage tools might tell us those lines were executed by our test suite, but failing to catch that crucial mutation would indicate that the output value isn’t checked by any assertions.
This is really the key difference between the two measurements:
- Coverage helps you understand where you need more test cases to exercise untested code.
- Mutation testing helps you understand where you need more assertions to verify the correctness of tested code, though this may require adding more test cases as well.
Mutation testing in practice
I installed Stryker mutator and tried it out on a small side project. Once I figured out how to digest the report, I spent some time adding test cases and assertions to improve the mutation score and get a feel for the benefits of mutation testing.
Here are my impressions:
Reports are more information-dense.
The text-based reports generated by Stryker in the terminal are pretty verbose, and not trivial to digest. The HTML based report is very slick, though, and really does help to understand the code and how well its tested.
In the report above, you can see the overall mutation score of the file, followed by the code, annotated with details from the test run. Each red tag represents a code change that didn’t cause a test failure, and can be expanded to show the exact mutation. Mutation #10, for example, was a StringLiteral
mutation replacing 'F'
with ""
.
You can play with the report yourself to get a feel for it.
It’s time-consuming.
Mutation testing, as implemented by Stryker, takes a lot of time to run. I ran the tool on another project with a few thousand lines of code and a decent test suite, and it took over an hour to run the full mutation suite. That involved creating over 2500 different mutations and running the full test suite nearly as many times.
That’s definitely not something I want to run on every pull request!
“Perfect” may not be the best goal
Shooting for a perfect mutation score could lead to tests that are too tightly coupled to the code. If you assert every little thing in every single test case, you can end up in a case where the test suite impedes rather than assists future efforts at refactoring and extension.
Let’s look at what “perfect” would look like in the code above.
To start with, the test suite that produced the report above looks like this:
https://gist.github.com/grossvogel/c8ec140628cfc707476c4e5c52dc1c8a
To catch all of the mutations outlined in the report, we basically need to have an assertion for each of those possible grades, and another one to catch the edge case where the score is equal to one of the grade cutoff values:
https://gist.github.com/grossvogel/ab75bc567caf3f9d3472032e355f3c2c
Now we have a perfect mutation score, but we also have a lot of tests to maintain given the extremely simple functionality. The ‘edge case’ test is absolutely necessary to have confidence in our code, and to me that shows the real value of mutation testing. The other test cases, though, boil down to checking that certain values were copied correctly from one place to another.
So what is the right goal?
This is a hard question, and like all hard question, the answer is probably, “It depends.”
It depends on the codebase, the style of coding, the testing culture of your team, and the risks associated with shipping incompletely-tested code in a particular project.
One thing I think I’ve learned here is that not all types of mutations are equally critical.
Using the example above as a guide, I think tests that check logic are more useful than those that check data structures and constant values. If we wanted to trust literals declared in our code, we could update our stryker.conf.js
file to exclude these things: excludedMutations: ['ArrayLiteral', 'BooleanLiteral', 'ObjectLiteral', 'StringLiteral']
. I think testing with these settings could still help detect untested logic without forcing developers to write overly verbose, brittle tests.
One thing I don’t see in Stryker is some way to annotate the code to indicate that a particular bit of code doesn’t need to be mutated in certain ways, a la /* istanbul ignore next */
. Something like this would allow us to be deliberate about what mutations we’re willing to live with, and still meet a high mutation score threshold. (Being part of the source, that annotation would be code reviewed to make sure its in agreement with the team’s standards and goals.)
Conclusion
I’m probably not ready to integrate mutation testing into all of my projects at this point, but I’m glad I took some time to learn about it.
I think it can be very valuable for taking stock of exactly how well tested your code is, and making course corrections to adjust the tactics used when testing your code. It’s also been a valuable learning tool. I think a solid familiarity with both traditional coverage and mutation testing goes a long way toward understanding the ways in which your code can fail, and the ways in which testing, if used properly, can protect against those failures.
We're building an AI-powered Product Operations Cloud, leveraging AI in almost every aspect of the software delivery lifecycle. Want to test drive it with us? Join the ProdOps party at ProdOps.ai.