Code Coverage is a Terrible Metric

June 01, 2015

One of the areas I’m focusing on right now is the subject of testing. Automated testing in general, and at the moment, unit testing specifically. And one observation keeps jumping out at me:

Code coverage is a terrible metric

What is code coverage? Briefly, it’s the number or percentage of the lines of code in your application that are actually executed (“exercised”) when your test suite runs. The thought being that if some lines of code are executed in the course of running a test, then those lines must be “tested”. At first it sounds like a great idea, and I suppose that measuring code coverage is (maybe) better than not paying attention to what is being tested at all.

But in reality, monitoring or enforcing code coverage may do more harm them good, because it incentivizes writing junk tests, and even worse, creates a false picture of actual test coverage.

Junk Tests

What is a junk test? I’m using the term to mean a test that takes more effort to write, read, and maintain than it contributes in actual value to the project. My favorite sign that your test might be a junk test is when it contains code which is almost line for line the same as the the code in the method it is testing. Here’s an example:

@interface MyButton : UIButton

-(void)deactivate;

@end

@implementation MyButton

-(void)deactivate {
  self.backgroundColor = [UIColor grayColor];
  self.userInteractionEnabled = NO;
  self.titleLabel.textColor = [UIColor whiteColor];
  self.alpha = 0.5f;
}

@end

@interface MyButtonTests : XCTestCase

@end

@implementation MyButtonTests

-(void)testDeactivate {
  MyButton *testButton = [[MyButton alloc] init];
  [testButton deactivate];
  XCTAssertEqualObjects(testButton.backgroundColor, [UIColor grayColor]);
  XCTAssertEqual(testButton.userInteractionEnabled, NO);
  XCTAssertEqualObjects(testButton.titleLabel.textColor, [UIColor whiteColor]);
  XCTAssertEqual(testButton.alpha,0.5f);
}

@end

As you can see, the test calls the -deactivate method on an instance of MyButton, and then checks line by line if the values that were set inside the method were actually set. Here are a few on the many reasons why this is a bad unit test:

Someone had to write the same code twice. Rather than making the implementation code more resilient, it actually creates double the opportunities for bugs. It’s just as likely that the developer will make a mistake in specifying what property values are expected in the test as they will a make mistake specifying the desired values in the method itself. It’s like writing an app twice to do a task once — twice the risk of human errors, with the same functionality you would have had anyway.

Put another way, if this test didn’t exist, there is one possible way a developer could have made an error: in the implementation for -deactivate. With the addition of this test, you now have two possible ways there could be errors: the test is wrong and the implementation is right (incorrectly failing test), or the test is wrong and the implementation is wrong (incorrectly passing test if they are both wrong, but wrong in the same places). There’s actually a third possibility too, which is that the implementation is wrong, and the test is simply incomplete and doesn’t catch the error. The last two possibilities are devious because the developer probably assumes that the implementation is working properly, since “it is unit-tested”.  
The test is mostly just verifying that UIKit works. If the method implementation is supposed to set self.userInteractionEnabled = NO, it’s much faster to just see in a code review or a single quick inspection that it did so. Checking that a call to set userInteractionEnabled = NO actually results in userInteractionEnabled == NO is wasted effort with no payoff. If you thought to yourself “aha, but what if no one is reviewing or inspecting the code in the first place, wouldn’t this test help to double check?”, my answer is covered in point 1 above and my response would be : "if no one is reviewing or inspecting the code, then how do you know the unit test itself is correct and checking the right thing anyway?"
Good tests will normally stay the same once written. Then, when implementations change, the test code will catch any errors in logic, refactors, etc. For example, let’s imagine a unit test that verifies a function that finds the distance between two points. The test would simply instantiate two points with specific values, pass them into the function being tested, and compare the resulting distance to the known “ correct answer” for what that distance should be.

This is a good unit test. If the implementation for the function that finds the distance changes, the test will continue working without modification because the inputs should still result in the same correct answer.

On the other hand, in the case of the earlier bad test example for the MyButton class, the test has to be rewritten every single time the implementation changes. For example, if the -deactivate method implementation is updated to set the text color of the disabled button to black instead of white then the test method has to have the same change. Which again creates extra opportunities for making mistakes in the course of maintaining the test, and continues to highlight the point that the test is literally doing the same work that the implementation does anyway.

These types of junk tests are certainly a lot of wasted effort, but they are also dangerous because they create more opportunities for bugs. The simple fact is that there is no worthwhile unit test that can be written for a method like the -deactivate example above, and developer time is better spent on code review or writing better tests for actual business logic. Note that there may be valid integration or UI automation tests to verify aspects of the above button behavior, but that is a separate discussion and outside the scope of this post.

So, if there are no good unit tests for the -deactivate method in the above example, why do some developers go to the trouble to write a junk test? The number one reason I’ve seen is because of a fixation on having “100% code coverage”. Because the method exists, it must be tested, even if pointlessly, in order to signal to a code coverage tool that “we have tested all these lines of code”.

Smarter teams will carefully manage which files are counted in a code coverage report, and exclude views and other classes with no business logic from the total percentage. And if you do that, you can probably avoid having your team write junk tests. But it will not help you avoid the second, and even bigger pitfall of code coverage.

A False Picture

By far the worst thing about using the code coverage metric is the completely false premise that a single unit test for a method equals a tested method. This is almost always untrue. Let’s start with an example where we may be creating our own array collection class:

@interface MyArray : NSObject

-(void)addObject;
-(id)objectAtIndex:(NSUInteger)index;

@end

We don’t need to go into the implementations of these methods here, so let’s just look at writing some unit tests instead.

@interface MyArrayTests : XCTestCase

@end

@implementation MyArrayTests

-(void)testAddObject {
  MyArray *array = [[MyArray alloc] init];
  NSString *stringObject = @"TestStringObject";
  [array addObject: stringObject];
  XCTAssertEqual(stringObject, [array objectAtIndex: 0]);
 }

-(void)testObjectAtIndex {
  MyArray *array = [[MyArray alloc] init];
  NSString *stringOne = @"StringOne";
  NSString *stringTwo= @"StringTwo";
  [array addObject: stringOne];
  [array addObject: stringTwo];
  XCTAssertEqualObjects(stringTwo, [array objectAtIndex: 1]);
}

@end

The above two test methods are acceptable tests. They are not junk as they verify intended functionality without recreating or paralleling the internal workings of the methods under test. So what is wrong with the above? What’s wrong is the omission of all other tests needed to verify this array will actually behave as intended.

Each of the two tests above are checking what is often called “the happy path”, or the the most optimistic ideal scenario the code will be used in. It’s important to test the happy path, so that you know that your code behaves as expected in the most commonly anticipated circumstances. But it’s actually more helpful and more important to test the edge cases, the unsafe cases, and the times when your code may not be used exactly as you envisioned or hoped.

For the example methods above:

What happens if you try to add a nil object?
What happens is you try to add something that isn’t an object at all, like an int?
Do newly added objects go to the beginning of the array, or the end? How are we verifying that?
What happens if you add the same object twice?
What happens you call -objectForIndex on an empty array?
What happens if you pass a negative int as the index parameter instead of an unsigned int?

These are examples of edge cases, and the exact sorts of things your unit tests should be checking and verifying the correct behavior for. And while the 2 example unit tests represent maybe 10% or less of the total necessary test cases needed to fully verify correct behavior for the two existing methods, they add up to "100% code coverage” according a code coverage tool!

And this is the absolute worst thing about “code coverage” as a goal or a metric. It doesn’t have any correlation to the existence of good tests, or sufficient tests. A junk test written for every method in an application adds up to 0% of needed test coverage, but can equal 100% code coverage. One good unit test per method in an application may be 5% - 15% of the actual needed test coverage, but will add up to 100% code coverage. And even if you cover some number of edge cases in your tests and get up to 60% of the test coverage the application should have, the code coverage remains at — you guessed it — 100%. All of which means that "100% code coverage” is a completely meaningless number that does not measure anything of value.

And yet, many teams and projects hold up “100% code coverage” as a badge of pride or a guiding principle. While it’s great to prioritize the activity of testing, using code coverage as a way to measure adherence to good testing practices is about as effective as using an arbitrary metric like “overall team confidence in code”. And it’s likely worse, because “overall team confidence in code” could at least conceivably be more related to effective and sufficient test cases than the "code coverage" metric is. And having an “overall team confidence in code” rating of 70% would suggest that there is more work to be done, while many teams and developers I’ve observed use a “code coverage” of 100% to mean that nothing more needs to be done on the testing front.

Alternatives

So clearly, I would like to see “code coverage” be a passing fad that dies a fiery death in the near future, but what should we replace it with? Well, that is something I’m working on for a future post, but at a high level, I believe that any application should have an evolving and updated statement of all valid and invalid cases for important functionality and business logic. And each valid and invalid case should have an expected result or behavior that means the code is performing as intended. With that in place, a useful test coverage metric would be the percentage of those cases verified by a unit test. This removes the need for junk tests on view methods, and doesn’t rely at all on a simple boolean of whether code ran during the test suite or not (which is the sole basis of the code coverage metric). Of course, part of the challenge would then be having an automated and easy way to generate a report of the percentage of cases tested, which is part of what I’m trying to solve currently.

In fact, perhaps the real reason why code coverage has gained any traction at all, is because it is a metric that can easily be measured automatically by a computer and is therefore easy to establish, update, and put on a team dashboard (or feed into a continuous integration system). And replacing it with a better metric may require having something else that can be automatically measured and computed just as easily.

Regardless of how you are currently measuring or not measuring your testing practices, please stop correlating “code coverage” with “fully-tested” or even “well-tested”. And start thinking about what is actually takes to make an app well-tested. Stay tuned for more thoughts on this in the future, and in the meantime, share this post with people and projects you care about. :)

Like this article? 6

Posted in: code coverage testing unit tests

Subscribe to this blog

Code Coverage is a Terrible Metric

Junk Tests

A False Picture

Alternatives

Share this article with friends