Monday, November 2, 2015

Test Specification, Over-specification, and Under-specification

A software test looks for the effects of a system or component in its environment when executed under specific conditions. This can be as simple as checking the outputs for a given input. Or it can be more complex, such as looking for changes in the component's surrounding environment (databases, other processes, UI, etc.) as the component runs. Simply put, a test checks what the component does. 
Over time, practitioners realized that small components -- units -- can be tested in isolation to find bugs early and obtain quick feedback in a controlled fashion. The benefits are many, including greater automation, comprehensiveness, design quality, and facilitation of refactoring. But there are also common problems that cause a lot of pain: over- and under-specification. 

Test Scope

The challenge of controlling the unit's entire environment soon became a dominant issue. Instead of only looking (or, should I say, digging) for the effects of the unit on external resources when testing, we were now looking primarily at the effects of the unit on other code pieces.

In the early days of unit testing, it was often good enough to verify the visible state of the unit based on return values, side effects on parameters, and queries to its public interface. If applicable, the unit's effects on global resources (those visible by the unit's clients, such as the file system) would be checked too. The following diagram, on the left, illustrates this concept.

The visible scope is an incomplete picture, however. A unit is not a program. Alone, the unit makes little sense. It is the interaction with other units -- both callers and callees -- what makes the unit valuable. Those interactions should also be verified. Although the test assumes the caller's role, interactions with collaborators (callees) and even other resources (e.g., RESTful services) are often hidden from the test's perspective, as the figure illustrates on the right.

Fortunately, as the need to check the environment past this visibility barrier became evident for proper testing, tricks started to emerge which eventually lead to the creation of modern mocking and spying frameworks such as Mockito and PowerMock in Java.

What vs. How

Testing a unit (or component or system) should cover both the calls to the unit and the underlying interactions with its collaborators. The entirety of the effects of the unit in this environment defines, in short, how the unit behaves. This idea of how the unit works can also be extended to its implementation. But is this what needs to be tested? Are all calls and interactions relevant to testing?

Let's assume that all effects of the unit are verified in a test scenario and the test passes. So far, so good. The unit conforms to the expectations of the test. Later, though, we decide to improve the unit's implementation, adapt it to changes in other units, or add a new capability (e.g., sorting some results) without affecting what the unit already promises to do in that scenario. But now, some part of the unit's state and interactions make the test fail. Why? Because the test is over-specified. It goes too far. It checks too much.

An over-specified test checks how the unit works rather than just what it is supposed to do. This poses a maintainability problem which makes software evolution harder rather than easier -- the opposite of what good testing is supposed to allow! We could react by stripping down the test to a minimum or event deleting it, but then the test could be under-specified. It could fail to detect what the unit should really do. Pick your poison. 


Where do we find the balance? What is the right boundary? To answer this, we need a very important concept in software engineering: specifications.

A specification is a description of what a system or a part of it does under given conditions. It can be formal or informal, documented or implicit, but in the end it boils down to the contract with the user or environment -- what the system requires and what it guarantees. How those guarantees are achieved are a different matter. That is implementation-specific.

Back to our over-specified test, we realize after a while that perhaps it did not matter after all whether and how often the unit reads from the database or a cache (assuming that performance is not part of the specification). It only matters that the unit fulfills its goal -- its contract -- correctly. Therefore, if the unit replaces some database queries with cache accesses, our test should not care. It should not fail just because the unit's internals changed.

So far, so good, right? Well, almost. What is the unit's goal after all? What is its specification? If we know it, we all agree on it, and it's stable, then we are good. We make sure we check only the behavior covered in the specification.

Often, however, we don't know exactly what the unit's behavior should be. The specification is still in early stages. We might disagree with coworkers. The unit might be old and written by developers who are no longer around. Requirements can change. All of this puts stress on the boundary between the what (specification) and the how (implementation) of a unit.

In Practice

So, specifications are often unclear. In those cases, our goal is to continue iterating on the unit and its specification. The primary role of tests here is to aid that process. If we understand that the specification needs to be defined eventually -- the sooner and the clearer, the better -- and achieve that goal, we can finally focus on completing its implementation and its regression test suite.

In all, we want to settle on that specification so we don't have to choose between over-specified tests that play safe but make change hard and under-specified tests that can miss important bugs.