THE NIGHT TIME FALLACY
Failure Pattern summary
Targeting "Night runs" can lead to inefficient automation
This failure pattern has been added by Michael Stahl. Failure patterns are also called "anti-patterns", as they are things you shouldn't do. They are Issues in this wiki.
Look out for this failure pattern if your management controls that you meet the schedule, but not the quality of the automation.
One of the big selling points of test automation is the idea of “tests running at night”. This is a compelling vision: as the workday gets close to its end, the testers fire their automated test system which will run unattended during the off hours, completing the test cycle and having the results ready when the testers come in the next morning. It is such an idyllic vision… you can almost hear the violins play.
But this vision drives some behaviors that result in an inefficient test automation system.
If tests run at night, it seems they cost the same whether they run for 1 hour or 8 hours. It’s still all off-shift. This means that testers can add many more tests and don’t have to be that diligent about what tests to add and how much test time these tests take.
This behaviour leads to the following problems:
- The night is too short
As the project progresses, as older versions enter the maintenance phase and new versions are added in, more and more tests are added to the Nightly Run. Eventually, you run out of night and test runs continue into the day, the next night etc.
- One way to deal with this would be to re-assess the test strategy; redesign tests, and reduce test count and test time. The problem is that this calls for a lot of engineering time, and the engineers are busy doing something that seems more critical and important than optimizing regression test suites.
- A different approach is to get more machines and run tests in parallel. A Corporate Truism: It’s easier to get budget for machines than for more testers. This corporate-world fact-of-life means that in most cases organizations end up buying more machines and splitting the regression tests over a large machine pool. Test time is down again to the one-night length. More tests are added… more machines are added… more code is added to manage these machines…
- Inefficient Logs
Eventually, you hit another problem: Since tests were never written with optimization in mind, their logs are also inefficient. They either generate too much or too little information. In both cases, testers spend a good deal of their time going over test logs to figure out why something failed, what is a real issue and what can be ignored.So eventually you end up with not enough testers, and learn another lesson:Test Automation Truism: Machines create work for more testers. You end up with a lot of machines running an inefficient set of automated test cases and a large test automation team whose main occupation is maintenance and wading through results.
- The Tragedy of the Commons (as applied to test systems)
Let’s take an example: A project has five test teams, all using the same machine pool to run their (inefficient) regression tests. The pool is overloaded and it takes time to get the test results back. Each of the teams would be wise to reduce its test time, but here is an interesting calculation:
Assume each team consumes 12 hours of test. The regression tests end in 60 hours.
Let’s say that team A decided to invest time and make its tests 50% more efficient. They will now run in 6 hours. The overall regression test cycle time will be… 54 hours.
So, for the super-human effort team A invested in test time reduction, they effectively got a 10% test time reduction, which is soon used up by the other teams.
The result is that teams have little incentive to improve their test time.
It may be interesting to compare this situation to the economic principle named The Tragedy of the Commons.
“The Tragedy of the Commons is the depletion of a shared resource by individuals, acting independently and rationally according to each one's self-interest, despite their understanding that depleting the common resource is contrary to the group's long-term best interests.” (http://en.wikipedia.org/wiki/Tragedy_of_the_commons).
Here too, each team, doing the right thing for them, end up using the available testing resources inefficiently.
- The Turnpike Effect
The Turnpike Effect (aka The Parking Lot effect) explains what happens when more capacity is introduced to a constrained system. The larger capacity ultimately results in increased volume of usage, up to the full capacity of the larger system [EReq, p.272-273]].
Applied to a test machine pool, the Turnpike Effect predicts that any numbers of test machines added to an existing pool will fast become busy running new tests, added as a result of having more capacity.
- Test is Automation
In some teams Test Automation becomes the holy grail. It is promoted as the most important effort; engineers working on test automation get exposure and recognition, while engineers whose main skill set is in creating good tests are not recognized enough.
Programming skills are nurtured; Testing skills are neglected. New members are added to the team based on their ability to write code and not their being good testers.
The lack of testing skills means a faster overload of the available test machine pool, as unskilled testers tend to create more tests than are needed, with little attention paid to test time. Making the risk decisions about what tests can be removed or how they can be optimized is beyond the test-skill capabilities of the test team.
Added to the basic pattern of The Night Time Fallacy, these phenomena almost guarantee that the test machine pools will be overloaded beyond capacity, and provide inferior testing services to the organization.
ExperiencesMichael Stahl on his experiences:
Dealing with this pattern calls for a number of actions.
First, you should invest resources and efforts in improving your testers’ testing skills. Knowledge of testing principles and techniques will increase the testers’ confidence when making risk and tradeoff decisions such as reduction of test time or smarter selection of tests cases. The level of attention to testing skills needs to be at least equal to the attention given to coding skills. Automation needs to take its proper place in the skill hierarchy: automation serves testing; it’s not a goal by itself.
A second action that is critical to take, is to have a direct link between efficient test time and teams’ performance goals. The easiest (and an action that has immediate results) is to allocate machines to teams. This can be done physically, or by some accounting or test routing mechanisms.
Once done, the three conditions for a Breakthrough System [BrP, p.29]] are in place:
- The team has a clear understanding of the results expected from their work
- The team has an immediate and valid feedback about current results VS expected results
- The team has control of all the resources needed to meet the expectations
The first two conditions are either already in place as part of existing project-level reports or can be created relatively easy with existing test results data.
It’s the improved testing skills and the test machine allocation that makes the third condition possible.
Once the team’s testing skills are improved, tests can be optimized successfully. Once the team owns the test machines, any action that improves test efficiency immediately shows in the reports as the team’s success.
My experience shows that once these actions are taken, you will see a fast improvement in test efficiency.
If you have also experienced this failure pattern (aka anti-pattern) and would like to contribute your experience to the wiki, please go to Feedback to submit your experience or comment.