Application Note: TestWorks Quality Index

TestWorks Quality Index ^TM
Application Note

The TestWorks Quality Index:
A Quantitative Quality Index for Your Application Development Process

OVERVIEW

Assessing the relative quality of a software system is a complex but important matter in software engineering. To make reasoned decisions about complex software requires an approach that combines analysis of product properties with analysis of the underlying software construction process. A weighted figure of merit software quality index -- the TestWorks Quality Index described here -- offers an attractive approach to doing this because it takes into account software quality metrics, process assessments, and other practical considerations.

TestWorks Quality Index `(TOP)`	Assess the quality of your quality process with the TestWorks Quality Index.
Table I -- Software Quality Process Filters	Assesses relative advantages and disadvantages of various software quality methods.
Table II -- TestWorks Quality Index Definitions	Summarizes the definitions of the factors that make up the TestWorks Quality Index and shows how to compute it for YOUR process.
Table III -- Product Application Profile	Gives a recommended composition of use of TestWorks products and indicates likely CMM-like levels and relative overall process efficiencies.

Give us your feedback about the TestWorks Quality Index!
Please send E-mail to us by using the Information Request Form

INTRODUCTION

A common problem in software development is:

How good is the quality of a specific software application?
How do I determine the quality of a specific software application?
How can I make decisions about my project, for example, should it be released yet?
How can I make decisions about what I ought to do in my project to improve its quality?

The Software Quality Index approach to this problem is to assess the quality of a particular application by weighing the answers to questions that address BOTH the properties of the application itself and the characteristics of the process used to produce it.

QUALITY PROCESS ASSESSMENT METHODS

The SEI CMM and the ISO-9000 type quality process models are based on examining the process that produces the product. This approach is based on the well-documented fact that a better industrial process tends to produce a better product, and that continual incremental improvements to that process tend to lead to continual incremental improvements in its product. This simple method can account for spectacular quality gains.

While this technique is clearly valid in general terms, sometimes good processes produce bad products and bad processes produce good products. This happens annoyingly often in software products, perhaps because some of the intermediate elements of the process can be very difficult to measure.

Quality Process Models accept such exceptions, focusing on the main point: Improving the process improves the product. And the exceptions are anomalies.

PRODUCT METHODS

The Product Analysis approach, often called the metrics approach or the static analysis approach, takes the opposite tack: look at the final product only, and base decisions about its quality on what is actually there, regardless of how it got there. After all, the final source code itself completely determines what an application can do. Regardless of how it was produced, regardless of the methodology or tools or process used to make it, the actual quality of a software product is determined directly by its own internal, intrinsic properties.

Even if it is junky, spaghetti code hacked together by rank amateurs, if it works well then it works well. Who needs a fancy software process, anyway? Simply put, quality is determined in the contest of the marketplace. Of course, given that quality is implicit in the as-built product, we still have to find a way to measure it if we want some measure of control over the result. To measure the quality of an application by its structural properties or content, we use software metrics (e.g., cyclomatic complexity, size metrics -- there are hundreds of possible metrics).

We know from experience that sometimes applications with very poor software quality metrics, e.g., with E(n) in the 100's and Halstead weights in the 1,000,000's, are perfectly good, perfectly reliable code and don't have any field problems. At the same time, some products that are high-scoring by every metric one can find are complete disasters.
Automated static analysis attempts to mechanize code inspection methods, but most implementations tend to find far too many things wrong. Long lists of product features that are "dangerous" but not "fatal" suggest not that a product will fail in the field, but that the builders don't mind living on the edge.
For example, if you always fixed everything that /bin/lint said ought to be fixed, assuming you could afford to do that, your software quality would be sure to go up, but there would be no guarantee that your delivered functionality would change for the better. It might change for the worse! You could be spending money to improve the product and changing it for the worse.

The paradox is that just having measurable high-quality code that meets small-scale and large-scale quality guidelines is no more a measure of field product success than having a perfect manufacturing process.

PROCESS/APPLICATION ASSESSMENTS

A combined process/application assessment brings together the strengths of each of these approaches. A software quality manager needs to take into account the following factors: HOW a product was built; WHAT its characteristics are; and WHY better quality is important; and what the producers and their management -- the team -- FEEL about how good the team/product combination is.

A multi-faceted assessment method can be fooled too, of course, but its strength is that it focuses on perceived quality-key aspects of both process assessment and product assessment.

HOW a product was built is addressed by questioning whether certain basic quality-oriented processes are used in its construction. For example: Was coverage analysis done? What coverage metrics, and how thoroughly? (How well was it tested?) Was automated regression done? How thoroughly? (How well was it tested?)
WHAT its characteristics are is addressed by identifying certain basic metrics that pertain to its actual functional content. For example: What is the average E(n) for functions? (How complex is the code?) What is the calltree aspect ratio? (How is the code shaped?)
WHY better quality is important is, simply put, an assessment of how critical the product is to the producers. If product quality isn't important, then quality shouldn't be of any concern. But if a product is intended for a life-critical application, and therefore has to have very high quality, then the need for quality has to be taken into account.
HOW the team FEELS about the product they've built affects a lot of things -- and will be controversial to measure. The very best software development efforts have often been fielded by dedicated, talented, teams who believe in their work.

There are plenty of available technical alternatives. Some of them are shown in the accompanying

Table I --Software Quality Process Filters

which shows a range of possible software quality filters and also indicates how they can be applied, what some of their limitations and advantages are, and where the payoffs -- if any -- lie with each method or approach.

THE METHODOLOGY

The TestWorks Quality Index is a balanced, weighted, experience-determined estimate of selected factors and uses a combination of estimates, measurements, and process-characterizations to come up with a quality figure that can be used to compare products.

The TestWorks Quality Index value is the average score obtained on a simple question list, where specific quantitative responses based on current engineering experience assign "points". The more points scored, the better the product.

In engineering this kind of calculation is usually called a "Figure of Merit (FOM)" and the notion of using FOMs has a long tradition of use in comparing complex things. From assessing competitive proposals (which are scored according to weighted averages), to determining plant efficiency, engineers take the practical approach even when it is known there is no theoretical solution.

Some benefits the TestWorks Quality Index offers in assessing your products are:

You can compare two products from your production shop in a quantitative way.
You can adjust the point values, if you like, to match your own experience. Or you can add quality-related factors if you like (but you can't take them away or you may destroy the integrity of the index).
You can use the evidence internally for your own purposes. You don't have to expose yourself to SEI or ISO-9000 audit processes.
You can have or seek a high SEI rating or obtain ISO-9000 approval. The TestWorks Quality Index assesses product/producer/process quality in a way that combines many of the features of these two important quality-related standards.
You can use the TestWorks Quality Index now, with data that you probably already have available, to come up with meaningful comparative estimates about your own product or process.
By adopting the TestWorks Quality Index you create an internal culture in which achieving practical goals for a software product becomes a common goal. You might even say that the product had been "TestWorked" to a certain level in your advertising.

HOW THE TestWorks QUALITY INDEX WORKS

The TestWorks Quality Index works as shown on the following chart. The factors on the chart are metrics that you can measure, or are assessments you can make, in a straightforward way. Detailed explanations of the terms follow the chart.

As you read the explanations, think of a specific project that you're working on, and try to calculate its TestWorks Quality Index score as you go along.

TestWorks QUALITY INDEX CRITERIA

Not just any list of scored questions qualifies as valid comparative index. To qualify as an effective indicator some constraints have to be put on the TestWorks Quality Index (or its in-place equivalent) to make sure that it isn't manipulated to favor a particular process or product feature or quality assurance approach.

Constraints that make sense are the following:

There should be no more than 10 (ten) factors. This keeps the arithmetic simple (managers need this, technical people think).
At least half of the criteria should be completely quantitative, objective, measurable, repeatable; i.e. not subject to any kind of judgement (see below for factors that take non-objective judgements into account).
[Q = Quality].
At least three of them should deal with the static (non-dynamic) characteristics of the product.
[S = Static].
At least three of the criteria should involve actual tests of (experiments on, examinations of, executions of) the as-built software product. You can't design quality into a product without some kind of checking, and you can't score high on the index without something that works and does something.
[T = Testedness].
At least three of the factors should deal with something about the dynamic behavior of the product, how it actually works when you run it.
[D = Behavior].
At least one of the factors should deal with something about how the product was put together, i.e., about the process used in constructing it.
[P = Product].
At least one of the factors has to deal with a measure of the quality "need" imposed from some outside force, e.g. the cost of repairing a defect in the field or the life-criticality of the application.
[$ = Cost Impact of a Defect (Money)].
Some of the factors can address several of these criteria: they don't have to be unique to each area.
[J = Judgement].
No more than one of the qualifying factors can be a "wild card" and need not meet any of the above criteria (The error will be still no worse than 10%). At least eight of the ten factors should have non-zero values.
[J = Judgement].

The idea here is to constrain the ways the FOM is computed so you are forced to include certain kinds of factors that will assure that the FOM really is meaningful.

DETAILED EXPLANATION OF TERMS

Here are short explanations of the above indicated measures. The inclusion keys [D, S, T, P, $] were explained above. Note that no factor can be included in the matrix unless at least one of these inclusion keys is addressed.

Fn Short
Definition Meets
Criteria EXPLANATION
F1 Cumulative C1 (Branch Coverage) Value for All Tests D, S, Q This is the total C1 value achieved for this product on all tests, e.g. as measured by TCAT. Note that statement coverage is NOT usable because it understates results by half or more. Statement coverage (C0) is not acceptable.
F2 Cumulative S1 (Callpair Coverage) Value for All Tests D, S, Q This is the total C1 value achieved for this product on all tests, e.g. as measured by TCAT. Note that we are counting the connects between caller and callee, not just whether a function was ever called (which is called module testing). Module testing coverage (S0) is not acceptable.
F3 Percent of Functions with E(n) < 20 S, Q This measures the structural complexity for all functions or modules or methods in the current application. Experience shows that the cyclomatic complexity E(n) = E - N + 2 > 20 implies a "too complex function" -- not necessarily bad, but a potentially troublesome problem if a high percentage of the individual functions have this value. Though not necessarily harmful, too high a percentage of "too complex" functions can be a serious warning sign of trouble ahead.
F4 Percent of Functions with Clean Static Analysis S, Q Static analysis finds a broad class of defects that may cause trouble in the future. Many errors found by static analysis are non-critical, but too many static analysis detections is an indicator of poor quality. The measurement made here requires that a certain stated percentage of functions be subjected to some form of static analysis.
F5 Last PASS / FAIL Percentage D, P, Q This is the total number of tests that PASS vs. the total number of tests available, as would be measured by the test controller, e.g. SMARTS. Tests PASS if they run as expected, and produce output close enough (as determined by the programmable differencer) to the baseline to be acceptable.
F6 Total Number of Test Cases / KLOC T, Q This is a measure of the degree to which you have thoroughly tested the software relative to its size measured in 1000's of lines of code (KLOC). Most software is very poorly tested, i.e. with very few test cases, so it may not take a great many tests to score a high value on this measure.
F7 Call Tree Aspect Ratio S, Q This is a measure of the "verticalness" of the call-tree of the package, with packages that have a less vertical structure (and thus are more independently testable) viewed as superior. The vertical height is the maximum depth calling tree (this is shown in Xcalltree/WINcalltree), and the horizontal width is the largest number of functions on any level in the tree. If the tree has multiple roots (as it will likely have in most modern applications) then the average values for all possible roots is taken.
F8 Current Number of OPEN Defects / KLOC T, J No software product/project is perfect; this metric indicates how many defects per KLOC are open that are critical. An open defect typically means it is reported, reproduced, but unresolved with no work-around available to the user.
F9 Path Coverage Performed for Some Percentage of Functions P, J For almost all packages some critical functions or modules require full path coverage, but not all. This measures the percentage of all functions for which some form of path coverage has been performed. Note that path coverage is NOT the same as branch coverage.
F10 Cost Impact / Defect $, J This is an indication of how critical a serious software defect might be, expressed in monetary terms, i.e. in terms of the direct cost of any defect. Note that the scale used tends to take points away for the most-critical kinds of projects; this is done so that the more critical projects receive the greatest attention.
Note: Any product which is "life critical" gets 0 points. This artifact has the effect of forcing "life critical" situations to gain TestWorks Quality Index points by increasing the requirements on all of the other factors.
All of these factors are summarized in

Table II -- TestWorks Quality Index Definitions

which summarizes the definitions of the factors that make up the TestWorks Quality Index and shows how to compute it for your process.

ILLUSTRATIVE EXAMPLES

Here's how the TestWorks Quality Index works when applied to some example projects.

Quick-And-Dirty Order Tracker

This project was done by one programmer and involved putting together an order tracking system for his company. It was written in C and uses an ASCII interface to a standard library of C++ functions.

Here's a summary of the scores that resulted:

Cumulative C1 (Branch Coverage) Value for All Tests 75 Cumulative S1 (Callpair Coverage) Value for All Tests 70 Percent of Functions with E(n) < 20 50 Percent of Functions with Clean Static Analysis 50 Last PASS / FAIL Percentage 80 Total Number of Test Cases / KLOC 50 Calling Tree Aspect Ratio (Width/Height) 60 Current Number of OPEN Defects / KLOC 50 Path Coverage Performed for % of Functions 50 Cost Impact / Defect: 100 TOTAL POINTS SCORED 535 TestWorks Quality Index 53.5

Test Tool Vendor's Coverage Analyzer

This coverage analyzer, a very popular one, is a sophisticated compiler-based product...

Cumulative C1 (Branch Coverage) Value for All Tests 70 Cumulative S1 (Callpair Coverage) Value for All Tests 85 Percent of Functions with E(n) < 20 80 Percent of Functions with Clean Static Analysis 60 Last PASS / FAIL Percentage 50 Total Number of Test Cases / KLOC 80 Calling Tree Aspect Ratio (Width/Height) 60 Current Number of OPEN Defects / KLOC 95 Path Coverage Performed for % of Functions 50 Cost Impact / Defect: 80 TOTAL POINTS SCORED 710 TestWorks Quality Index 71

Bedside Cardiac Monitor

Think of this as a life-critical application...

Cumulative C1 (Branch Coverage) Value for All Tests 100 Cumulative S1 (Callpair Coverage) Value for All Tests 100 Percent of Functions with E(n) < 20 80 Percent of Functions with Clean Static Analysis 80 Last PASS / FAIL Percentage 90 Total Number of Test Cases / KLOC 85 Calling Tree Aspect Ratio (Width/Height) 60 Current Number of OPEN Defects / KLOC 95 Path Coverage Performed for % of Functions 60 Cost Impact / Defect: 50 TOTAL POINTS SCORED 800 TestWorks Quality Index 80

CONNECTING THE INDEX TO REALITY

The hard part comes when trying to connect with reality. The main question everyone asks is, "How reliable will my application be in the field?"

As students of software quality know very well, this is a very deep question to which there are few definitive or even suggestive answers. Instead, about the best we can do is associate a particular process's TestWorks Quality Index score with a likely estimate of reliability based on judgment and experience.

An initial experimental estimate of this is done in the attached

Table III --Product Application Profile

which gives a recommended composition of use of TestWorks products and indicates likely CMM levels and relative overall process efficiencies.

Time will tell whether the numbers are too high or too low. Time will tell if the reliability values correspond to the SEI/CMM levels, or if the achieved reliability is too low or too high.

And, time will tell whether that application of relatively simple quality filters will achieve, or won't achieve, the expected effect often enough to be relied upon.

But in any case, making the attempt to tie these essential ingredients together is totally essential.

Fn	Short Definition	Meets Criteria	EXPLANATION
F1	Cumulative C1 (Branch Coverage) Value for All Tests	D, S, Q	This is the total C1 value achieved for this product on all tests, e.g. as measured by TCAT. Note that statement coverage is NOT usable because it understates results by half or more. Statement coverage (C0) is not acceptable.
F2	Cumulative S1 (Callpair Coverage) Value for All Tests	D, S, Q	This is the total C1 value achieved for this product on all tests, e.g. as measured by TCAT. Note that we are counting the connects between caller and callee, not just whether a function was ever called (which is called module testing). Module testing coverage (S0) is not acceptable.
F3	Percent of Functions with E(n) < 20	S, Q	This measures the structural complexity for all functions or modules or methods in the current application. Experience shows that the cyclomatic complexity E(n) = E - N + 2 > 20 implies a "too complex function" -- not necessarily bad, but a potentially troublesome problem if a high percentage of the individual functions have this value. Though not necessarily harmful, too high a percentage of "too complex" functions can be a serious warning sign of trouble ahead.
F4	Percent of Functions with Clean Static Analysis	S, Q	Static analysis finds a broad class of defects that may cause trouble in the future. Many errors found by static analysis are non-critical, but too many static analysis detections is an indicator of poor quality. The measurement made here requires that a certain stated percentage of functions be subjected to some form of static analysis.
F5	Last PASS / FAIL Percentage	D, P, Q	This is the total number of tests that PASS vs. the total number of tests available, as would be measured by the test controller, e.g. SMARTS. Tests PASS if they run as expected, and produce output close enough (as determined by the programmable differencer) to the baseline to be acceptable.
F6	Total Number of Test Cases / KLOC	T, Q	This is a measure of the degree to which you have thoroughly tested the software relative to its size measured in 1000's of lines of code (KLOC). Most software is very poorly tested, i.e. with very few test cases, so it may not take a great many tests to score a high value on this measure.
F7	Call Tree Aspect Ratio	S, Q	This is a measure of the "verticalness" of the call-tree of the package, with packages that have a less vertical structure (and thus are more independently testable) viewed as superior. The vertical height is the maximum depth calling tree (this is shown in Xcalltree/WINcalltree), and the horizontal width is the largest number of functions on any level in the tree. If the tree has multiple roots (as it will likely have in most modern applications) then the average values for all possible roots is taken.
F8	Current Number of OPEN Defects / KLOC	T, J	No software product/project is perfect; this metric indicates how many defects per KLOC are open that are critical. An open defect typically means it is reported, reproduced, but unresolved with no work-around available to the user.
F9	Path Coverage Performed for Some Percentage of Functions	P, J	For almost all packages some critical functions or modules require full path coverage, but not all. This measures the percentage of all functions for which some form of path coverage has been performed. Note that path coverage is NOT the same as branch coverage.
F10	Cost Impact / Defect	$, J	This is an indication of how critical a serious software defect might be, expressed in monetary terms, i.e. in terms of the direct cost of any defect. Note that the scale used tends to take points away for the most-critical kinds of projects; this is done so that the more critical projects receive the greatest attention. Note: Any product which is "life critical" gets 0 points. This artifact has the effect of forcing "life critical" situations to gain TestWorks Quality Index points by increasing the requirements on all of the other factors.

The TestWorks Quality Index: A Quantitative Quality Index for Your Application Development Process