Session S72.3

Session S72.3

Controlling True Positive Rate in ROC Analysis

T Eftestøl*

University of Stavanger
Stavanger, Norway

ROC analysis is widely used to evaluate the performance of diagnostic markers. The method is quite straightforward in use when only a single marker is evaluated. However, when several markers are combined in a multi-dimensional feature vector, decision regions can be determined by use of Bayes decision theory. The true positive and negative rates can be controlled by use of loss functions to set the size of the decision area. Another issue to consider is the problem of using resampling, repeatedly determining decision regions for specific true positive and negative rates. With data material being scarce, the correspondence between the decision regions in the resampling iterations will be poor. A method is proposed for accurately controlling the true positive rate which can be used for problems involving small data sets and use of resampling. The method estimates the probability density functions (PDF) for the data sets and represents values of these on an evenly distributed grid of coordinates representing the feature space. As will be shown, the PDFs will be used to control the size of the decision regions, and thus the number of grid points will determine the true positive rate resolution. The PDFs for the two classes, p1 and p2, are estimated and represented on the grid and normalised (sums to 1). The prior probabilities are estimated as P1 and P2. According to Bayes decision theory, choosing the class with highest gi=Pi*pi minimises the error rate. Alternatively, one might express this: Select class 1 if G=g1/g2>T, otherwise class 2 is selected. The true positive and negative rates can be changed by setting the threshold T to another value. G is represented on the grid and its values arranged in descending order in a vector T'. As each element in this vector represents a coordinate in feature space, the values in the p1 representation is arranged correspondingly in a vector p1' so that p1'(i) is the PDF value of T'(i). The values in the p2 representation is also arranged in this way. The accumulated sum of p1' is computed and named TP. Thus, TP (i), is the true positive rate corresponding to the threshold value T'(i). This threshold value corresponds to the decision region for class 1 consisting of the grid points according to G'(i),..., G'(i). For ROC analysis, the required true positive values might be set with a given resolution. These specified values can then be found quite accurately in the vector TP (accuracy can be improved by increasing the number of grid points). The corresponding true negative rate is computed as the sum of the p2 representation grid points not in the decision region for class 1. As the number of grid points can be freely chosen, the problem of maintaining the same true positive value throughout resampling can be handled. A method for controlling the true positive rates for multi-dimensional feature vectors has been described.
(Abstract Control Number: 234)