Over the past year, data breaches, through the web, business, and mobile application exploitation, have continued to run rampant. In 2018, major household names like Ticketmaster, the United States Postal Service (USPS), Air Canada, and British Airways were hit by application-based exploits. To minimize vulnerabilities — and identify existing ones before they can do this level of damage — application security solutions need to be fast, provide good coverage for capturing all classes of vulnerabilities, and more importantly, they need to be highly accurate, to be useful to DevOps application development teams. Providing results fast but less accurately is counter-productive to an efficient and successful application security program. Time wasted by engineers to triage the false positives far outweighs the speedier results provided.

Most automated application security testing solutions have the ability to scan thousands of applications containing millions of lines of code and can produce results containing millions of attack vectors. But every application is different — different functionality, different code, different size, and different complexity —resulting in significantly different security findings with different accuracy. More so, selecting any single scanned application with the best accuracy from many and claiming accuracy is misleading. Even taking averages would be misleading, because it would be a measure of only the limited set of applications that the vendor’s solution scanned, and hence, incomparable to the accuracy of other solutions.

So, how do you benchmark and compare the accuracy of application security testing solutions?

OWASP Benchmark

OWASP Benchmark Project is a vendor-neutral, well-respected, and true indicator of accuracy that can be used to compare different solutions. It is a free and open testing project that evaluates how automated software vulnerability detection tools stack up in terms of accuracy. The project is sponsored by DHS and has created a huge test suite to gauge the true effectiveness of all kinds of application security testing tools–over 21,000 test cases. It calculates an overall score for application security solutions, based on both true positive rate (TPR) and false positive rate (FPR).

This benchmarking approach helps companies select the right tool to solve a very common problem. In the application security space, customers and prospects tell the same story time and time again:

“We set up an automated application security testing product, we got our findings from it, we brought them to our developers, and we convinced them to prioritize fixing these vulnerabilities.
But the first finding they worked on was a false positive. Then the second was too. Now, our engineering team no longer takes our reports seriously.
So tell me, why are these false positives even there? Why can’t these be suppressed automatically?”

While thoroughly vetting vulnerabilities can slightly compromise speed, providing results fast but containing false positives is counter-productive to an efficient and successful application security program. Time wasted by engineers to triage the false positives far outweighs the speedier results provided.

Yet, that’s the approach that many automated application security solutions take.

False Positives Are Not a Full Measure of Accuracy

Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function. Sensitivity measures the proportion of actual positives that are correctly identified as such. Specificity measures the proportion of actual negatives that are correctly identified as such. In other words, sensitivity quantifies the avoiding of false negatives, and specificity does the same for false positives.

A perfect test would be described as 100 percent sensitive and 100 percent specific. In reality, however, any non-deterministic predictor will possess a minimum error bound known as the

According to Wikipedia, “For any test, there is usually a trade-off between the measures – for instance, in airport security, since testing of passengers is for potential threats to safety, scanners may be set to trigger alarms on low-risk items like belt buckles and keys (low specificity) in order to increase the probability of identifying dangerous objects and minimize the risk of missing objects that do pose a threat (high sensitivity).”

Similarly, in application security testing, false positives alone don’t determine the full accuracy. False positives are just one of the four aspects that determine its accuracy – the other three being ‘true positives,’ ‘true negatives,’ and ‘false negatives.’

False Positives (FP): Tests with fake vulnerabilities that were incorrectly reported as vulnerable

True Positives (TP): Tests with real vulnerabilities that were correctly reported as vulnerable

False Negatives (FN): Tests with real vulnerabilities that were not correctly reported as vulnerable

True Negatives (TN): Tests with fake vulnerabilities that were correctly not reported as vulnerable

Therefore, a true positive rate (TPR) is the rate at which real vulnerabilities were reported, correctly. A false positive rate (FPR) is the rate at which fake vulnerabilities were reported as real, incorrectly.

That is:

Image title

TPR Can’t Be 100 Percent and FPR Can’t Be 0 Percent on Automated AppSec Testing

When developers and QA engineers write their unit tests and functional tests, they write these tests specifically for their applications. If they’re testing a+b=c, then all their tests are written specifically to test that. It helps them to ensure the accuracy of these tests is high, i.e., no false positives and no false negatives. Developers or QA engineers will not perform their unit & functional tests using “generic” tests. In other words, they “curve-fit” their tests to specifically match their results. If they didn’t, they’d have some false positives and some false negatives that would fail their tests, incorrectly. Similarly, if they wrote their own security tests customized specifically for their application, then these would be highly accurate as well. But when it comes to automated application security testing solutions such as WhiteHat Sentinel, all the security tests are fairly “generic” and written to be able to scan any and all applications.

With that generality come both false positives and false negatives. False positives can be easily reduced by reducing the number of tests and lowering coverage. But that will also increase the false negatives, and as a result, the coverage of these tests. Therefore, it is almost impossible to remove both false positives and false negatives for these “generic” automated security tests. In other words, TPR can never be 100 percent, and FPR can never be 0 percent. Thus, the OWASP Benchmark Score can never be 100 percent.

Image title

Image source: https://rawgit.com/OWASP/Benchmark/master/scorecard/OWASP_Benchmark_Home.html

In conclusion, we always encourage you to consider all four-speed, accuracy, breadth, and depth — when analyzing the accuracy of automated application security solutions.