by David Golden
I want to sum up a few things that I took away from the conversations and propose a series of major changes to CPAN Testers. Special thanks to an off-list (and very civil) conversation with chromatic for triggering some of these thoughts.
Type I and Type II errors
In statistics, a Type I error means a "false positive" or "false alarm". For CPAN Testers, that's a bogus FAIL report. A Type II error means a "false negative", e.g. a bogus PASS report. Often, there is a trade-off between these. If you think about spam filtering as an example, reducing the chance of spam getting through the filter (false negatives) tends to increase the odds that legitimate mail gets flagged as spam (false positives).
Generally, those involved in CPAN Testers have taken the view that it's better to have a false positives (false alarms) than false negatives (a bogus PASS report). Moreover, we've tended to believe -- without any real analysis -- that the false positive *ratio* (false FAILs divided by all FAILs) is low.
But I've never heard a single complaint about a bogus PASS report and I hear a lot of complaints about bogus FAILS, so it's reasonable to think that we've got the tradeoff wrong. Moreover, I think the downside to false positives is actually higher than for false negatives if we believe that CPAN Testers is primarily a tool to help authors improve quality rather than a tool to give users a guarantee about how distributions work on any given platform.
False positive ratios by author
Even if the aggregate false positive ratio is low, individual CPAN authors can experience extraordinarily high false positive ratios. What I suddenly realized is that the higher the quality of an author's distributions, the higher the false positive ratio.
Consider a "low quality" author -- one who is prone to portability errors, missing dependencies and so on. Most of the FAIL reports are legitimate problems with the distribution.
Now consider a "high quality" author -- one who is careful to write portable code, well-specified dependencies and so on. For this author, most of the FAIL reports only come when a tester has a broken or misconfigured toolchain The false positive ratio will approach 100%.
In other words, the *reward* that CPAN Testers has for high quality is increased annoyance from false FAIL reports with little benefit.
Repetition is desensitizing
From a statistical perspective, having lots of CPAN Testers reports for a distribution even on a common platform helps improve confidence in the aggregate result. Put differently, it helps weed out "outlier" reports from a tester who happens to have a broken toolchain.
However, from author's perspective, if a report is legitimate (and assuming they care), they really only need to hear it once. Having more and more testers sending the same FAIL report on platform X is overkill and gives yet more encouragement for authors to tune out.
So the more successful CPAN Testers is in attracting new testers, the more duplicate FAIL reports authors are likely to receive, which makes them less likely to pay attention to them.
When is a FAIL not a FAIL?
There are legitimate reasons that distributions could be broken such that they fail during PL or make in ways that are not the fault of the tester's toolchain, so it still seems like valuable information to know when distributions can't build as well as when they don't pass tests. So we should report on this and not just skip reporting. On the other hand, most of the false positives that provoke complaint are toolchain issues during PL or make/Build.
Right now there is no easy way to distinguish the phase of a FAIL report from the subject of an email. Removing PL and make/Build failures from the FAIL category would immediately eliminate a major source of false positives in the FAIL category and decrease the aggregate false positive ratio in the FAIL category. Though, as I've shown, while this may decrease the incidence of false positives for high quality authors, the false positive ratio is likely to remain high.
It almost doesn't matter whether we reclassify these as UNKNOWN or invent new grades. Either way partitions the FAIL space in a way that makes it easier for authors to focus on which ever part of the PL/make/test cycle they care about.
What we can fix now and what we can't
Some of these issues can be addressed fairly quickly.
First, we can lower our collective tolerance of false positives -- for example, stop telling authors to just ignore bogus reports if they don't like it and find ways to filter them. We have several places to do this -- just in the last day we've confirmed that the latest CPANPLUS dev version doesn't generate Makefile.PL's and some testers have upgraded. BinGOs has just put out CPANPLUS::YACSmoke 0.04 that filters out these cases anyway if testers aren't on the bleeding edge of CPANPLUS. We now need to push testers to upgrade. As we find new false positives, we need to find new ways to detect and suppress them.
Second, we can reclassify PL/make/Build fails to UNKNOWN. This won't break any of the existing reporting infrastructure the way that adding new grades would. I can make this change in CPAN::Reporter in a matter of minutes and it probably wouldn't be hard to do the same in CPANPLUS. Then we need another round of pushing testers to upgrade their tools. We could also take a decision as to whether UNKNOWN reports should be copied to authors by default or just sent to the mailing list.
However, as long as the CPAN Testers system has individual testers emailing authors, there is little we can do to address the problem of repetition. One option is to remove that feature from Test::Reporter and reports will only go to the central list. With the introduction of an RSS feed (even if not yet optimal), authors will have a way to monitor reports. And from that central source, work can be done to identify duplicative reports and start screening them out of notifications.
Once that is more or less reliable, we could restart email notifications from that central source if people felt that nagging is critical to improve quality. Personally, I'm coming around to the idea that it's not the right way to go culturally for the community. We should encourage people to use these tools, sign up for RSS or email alerts, whatever, because they think that quality is important. If the current nagging approach is alienating significant numbers of perl-qa members, how can we possibly expect that it's having a positive influence on everyone else?
Some of these proposal would be easier in CPAN Testers 2.0, which will provide reports as structured data instead of email text, but if "exit 0" is a straw that is breaking the Perl camel's back now, then we can't ignore 1.0 to work on 2.0 as I'm not sure anyone will care anymore by the time it's done.
What we can't do easily is get the testers community to upgrade to newer versions of the tools. That is still going to be a matter of announcements and proselytizing and so on. But I think we can make a good case for it, and if we can get the top 10 or so testers to upgrade across all their testing machines then I think we'll make a huge dent in the false positives that are undermining support for CPAN Testers as a tool for Perl software quality.
I'm interested in feedback on these ideas. In particular, I'm now convinced that the "success" of CPAN Testers now prompts the need to move PL/make fails to UNKNOWN and to discontinue copying authors by individual testers. I'm open to counter-arguments, but they'll need to convince me of a better long-run solution to the problems I identified.
David Golden is a CPAN Tester and prolific CPAN author with over two dozen modules released, including the groundbreaking CPAN::Reporter and Class::InsideOut. David was the release engineer for the alpha versions of Strawberry Perl. He has been a speaker at YAPC::NA, The New York Perl Seminar, and Boston.pm and has written articles for The Perl Review. David lives in New York City.