Shouldn't A.I. not be used in a way that it only tries to assist? E.g. a doctor takes a look first and if (s)he can't find anything then A.I. is checking as well (or in parallel).
My personal opinion: AI should be still kept out of anything mission critical, in all stages, except for evaluation.
There is other comment very correctly noting that this result is on 100% positive input. Same AI in “real life” would score probably much better eventually. But as you point out, if used as a confirmation tool, is definitely bad.
> Same AI in “real life” would score probably much better eventually
Either I don't understand your reasoning or you are very much wrong. A "real life" dataset would contain real negatives too and the result would be equal if false positive rate was zero and strictly worse if the rate was any higher. One should expect the same AI to score significantly worse in a real life setting.
Depends on what you call better or worse. In real life positives (TN) are far less common than negatives (TN), if this system does not have lots of FP (which is very possible), the accuracy will be much better than you may expect.
What I mean with “score” is having a relatively high accuracy.
Come let’s do the math: incidence of BC is 1 every 12, lets say. Now let’s say we have 12000 patients:
There was a study that found that, in radiology, human-first assessment resulted in worse outcomes that human-alone. Possibly the human's letting borderline cases through, on the assumption that the machine will catch them.
There's a roundup of such findings here, but they're a mixed bag: https://www.uxtigers.com/post/humans-negative-value I suspect you need careful process design to get better outcomes, and it's not one-size-fits-all.