> to get the most value, test suites should accelerate the time to useful feedback
Well, they should also optimise the usefulness of the feedback they provide. Typically, tests higher up the pyramid are also more brittle (e.g. end-to-end tests might fire up an entire browser and Selenium), and thus are more likely to fail when in actuality, nothing is wrong. That's an additional reason for limiting the number of those tests.
Brittle tests seem not useful in general though aren't they?
I'm not sure its necessarily true that brittleness must correlate with height in pyramid or execution time -- in my experience brittleness correlates with selenium more than it does pyramid height (that's a statement about selenium more than it is a statement about any particular category of testing pyramid).
Its possible to write very useful non-brittle tests using something like headless chrome ...
But yes, Selenium is brittle. That said, Google engineers actually did some investigation into this, and although I think their methods were probably a bit heavyweight, they did conclude that it's mostly RAM use that leads to brittleness.
I’m curious how many tests were in the small size range for that chart which provides evidence to show the size-flakiness correlation holds in tests that use tools associated with higher than average flakiness...
I’m also feeling like I want to have more clarity around the mechanism for measuring flakiness — the definition they use is that a test is flakey if it shows both failing and success runs with the “same code” — does “same code” refer to a freeze of only the codebase under test or also a statement about change to the tools in the testing environment ...?
I wonder what the test suites for tools like selenium/WebDriver look like ... do they track a concept of “meta-flakiness” to try and observe changes to test flakiness results caused by changes to the test tooling ...?
Well, they should also optimise the usefulness of the feedback they provide. Typically, tests higher up the pyramid are also more brittle (e.g. end-to-end tests might fire up an entire browser and Selenium), and thus are more likely to fail when in actuality, nothing is wrong. That's an additional reason for limiting the number of those tests.