Seems a better solution would be a checksum on the license text no?

pabs3 · on April 8, 2022

The most accurate license detector is ScanCode. I think it uses some sort of rolling checksum thing for detection.

https://github.com/nexB/scancode-toolkit/

pombreda · on April 9, 2022

Hey, pabs3! Actually this is not using a rolling checksum for detection but rather a combo of language model, checksums, automatons, bitvectors, inverted indexes and multiple sequences alignment (e.g. a specialized diff). I put some docs there to explain the approach at ahttps://github.com/nexB/scancode-toolkit/blob/develop/src/li...

pombreda · on April 9, 2022

You need a bit more than checksums for this. If anything the FSF published many different versions of the "official" GPL2 texts and this would defeat checksums. See https://github.com/pombredanne/gpl-history ... so in the general cases hashing a text does not work consistently and safely. Eventually you need a diff for a pairwise comparison, and the only difficulty is making diff fast or find ways to approximate it fast enough to avoid doing a full diff.