Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Seems a better solution would be a checksum on the license text no?


The most accurate license detector is ScanCode. I think it uses some sort of rolling checksum thing for detection.

https://github.com/nexB/scancode-toolkit/


Hey, pabs3! Actually this is not using a rolling checksum for detection but rather a combo of language model, checksums, automatons, bitvectors, inverted indexes and multiple sequences alignment (e.g. a specialized diff). I put some docs there to explain the approach at ahttps://github.com/nexB/scancode-toolkit/blob/develop/src/li...


You need a bit more than checksums for this. If anything the FSF published many different versions of the "official" GPL2 texts and this would defeat checksums. See https://github.com/pombredanne/gpl-history ... so in the general cases hashing a text does not work consistently and safely. Eventually you need a diff for a pairwise comparison, and the only difficulty is making diff fast or find ways to approximate it fast enough to avoid doing a full diff.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: