I have thought about using tessaract, using it to OCR the TOC and generate something like this. But there are just so many edge cases that make the whole process fail. For example, how do you handle it if the title breaks into two lines? What if the page number is not recognized correctly? For example, 10 can be 1o What if there are dots? Maybe you can use GPT to clean the extracted text.
In the end, I found ChatGPT-4's multimodal capability can recognize text + page number pairs well if I feed screenshots of TOC into it, and I have settled on that.
In the end, I found ChatGPT-4's multimodal capability can recognize text + page number pairs well if I feed screenshots of TOC into it, and I have settled on that.