ScribeOCR – Web interface for recognizing text, OCR, & creating digitized docs

fodkodrasz · 2025-10-10T06:47:10 1760078830

I really like the idea, but unfortunately it could not cope with my usecase.

I have some lecture slides as image-only PDF (Hungarian language with a sparkle of English and Latin (biology)). I tried the tool on it and I had the following experience:

- proofreading with the overlay seems like a good idea, actually it is unusable when the original text has colors, and you need to recognize diacritic marks. Being able to show the original in grayscale or black&white could help. (BW worked, but Grayscale left everything colored)

- For proofreading the ebook mode was the most useful, I immediately spotted lots of errors that I could not see with overlay. A quick switch between the modes would be useful

- Editing text is not efficient when error rate is high (Hungarian language is not supported, that caused it mostly I guess), the interface has high overhead for mass corrections.

Very good idea, I think after a little polish it would even fit my usecase. For more traditional OCR usecases than mine it is probably already great.

zihotki · 2025-10-10T06:12:33 1760076753

According to what I read in the documentation, it uses Tesseract underneath. I've used Tesseract v3 in the past and it was pain. Tesseract 4 uses LSTM neural net. How good is the performance and quality of the recognition nowadays in v4? Could anyone share his experience?

graynk · 2025-10-10T08:18:26 1760084306

I use paperless-ngx for digitizing all my documents, it also uses Tesseract. The result is not perfect, but more than acceptable, if I scan at 600dpi

oigursh · 2025-10-13T13:41:23 1760362883

There's https://github.com/icereed/paperless-gpt as a plugin

graynk · 2025-10-27T12:25:15 1761567915

Local LLMs I've found to not be good enough for OCR (while being a lot more resource hungry), and OpenAI models I want to avoid for privacy reasons. Default tesseract does the job for me, since my only requirements for the results it "I can easily find what I need with full-text search" - I rarely need to actually copy the text from the resulting PDFs

btian · 2025-10-10T23:42:54 1760139774

it's fine for simple use cases, but far inferior to the likes of GPT, Gemini or Mistral

aidenn0 · 2025-10-10T05:01:26 1760072486

This is my first encounter with Scribe.js; since I have many book scans I always try OCRing them when I see this. Compared to Tesseract (which is the best I have so far), it gets the words right slightly more, but the paragraph segmentation is many times worse. On a book where every paragraph is indented, it reliably decides two consecutive one-line paragraphs are the same paragraph, which is understandable, but a downgrade from Tesseract which gets the paragraph segmentation as correct as possible (It doesn't handle paragraphs that spanpage-breaks, since I'm feeding it one page at a time)

zihotki · 2025-10-10T06:15:35 1760076935

Scribe is Tesseract. It uses tesseract.js which is a Web Assembly port of Tesseract. So they in theory should be equal. In practice custom settings or older versions could make a difference.

aidenn0 · 2025-10-10T13:54:08 1760104448

This is only true in the "speed" mode; in the "quality" mode it claims better word recognition than Tesseract on clean scans (which matches my tests): https://github.com/scribeocr/scribe.js/blob/master/docs/scri...

criddell · 2025-10-10T12:51:27 1760100687

What's the motivation for doing this in the browser? It seems like intentionally choosing a more difficult path to create an inferior result.

A native MacOS or Windows application could use the OCR facilities of the operating system and, in my experience, both produce results that are far better than Tesseract.

Zardoz84 · 2025-10-10T13:24:01 1760102641

Generate the OCR on the fly, in the browser, when you do not have the proper OCR info. As someone that works on public web libraries, I see it useful (but wasteful)

Elucalidavah · 2025-10-10T06:10:40 1760076640

> Tesseract (which is the best I have so far)

Have you looked at EasyOCR?

aidenn0 · 2025-10-10T13:52:27 1760104347

EasyOCR is significantly worse than Tesseract for clean printed text and , while being orders of magnitude slower; far better than Tesseract for low-quality clean scans and extracting text from pictures (e.g. comics), which Tesseract does not as well.

criddell · 2025-10-10T14:21:37 1760106097

Have you tried Abbyy FineReader? It's the best OCR package I've seen.

aidenn0 · 2025-10-11T15:33:45 1760196825

It doesn't seem to have a Linux version; I don't have a mac or windows machine.

ranger_danger · 2025-10-10T03:27:39 1760066859

This is awesome. Only issue was I had to disable my JShelter extension because it would freeze the page using 100% CPU forever.

constantinum · 2025-10-10T07:00:15 1760079615

anyone looking for an ocr or text pre-processor that maintains the layout(tables, forms) try LLMWhisperer > https://pg.llmwhisperer.unstract.com/

Zardoz84 · 2025-10-10T10:09:13 1760090953

If it would generate ALTO XML files... IF!