OCR image to text

Discussion:

OCR image to text

(too old to reply)

Bill Powell

2024-07-14 00:49:34 UTC

I have a series of one-page images that are really images and not text even
though they look like they're just a page of simple text in the same font.

Is there a way to easily OCR a PDF to actual text on Windows for free?

Geoff

2024-07-14 01:32:09 UTC

Permalink

Post by Bill Powell
I have a series of one-page images that are really images and not text even
though they look like they're just a page of simple text in the same font.
Is there a way to easily OCR a PDF to actual text on Windows for free?

https://letmegooglethat.com/?q=free+ocr+to+pdf

geoff

Peter

2024-07-14 02:03:21 UTC

Permalink

Post by Geoff

Post by Bill Powell
Is there a way to easily OCR a PDF to actual text on Windows for free?

https://letmegooglethat.com/?q=free+ocr+to+pdf
geoff

You've never actually run that search, have you?
If you did, you'd know all you'll get are advertising shills.
All of which are online PDF converters which are huge privacy scams.

As far as I am aware, there is only one free Windows OCR converter extent.
That's GNU OCR (GOCR, aka JOCR) https://jocr.sourceforge.net/

The gocr help just says it works on "pnm,pgm,pbm,ppm,pcx..." files.
https://jocr.sourceforge.net/examples.html
https://www-e.ovgu.de/jschulen/ocr/download.html
"Windows-binary gocr049.exe" v0.49 154kB by Peter B L Meijer, Oct 2010
http://www-e.uni-magdeburg.de/jschulen/ocr/gocr049.exe
Name: gocr049.exe
Size: 153600 bytes (150 KiB)
SHA256: 1FFC4CD29A5B275F40FBC5F6F9194ED72B8D2BCCBD46019F088C9E5DE2923F59

gocr049.exe
Optical Character Recognition --- gocr 0.49 20100924
Copyright (C) 2001-2010 Joerg Schulenburg GPG=1024D/53BDFBE3
released under the GNU General Public License
use option -h for help

gocr049.exe -h
Optical Character Recognition --- gocr 0.49 20100924
Copyright (C) 2001-2010 Joerg Schulenburg GPG=1024D/53BDFBE3
released under the GNU General Public License
using: gocr [options] pnm_file_name # use - for stdin
options (see gocr manual pages for more details):
-h, --help
-i name - input image file (pnm,pgm,pbm,ppm,pcx,...)
-o name - output file (redirection of stdout)
-e name - logging file (redirection of stderr)
-x name - progress output to fifo (see manual)
-p name - database path including final slash (default is ./db/)
-f fmt - output format (ISO8859_1 TeX HTML XML UTF8 ASCII)
-l num - threshold grey level 0<160<=255 (0 = autodetect)
-d num - dust_size (remove small clusters, -1 = autodetect)
-s num - spacewidth/dots (0 = autodetect)
-v num - verbose (see manual page)
-c string - list of chars (debugging, see manual)
-C string - char filter (ex. hexdigits: 0-9A-Fx, only ASCII)
-m num - operation modes (bitpattern, see manual)
-a num - value of certainty (in percent, 0..100, default=95)
-u string - output this string for every unrecognized character
examples:
gocr -m 4 text1.pbm # do layout analyzis
gocr -m 130 -p ./database/ text1.pbm # extend database
djpeg -pnm -gray text.jpg | gocr - # use jpeg-file via pipe

webpage: http://jocr.sourceforge.net/

When I tested it just now, it worked but it's prone to spelling errors
even on perfectly good text so, while it works, it doesn't work well.

a. I couldn't get gocr to convert a docx or pdf to anything
gocr049.exe -i "testpage.docx" -o testpage.txt -f UTF8
b. Then I couldn't get imagemagic to convert pdf to anything
convert testpage.pdf testpage.pnm
c. So I saved the testpage.pdf to testpage.png to convert by imagemagick
convert testpage.png testpage.pnm
d. gocr049.exe -i "testpage.pnm" -o testpage.txt -f UTF8
(it had a tremendous amount of spelling errors, but it worked)

As far as I'm aware, there is no other Windows OCR freeware extent.

Geoff

2024-07-14 06:56:55 UTC

Permalink

Post by Peter

Post by Geoff

Post by Bill Powell
Is there a way to easily OCR a PDF to actual text on Windows for free?

https://letmegooglethat.com/?q=free+ocr+to+pdf
geoff

You've never actually run that search, have you?
If you did, you'd know all you'll get are advertising shills.
All of which are online PDF converters which are huge privacy scams.

You didn't specify 'no online converters'. You think this one (found in
the search) is a privacy scam, or are you worriedabout potentially
exposing something very sensitive in the 'online' scenario ?

https://www.adobe.com/acrobat/online/ocr-pdf.html

geoff

Abandoned Trolley

2024-07-14 05:26:41 UTC

Permalink

Theres a OCR reader / converter thing built in to MS Word (it might be
called MS Lens?) if its any help, but a clear explanation of what you
have and what you want might be more useful.

You say you have a series of one page images, which I assume are digital
files and not bits of paper ?

If they are images, then they might be jpegs or something, but you dont say.

Or. they might be a .pdf, as you say you want to "easily OCR a PDF to
actual text"

If they are .pdf then simply cut and paste ?

Alan Browne

2024-07-14 12:15:21 UTC

Permalink

There are plenty of free online converters - of course you're exposing
content to a third party. Be mindful of what is in the doc.

--
"It would be a measureless disaster if Russian barbarism overlaid
the culture and independence of the ancient States of Europe."
Winston Churchill

Geoff Realname

2024-07-23 11:15:00 UTC

Permalink

Late to the party here, but what about FreeOCR?
http://www.paperfile.net/index.html
It's a bit ancient, but it certainly works well for my needs. Also,
though I haven't tried it, IrfanView includes OCR capabilities
https://www.irfanview.com/

--
I would be unstoppable if I could get started.

Abandoned Trolley

2024-07-23 15:42:52 UTC

Permalink

Post by Geoff Realname

As I said about a week ago, theres no clear explanation from the OP of
what he has and what he wants.

Matti Haveri

2024-08-24 15:06:57 UTC

Permalink

Tesseract via the terminal is one option.

https://github.com/tesseract-ocr/tesseract

--
Matti Haveri <mattiDOThaveriATgmailDOTeiroskaaDOTcom> remove ei roskaa