Post by GeoffPost by Bill PowellIs there a way to easily OCR a PDF to actual text on Windows for free?
https://letmegooglethat.com/?q=free+ocr+to+pdf
geoff
You've never actually run that search, have you?
If you did, you'd know all you'll get are advertising shills.
All of which are online PDF converters which are huge privacy scams.
As far as I am aware, there is only one free Windows OCR converter extent.
That's GNU OCR (GOCR, aka JOCR) https://jocr.sourceforge.net/
The gocr help just says it works on "pnm,pgm,pbm,ppm,pcx..." files.
https://jocr.sourceforge.net/examples.html
https://www-e.ovgu.de/jschulen/ocr/download.html
"Windows-binary gocr049.exe" v0.49 154kB by Peter B L Meijer, Oct 2010
http://www-e.uni-magdeburg.de/jschulen/ocr/gocr049.exe
Name: gocr049.exe
Size: 153600 bytes (150 KiB)
SHA256: 1FFC4CD29A5B275F40FBC5F6F9194ED72B8D2BCCBD46019F088C9E5DE2923F59
gocr049.exe
Optical Character Recognition --- gocr 0.49 20100924
Copyright (C) 2001-2010 Joerg Schulenburg GPG=1024D/53BDFBE3
released under the GNU General Public License
use option -h for help
gocr049.exe -h
Optical Character Recognition --- gocr 0.49 20100924
Copyright (C) 2001-2010 Joerg Schulenburg GPG=1024D/53BDFBE3
released under the GNU General Public License
using: gocr [options] pnm_file_name # use - for stdin
options (see gocr manual pages for more details):
-h, --help
-i name - input image file (pnm,pgm,pbm,ppm,pcx,...)
-o name - output file (redirection of stdout)
-e name - logging file (redirection of stderr)
-x name - progress output to fifo (see manual)
-p name - database path including final slash (default is ./db/)
-f fmt - output format (ISO8859_1 TeX HTML XML UTF8 ASCII)
-l num - threshold grey level 0<160<=255 (0 = autodetect)
-d num - dust_size (remove small clusters, -1 = autodetect)
-s num - spacewidth/dots (0 = autodetect)
-v num - verbose (see manual page)
-c string - list of chars (debugging, see manual)
-C string - char filter (ex. hexdigits: 0-9A-Fx, only ASCII)
-m num - operation modes (bitpattern, see manual)
-a num - value of certainty (in percent, 0..100, default=95)
-u string - output this string for every unrecognized character
examples:
gocr -m 4 text1.pbm # do layout analyzis
gocr -m 130 -p ./database/ text1.pbm # extend database
djpeg -pnm -gray text.jpg | gocr - # use jpeg-file via pipe
webpage: http://jocr.sourceforge.net/
When I tested it just now, it worked but it's prone to spelling errors
even on perfectly good text so, while it works, it doesn't work well.
a. I couldn't get gocr to convert a docx or pdf to anything
gocr049.exe -i "testpage.docx" -o testpage.txt -f UTF8
b. Then I couldn't get imagemagic to convert pdf to anything
convert testpage.pdf testpage.pnm
c. So I saved the testpage.pdf to testpage.png to convert by imagemagick
convert testpage.png testpage.pnm
d. gocr049.exe -i "testpage.pnm" -o testpage.txt -f UTF8
(it had a tremendous amount of spelling errors, but it worked)
As far as I'm aware, there is no other Windows OCR freeware extent.