当前位置：首页 > news >正文

简单使用tesseract-ocr提取图片中的文字

news 2025/4/27 16:40:06

访问Introduction | tessdoc，下载Windows版本的安装包和其他语言的训练数据

安装包下载地址：Home · UB-Mannheim/tesseract Wiki · GitHub

其他语言的训练数据下载地址：Traineddata Files for Version 4.00 + | tessdoc

1、下载Tesseract-OCR软件
下载地址：https://github.com/UB-Mannheim/tesseract/releases/download/v5.4.0.20240606/tesseract-ocr-w64-setup-5.4.0.20240606.exe2、安装Tesseract-OCR软件3、添加环境变量
在环境变量的path变量增加Tesseract-OCR安装路径4、下载中文简体语言训练数据（chi_sim）
下载地址：https://github.com/tesseract-ocr/tessdata/raw/4.00/chi_sim.traineddata5、把下载好的chi_sim.traineddata文件放在 Tesseract-OCR安装目录下的tessdata目录
说明：Tesseract-OCR安装目录下的tessdata目录默认已有eng.traineddata6.1、测试：提取图片中的英文
截图内容：https://tesseract-ocr.github.io/tessdoc/Installation.html
执行命令：tesseract test1.png test1-output
输出文件：test1-output.txt6.2、测试：提取图片中的中文
截图内容：https://alk.12348.gov.cn/Detail?dbID=75&dbName=CWZC&sysID=152
执行命令：tesseract test2.png test2-output -l chi_sim
输出文件：test2-output.txt7、说明
使用Tesseract-OCR提取的内容不一定准确

查看全文

http://www.mrgr.cn/news/34242.html