update README

This commit is contained in:
Yi Ge 2019-04-29 23:16:37 +02:00
parent 04ad4597ff
commit 7ebf6a17f8
1 changed files with 24 additions and 14 deletions

View File

@ -2,7 +2,7 @@
Extract hardcoded subtitles from videos using the [Tesseract](https://github.com/tesseract-ocr/tesseract) OCR engine with Python. Extract hardcoded subtitles from videos using the [Tesseract](https://github.com/tesseract-ocr/tesseract) OCR engine with Python.
Input video with hardcoded subtitles: Input a video with hardcoded subtitles:
<p float="left"> <p float="left">
<img width="430" alt="screenshot" src="https://user-images.githubusercontent.com/10210967/56873658-3b76dd00-6a34-11e9-95c6-cd6edc721f58.png"> <img width="430" alt="screenshot" src="https://user-images.githubusercontent.com/10210967/56873658-3b76dd00-6a34-11e9-95c6-cd6edc721f58.png">
@ -12,43 +12,48 @@ Input video with hardcoded subtitles:
```python ```python
import videocr import videocr
print(videocr.get_subtitles('video.avi', lang='HanS')) print(videocr.get_subtitles('video.avi', lang='chi_sim+eng', sim_threshold=70))
``` ```
Output: Output:
``` ```
0 0
00:00:00,000 --> 00:00:02,711 00:00:01,042 --> 00:00:02,877
-谢谢 … 你 好 -谢谢 喝 点 什么 ?
Thank you...Hi. Thanks.
1
00:00:02,794 --> 00:00:04,879
喝 点 什么 ?
What can I get you? What can I get you?
1
00:00:03,044 --> 00:00:05,463
我 不 知道
Um, I'm not sure.
2 2
00:00:05,046 --> 00:00:12,554 00:00:08,091 --> 00:00:10,635
休闲 时 光 … 休闲 时 光 …
For relaxing times, make it... For relaxing times, make it...
3 3
00:00:12,804 --> 00:00:14,723 00:00:10,677 --> 00:00:12,595
三 得 利 时 光 三 得 利 时 光
Bartender, Bob Suntory time. Bartender, Bob Suntory time.
4 4
00:00:16,474 --> 00:00:19,144 00:00:14,472 --> 00:00:17,142
我 要 一 杯 伏特 加
Un, I'll have a vodka tonic. Un, I'll have a vodka tonic.
5 5
00:00:19,394 --> 00:00:20,687 00:00:18,059 --> 00:00:19,019
谢谢 谢谢
Laughs Thanks. Laughs Thanks.
``` ```
## Performance
The OCR process runs in parallel and is CPU intensive. It takes 3 minutes on my dual-core laptop to extract a 20 seconds video. You may want more cores for longer videos.
## API ## API
```python ```python
@ -71,7 +76,11 @@ Write subtitles to `file_path`. If the file does not exist, it will be created a
- `lang` - `lang`
Language of the subtitles in the video. Besides `eng` for English, all language codes on [this page](https://github.com/tesseract-ocr/tessdata_best/tree/master/script) are supported. The language of the subtitles in the video. All language codes on [this page](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-400-november-29-2016) (e.g. `'eng'` for English) and all script names in [this repository](https://github.com/tesseract-ocr/tessdata_fast/tree/master/script) (e.g. `'HanS'` for simplified Chinese) are supported.
Note that you can use more than one language. For example, `'hin+eng'` means using Hindi and English together for recognition. More details are available in the [Tesseract documentation](https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#using-multiple-languages).
Language data files will be automatically downloaded to your `$HOME/tessdata` directory when necessary. You can read more about Tesseract language data files on their [wiki page](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files).
- `time_start` and `time_end` - `time_start` and `time_end`
@ -92,3 +101,4 @@ Write subtitles to `file_path`. If the file does not exist, it will be created a
- `use_fullframe` - `use_fullframe`
By default, only the bottom half of each frame is used for OCR. You can explicitly use the full frame if your subtitles are not within the bottom half of each frame. By default, only the bottom half of each frame is used for OCR. You can explicitly use the full frame if your subtitles are not within the bottom half of each frame.