update README

This commit is contained in:
Yi Ge 2019-04-30 17:59:10 +02:00
parent 95da09ab7e
commit eb29dd4d90

View File

@ -1,6 +1,6 @@
# videocr # videocr
Extract hardcoded subtitles from videos using the [Tesseract](https://github.com/tesseract-ocr/tesseract) OCR engine with Python. Extract hardcoded (burned-in) subtitles from videos using the [Tesseract](https://github.com/tesseract-ocr/tesseract) OCR engine with Python.
Input a video with hardcoded subtitles: Input a video with hardcoded subtitles:
@ -10,11 +10,16 @@ Input a video with hardcoded subtitles:
</p> </p>
```python ```python
# print_sub.py
import videocr import videocr
print(videocr.get_subtitles('video.avi', lang='chi_sim+eng', sim_threshold=70)) if __name__ == '__main__':
print(videocr.get_subtitles('video.avi', lang='chi_sim+eng', sim_threshold=70))
``` ```
`$ python3 print_sub.py`
Output: Output:
``` ```
@ -47,7 +52,6 @@ Un, I'll have a vodka tonic.
00:00:18,059 --> 00:00:19,019 00:00:18,059 --> 00:00:19,019
谢谢 谢谢
Laughs Thanks. Laughs Thanks.
``` ```
## Performance ## Performance
@ -60,7 +64,6 @@ The OCR process runs in parallel and is CPU intensive. It takes 3 minutes on my
2. `$ pip install videocr` 2. `$ pip install videocr`
## API ## API
```python ```python
@ -83,7 +86,7 @@ Write subtitles to `file_path`. If the file does not exist, it will be created a
- `lang` - `lang`
The language of the subtitles in the video. All language codes on [this page](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-400-november-29-2016) (e.g. `'eng'` for English) and all script names in [this repository](https://github.com/tesseract-ocr/tessdata_fast/tree/master/script) (e.g. `'HanS'` for simplified Chinese) are supported. The language of the subtitles. You can extract subtitles in almost any language. All language codes on [this page](https://github.com/tesseract-ocr/tesseract/wiki/Data-Files#data-files-for-version-400-november-29-2016) (e.g. `'eng'` for English) and all script names in [this repository](https://github.com/tesseract-ocr/tessdata_fast/tree/master/script) (e.g. `'HanS'` for simplified Chinese) are supported.
Note that you can use more than one language. For example, `'hin+eng'` means using Hindi and English together for recognition. More details are available in the [Tesseract documentation](https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#using-multiple-languages). Note that you can use more than one language. For example, `'hin+eng'` means using Hindi and English together for recognition. More details are available in the [Tesseract documentation](https://github.com/tesseract-ocr/tesseract/wiki/Command-Line-Usage#using-multiple-languages).
@ -108,4 +111,3 @@ Write subtitles to `file_path`. If the file does not exist, it will be created a
- `use_fullframe` - `use_fullframe`
By default, only the bottom half of each frame is used for OCR. You can explicitly use the full frame if your subtitles are not within the bottom half of each frame. By default, only the bottom half of each frame is used for OCR. You can explicitly use the full frame if your subtitles are not within the bottom half of each frame.