Debugging issues with TVEHD (Spanish DVB)

CCExtractor Development

CCExtractor supports a lot of different standards for extracting captions/subtitles from video files in almost any language. However, occasionally we still encounter issues with certain recordings. This is one of them. The samples we got result in either a crash of CCExtractor or just garbage output.

The issue linked in the external URL will give you the two samples we have for this issue, and in order to solve this task, we'd like you to dig into why these samples could be producing issues.

We already did a bit of digging by ourselves, and we are certain that it is related to OCR. This conclusion was reached because if we don't use tesseract the xml files are generated just fine, even though that leaves us with a bunch of images only... while when using tesseract there's a quick crash. If exporting to .srt the generated file before the crash contains garbage.

We got no further pointers to where it crashes in our code (using Visual Studio), so we assume it's inside one of the libraries.

To complete this task you need some proficiency with running a debugger on a program, so you can try to trace back the origin of the issue.

We expect either a report indicating why it's impossible to extract captions from it, or a root cause in case you find out why it isn't. Bonus points if you can open a PR with a fix.

https://github.com/CCExtractor/ccextractor/issues/279

Task tags

spanish
debugging
ocr
crash
dvb

Students who completed this task

Harry Yu

Task type

Code
Quality Assurance