CCExtractor Development

Debugging issues with TVEHD (Spanish DVB)

CCExtractor supports a lot of different standards for extracting captions/subtitles from video files in almost any language. However, occasionally we still encounter issues with certain recordings. This is one of them. The samples we got result in either a crash of CCExtractor or just garbage output.

The issue linked in the external URL will give you the two samples we have for this issue, and in order to solve this task, we'd like you to dig into why these samples could be producing issues.

We already did a bit of digging by ourselves, and we are certain that it is related to OCR. This conclusion was reached because if we don't use tesseract the xml files are generated just fine, even though that leaves us with a bunch of images only... while when using tesseract there's a quick crash. If exporting to .srt the generated file before the crash contains garbage.

We got no further pointers to where it crashes in our code (using Visual Studio), so we assume it's inside one of the libraries.

To complete this task you need some proficiency with running a debugger on a program, so you can try to trace back the origin of the issue.

We expect either a report indicating why it's impossible to extract captions from it, or a root cause in case you find out why it isn't. Bonus points if you can open a PR with a fix.

Task tags

  • spanish
  • debugging
  • ocr
  • crash
  • dvb

Students who completed this task

Harry Yu

Task type

  • code Code
  • done_all Quality Assurance
close

2017