[Winnerstrack]: Produce a "smart" multiple choice exam analyzer (II)

CCExtractor Development

Following with this task:

https://codein.withgoogle.com/dashboard/tasks/6498594207563776/

In which you had to figure out how to correctly split a PDF into the questions (and we received a couple of reasonably good implementations), this new task requires you:

Save each question into an individual file (you had to to this in the previous task, but let's formalize it)
For each question, OCR the text the best you can. You can use an external library such as tesseract, or amazon's stuff which has things like borders.
If a question has for example 4 possible answers (that's typically the case) try to extract each answer separately. This is useful for example in the case that we want to shuffle the answers to generate an exam that has the same answers but in a different order.
Write generated PDFs to file. Write text to .json files and a sqlite3 database (so we can do queries later).

We consider this task to be hard (but challenging and fun).

You can find lots of different sample exams, for different subjects, here:

https://drive.google.com/file/d/1WULFj053Lm1_y6BTQVOyPdXeynNmOh14/view?usp=sharing

Task tags

python
hard
winnerstrack
rust

Students who completed this task

knightron0, Musab Kılıç, RobOHt

Task type

Code