Apertium

Use open-source OCR to convert open-source non-text news corpora to text. Evaluate an analyser's coverage on them.

Many languages that have online newspapers do not use actual text to store the news but instead use images or GIFs;:((( find a newspaper for a language that lacks news text online (eg. Marathi), check licenses, find an OCR tool and scrape a reasonably large corpus from the images if doing so would not violate CC/GPL. Evaluate the morphological analyser on it.

Read more

Task tags

  • python
  • morphology

Students who completed this task

Grzegorz Stark

Task type

  • code Code
  • assessment Outreach / Research
close

2017