
group and count possible lemmas output by guesser

Currently a 'guesser' version of Apertium transducers can output a list of possible analyses for unknown forms. Develop a new pipleine, preferably with shell scripts or python, that uses a guesser on all unknown forms in a corpus, and takes the list of all possible analyses, and output a hit count of the most common combinations of lemma and POS tag.

The first step is to compile an Apertium transducer that has guesser support, like Kazakh, and enable the guesser and make sure it works. Then you'll want to run it on a large amount of text in the language.

Task tags

  • transducers
  • guesser
  • shellscripts

Students who completed this task

Ryan A. Chi

Task type

  • code Code