Apertium

Write a script to deduplicate and/or sort individual lexc lexica.

The lexc format is a way to specify a monolingual dictionary that gets compiled into a transducer: see Apertium-specific conventions for lexc and Lttoolbox and lexc#lexc . A single lexc file may contain quite a few individual lexicons of stems, e.g. for nouns, verbs, prepositions, etc. Write a script (in python or ruby) that reads a specified lexicon, and based on which option the user specifies, identifies and removes duplicates from the lexicon, and/or sorts the entries in the lexicon. Be sure to make a dry-run (i.e., do not actually make the changes) the default, and add different levels debugging (such as displaying a number of duplicates versus printing each duplicate). Also consider allowing for different criteria for matching duplicates: e.g., whether or not the comment matches too. There are two scripts that parse lexc files already that would be a good point to start from: lexccounter.py and inject-words-from-bidix-to-lexc.py (not fully functional).

Read more

Task tags

  • python
  • ruby
  • lexc

Students who completed this task

Grzegorz Stark

Task type

  • code Code
close

2017