CCExtractor Development

[AI][EASY] Chinese-English Parallel Corpora Collection

Corpora means a collection of text. A parallel corpus consists of a collection of texts which have been translated into one or more other language(s). For example Chinese-English parallel corpora means you will have two files, one will contain a text written in Chinese language and an another file containing the same text translated in English language.

[Problem Statement] Parallel corpora is needed for training Machine Translation systems. We need 100% accurate parallel corpora for Chinese-English language pair. Students need to search for any publicly available(open source) parallel data for Chinese-English, combine all individual parallel texts into two files, one for Chinese and the other for English. We expect the size of the corpus to be huge. Try to gather parallel data containing around 5 Million parallel sentences.

Task tags

  • machine translation
  • natural language processing

Students who completed this task

imahumanrcw, Jed Lim, lyect, Ivan Makarov

Task type

  • assessment Outreach / Research
  • done_all Quality Assurance