CCExtractor Development
[AI][EASY] Chinese-English Parallel Corpora Collection
Introduction
Corpora means a collection of text. A parallel corpus consists of a collection of texts which have been translated into one or more other language(s). For example Chinese-English parallel corpora means you will have two files, one will contain a text written in Chinese language and an another file containing the same text translated in English language.
Skills
You will learn how to find datasets for different problems.
Task
Students need to search for any publicly available(open source) parallel data for Chinese-English, combine all individual parallel texts into two files, one for Chinese and the other for English. We expect the size of the corpus to be huge. Try to gather parallel data containing around 5 Million parallel sentences.
Task tags
Students who completed this task
Sudox, knightron0, AlephZero