[AI][EASY] Chinese-English Parallel Corpora Collection

CCExtractor Development

Introduction

Corpora means a collection of text. A parallel corpus consists of a collection of texts which have been translated into one or more other language(s). For example Chinese-English parallel corpora means you will have two files, one will contain a text written in Chinese language and an another file containing the same text translated in English language.

Skills

You will learn how to find datasets for different problems.

Task

Students need to search for any publicly available(open source) parallel data for Chinese-English, combine all individual parallel texts into two files, one for Chinese and the other for English. We expect the size of the corpus to be huge. Try to gather parallel data containing around 5 Million parallel sentences.

Task tags

nlp
dataset
research

Students who completed this task

Sudox, knightron0, AlephZero

Task type

Outreach / Research