CCAligner : Find and integrate a text tokenization library.

CCExtractor Development

Text tokenisation or normalisation means splitting a sentence into respective tokens. The challenging part here is the separation of possessives, understanding numerical, dates et cetera (see the example below). Read more about it here: https://en.wikipedia.org/wiki/Text_normalization

In CCAligner, the current implementation of text tokenisation is pretty naive and doesn't cover all aspects. A nice tokenisation library should be able to generate all possible text tokens like currency, dates, numbers, symbols and so on.

For example :

In 1996, 1996 people sent emails to someone @ example.com at 1:30 PM.

In nineteen ninety six, one thousand nine hundred and ninety six people sent emails to someone at example dot com at one thirty p m

and all the alternative versions.

You can pick any tokenization library which you deem good, but we already got a suggestion from Nikolay from CMUSphinx (the org which built Pocketsphinx - the Automated Speech Recognition system we are using). He suggested to use Sparrowhawk. It may or may not be exactly what we need, even though it's really promising. While we don't have experience with using that library, when you run into problems we'll look into them together.

The library needs to be integrated in the subtitle parser (you can find it here) which was implemented in CCAligner. As such we expect a Pull Request from you that makes these changes.

Here's a research report from a student who attempted this task, might help you : https://goo.gl/gYyh8P

https://github.com/saurabhshri/CCAligner/issues/7

Task tags

library
potentially hard
c++
tokenization

Students who completed this task

Nikunj Taneja

Task type

Code
Outreach / Research