Apertium

Write a scraper for aligned daily content from wol.jw.org in two languages

Write a scraper (preferably in python) that accepts two languages and a date range, and creates an aligned corpus (using TMX format) from content at wol.jw.org in those languages for those date ranges.

Here are some example pages in different languages for a given date (December 1, 2015): [http://wol.jw.org/en/wol/dt/r1/lp-e/2015/12/1] [http://wol.jw.org/kk-cyrl/wol/dt/r43/lp-az/2015/12/1] [http://wol.jw.org/ky/wol/dt/r51/lp-kz/2015/12/1] [http://wol.jw.org/khk/wol/dt/r159/lp-kha/2015/12/1] [http://wol.jw.org/tt/wol/dt/r100/lp-tat/2015/12/1]

For further information and guidance on this task, you are encouraged to come to our IRC channel.

Task tags

  • python
  • aligning
  • scraper

Students who completed this task

vigneshv

Task type

  • code Code
close

2015