Wikimedia

In Pywikibot's download_dump.py file, make download process atomic

Pywikibot is a Python-based framework to write bots for MediaWiki (more information).

Thanks to work in Google Code-in, Pywikibot now has a script called download_dump.py. It downloads a Wikimedia database dump from http://dumps.wikimedia.org/ , and places the dump in a predicable directory for semi-automated use by other scripts and tests.

It is however not [[https://en.wiktionary.org/wiki/atomic#Adjective|atomic]]:

This task shall include two parts:

  1. Make sure the target file is //always// a complete file; only commit if and only if the file is completely downloaded (and verified with checksum), or discard if anything goes wrong.
  2. Make sure two downloaders do not write on the same partially-downloaded file at the same time.

One implementation is to perform [[https://en.wikipedia.org/wiki/File_locking|file locking]] on whichever is being downloaded on, and exit with an error if the will-be-written file is locked. Another is to avoid same-filenames entirely.

Please do read https://phabricator.wikimedia.org/T183675 for a longer explanation!

You are expected to provide a patch in Wikimedia Gerrit. See https://www.mediawiki.org/wiki/Gerrit/Tutorial for how to set up Git and Gerrit.

Task tags

  • python
  • pywikibot

Students who completed this task

Ryan Chang

Task type

  • code Code
close

2017