In Pywikibot's download_dump.py file, make download process atomic
Pywikibot is a Python-based framework to write bots for MediaWiki (more information).
Thanks to work in Google Code-in, Pywikibot now has a script called download_dump.py
. It downloads a Wikimedia database dump from http://dumps.wikimedia.org/ , and places the dump in a predicable directory for semi-automated use by other scripts and tests.
It is however not [[https://en.wiktionary.org/wiki/atomic#Adjective|atomic]]:
This task shall include two parts:
- Make sure the target file is //always// a complete file; only commit if and only if the file is completely downloaded (and verified with checksum), or discard if anything goes wrong.
- Make sure two downloaders do not write on the same partially-downloaded file at the same time.
One implementation is to perform [[https://en.wikipedia.org/wiki/File_locking|file locking]] on whichever is being downloaded on, and exit with an error if the will-be-written file is locked. Another is to avoid same-filenames entirely.
Please do read https://phabricator.wikimedia.org/T183675 for a longer explanation!
You are expected to provide a patch in Wikimedia Gerrit. See https://www.mediawiki.org/wiki/Gerrit/Tutorial for how to set up Git and Gerrit.