Implement feature for detecting clumps (large areas) of text in a wiki page that lack references

Wikimedia

This task is about using artificial intelligence to check and/or improve the quality of content on Wikimedia wikis. It should be helpful if you already have some basic knowledge about this area.

In https://phabricator.wikimedia.org/T170434 , it was pointed out that "Clumps of text without references is an important smell, even if the overall number of refs in the article is high."

We should be able to get a good statistic on the distance between references or references per section on a wiki page, or something like that.

Implement a feature that gathers signal for large chunks (amounts) of uncited text (=text in wiki pages that does not have any references) and check to see if predictions improve.

This is the list of features in wikiclass for English Wikipedia. This task is done when at least one feature related to how a big chunk of uncited text is there as a feature and when the model has been rebuilt with the new statistics to show the improvement of the accuracy in the models. Your Pull Request should be made against the wikiclass repository at https://github.com/wiki-ai/wikiclass

If you have questions it is best to ask them in https://phabricator.wikimedia.org/T174384 (see the Phabricator help) as more people will see it there than on the GCI website.

https://phabricator.wikimedia.org/T174384

Task tags

python
artificial intelligence

Students who completed this task

Phantom42

Task type

Code
Outreach / Research