FOSSASIA

Privly: Post-Process Bot User Agent List

A previous GCI task collected user-agent strings presented by web crawler bots. We need to use these strings to block social media sites that attempt to fetch content from our server without respecting our robots.txt file. The problem is that there are many different possible user-agent strings and we don't want to check for all of them on every request. One possible approach is to look for strings of characters that are shared by many different bots. For instance, we can check for CommonCrawler instead of Mozilla/5.0 CommonCrawler Node 3BQ5KOMHUFGZXDFNO4UYL2CU23T6HCCSKFRWWIMUIUZ7TAEYQ4LX7RRN3FM2RMX.5.NFRJZFC7JZQ3PMWMKHGTFBBJVDP5ZKQBIC77WT46KHHBW2XR.cdn0.common.crawl.zone. This task is aimed at producing this condensed list.

(1) Objective

Condense the list of robot user agents.

(2) Requirement

  1. Put the list of robots into a spreadsheet
  2. Make the column next to the robot list and add the string that will be used to identify that user agent, ie, "CommonCrawler"
  3. Submit the spreadsheet for review

(3) Expected outcome

We have a list of strings we can check for when we want to block a known set of bots

(4) Resources

list of bot user agents

Task tags

  • privly
  • blacklist
  • bots
  • spreadsheet

Students who completed this task

seadog007

Task type

  • code Code
  • done_all Quality Assurance
close

2015