User:RBSpamAnalyzerBot
Overview
This bot will post external link analysis, find probable spambot-created pages, and eventually tag them for speedy deletion. It will also generate a set of statistics that can be used by the community to determine whether some pages are being used as spam carriers.
The bot runs once per database dump. In the case of the English Wikipedia, I expect it to run once every 45-60 days.
Tasks
The bot itself is composed of a set of bash shell script files, each doing a single task:
- review.sh: The "bot" itself. The script just calls each of the following scripts in order, handling any problem they may have.
- download.sh: Checks download.wikimedia.org to find new database dumps, comparing the current ones with the last one it had processed. If new ones are found, it can generate a list of urls to download page.sql.gz and externallinks.sql.gz to be downloaded via wget.
- process.sh: Executes the queries from page.sql.gz and externallinks.sql.gz in a local database, then executes several custom-made queries to gather statistics:
SELECT COUNT(el_from) AS total, el_from, page_title
FROM externallinks, page
WHERE externallinks.el_from = page_id AND page_is_redirect = 0 AND page_namespace = 0
GROUP BY el_from
ORDER BY total DESC;- Generates a list of articles sorted by the amount of external links each has.
SELECT COUNT(el_to) AS total, SUBSTRING_INDEX(el_to, '/', 3) AS search
FROM externallinks, page
WHERE page_id = el_from AND page_namespace = 0
GROUP BY search
ORDER BY total DESC;- Generates a list of external links in descendant order.
SELECT page_id, page_title, page_namespace
FROM page
WHERE page_title LIKE '%index.php%'
OR page_title LIKE '%/wiki/%'
OR page_title LIKE '%/w/%' OR
page_title LIKE '%/';- Generates a list of pages with titles containing one of several patterns used by malfunctioning bots, like /wiki/, /w/, or ending with /.
- After executing the queries, the script processes the resulting lists to limit the lists to a determined amount, to prevent creating pages too big. If resulting listing has more than 500 items, the bot stops, as the dump result must be manually analyzed.
- upload.sh: This script executes the communication between the bot and the Wikipedia project. The script logins the bot and uploads the generated listings at a determined location. Currently, that is being done at User:ReyBrujo/Dumps. First, the script determines whether there is a current dump, and if so, archives it at User:ReyBrujo/Dumps/Archive. Then it uploads the listings and the dump page, with the format:
- User:ReyBrujo/Dumps/yyyymmdd where yyyymmdd is the database dump date (and not the processing date)
- User:ReyBrujo/Dumps/yyyymmdd/Sites linked more than xxx times where xxx is usually 500 in the case of the English Wikipedia
- User:ReyBrujo/Dumps/yyyymmdd/Sites linked between xxx and yyy times where xxx and yyy are delimiters when a single listing would have over 500 items.
- User:ReyBrujo/Dumps/yyyymmdd/Articles with more than xxx external links where xxx is usually 1000.
- User:ReyBrujo/Dumps/yyyymmdd/Articles with between xxx and yyy external links where xxx and yyy are delimiters when a single listing would have over 500 items.
Finally, the bot will also edit a global page currently found at meta:User:ReyBrujo/Dump statistics table, updating the statistics in that page. Permission for the bot to run there will be requested after having the bot approved in the English Wikipedia.