Wikipedia:Bots/Requests for approval/TorNodeBot 2
- The following discussion is an archived debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA. The result of the discussion was Approved.
TorNodeBot (Second Request)
Operator: Shirik (talk · contribs)
Time filed: 18:14, Sunday October 31, 2010 (UTC)
Automatic or Manually assisted: Automatic
Programming language(s): PHP and Lua
Source code available: Yes, http://shiibot.com/torbot.php.txt and http://shiibot.com/blockcheck.lua.txt. Please note presently the line to actually perform the block is commented out.
Function overview: Identifies unblocked TOR nodes and block them temporarily, AO/ACB.
Links to relevant discussions (where appropriate): There is currently no discussion about this bot, however this bot was approved for trial in the past at WP:Bots/Requests for approval/TorNodeBot. I withdrew this request last time because it was determined it was no longer needed, but it seems it is needed once again.
Edit period(s): Frequently, through cron, though period has not yet been determined. Probably about every 15 minutes.
Estimated number of pages affected: There will likely be a large influx of blocks initially, probably on the order of 50-100 500. After this, a block would occur only when new tor nodes open up. This is probably on the order of a couple per day but I don't have any real data to back this up.
Exclusion compliant (Y/N): N (not applicable)
Already has a bot flag (Y/N): N
Function details: The TorBlock MediaWiki extension is designed to block accounts from editing via Tor. Unfortunately, either due to bugs or due to its refresh rate being insufficiently fast, it has been unable to keep up with accounts editing via Tor, so edits are still able to get through. This bot monitors a DNSBL searching for IPs and blocks them temporarily for editing (AO/ACB).
The DNSBL is designed for high traffic scenarios so strain on the remote server is minimal. The bot will run from my server, so there will be no impact on Wikipedia's servers from this additional traffic. The only additional traffic will be to the toolserver, where the bot makes a request once it identifies an IP that has a Tor exit node to Wikipedia to check if it's already blocked.
This bot only blocks Tor nodes which have an exit node to Wikipedia. Other tor nodes are not affected. Once a node has been identified as broadcasting an exit node to Wikipedia, nmap is used to double-check that the ports are actually open (some have Tor or their networks misconfigured and there is no reason to block these IPs). Only those IPs relevant to Tor are checked and this check is run on my server.
The block time for the trial before was 24 hours (because it was a trial) but it is likely that we should increase this, perhaps to a month. Community input for this would be appreciated.
The reason this was withdrawn last time was because the TorBlock extension is expected to be able to do this, and we appeared to be getting support with this extension. However, it is apparent that the fix was insufficient, and so I would like to put this bot up for approval once again as a stop-gap measure against the present Tor abuse.
If the community desires, we can leave this bot offline except at times when there are abusive Tor users running around, however I'm not sure the benefit of this.
Please note that the concerns from the original request about whether or not this bot meets Toolserver rules has been addressed as this bot no longer runs on the Toolserver. It does, however, make a request to a PHP script on Toolserver simply to check if an IP is already blocked (this is done because it is easier to query directly from the database than it is to scrape it from Wikipedia).
I will post notifications on VPT and AN regarding this bot for community input.
Discussion
A notification to the community has been placed at AN and VPPR --Shirik (Questions or Comments?) 18:22, 31 October 2010 (UTC)[reply]
As an indicator of this bot's utility, I would like to point out that this bot, during its previous trial (many months ago), blocked both 216.24.174.245 (talk · contribs · WHOIS) and 89.16.175.194 (talk · contribs · WHOIS), both of which are still under blocks by checkusers for Tor abuse. It is evident that the TorBlock extension isn't picking up even some long-term exit nodes. --Shirik (Questions or Comments?) 18:22, 31 October 2010 (UTC)[reply]
- I will probably add some comments in a bit, but just to inform people that I reopened bugzilla:23321 earlier today. -- zzuuzz (talk) 18:29, 31 October 2010 (UTC)[reply]
- It does seem that Zealking is back, as he edited as Special:Contributions/Dr.ZL_King today. It may be more than coincidence that this happened so soon after the RefDesk vandal appeared; perhaps Tor itself has changed in a way that makes it harder to detect and therefore block? I don't really know anything but hopefully we can do something. —Soap— 21:29, 31 October 2010 (UTC)[reply]
Some comments and suggestions:
- I would suggest a block length of no longer than two months, once the bot is proven, unless there's also a mechanism for unblocking.
- Are you aware that anon only blocks are virtually useless on rapidly rotating networks like Tor? Would you please consider hardblocks, as checkusers and admins would normally do?
- Could you add the {{Tor}} template to the block message, and not capitalise TOR.[1]
- The block check doesn't seem to include rangeblocked IPs. I think there's quite of few of them. I don't know if that's something you'd want to consider.
- You are probably going to miss a number of nodes with the port check, as an increasing number move away from the standard ports. Since the ports are broadcast I'd hope you could find a way to check which ones they advertise.
- I understand the bot will be reading in a list of IPs from http://torstatus.kgprog.com/ ? It's currently down for me, and I think it has been for some time. If you were to read in a list from somewhere else you may need to update the regexp.
- I wouldn't mind seeing an up to date test run.
- Otherwise, as long as it works, I don't see any credible policy objections given that we've already got the Tor extension and ProcseeBot.
-- zzuuzz (talk) 20:29, 31 October 2010 (UTC)[reply]
- Thanks for the advice. The port check was kinda hacked together. You make a good point that I could manually look up the port and check only that port; I will switch it over to that. I also haven't run a test which is why I didn't realize that URL was down; I can fix that up as well. Once I've fixed those two things up I will be sure to do a test run and publish results.
- Additionally, regarding the hardblocks. Originally I had considered doing hardblocks but it was requested (off-wiki) that I take it down to AO/ACB. Whichever way the community chooses to go is that which I would do; it's a trivial change. --Shirik (Questions or Comments?) 20:34, 31 October 2010 (UTC)[reply]
- I have tried to address your concerns:
- I now use a new list (the other list was official, as is this one, but I'm not sure where the original one went). I am still looking for an alternative mechanism but this works at least for now.
- I now check the published routing port and try to connect to that directly. This will also handle non-standard ports (this is evidenced in the test data below, e.g., port 59001). I also only check that port, so this has an added benefit of reducing false negatives due to SPI firewalls.
- I have not yet addressed rangeblocked IPs; I intend to but this will take a little more time.
- I will gladly move to hardblocks if that is consensus. Such a change would take literally seconds to make.
- I have adjusted the block summary.
- Please see the following for test data summary (it's a bit long so I chose not to add it here): testdata.txt (note I'm still going through this to check accuracy)
- Please feel free to give any additional advice/criticism --Shirik (Questions or Comments?) 22:05, 31 October 2010 (UTC)[reply]
- I have tried to address your concerns:
- Additional note: I also corrected the capitalization of "tor", as I know how angry I get when people incorrectly capitalize Lua. I've made sure the block reason also transcludes {{tor}}. --Shirik (Questions or Comments?) 22:16, 31 October 2010 (UTC)[reply]
- I've also adjusted the initial estimates -- there are more tor exit nodes than I initially thought, so we're talking closer to an initial hit of 500 blocks. After this initial hit I still only expect a few blocks per day. --Shirik (Questions or Comments?) 22:20, 31 October 2010 (UTC)[reply]
I concur with the desire to convert the block to a hard block: named accounts shouldn't be using tor nodes either. As a future development, you should consider rechecking nodes that you have blocked on a regular basis, and unblocking/reblocking as appropriate. That makes the choice of block length less important.—Kww(talk) 22:43, 31 October 2010 (UTC)[reply]
I've implemented the check for rangeblocks now, and I've re-run the test to show the results with rangeblock data. --Shirik (Questions or Comments?) 01:55, 1 November 2010 (UTC)[reply]
- I've had a look through the test data and didn't notice anything weird. A very small number such as 83.30.229.138 and 83.171.151.28 weren't Tor by the time I checked, but they were recently and can be considered fair game for a short block. I found one false negative: 79.105.153.108. I'd suggest the blocks are kept fairly short until you can implement some sort of recheck and unblock mechanism. Regarding the rangeblocked IPs, it's not a problem if they get blocked (again) directly, but it is a problem if they get softblocked while the range is hardblocked. Most of these types of ranges are hardblocked. But that all seems OK now. No further comments from me. You know where I stand on the hardblocks. Thanks. -- zzuuzz (talk) 08:05, 1 November 2010 (UTC)[reply]
- RE: the "short block" argument. I'm not sure there is any real benefit to quickly unblocking an old tor node. Once it has been used as a tor node to access Wikipedia, isn't there a relatively high probability that it will be used that way again? I agree that rechecking and unblocking is necessary, but it doesn't seem like anything to try to make particularly speedy.—Kww(talk) 14:42, 1 November 2010 (UTC)[reply]
- A good proportion of Tor IPs are dynamic, and many others get closed soonish. I've unblocked a few (other types of proxies) blocked by ProcseeBot and requesting unblock and reported to WP:OP within the two month block length. Who knows how many didn't bother requesting unblock. I don't think there's a particular hurry to unblock them once they've been Tor, as you still have to consider regular downtimes, but there's no need to stretch the blocks more than a month beyond. Anyone who checks open proxies for unblock would always wait at least a few days, often more than a week. -- zzuuzz (talk) 16:19, 1 November 2010 (UTC)[reply]
- RE: the "short block" argument. I'm not sure there is any real benefit to quickly unblocking an old tor node. Once it has been used as a tor node to access Wikipedia, isn't there a relatively high probability that it will be used that way again? I agree that rechecking and unblocking is necessary, but it doesn't seem like anything to try to make particularly speedy.—Kww(talk) 14:42, 1 November 2010 (UTC)[reply]
The Tor extension is evil and sucks! So lets use the exact same source of data! Q T C 17:30, 2 November 2010 (UTC)[reply]
- Are you sure this is where the tor extension is getting its data? If this was true, then the bot shouldn't be picking up nodes that we had to manually block. Alternatively, do you have another suggestion on a data source? --Shirik (Questions or Comments?) 17:35, 2 November 2010 (UTC)[reply]
- I checked the source for the TorBlock extension. It is not the same source of data. --Shirik (Questions or Comments?) 17:38, 2 November 2010 (UTC)[reply]
Tor is awesome, but I also feel it helps people vandalize. Support bot proposal as a second layer of anit-tor protection. Hamtechperson 12:33, 4 November 2010 (UTC)[reply]
- Looks good, but I have just one suggestion. After your bot blocks a IP address, every time your bot runs it will query the block status of that IP address, this query is unnecessary because the block status is unlikely to change. Adding a 'caching' function will be a good idea to stop the unneeded query. -- d'oh! [talk] 00:55, 11 November 2010 (UTC)[reply]
- I do have intention of doing something like this, in conjunction with the unblocking capability that Zzuuzz had requested, but I would argue that such a capability is not necessary prior to trial because the trial would only be run once or twice and such a capability would not really affect the trial in any major way. --Shirik (Questions or Comments?) 01:16, 11 November 2010 (UTC)[reply]
{{BAG assistance needed}} I realize that this is an adminbot so extra care must be taken, but we've already shown good success with this above. Additionally, it's been more than a week since bugzilla:23321 was reopened with no real action, and in the meantime disruption is persisting. Hopefully we can at least get this into a trial to simultaneously act as a stop-gap measure and prove the worthiness of the bot, if someone is willing to give the green flag. --Shirik (Questions or Comments?) 17:37, 11 November 2010 (UTC)[reply]
- I'm not sure if this is relevant or not, but apparently, The automatic blocking of Tor exit nodes is working reliably again (bug #23321)--Rockfang (talk) 02:34, 16 November 2010 (UTC)[reply]
- I'm aware of this, but I would like to point out that this is not the first time this exact sequence of events has occurred, and this time through we took over a week of abuse before it was resolved. Instead, I'd still like to push this bot through approval so that either it can be a second layer of defense or we can leave it approved but inactive until we notice TorBlock isn't working again, at which time we can start it up (having already been approved). --Shirik (Questions or Comments?) 03:56, 16 November 2010 (UTC)[reply]
- That seems logical. Thank you for replying.--Rockfang (talk) 23:58, 16 November 2010 (UTC)[reply]
- I'm aware of this, but I would like to point out that this is not the first time this exact sequence of events has occurred, and this time through we took over a week of abuse before it was resolved. Instead, I'd still like to push this bot through approval so that either it can be a second layer of defense or we can leave it approved but inactive until we notice TorBlock isn't working again, at which time we can start it up (having already been approved). --Shirik (Questions or Comments?) 03:56, 16 November 2010 (UTC)[reply]
I'm no expert when it comes to tor. But there seems to be a clear consensus for this bot, and it's basically a duplicate of what the software should do automatically. Since this software is prone to breaking, it may be a good idea to have this bot as a back-up (in addition the bot uses a different source to find the exit nodes). Is there still a wish to do a trial? I seem to re-call a small dry run taking place, but maybe that was a dream I had. It might be a good idea to perform a dry run (list the actions the bot would take on a user page somewhere. Or if you prefer we can try running it live instead, although I'm not sure if it's going to conflict with the TorBlock extension. Could someone provide an example of a block by this extension? Will the bot pick up that the IP address is already being blocked by the extension? - Kingpin13 (talk) 22:04, 23 November 2010 (UTC)[reply]
- I ran a dry run a few weeks ago linked above (and here so you don't have to go searching for it) which basically means the only thing that hasn't been tested is the final step of actually performing the block. I can run a fresh dry run if you'd like so that we can get fresher data (it is highly likely that some of the detections in the above list are no longer tor nodes).
- The TorBlock extension works at the code level directly in a manner parallel to actual blocks. What I mean by this is that it doesn't actually show up in the block log (and for that reason there's no way for me to verify that TorBlock has actually blocked it), it just keeps an internal list of exit nodes that should be blocked and every edit is run through that list; I like to think of it as closer to an automatically-updated abuse filter, except that it applies to when you open the edit page instead of when you save the edit (though it does check there, too, for security reasons). Putting a block on any nodes which are being detected by the TorBlock extension is harmless in the sense that TorBlock cannot be overridden (except, I believe, by IPBE), though TorBlock has an added benefit of not taking up an entry in the block log. Which way we want to go (leaving it down until there's a reason to pull it up, or leaving it constantly running) could work both ways. Leaving it down has the benefit of leaving block logs a bit more clear, but leaving it up means we are less likely to be affected by problems with TorBlock as well as a reduced time-to-block for newer exit nodes (TorBlock has a fixed rate at which it updates its exit node list; this bot could run faster if we so desire, with no additional stress on Wikipedia). --Shirik (Questions or Comments?) 13:59, 24 November 2010 (UTC)[reply]
- (Passing comment after reading this) I would rather see this bot only run during "times of trouble" and the TorBlock extension switched off too. IMO we should have an uneasy co-existence with open proxies, since they do have a purpose other than abuse, one that fits perfectly with the idea oof "anyone can edit". If there is excessive abuse of it, sure, shut it down - but otherwise, let it be. II personally am not involved in the depths of this battle, so I'm not aware of what percentage of edits coming through Tor nodes are good edits. Those are the ones we should be thinking of though. Franamax (talk) 20:54, 25 November 2010 (UTC)[reply]
- The problem is that there's always times of trouble with proxies. (X! · talk) · @926 · 21:13, 25 November 2010 (UTC)[reply]
- (Passing comment after reading this) I would rather see this bot only run during "times of trouble" and the TorBlock extension switched off too. IMO we should have an uneasy co-existence with open proxies, since they do have a purpose other than abuse, one that fits perfectly with the idea oof "anyone can edit". If there is excessive abuse of it, sure, shut it down - but otherwise, let it be. II personally am not involved in the depths of this battle, so I'm not aware of what percentage of edits coming through Tor nodes are good edits. Those are the ones we should be thinking of though. Franamax (talk) 20:54, 25 November 2010 (UTC)[reply]
- The policy on open proxies is quite clear: don't do it. This bot is only enforcing that policy. Those that have legitimate reasons for their use can (and have) been given IP block exemption, but that should be so excessively rare that we should (and do) deal with it on a case-by-case basis. Open proxies are constantly being abused by editors on a literally daily basis, and when TorBlock went down recently we saw a massive influx of vandalism. Just a few days' look at WP:SPI should be able to describe the level of problems that arise out of open proxies, and that's why the policy is written like that. --Shirik (Questions or Comments?) 22:02, 25 November 2010 (UTC)[reply]
- To quite flagrantly cherrypick from that policy, "...legitimate users ... may freely use proxies...". Running this bot continuously runs against the spirit of "...until they are blocked." For as long as I can remember (only 3 years here) the uneasy coexistence has more or less worked, i.e. when Tor nodes come to our attention, we block 'em. I'm not in favour of a pre-emptive approach, as it presupposes that all proxy edits are going to be bad. The issue I have is that we can all agree that many bad edits come fron open proxies (I've seen quite a lot of the RefDesk vandal mentioned elsewhere), I have no information on how many good edits will be rejected. Comparing the OP "range" to any other netblock, why is this particular one being singled out? I don't find the IPblock-exempt angle all that persuasive either, that's not how people normally get addicted to editing here. Obviously if this is built into MediaWiki I'm brhind the discussion though. I'm also troubled by the notion of officially sanctioning a process which makes intrusive port scans to external systems. I do that stuff from my own computer in times of need and I'm not going to ask en:wiki to sanction it. Franamax (talk) 22:50, 25 November 2010 (UTC)[reply]
- I think you're overreacting. First off, what do you think ProcseeBot does? Secondly, I don't do port scans wildly. First, I query a list of published tor nodes. Then I ask the tor service directly if it agrees that (1) the node is a tor node and (2) it exits to Wikipedia. If, and only if, this is confirmed, I try to connect to the advertised port only as a double-check that it really exists. I don't scan any other ports other than the advertised one, and I don't scan hosts that have not been advertised by two distinct sources as being a tor exit node. --Shirik (Questions or Comments?) 01:51, 26 November 2010 (UTC)[reply]
- To quite flagrantly cherrypick from that policy, "...legitimate users ... may freely use proxies...". Running this bot continuously runs against the spirit of "...until they are blocked." For as long as I can remember (only 3 years here) the uneasy coexistence has more or less worked, i.e. when Tor nodes come to our attention, we block 'em. I'm not in favour of a pre-emptive approach, as it presupposes that all proxy edits are going to be bad. The issue I have is that we can all agree that many bad edits come fron open proxies (I've seen quite a lot of the RefDesk vandal mentioned elsewhere), I have no information on how many good edits will be rejected. Comparing the OP "range" to any other netblock, why is this particular one being singled out? I don't find the IPblock-exempt angle all that persuasive either, that's not how people normally get addicted to editing here. Obviously if this is built into MediaWiki I'm brhind the discussion though. I'm also troubled by the notion of officially sanctioning a process which makes intrusive port scans to external systems. I do that stuff from my own computer in times of need and I'm not going to ask en:wiki to sanction it. Franamax (talk) 22:50, 25 November 2010 (UTC)[reply]
- The policy on open proxies is quite clear: don't do it. This bot is only enforcing that policy. Those that have legitimate reasons for their use can (and have) been given IP block exemption, but that should be so excessively rare that we should (and do) deal with it on a case-by-case basis. Open proxies are constantly being abused by editors on a literally daily basis, and when TorBlock went down recently we saw a massive influx of vandalism. Just a few days' look at WP:SPI should be able to describe the level of problems that arise out of open proxies, and that's why the policy is written like that. --Shirik (Questions or Comments?) 22:02, 25 November 2010 (UTC)[reply]
- Oh, I missed one other question. "How many good edits will be blocked?" The answer is, theoretically, none. The configuration on Wikipedia is (and has been for some time) such that TorBlock should be enabled. This should never let a user without IPBE edit. This bot is a band-aid for when that extension goes down, which has happened relatively frequently. Additionally, both the {{tor}} and MediaWiki:Torblock-blocked notices clearly indicate how to request exemption. We have been running in this manner for an extremely long time now; this is nothing new. --Shirik (Questions or Comments?) 02:04, 26 November 2010 (UTC)[reply]
- In response to Kingpin, I've found a way to check if TorBlock is currently blocking the IP by manually setting up a tor circuit to exit through that node and check if it's blocked (naturally this would be done without logging in). This will take a little reworking, but it looks like a good solution. With a little extra work this will also allow us to eliminate the dependency on the external site that is reporting the tor nodes (though I would still keep the DNSBL for verification) as we can query the directory list manually. --Shirik (Questions or Comments?) 01:51, 26 November 2010 (UTC)[reply]
Arbitrary break
I have made significant improvements based on the feedback from various people. I have identified a (hackish, but workable) solution to simultaneously make a triple-check that an IP is an exit node and check if TorBlock has blocked a given node. This would allow the bot to run in tandem with TorBlock and pick up anything it seems to miss. Because TorBlock shouldn't be down for long periods of time, I think it's a good argument to make that, should this bot be approved, we should keep it to very short durations (it can always re-block if necessary). I'm thinking 2 weeks. This is enough time for TorBlock to be repaired while keeping things under control. I am running a dry-run test of it right now to see what happens, but I have already noticed a few nodes that are not blocked (to be fair, they appear to be recently spawned; this is typical as TorBlock isn't instant, however it is necessary to block these nodes as quickly as possible as there are vandals that rebuild their circuits hoping to find these nodes, so I don't advocate putting in a check to only block nodes if they have a given uptime unless consensus is for that). I will post the dry run results when I have them. --Shirik (Questions or Comments?) 06:44, 26 November 2010 (UTC)[reply]
- Dry run complete. It found 27 unblocked tor nodes, some of which have been up for quite some time. I verified each at the time they were detected. Some were detected multiple times; this is because they had duplicate entries in the directory; this is a non-issue because they would be blocked after the first detection which would cause the second check to be skipped. I'm not sure why TorBlock is missing some of these nodes (my understanding is that any nodes that have been up for at least 30 minutes should be blocked, but this is apparently not the case). --Shirik (Questions or Comments?) 15:23, 26 November 2010 (UTC)[reply]
- I ran a Second dry run after noticing that my shell got disconnected mid-run last night (which caused a SIGHUP and early termination of the run). With a full run it found 56 unblocked tor nodes, all of which are unique and have been verified. --Shirik (Questions or Comments?) 19:36, 26 November 2010 (UTC)[reply]
{{BAG assistance needed}} It's been a week and a half since the last set of comments and a week since the last dry run posted. Moreover I think that dry run shows that TorBlock is slower to react than we think (which is a good indicator of how some of our more well-known tor vandals might still be operating. What's our next steps? --Shirik (Questions or Comments?) 08:34, 3 December 2010 (UTC)[reply]
Approved.: I am satisfied that the bot both conforms to current policy (for which a significant consensus level is implied and added to by most contributors here) and is technically proficient. Please be aware, as I'm sure you will be, that any consensus formed against the bot in the future should be taken as an act of WP:CCC, and in essence negates this BRFA. What with our collection of fun-filled drama boards this happens surprisingly regularly on such contentious issues, and I would really hate to see my approval misquoted as some sort of carte blanche (which, of course, it isn't). Regards, - Jarry1250 [Who? Discuss.] 16:54, 3 December 2010 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.