Wikipedia:Bots/Requests for approval/WebCiteBOT
- The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section. The result of the discussion was Approved.
Automatic or Manually Assisted: Automatic, with only occasional supervision
Programming Language(s): Perl
Function Summary: Combat link rot by automatically submitting newly added URLs to WebCite (WebCitation.org)
Edit period(s): Continuous
Already has a bot flag (Y/N): N
Function Details: This BOT has not actually been coded yet. I am submitting this proposal to test for consensus and iron out the exact details of the bot's operation. I have posted notices in several locations to encourage community participation in evaluating this request.
Background: Linkrot is a major problem on all major websites, and Wikipedia is no exception. WebCite is a free service that archives web pages on demand. Unlike archive.org one can submit a website and it will be instantly archived. One is instantly given a permanent link to a copy of the cited content. It is intended for use by scholarly pages (like Wikipedia) that cite other online resources. I am proposing a bot (coded by me) that will automatically submit URLs recently added to Wikipedia to WebCite and then supplement the Wikipedia link as follows:
Original:
*Last, First. [http://originalURL.com "Article Title"], ''Source'', January 22, 2009.
- Last, First. "Article Title", Source, October 27, 2005.
Modified:
*Last, First. [http://WebCiteURL.com "Article Title"], ''Source'', January 22, 2009. [[WebCite]]d on January 23, 2009; Originally found at: http://originalURL.com
- Last, First. "Article Title", Source, January 22, 2009. WebCited on January 23, 2009; Originally found at: http://originalURL.com
New proposed format:
Original 1 - bare reference | Modified 1* |
---|---|
<ref>[http://originalURL.com "Article Title"]</ref> | <ref>{{cite web |url=http://originalURL.com |title=Article Title |accessdate=(date article added to page) |author=(bot will try find one on link page) |date=(bot will try find one on link page) |publisher=(bot will pick one from list or use whatever.com) |archiveurl=http://archiveURL.com |archivedate=2009-01-26 |dead=0}}<!--- info filled in by BOT ---></ref> |
*"Article Title" | *Last, First (2007-04-30). "Article Title", Source, Retrieved on 2009-01-20. (Archived on 2009-01-26.) |
Original 2 - "cite web"ed (similar format used for "cite news") | Modified 2* |
<ref>{{cite web |url=http://originalURL.com |title=Article Title |accessdate=2009-01-25 |author=Last, First |date=2008-12-20 |publisher=Coolstuff.com}}</ref> | <ref>{{cite web |url=http://originalURL.com |title=Article Title |accessdate=2009-01-25 |author=Last, First |date=2008-12-20 |publisher=Coolstuff.com |archiveurl=http://archiveURL.com |archivedate=2009-01-26}}</ref> |
*Last, First (2008-12-20). "Article Title", Coolstuff.com, Retrieved on 2009-01-25. | *Last, First (2008-12-20). "Article Title", Coolstuff.com, Retrieved on 2009-01-25. (Archived on 2009-01-26.) |
Original 3 - hand referenced: | Modified 3 |
<ref>*Last, First. [http://originalURL.com "Article Title"], ''Source'', January 22, 2009.</ref> | <ref>*Last, First. [http://originalURL.com "Article Title"], ''Source'', January 22, 2009. ([http://archiveURL.com Archived] on January 26, 2009.)</ref> |
*Last, First. "Article Title", Source, January 22, 2009. | *Last, First. "Article Title", Source, January 22, 2009. (Archived on January 26, 2009.) |
* - for the cite web template to actually produce results in this format it will have to be modified as follows:
-add a new paramater "dead"
-when dead is 0 or missing, use the original URL as the primary (i.e. the first one)
-when dead is set to 1, use the archived link as the primary and put "Originally found at http://originalURL.com (no longer available) at the end of the reference.
Operation: WebCiteBOT will monitor the URL addition feed at IRC channel #wikipedia-en-spam. It will note the time of each addition, but take no immediate action. After 24 hours have passed it will go back and check the article to make sure the new link is still in place and that the link is used as a reference (i.e. not as an external link). These precautions will be in place to help prevent the archiving of spam/unneeded URLs.
After 24 hours have passed and the link has been shown to be used as a reference, the bot will take the following actions:
- Make sure the article in question isn't tagged for deletion. If it is, the bot will add it to a separate queue to be rechecked after 10 days time (to see if the article was actually deleted or not.)
- Check to make sure the URL is functional as supplied. If it's not, it will mark the link with {{dead link}} and notify the person who added it, in the hopes that they will correct it.
- Make sure the URL is not from a domain that has opted out of WebCite's archiving - the list of such domains will be built as the bot finds domains so tagged
- Submit the URL to WebCite.
- Check if the submission was successful. If not, figure out why and update the bot's logs as needed.
- If all was successful, update the Wikipedia article. This will be accomplished though a new parameter to {{cite web}}, and possibly a few other templates, titled "WCarchive". If the original reference was a bare reference (i.e. <ref>[http://ref.com title]</ref>) the bot will convert it to a {{cite web}}, filling in whatever fields it can and leaving a hidden comment to say it filled in by a bot.
Links for sites that have been previously archived will be handled as follows:
- if less than 24 hours have passed, assume the link is unchanged and add a link to the existing archive
- if more than 24 hours have passed, check if the linked page has changed since archived. This will be done via a check sum function.
- If no changes were made, use the old archive. If it has changed, make a new archive
Discussion
Policy Comments:
- (I.e. Is this a good idea?)
As per my comments at the bot request, I believe the bot should not supplement the main url. I suggest a system where {{citeweb}}
is modified in such a way that the bot fills the normal archive parameters, but also adds a parameter like |archive=no
, which stops the archive url becoming the primary link in the citation. This parameter can be removed when the normal URL is confirmed to be 404'd and then the archive url becomes the primary link.
I suggest something like what is show below when the |archive
parameter is set to no
:
- "My favorite things part II". Encyclopedia of things. Retrieved on July 6, 2005. Site mirrored here automatically at July 8, 2008
and then the normal archiving look when |archive
is not set. Foxy Loxy Pounce! 04:20, 24 January 2009 (UTC)[reply]
- I'm open to either option. On the one hand the archived version of the referenced article will always contain the exact information the user saw when they added the reference. On the other hand, the live URL may contain updated information that should be incorporated into our article. Also, I'm not sure adding a hidden parameter would help much since a user would still have to both notice the URL had gone dead and know about the hidden parameter in order to fix the problem.
- Thus, I'd probably prefer a solution that shows both URLs although I have no preference as to which is used as the primary URL for the reference. --ThaddeusB (talk) 05:05, 24 January 2009 (UTC)[reply]
- How about this version I once saw on fr.wp:
- "My favorite things part II". Encyclopedia of things. Retrieved on July 6, 2005. [Archive]
- ChrisDHDR 15:11, 24 January 2009 (UTC)[reply]
- I like that idea, although it would be good to incorporate the date somehow. Foxy Loxy Pounce! 01:56, 25 January 2009 (UTC)[reply]
- I think it's a great idea. Who knows how many dead links we have. Probably millions. This should solve that, mostly. I just wonder if the real URL should show up first in the reference. - Peregrine Fisher (talk) (contribs) 04:20, 24 January 2009 (UTC)[reply]
- I think the original should take precedent. –xeno (talk) 19:05, 24 January 2009 (UTC)[reply]
How will this handle duplicates? Add the |archive
parameter, or ask Web Cite to re-archive it? LegoKontribsTalkM 07:16, 24 January 2009 (UTC)[reply]
- There are at least two distant forms of link duplication. 1) The same exact link was added multiple times in one or more articles in close succession - likely by a single user. 2) A link that was previously used on some other article just happened to be used again on a different article at later date. In the first case, these links were clearly looking at the same version of the page and thus only 1 archive is required. In the second scenario, there is no guarantee that the users were viewing the same content, in which case a second archive is probably a good idea.
- Thus I will likely approach this by keeping a database of when each link was archived and having all references to a specific link that are added within a few days of each pointed to the same WebCite archive, but allowing for the creation of new archives when the original archive is more than a few days old. Not sure what the precise definition of "a few days" should be, so I'm open to suggestions on that point. --ThaddeusB (talk) 05:16, 25 January 2009 (UTC)[reply]
- I think the best way to solve this problem would be to store the links which the bot has archived to a database table. If the link is the same download the page which is going to be archived and compare it with the webcite version. If there are any differences you can resubmit for archival, if not you can ignore the link. After you have made the first comparision you could take a hash of the target page to compare in the future to prevent the need to download both the webcite version and the target version. —Nn123645 (talk) 15:59, 25 January 2009 (UTC)[reply]
- Good suggestion. I have added a checksum function to its proposed mode of operation. --ThaddeusB (talk) 06:06, 26 January 2009 (UTC)[reply]
- I think the best way to solve this problem would be to store the links which the bot has archived to a database table. If the link is the same download the page which is going to be archived and compare it with the webcite version. If there are any differences you can resubmit for archival, if not you can ignore the link. After you have made the first comparision you could take a hash of the target page to compare in the future to prevent the need to download both the webcite version and the target version. —Nn123645 (talk) 15:59, 25 January 2009 (UTC)[reply]
- Actually, WebCite is handling that part already internally. WebCite does not physically store/archive an document twice if the hash sum is the same. So if you repeatedly request to archive the same URL on different dates, WebCite will only archive it once. (Note that the WebCite ID which WebCite spits out differs each time you archive the same unchanged webpage, but behind the scenes it point to the same document. The WebCite ID encodes the URL and the archiving date/time, so that's why it will be different each time it is archived. Archived documents can be retrieved using the WebCite ID, a url/date combination, or even the hash sum itself). (Gunther Eysenbach - WebCite) --Eysen (talk) 18:23, 26 January 2009 (UTC)[reply]
- This is an excellent idea. However, it needs to be careful not to submit spam links. Perhaps wait some time to see if a link gets reverted before archiving it. —AlanBarrett (talk) 16:16, 24 January 2009 (UTC)[reply]
- I agree. The current proposal states "After 24 hours have passed it will go back and check the article to make sure the new link is still in place." By this I meant, make sure the link was reverted out or otherwise removed form the Wikipedia article. Sorry if it wasn't clear. Let me know if you think I should change the time frame. --ThaddeusB (talk) 05:16, 25 January 2009 (UTC)[reply]
- Sorry, I missed that part of the proposal. I don't know whether 24 hours is long enough to wait to see if a link is reverted or becomes subject to a dispute, but it's a reasonable starting point for experiment. —AlanBarrett (talk) 19:43, 25 January 2009 (UTC)[reply]
- Wikipedia:Citing sources says "You should follow the style already established in an article, if it has one. Where there is disagreement, the style used by the first editor to use one should be respected." Apparently this bot is intended to work with {{Cite web}} and perhaps a few other citation templates. It should not change citations that are not within a template to a citation template. Above, there is a mention of converting bare links. Perhaps we could agree that a bare link is a poor citation style, but how can the bot tell the difference between a bare link and a proper citation that is not within a template. Even if it could tell, how can it tell whether the other citation in the article use templates or not? --Gerry Ashton (talk) 19:49, 24 January 2009 (UTC)[reply]
- This is a valid point, but IMO a bare link reference like <ref>[http://somesite.com]</ref> or <ref>http://somesite.com</ref> isn't really a style, but rather an implied {{cite web}} without the fields filled in. Now if someone adds a link like <ref>Cool author. "The History of Stuff" [http://stuff.com Cool Stuff] Retrieved 1/23/09</ref> that is a different story entirely. Ideally the bot would simple add something like "[http://webcite.org Archived] on 1/24/09" to the end of such references. In the event this proves unfeasible, it will just ignore such formatted references.
- As to how it can tell if a link is in a template or not, the bot will only consider things enclosed in <ref> tags to be references (and possibly links found in the "references"/"sources" sections - I haven't decided on that one yet) and ignore all other links. It can then easily process the text between those tags to see what kind of reference it is.
- Now sometimes people add references by simply doing something like [http://somesite.com]. I don't see any practical way 'fix' these links without risking a style change so they will simply be ignored.
--ThaddeusB (talk) 05:18, 25 January 2009 (UTC)[reply]
- If someone adds <ref>[http://somesite.com]</ref> or <ref>http://somesite.com</ref> to an article that already has cite XXX tags, or no other references, it would indeed be appropriate to turn it into a cite web, either because it matches the style of the other references, or because you, through the bot, are making an editorial choice to use cite web in an article that previously had no references. But if the article already has valid hand-formatted references, then the new reference should be cast into the existing style, which is beyond the ability of a bot. --Gerry Ashton (talk) 05:52, 25 January 2009 (UTC)[reply]
I rather strongly disagree with supplementing existing citations. Why not make this endeavour widely known, so people who encounter dead links from now on know to search on WebCite and change the link (or expand the template) manually. It just seems like automatically supplementing article citations would lead to a lot of unnecessary (and IMO disruptive) watchpage traffic, similar to the complaints that have often been logged against Lightmouse and his bot. — Huntster (t • @ • c) 10:09, 25 January 2009 (UTC)[reply]
- I think the idea is a) to do only new links (so much less in the way of watchpage traffic) b) that by the time the links are broken, it's too late - and much better to have archived the page at the time that the user was looking at it (or shortly afterwards). - Jarry1250 (t, c) 11:54, 25 January 2009 (UTC)[reply]
- No no, I'm 100% for the bot automatically archiving linked pages, this actually excites me greatly...I just don't like the idea of changing the citation text in an article to reflect that such an archive has taken place. That is something better left to a manual change. — Huntster (t • @ • c)
- Oh, right, now I understand. Yeah, it does sound exciting considering the scale of the problem. - Jarry1250 (t, c) 14:20, 25 January 2009 (UTC)[reply]
- Maybe it could create archives after 24 hours, but only add them after a week. Then, if 1+ refs are added every day for a week, there would only be one bot edit. - Peregrine Fisher (talk) (contribs) 18:30, 25 January 2009 (UTC)[reply]
- I'm still not seeing why articles need to be edited in the first place. If it is made well-known that this effort in archiving is ongoing then people can update dead links manually. Why add a link to the archived page when the existing link still works? It's just more clutter. — Huntster (t • @ • c) 00:46, 26 January 2009 (UTC)[reply]
- A few reasons. Pages change, so with this, people can see the exact state of the page that was used as a ref. Also, while an experienced editor who knew we had been archiving every web page could fix a dead link, that doesn't help readers or less knowledgable editors. And, editors don't notice dead links all the time. Look at some of the featured article candidates. These are pages that have had massive amounts of effort put into them, and the dead links still weren't caught until the candidacy. If it can be done automatically, why make a human do the work. Yeah, it will add a bit of clutter. Any article with refs is already cluttered looking in the edit window, though. The short version of the webcitaion.org links are pretty short and clean, too. - Peregrine Fisher (talk) (contribs) 01:14, 26 January 2009 (UTC)[reply]
- I'm still not liking this aspect of it, but if this goes through, you'll need to interface with the "Cite XXX" and "Citation" template folks to determine the best, and most brief, method to get this installed. — Huntster (t • @ • c) 01:17, 26 January 2009 (UTC)[reply]
- Difinitely. - Peregrine Fisher (talk) (contribs) 01:29, 26 January 2009 (UTC)[reply]
- I'm still not liking this aspect of it, but if this goes through, you'll need to interface with the "Cite XXX" and "Citation" template folks to determine the best, and most brief, method to get this installed. — Huntster (t • @ • c) 01:17, 26 January 2009 (UTC)[reply]
- A few reasons. Pages change, so with this, people can see the exact state of the page that was used as a ref. Also, while an experienced editor who knew we had been archiving every web page could fix a dead link, that doesn't help readers or less knowledgable editors. And, editors don't notice dead links all the time. Look at some of the featured article candidates. These are pages that have had massive amounts of effort put into them, and the dead links still weren't caught until the candidacy. If it can be done automatically, why make a human do the work. Yeah, it will add a bit of clutter. Any article with refs is already cluttered looking in the edit window, though. The short version of the webcitaion.org links are pretty short and clean, too. - Peregrine Fisher (talk) (contribs) 01:14, 26 January 2009 (UTC)[reply]
- To solve all problems, we need two bots: WebCiteArchivingBOT would immediately archive any link added to Wikipedia. WebCiteLinkingBOT would crawl Wikipedia detecting linkrot, and add the "Site mirrored here" message only if the link to the original website has been dead for a month. I think everybody would be happy with this. Nicolas1981 (talk) 11:21, 18 February 2009 (UTC)[reply]
- Except when some well-meaning user deletes the link two weeks into your month's wait. IMO, it's better to have the archive link present (even if not displayed) immediately to forestall that sort of issue. Having it displayed, as noted, also allows the reader to see the page as it was when originally sourced in case it has changed in the mean time. Anomie⚔ 12:30, 18 February 2009 (UTC)[reply]
- I'm still not seeing why articles need to be edited in the first place. If it is made well-known that this effort in archiving is ongoing then people can update dead links manually. Why add a link to the archived page when the existing link still works? It's just more clutter. — Huntster (t • @ • c) 00:46, 26 January 2009 (UTC)[reply]
- Maybe it could create archives after 24 hours, but only add them after a week. Then, if 1+ refs are added every day for a week, there would only be one bot edit. - Peregrine Fisher (talk) (contribs) 18:30, 25 January 2009 (UTC)[reply]
- Oh, right, now I understand. Yeah, it does sound exciting considering the scale of the problem. - Jarry1250 (t, c) 14:20, 25 January 2009 (UTC)[reply]
- No no, I'm 100% for the bot automatically archiving linked pages, this actually excites me greatly...I just don't like the idea of changing the citation text in an article to reflect that such an archive has taken place. That is something better left to a manual change. — Huntster (t • @ • c)
- When you say "Lightmouse and his bot", are you referring to Lightbot? Can you give any examples of these complaints (a quick search has not turned up any)? Brian Jason Drake 05:04, 26 January 2009 (UTC)[reply]
- Lightbot is rather infamous in some corners for wasting resources de-linking dates. It wasn't really approved to de-link all dates, just broken ones and the edits are really needed. Yes linking dates is deprecated, but that doesn't mean mass unlinking them is a good use of resources. That said, there is very light comparison between my proposed bot and lightbot. The only similarity being that mine would also make a not of edits. --ThaddeusB (talk) 05:11, 26 January 2009 (UTC)[reply]
- Yes, that was the only comparison I intended to make...the volume of potential edits and cluttering of watchlists. I'm not in any way attempting to diminish the idea here, just expressing my POV with regard to the amount of automated edits that the original suggestion proposed. — Huntster (t • @ • c) 06:00, 26 January 2009 (UTC)[reply]
- Lightbot is rather infamous in some corners for wasting resources de-linking dates. It wasn't really approved to de-link all dates, just broken ones and the edits are really needed. Yes linking dates is deprecated, but that doesn't mean mass unlinking them is a good use of resources. That said, there is very light comparison between my proposed bot and lightbot. The only similarity being that mine would also make a not of edits. --ThaddeusB (talk) 05:11, 26 January 2009 (UTC)[reply]
Great idea! I originally proposed putting more emphasis on editors themselves archiving their web references with WebCite but a bot would be way more efficient. You got my support. OlEnglish (talk) 23:42, 25 January 2009 (UTC)[reply]
I have modified the proposed formatting per the above discussion. Please feel free to make an additional comments you may have about the new suggested format. Once everyyone is reasonably happy with the formatting I'll request the appropraite updates by made to {{cite web}}, {{cite news}}, {{citation}}, and any other necessary templates (please specify any you know of). Thanks. --ThaddeusB (talk) 06:06, 26 January 2009 (UTC)[reply]
- It looks like a wonderful idea to automatically archive new references, but please don't edit articles until the link actually goes dead. Maybe add an invisible or small link at the end, but even those edits would clutter watchlists horribly. Automatically finding author name nad publication date seems dangerous. Wrong information is a problem, while missing info is not so bad if we have an archived copy of the page. Also, you might want to check with WebCite if they can handle the traffic. --Apoc2400 (talk) 12:47, 26 January 2009 (UTC)[reply]
- Bot edits can be hidden from watchlists quite easily. — Jake Wartenberg 13:05, 26 January 2009 (UTC)[reply]
- Indeed; bot edits are, in general, clutter. That is precisely why they are by default hidden from watchlists by default. If you choose to disable that option, you implicitly accept that there will be increased clutter on your watchlist; if you don't want that, don't disable the option. Happy‑melon 16:15, 26 January 2009 (UTC)[reply]
- I think you're confusing recent changes (where bots are hidden by default) and watchlist (where bots are not hidden by default). –xeno (talk) 16:23, 26 January 2009 (UTC)[reply]
- I wouldn't be concerned about clutter. See how publishers like Biomedcentral are handling links to WebCite (see references 44 and 45 in this article [1]). As somebody else pointed out above, one idea of WebCite is also to ensure that what people see on the cited webpage is the same as the page the citing author saw. So it does not make sense to wait "until the link actually goes dead" to display the link to the archived version. The link may be still alive, but the content may have changed. I look forward to the bot implementation! --Eysen (talk) 18:37, 26 January 2009 (UTC)[reply]
- I think you're confusing recent changes (where bots are hidden by default) and watchlist (where bots are not hidden by default). –xeno (talk) 16:23, 26 January 2009 (UTC)[reply]
- Indeed; bot edits are, in general, clutter. That is precisely why they are by default hidden from watchlists by default. If you choose to disable that option, you implicitly accept that there will be increased clutter on your watchlist; if you don't want that, don't disable the option. Happy‑melon 16:15, 26 January 2009 (UTC)[reply]
- Bot edits can be hidden from watchlists quite easily. — Jake Wartenberg 13:05, 26 January 2009 (UTC)[reply]
- Looks good to me, from what I've seen. I've stumbled across plenty of dead links and I rarely click on the external links. Useight (talk) 00:57, 1 February 2009 (UTC)[reply]
- Solves a well-known problem (links are ephemeral, and even on sites that keep them indefinitely the page may be edited at a later time). Less concerned about spam, since if a spam link were added to the wiki, the worst outcome is webcitation.org has an archived version of the spam webpage (which is negligible in this context); the link itself being removed from the wiki exactly as normal. Sensible precautions against backing up short-lived links are fine.
- One feature that would be quite valuable (separate bot?) would be to compare the main link and archived link data (or compare a hash of the present link with the saved hash of the link as archived, or check the "last modified" date) to spot dead or changed links that may need review. FT2 (Talk | email) 13:37, 7 February 2009 (UTC)[reply]
Cool! I think it would be good to try out, and has the potential to be very productive. --Explodicle (T/C) 16:58, 11 February 2009 (UTC)[reply]
This is a fantastic idea. Speaking as an exopedian, it's often dismaying to revisit an article I have written a year or two ago with a view to explanding it, then finding that four or five of the links have died and the Wayback Machine has no record of them. If it succeeds, this initiative will significantly advance the encyclopaedia's goal of providing free access to the sum of human knowledge to the people of the world. Skomorokh 16:37, 18 February 2009 (UTC)[reply]
I'm not sure if this is a policy or a technical remark. This sounds like a great idea, will it be usable on non-english projects too (for example, nl.wikipedia.org)? Regards, --Maurits (talk) 16:14, 22 February 2009 (UTC)[reply]
- I think this is a great idea, and I hope it gets approved. – Quadell (talk) 20:36, 2 March 2009 (UTC)[reply]
Technical Comments:
- (I.e. Will the bot work correctly?)
On the technical side of things, I don't believe you have provided a link to source code since programming has not commenced yet, I can't evaluate if it will work correctly. Foxy Loxy Pounce! 04:20, 24 January 2009 (UTC)[reply]
- After I feel consensus is forming, I'll start the actual programming and post the code at that time. --ThaddeusB (talk) 04:59, 24 January 2009 (UTC)[reply]
I understand that WebCite honors noarchive instructions contained on web sites, and there are opt-out mechanisms. Is one of the requirements for the bot that it will deal gracefully with a refusal by WebCite to archive a page? --Gerry Ashton (talk) 04:43, 24 January 2009 (UTC)[reply]
- Yes, if a page isn't archived (for whatever reason), no Wikipedia editing will be done. If the site has opted out, it will be added to a list of opted out sites that the bot will pre-screen to avoid doing unnecessary work (submitting URLs that are sure to fail). --ThaddeusB (talk) 04:59, 24 January 2009 (UTC)[reply]
To keep the number of archived pages down, if the bot checks an article after 24 hours and finds it marked for deletion (speedy, prod, AFD), I think it should put the article in a queue to be re-checked in ten days or so, so it won't go archiving references for articles that are just going to be deleted. --Carnildo (talk) 05:07, 24 January 2009 (UTC)[reply]
- Good suggestion. I'm adding this to its proposed mode of operation. --ThaddeusB (talk) 06:28, 24 January 2009 (UTC)[reply]
Okay, so this a very minor point, but what happens if a site is ref'd with an access-date that isn't current? I can imagine copy-paste scenarios and there might be other circumstances as well. The bot should therefore reason that the link is not in fact new, and it should not be archived. (Apologies if this is automatically handled somewhere; the irc feed might do I suppose.) - Jarry1250 (t, c) 13:22, 24 January 2009 (UTC)[reply]
- Good point. I'll add a check to make sure the reference is a current one. If it is an old one, it will only add the WebCite link if one already exists for the corresponding time period. I.e. it won't have a new archive created. --ThaddeusB (talk) 05:20, 25 January 2009 (UTC)[reply]
We should probably make the WebCite people aware of this discussion. This will increase their traffic by several orders of magnitude. — Jake Wartenberg 23:49, 24 January 2009 (UTC)[reply]
- Definitely tell them, but they have thought it. Their FAQ page indicates they've thought of doing this themselves. In their "I am a programmer or student looking for a project and ideas - How can I help?" section they say "develop a wikipedia bot which scans new wikipedia articles for cited URLs, submits an archiving request to WebCite®, and then adds a link to the archived URL behind the cited URL" - Peregrine Fisher (talk) (contribs) 23:52, 24 January 2009 (UTC)[reply]
- Done I have sent the project lead an email with a link to this page. — Jake Wartenberg 01:44, 25 January 2009 (UTC)[reply]
- Thanks for alerting us of this. We have indeed thought of a WebCiteBot ourselves, and are highly supportive of this. It seems like some consensus is building for developing this bot, and we are thrilled to see this implemented. As long as we are aware of when this goes live (and there will be some testing anyways), WebCite should be scalable to handle the additional load. We are determined to work with the programmer(s) who wish to develop the bot, including access to our server-side code (which is open source) in case any server-side modifications become necessary. In case anyone isn't aware, there is a rudimentary API which allows WebCite archiving requests and queries using XML [2]. Gunther Eysenbach, WebCite initiator --Eysen (talk) 18:52, 26 January 2009 (UTC)[reply]
- Done I have sent the project lead an email with a link to this page. — Jake Wartenberg 01:44, 25 January 2009 (UTC)[reply]
An implementation point: rather than adding a new field, could you not use the existing |archivedate=
and |archiveurl=
parameters in Cite web? Martin (Smith609 – Talk) 19:27, 25 January 2009 (UTC)[reply]
- The problem with that is then the first URL linked to is the archive, and the live page is mentioned at the end of the ref. - Peregrine Fisher (talk) (contribs) 19:40, 25 January 2009 (UTC)[reply]
- I think that's a misfeature in the template, and suggest that it should be fixed globally, not only for new work by this bot. I think that the "url=" link should be more prominent than the "archiveurl=" link in all cases (which is the opposite of the current situation). This should probably be discussed at Template talk:cite web. —AlanBarrett (talk) 19:53, 25 January 2009 (UTC)[reply]
- I strongly agree that the existing archiveurl parameters should be used; adding a parameter such as
|WCitation=
produces an immediate external dependency: what happens if that site goes down or goes static? We can easily change to a different archive, but the parameter name is then wrong. I agree that the functionality of the|archiveurl=
parameter is not optimal, but this should be fairly easy to fix. Happy‑melon 16:13, 26 January 2009 (UTC)[reply]
- I strongly agree that the existing archiveurl parameters should be used; adding a parameter such as
- I think that's a misfeature in the template, and suggest that it should be fixed globally, not only for new work by this bot. I think that the "url=" link should be more prominent than the "archiveurl=" link in all cases (which is the opposite of the current situation). This should probably be discussed at Template talk:cite web. —AlanBarrett (talk) 19:53, 25 January 2009 (UTC)[reply]
If you ask WebCite to archive the same page several times, you get different results each time, even if the underlying page hasn't changed. So I'd suggest checking whether webcite already has a recent archive of the page, and creating an new archive only if necessary. —AlanBarrett (talk) 19:53, 25 January 2009 (UTC)[reply]
- I have added this feature to the proposed mode of operation. --ThaddeusB (talk) 06:06, 26 January 2009 (UTC)[reply]
- See my comment further up in this document. You get different "results" (i.e. a different WebCite ID) each time, even if the underlying page hasn't changed, when archiving the same page multiple times, because the WebCite ID encodes the cited/arhived URL and the timestamp. Internally, WebCite checks the hash and stores a document only once, if the document hasn't changed.--Eysen (talk) 18:52, 26 January 2009 (UTC)[reply]
- Eysen has a good write up at User:Eysen#What I_am advocating.2Fsupporting in_the Wikipedia_context. One good point, which would help to alleviate the clutter mentioned above, is a template or field in a template that takes just the webcitation.org hash and automatically expands it into the url. This will make the refs quite a bit smaller in the edit window. - Peregrine Fisher (talk) (contribs) 20:13, 26 January 2009 (UTC)[reply]
- I will take a look and decide how to best implement the suggestions on that page. Thanks for the heads up. --ThaddeusB (talk) 20:30, 26 January 2009 (UTC)[reply]
- Eysen has a good write up at User:Eysen#What I_am advocating.2Fsupporting in_the Wikipedia_context. One good point, which would help to alleviate the clutter mentioned above, is a template or field in a template that takes just the webcitation.org hash and automatically expands it into the url. This will make the refs quite a bit smaller in the edit window. - Peregrine Fisher (talk) (contribs) 20:13, 26 January 2009 (UTC)[reply]
- See my comment further up in this document. You get different "results" (i.e. a different WebCite ID) each time, even if the underlying page hasn't changed, when archiving the same page multiple times, because the WebCite ID encodes the cited/arhived URL and the timestamp. Internally, WebCite checks the hash and stores a document only once, if the document hasn't changed.--Eysen (talk) 18:52, 26 January 2009 (UTC)[reply]
Will this be available for use on a per-article basis on the ToolServer in the future? I'm thinking this could join the ranks of dabfinder.py, Checklinks, persondata.php, etc.? Thanks, §hepTalk 23:06, 31 March 2009 (UTC)[reply]
- I'm not sure it would be a good idea to sic a bot on archiving olds links, considering those links could be rotten, and the whole point of the bot is to combat such rot. Old links really should be manually archived, and WebCite offers a great Bookmarklet for such jobs. Now, a human-audited toolserver tool might be useful, so long as it wasn't abused by simple copy/pasting of links without currency verification. — Huntster (t • @ • c) 00:10, 1 April 2009 (UTC)[reply]
- Unfortunately, that's exactly what's happening. — Dispenser 15:49, 22 April 2009 (UTC)[reply]
Things that this 'bot will get wrong, as currently specified
This 'bot will, as currently specified above, get several things wrong:
- The pages shown by news aggregators aren't correct source citations in the first place, and they shouldn't be archived at all. Yet people do tend to add their URLs as purported sources.
- Search engine results aren't correct source citations in the first place, and they shouldn't be archived at all. These URLs come in two flavours:
- Hyperlinks to search engine results lists
- It's not widely appreciated anywhere near enough by Wikipedia editors that in fact the results that one sees on Google's various searches not only vary according to the time of the search (as the spider updates the database) but also vary according to who is performing the search. It's come up from time to time at AFD where people have erroneously hyperlinked to search engine results instead of to actual sources, that what one person sees in one country will be different to what another person sees in another country. Google Books results can be wildly different from country to country. But this phenomenon is not limited just to Books. It's not even limited to just Google.
- Hyperlinks to cached copies of pages
- The main culprit here that I've seen is the category of citations that include links to specific pages of books at Google Books. But cached copies of Google Web results are also mis-used as well. In terms of Google Books specifically, note that we should not be favouring one of our listed book sources over all of the others. Whilst the URL may be handy in spoon-feeding editors who don't put in any effort, in talk page discussions, it's not really appropriate for a citation. Readers have many book sources available, not just Google Books.
- The articles published by news services shouldn't be archived because doing so is often a violation of the terms and conditions of use of the service. Several services scroll their articles behind costwalls, for example. The New York Times' copyright licence prohibits archiving services from copying its articles, for example.
On the gripping hand, the thinking behind this 'bot isn't quite right anyway. The URL in a newspaper article citation is just a convenience feature. A proper newspaper article citation lists title, publication, dateline, and byline. That's enough to locate the article, in microfiche archives in libraries or in publisher's archives on WWW sites. The exact URL of a archive on a publisher's WWW site is a convenience only. If the link rots (because of the publisher swithing to a different naming scheme) the correct approach is to fix the URL, not to get some third party to violate the publisher's copyright.
Also note, as an additional point, that the publisher= parameter to {{cite web}} is the name of the publisher. It's not the name of the web site. That belongs in the work= parameter. The name of the publisher is not going to be easy for a 'bot to obtain, because it's often only available by reading the copyright or publication information in the small print somewhere on the web site (not even necessarily on the page hyperlinked-to). Uncle G (talk) 15:40, 1 February 2009 (UTC)[reply]
- I agree that this isn't going to solve all the problems with external links, but surely it's going to be help for many citations? Copyright has got to be a major concern, though, I don't think anyone's denying that; I wonder what WebCite's policies on that are. - Jarry1250 (t, c) 15:56, 1 February 2009 (UTC)[reply]
- Part of the problem is that, in my own estimation anyway, it probably won't help with the majority of citations. I think that this is at least true for newspaper article citations. Most news services have terms and conditions of use and copyright licences that prohibit archiving by the likes of WebCite. Just look at how many times one's search results on Google Web that find WWW-published newspaper articles don't provide access to cached copies by Google. Indeed, observe that Google News doesn't provide cached copies at all.
WebCite doesn't appear to have thought about this at all. (Google has. Hence the limitations.) One of the most telling points is that the very examples given by User:Eysen at User:Eysen#What I am advocating/supporting in the Wikipedia context violate copyright. It's a third-party unauthorised copy of an article published by The Guardian, which is being held and republished by WebCite in direct violation of The Guardian's terms and conditions and copyright licence. (See section #3 in particular.)
Note, by the way, that stopping the 'bot from adjusting {{cite news}} doesn't avoid this problem. Oftentimes people mis-use {{cite web}} to cite newspaper articles. Uncle G (talk) 16:14, 1 February 2009 (UTC)[reply]
- I'm not qualified to answer on those charges; I shall avoid making a hash of it by passing, if that's okay. - Jarry1250 (t, c) 16:29, 1 February 2009 (UTC)[reply]
- Do note that their FAQ does address the copyright question, including how someone like The Guardian can prevent archival and/or request an existing archival removed. I see no particular reason to assume they have not sought appropriate legal advice before offering their service, and thus no particular reason for copyright paranoia on our part. If anything, a feed from WebCite the bot could watch in order to remove WebCite links that have been taken down would be all that is needed IMO.
I also fail to see how the bot would be "failing" by submitting a URL for something that a human editor added and no human editor saw fit to remove, whether it's a reliable source, a search engine results list, a Google Books convenience link, or whatever. If WebCite doesn't want to archive certain types of URLs for whatever reason, they can easily enough reject the bot's request; to date, they have not expressed any concern over the vast quantity of URLs this bot would submit and in fact have actively solicited this bot's creation. Anomie⚔ 17:15, 1 February 2009 (UTC)[reply] - Uncle G's arguments feel like a case of throwing the baby out with the bathwater. While some bad links may get archived as well, this is no reason to scrap the archiving of any links in the first place. Over time, a blacklist of inappropriately cited pages can be built and such things as aggregators, Google results, etc, can be filtered out, perhaps even leaving a superscripted note on the article indicating it was inappropriate. And has been said, publishers have the ability to limited inclusion of their pages from the archive...if they choose to not take that step, that is their problem. — Huntster (t • @ • c) 20:42, 1 February 2009 (UTC)[reply]
- Part of the problem is that, in my own estimation anyway, it probably won't help with the majority of citations. I think that this is at least true for newspaper article citations. Most news services have terms and conditions of use and copyright licences that prohibit archiving by the likes of WebCite. Just look at how many times one's search results on Google Web that find WWW-published newspaper articles don't provide access to cached copies by Google. Indeed, observe that Google News doesn't provide cached copies at all.
- With all due respect, this section should really be called "things human editors do wrong that this bot won't correct." If a bot had to make sure an article was perfect when it was done editing, no bot could ever edit. Simply adding an archived version to an existing link will do no harm - anyone is free to remove the whole link (including archival record) at any time.
- Secondly, it is not our job to ensure the legality of the archive. Having said that, it is highly likely it is legal because: 1) They are operating as a non-profit 'library' and thus are entitled to "copy" and archive things for education reasons. 2) They allow any site who is interested in doing so to opt out of being archived. --ThaddeusB (talk) 02:05, 4 February 2009 (UTC)[reply]
- I fully agree. A bot is not capable of judgement or independent editing, that's what we have humans for. A bot is a 'force multiplier': it amplifies the effects of human actions. In the few cases where those human actions are incorrect, of course the bot will not be able to magically rectify those mistakes. To say that no good should be done because some of its actions will be at worst of no consequence (since any copyright issues will go to WebCite, not to us), is indeed throwing the baby out with the bathwater. Happy‑melon 08:35, 4 February 2009 (UTC)[reply]
- "since any copyright issues will go to WebCite, not to us" That sounds incredibly insensitive, since we are intending to use them. — neuro(talk) 10:35, 14 February 2009 (UTC)[reply]
- They should be safe, though. They have been doing it for a while, already. They're non-profit, and they'll remove any content requested. - Peregrine Fisher (talk) (contribs) 01:59, 15 February 2009 (UTC)[reply]
- "since any copyright issues will go to WebCite, not to us" That sounds incredibly insensitive, since we are intending to use them. — neuro(talk) 10:35, 14 February 2009 (UTC)[reply]
- I fully agree. A bot is not capable of judgement or independent editing, that's what we have humans for. A bot is a 'force multiplier': it amplifies the effects of human actions. In the few cases where those human actions are incorrect, of course the bot will not be able to magically rectify those mistakes. To say that no good should be done because some of its actions will be at worst of no consequence (since any copyright issues will go to WebCite, not to us), is indeed throwing the baby out with the bathwater. Happy‑melon 08:35, 4 February 2009 (UTC)[reply]
- Meta:Avoid copyright paranoia. It is up to WebCite to police itself regarding copyright law. If they are in violation of copyright, they'll be contacted soon enough, and they'll have to amend their policies. We should not preemptively decide their legal responsibilities for them. It is up to them, not us, to determine what pages they should or shouldn't archive.--Father Goose (talk) 17:13, 19 February 2009 (UTC)[reply]
- Incidentally, read their FAQ; among the things they say there are, "Another broader societal aspect of the WebCite initiative is advocacy and research in the area of copyright. We aim to develop a system which balances the legitimate rights of the copyright-holders (e.g. cited authors and publishers) against the "fair use" rights of society to archive and access important material. We also advocate and lobby for a non-restrictive interpretation of copyright which does not impede digital preservation of our cultural heritage, or free and open flow of ideas." By preempting the copyright issue on their behalf, we're circumventing one of the very purposes of their project.--Father Goose (talk) 17:29, 19 February 2009 (UTC)[reply]
- Yeah, we shouldn't stop because of copyright concerns. - Peregrine Fisher (talk) (contribs) 18:46, 19 February 2009 (UTC)[reply]
Checklinks
Although, I haven't had much time, but I am willing to contribute to this bot. The PDFbot and Checklinks source code would provide a very good launch point. I could also attempt to incorporate the WebCite feature so we can see how well such a tool would have to work and have a tool take mass archives on demand. — Dispenser 01:58, 22 February 2009 (UTC)[reply]
- Sounds good. I think we have consensus, we just need some super coders to their thing. - Peregrine Fisher (talk) (contribs) 02:35, 22 February 2009 (UTC)[reply]
- I don't see any objections above that could not be addressed with small changes to the code (or changes to editor behavior, which is not the bot's responsibility). If you're willing to start work on it, that would be great.--Father Goose (talk) 20:42, 22 February 2009 (UTC)[reply]
alphanum = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
def webCiteShortId(t):
"""
WebCite's short identifier is the time measure in microsecond since 1970
of the date of the archive request stored as a base-62 number.
"""
s = ""
while (t >= 1):
s = alphanum[t%62] + s
t /= 62
return s
Document how the short id is derived. — Dispenser 01:55, 21 April 2009 (UTC)[reply]
Status Updates
Coding... I've been very busy and haven't been on Wikipedia for a couple weeks. However, I should have some free time this weekend. I think it is pretty clear there is consensus for this task, so I will begin working on the code now. --ThaddeusB (talk) 03:11, 28 February 2009 (UTC)[reply]
- Update: Per request, here's a quick update as to this project's status... The coding is mostly ready and the bot has been monitoring new additions for more than a week now. I am waiting on a response from the WebCite people to clear up a few things on the technical side. (I was told to expect a reply early this week.) Once I get a reply, I'll finish up the code and be ready for testing. Code will be posted for review at that time. --ThaddeusB (talk) 03:22, 18 March 2009 (UTC)[reply]
- That's great news, keep up the good work! Skomorokh 03:23, 18 March 2009 (UTC)[reply]
- I'm kinda excited. This bot should be pretty cool. - Peregrine Fisher (talk) (contribs) 05:30, 18 March 2009 (UTC)[reply]
- Another quick update: I heard back from the WebCitation people yesterday and am updating the code to meet their technical specifications. A working version of the bot's code should be published within 2 days. Thanks everyone for your patience :) --ThaddeusB (talk) 18:33, 25 March 2009 (UTC)[reply]
- Great! Thanks for you work.--Father Goose (talk) 21:12, 25 March 2009 (UTC)[reply]
Code Comments
A working version of the code is now complete. It is missing a few minor features (more robust capture of metadata; auto-skipping blocked domains; auto-tagging dead links), but the current form will "get the job done."
Here is the code, minus the headers (where my passwords and such are). I am publishing it on my web page, as I do not wish to release it into public domain at this time. The code has been tested locally and all bugs I could find have been fixed. Comments are, of course, welcome. --ThaddeusB (talk) 19:47, 31 March 2009 (UTC)[reply]
Monitor link additions
Main program
- Are there any remaining objections? If not, I'd recommend a trial run. – Quadell (talk) 13:06, 2 April 2009 (UTC)[reply]
- Have you tested it against a) Wikipedia pages with UTF-8 in the titles, and b) web pages with Unicode titles? --Carnildo (talk) 20:10, 2 April 2009 (UTC)[reply]
- It should properly handle both cases... My db is set to utf8 where appropriate. I assume EnLinkWatcher properly reports utf8 Wikipedia page titles to the the IRC channel, but if by some chance it doesn't report them the pages will just be skipped.
- Perl's LWP automatically decodes web pages, so as long as the encoding & character set are standard (e.g. utf8) the page title will work correctly. Now, there are a few ambiguous character sets which may cause problems. As it so happens, this page was in my test set. The page uses the "x-sjis" character set, which has multiple definitions (maps) according to this document. The encoding scheme assumed by LWP renders the title as: (note: the follow may or may not appear the same way in your browser as they do in mine :\ )
Billboard, CASHBOX & Record World ��1 ALBUMS(1973年)
- Which appears to use the Unicode consortium mapping. But, it is clear the actual intend scheme is the Microsoft version which renders as:
Billboard, CASHBOX & Record World №1 ALBUMS(1973年)
- Ironically, my Internet Explorer also gets it wrong, rendering as (and telling me I need to install a Japanese Font to load to page properly):
Billboard, CASHBOX & Record World ‡‚1 ALBUMS(1973”N)
- So basically, rare cases like these may create a title that isn't perfect in the wiki text. However, all bot generated titles are so marked and can easily be "fixed" by a human at a later date. --ThaddeusB (talk) 03:26, 3 April 2009 (UTC)[reply]
- Double checking would be good. For example, depending on the version of your LWP module you may need to explicitly convert the title from Perl's internal representation to UTF8 (newer versions handle that automatically). And your logging may or may not give you warnings about "Wide character in print". Anomie⚔ 12:03, 3 April 2009 (UTC)[reply]
- So basically, rare cases like these may create a title that isn't perfect in the wiki text. However, all bot generated titles are so marked and can easily be "fixed" by a human at a later date. --ThaddeusB (talk) 03:26, 3 April 2009 (UTC)[reply]
- One thing that immediately jumps out at me: Don't parse the API responses using regular expressions and string manipulation functions! Perl has modules for (correctly) parsing JSON, XML, YAML, and PHP's serialized format, use one of those instead. Anomie⚔ 22:17, 2 April 2009 (UTC)[reply]
- Well, I am more comfortable working with text parsing than (for example) json containers, which I've never used before. The two should be functionally equivalent since it is highly unlikely a string like "[code] => maxlag" would ever appear in an article's body. However, if you think it is important I can change it. --ThaddeusB (talk) 03:26, 3 April 2009 (UTC)[reply]
- Feel free to look at AnomieBOT's code for that, which (besides variable initialization) is in sub _query. It's basically:
- Well, I am more comfortable working with text parsing than (for example) json containers, which I've never used before. The two should be functionally equivalent since it is highly unlikely a string like "[code] => maxlag" would ever appear in an article's body. However, if you think it is important I can change it. --ThaddeusB (talk) 03:26, 3 April 2009 (UTC)[reply]
use JSON; # using the JSON::XS module
my $json = JSON->new->utf8;
my $ret;
eval { $ret=$json->decode($res->decoded_content); };
if($@){
# Handle error: "JSON decoding failed: $@"
}
- $ret is then a hashref matching the structure of the response, which you can traverse in the normal way. For example,
@pages = values(%{$ret->{query}{pages}}); if(exists($pages[0]{missing})){ ... }
checks if the first page result has the "missing" flag set. The PHP and YAML modules probably work similarly; some of the many XML modules may also do so, while others may give you an XML DOM tree. Anomie⚔ 12:03, 3 April 2009 (UTC)[reply]
- $ret is then a hashref matching the structure of the response, which you can traverse in the normal way. For example,
- Updated to use JSON containers now; Thanks for the help Anomie. The code has been tested with the new format & seems to work right. --ThaddeusB (talk) 20:34, 4 April 2009 (UTC)[reply]
- No problem. I've taken a more thorough look over the code and found a few more issues. I just like things to be "right", even if the problem is rather unlikely; only the "major" issues are ones that IMO really need fixing.
- Thanks for all the input. I too want to "get it right." I have updated the code to match most of your suggestions - details below. --ThaddeusB (talk) 04:10, 6 April 2009 (UTC)[reply]
- (Replies inline below) Anomie⚔ 12:05, 6 April 2009 (UTC)[reply]
- Major issues (IMO)
- It looks like you're trying to group consecutive entries for the same page in the results of your
$read_new
query. For that to work reliably, you need to add "ORDER BY Page" to that query.
- I think you may have missed the
@ReadyLinks = sort {$a->[0] cmp $b->[0]} @ReadyLinks;
line. I let perl do the sorting, which (surprisingly) is considerably faster than MySQL doing it. So unless there is a reason you believesort
will break, I'll leave that as is.
- Yeah, I missed that (bad me!). If you don't have an index on (Time,Page), you could try adding one and seeing if that speeds up MySQL's sort, but the Perl sort shouldn't break.
- I think you may have missed the
- When substituting a variable into a regular expression, you need to use quotemeta or the \Q and \E escapes. Otherwise, any URL with a "?" in it (among other things) will cause you trouble. I see you do this in a few places, but not all.
- Good catch. I realized this of course, but for some reason it didn't occur to me that URLs could break my regexes. All fixed now.
- You missed one on line 347.
- Got it now, thanks!
- Instead of messing around trying to split the URI into pieces, why not just use perl's URI module? Just as with using JSON to parse the query result, it's less likely to break in odd cases using the existing module.
- I didn't actually know about those URI functions. I'd only ever used it to construct (form data) URLs. Code is now changed.
- For proper edit conflict detection (as unlikely as that may be in this case), you should be including the "starttimestamp" parameter in your action=edit API call (you get the value from the same intoken=edit query as the edit token itself).
- I thought using "basetimestamp" was sufficient. Do you have to have both basetimestamp & starttimestamp? The API documentation just says both parameters are used to detect edit conflicts, which I assumed meant either/or. In any case, the code now uses both stamps.
- "basetimestamp" specifies the timestamp of the last revision at the time you loaded the page, and "starttimestamp" specifies the time you loaded the page. If "basetimestamp" is omitted, the API edit code assumes the most recent revision (even if it's not actually the same one), which means edit conflicts won't be detected. "starttimestamp" is used to detect whether the page was deleted since you started editing the page; if it is omitted, it uses "basetimestamp" which leads to the problem described in T17647.
- Besides the standard
$data->{error}{code}
, an action=edit API call has failed unlesslc($data->{edit}{result}) eq 'success'
. At the moment, I know that Extension:SpamBlacklist, Extension:AssertEdit, and Extension:ConfirmEdit indicate failure in this way; more extensions may in the future, as it allows the extension to give the API client more information than just "Oops, it failed".
- I had forgotten about that. Code fixed.
- You must have done that after your most recent upload, I don't see it at [3].
- My bad. Here's the relevant snippet:
- My bad. Here's the relevant snippet:
- Thanks for all the input. I too want to "get it right." I have updated the code to match most of your suggestions - details below. --ThaddeusB (talk) 04:10, 6 April 2009 (UTC)[reply]
- No problem. I've taken a more thorough look over the code and found a few more issues. I just like things to be "right", even if the problem is rather unlikely; only the "major" issues are ones that IMO really need fixing.
if (lc($data->{edit}{result}) ne 'success' )
printL "FAILED - " . $data->{edit}{result} . ": \n" . $res->decoded_content . "\n\n";
die;
}
- I notice that temporary network failures could lead to $cited_links being empty after the call to submit_links(), which would result in the links never being archived as on failure there the newlinks entry is deleted. Also, BTW, is the newlinks entry ever deleted after a successful archival and page write?
- Good point. I changed it so it only deletes a page if all links were 404. (This should be expanded to also match some other response codes probably.) Currently, the code is a bit of a hack, but it was the first solution that occurred to me.
- Off the top of my head: 404 and 410 are probably permanent failures, and 401 may as well be treated as such. 403 could mean your bot was blocked somehow (probably by a user agent sniffer, you can minimize this by setting your own useragent (possibly spoofing IE or Firefox)) or it could mean a permanent failure. Most other 4xx codes probably need you to look at the situation and find out what went wrong, as do any 3xx or 1xx you actually see. BTW, $res->code will give you just the status code of the response if that is more convenient.
- Minor issues
- Why do you concatenate $addTime and $adder before inserting into @ReadyLinks, only to split them back apart on extraction? For that matter, fetchall_arrayref might be an easier way to populate @ReadyLinks.
- No good reason - In an earlier version of the code I only used one of the variables & when I changed the code I did it that way rather than adding another dimension to the array. In any case, I changed it to use the fetchall_arrayref as that is more efficient, as you suggested.
- Ditto for $wiki and $page in @ArchiveLinks, although there it makes slightly more sense as that's your database representation of a page title.
- I think I'll leave it as since that is the way the db is structured.
- I notice you like to construct and split strings instead of just using arrayrefs, e.g. for $URLs and $cited_links; is there any particular reason for that? It seems more error prone, as if your pseudo-array accidentally gets an entry containing your separator character you're in trouble.
- Most of the time, I would probably have used an array to do this sort of thing. Since the data comes from the db in "http://link1 http://link2 ..." format, I just chose to reuse that format. Since Wikipedia doesn't allow spaces in URLs it can't break, so I'll just leave it as is.
- Something similar goes for your citation metadata stored in your database. Best would be to use separate columns for each item of metadata. If possible schema changes are too difficult to worry about, I'd probably store a JSON-encoded hash, but as long as you make sure to encode "|" (as you already do for Title and Author) this would be a place that a split would be workable. As it is now, in the unlikely event that your title contains "Author=" you'd have a problem.
- The reason I did it that may is because the metadata I store is likely to change\be expanded in the future. There is no reason why I couldn't just add more columns at that time, I suppose. For now, I just fixed the regex to match "|Author=" instead of "Author=" (and so on) to prevent any possible strangeness.
- That seems good, although technically
(?:^|\|)
would be better in case you ever change it to put Author at the start of the metadata string. If you want to split the metadata as stored into an actual hash,%metadata = map { split(/=/, $_, 2) } split(/\|/, $metadata);
should do it.
- Changed to a hash, per suggestion.
- That seems good, although technically
- Your construction of the filename for logging the page contents could lead to name conflicts; better would be to encode the problem characters somehow instead of stripping them.
- Yah I know, but it is just for testing purposes anyway so I won't worry about it.
- Really minor issues
- If you want,
POSIX::strftime('%F.log.txt', gmtime)
is a more straightforward way to generate your log file name (or "%Y-%m-%d" if for some reason your strftime doesn't have "%F").
- Nice trick (although %F didn't work for some reason). So changed.
- Your C library's strftime must not support C99, I guess.
- Rather than
($wiki, @page) = split(/:/, $page); $page = join(":", @page);
, do($wiki, $page) = split(/:/, $page, 2);
.
- Forgot about the limit parameter of split. So changed.
- When a maxlag error is returned, the HTTP headers include a "Retry-After" header, if you want to use it.
- I won't worry about it.
- Why sleep an hour between submitting the request to WebCite and updating Wikipedia? Is that to give WebCite time to do the archiving on their end?
- You are correct - their archiving isn't instantaneous (although is usually within minutes in my experience) and I was advised to wait an hour between archive & retrieval by the WebCitation people.
- Anomie⚔ 04:21, 5 April 2009 (UTC)[reply]
- Let me know you there is anything else I should change. --ThaddeusB (talk) 04:10, 6 April 2009 (UTC)[reply]
- Just the one missed regex missing \Q\E needs changing; I'll wait a little while in case anyone else wants to comment, and then give you a trial. Anomie⚔ 12:05, 6 April 2009 (UTC)[reply]
- No problem. I'll be going out of town for a week on Wednesday afternoon, so I'll probably have to wait until I get back to run the trial. --ThaddeusB (talk) 16:00, 6 April 2009 (UTC)[reply]
- Just the one missed regex missing \Q\E needs changing; I'll wait a little while in case anyone else wants to comment, and then give you a trial. Anomie⚔ 12:05, 6 April 2009 (UTC)[reply]
- Let me know you there is anything else I should change. --ThaddeusB (talk) 04:10, 6 April 2009 (UTC)[reply]
Trial
I'm back in town and can start a trial at any time. --ThaddeusB (talk) 02:23, 17 April 2009 (UTC)[reply]
- Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Anomie⚔ 02:24, 17 April 2009 (UTC)[reply]
- Trial complete. ±50 edits complete. A full log of its activty (included non-edits) can be found under User:WebCiteBOT/Logs/.
There were some minor issues (all fixed now) as follows:
- The first edit used HTML markup "<a href=" instead of wiki formatting for links. I probably looked at the code a dozen times, but didn't catch this error until seeing the results of the first edit.
- This edit archived links of a page nominated under WP:AfD. A feature has been added to skip pages tagged AfD/prod/speedy/db. The article (Lil) Green Patch has been correctly skipped several times, for example in this log.
- This edit revealed the need for the bot to add a reference list when one doesn't already exist. Correctly handled here.
- This edit revealed a bug related to articles with references in <ref name="name"/> format.
Additionally, the following improvements were made:
- Links in the format [http://webpage.com A Website] are now handled better such that "A website" is used as the title of the cite, such as here.
- New links to webcitation.org and archive.org are no longer stored
Finally, edits such as this one where the bot can't get a title from the remote page are slightly problematic. I can't leave the field blank, as that break the template so I used "?" as the title. I am open to alternative suggestions. --ThaddeusB (talk) 18:53, 20 April 2009 (UTC)[reply]
- (I'd recommend "No title".) – Quadell (talk) 19:41, 20 April 2009 (UTC)[reply]
- Or "Title unkown". - Peregrine Fisher (talk) (contribs) 19:46, 20 April 2009 (UTC)[reply]
- "Title unknown" would be best, since it isn't know whether or not the article has a title...just that the bot cannot, for whatever reason, retrieve it. — Huntster (t • @ • c) 01:28, 21 April 2009 (UTC)[reply]
- Or "Title unkown". - Peregrine Fisher (talk) (contribs) 19:46, 20 April 2009 (UTC)[reply]
- Changed to "Title unknown" --ThaddeusB (talk) 02:33, 21 April 2009 (UTC)[reply]
- The edits look good to me, but I do have a few comments:
- Did anything ever come of the "archive=no" suggestion above? It might be worth having the bot add even if it's not supported at this time, so it will be present if it is added to the template later.
- I posted a comment at Template_talk:Citation/core (which is used by most of the citation templates), but no one replied. Parties interested in changing the ordering should probably take comments there first, and then to the individual templates. In the mean time, I'd be happy to add a dummy parameter. I'd suggest either 'dead=no' or 'deadurl=no' to reflect the reason for the changing standard.
- [4] In the second ref, the title contained a literal "|", which needs to be escaped (either |, {{!}}, or <nowiki>) when inserted into the template.
- Good catch - I'll make the change now.
- This was a temporary bug caused by a missing '?' after I changed the regex to better handle links in the [http://webpage.com A Website] format.
- You'll eventually need to figure out some way to match the date format in the rest of the reference for edits like [7], but I'd wait as there still might be a miracle where the "plain text dates only" crusaders allow us a better way to determine the appropriate format than "scan the text and guess".
- I agree. Its a shame people are opposed to auto-formatting.
- Anomie⚔ 01:43, 21 April 2009 (UTC)[reply]
- --ThaddeusB (talk) 02:33, 21 April 2009 (UTC)[reply]
- "deadurl=no" sounds better to me, only because it'll end up in every "cite xxx" if it ends up supported at all (were it only cite web, I'd go for "dead=no"). I intend to approve the bot, but I'll wait a short time in case there are any last-minute objections. Anomie⚔ 02:43, 21 April 2009 (UTC)[reply]
- Alleluia! Can't come a moment too soon IMO. Happy‑melon 10:03, 21 April 2009 (UTC)[reply]
- Seconded. – Quadell (talk) 13:48, 21 April 2009 (UTC)[reply]
- Alleluia! Can't come a moment too soon IMO. Happy‑melon 10:03, 21 April 2009 (UTC)[reply]
- "deadurl=no" sounds better to me, only because it'll end up in every "cite xxx" if it ends up supported at all (were it only cite web, I'd go for "dead=no"). I intend to approve the bot, but I'll wait a short time in case there are any last-minute objections. Anomie⚔ 02:43, 21 April 2009 (UTC)[reply]
- --ThaddeusB (talk) 02:33, 21 April 2009 (UTC)[reply]
- In the log of 20th april it says link removed for My Life at First Try. The links are live and should be archived. They're under the Reference section. It seems the bot only looks for references in the ref format. Some pages have general references that aren't in that format. External links that aren't the official websites should also be archived, because mostly they include general information that is used as a source of information. I suggest doing another 50 trials before approving the bot for a mass links archiving. This would show the new features the bot got and the many bug fixes. A dead link tag should definitely be applied when a link is dead. This could help in spotting a dead link for replacement.--Diaa abdelmoneim (talk) 11:46, 21 April 2009 (UTC)[reply]
- There are no links on the page [["my life at first try"]] which is the one in the log. Instead, the page is now redirected to [[My Life at First Try]]. (Confusing, I know). When the bot gets to that 2nd page it will archive the links in question, as it does handle general references accurately as long as they are in the "Reference(s)" section or "Notes". For example, see this edit.
- As to links in the External Links section, the idea of the bot is to save citations from rotting, not necessarily to save links in general. Archiving external links was not in the scope of the original bot request. --ThaddeusB (talk) 12:16, 21 April 2009 (UTC)[reply]
- Would the bot go through all articles on Wikipedia and archive all links or would it only look into new additions? Because as I currently understand it, the bot would "automatically submit URLs recently added to Wikipedia to WebCite".
- Currently, it only handles new links. The reason I said "When the bot gets to that 2nd page..." is because the move from "my life at first try" to My Life at First Try caused the links to be new additions at the new page. --ThaddeusB (talk) 13:40, 21 April 2009 (UTC)[reply]
- Would the bot go through all articles on Wikipedia and archive all links or would it only look into new additions? Because as I currently understand it, the bot would "automatically submit URLs recently added to Wikipedia to WebCite".
Just wanted to say great job and thanks for all the hard work you guys put into this bot. I know it will be a real asset to Wikipedia and I appreciate what you guys have done here. Congratulations. :) -- OlEnglish (Talk) 20:31, 21 April 2009 (UTC)[reply]
- Should be escaping ], double and triple ', }}. The meta author tag is almost never reliable from my experience and typically reflects the author of the layout, not the content.
- You are correct. In particular "]" is bound to happen from time to time. All codes now escaped.
- I do give preference to the "content_author" tag over the "author" tag.
- Would be nicer if the regex matched the surrounding template style.
- IMO, completely trivial - I am not going to worry about putting each paramter on a newline just b/c someone else did.
- archiveurl= is suppose to be the date it was archived by webcite, not the current date. accessdate= should not be filled in if the link has gone dead (since this it indicates the date when the page was verified to containing information).
- In the bot's normal course of action the archive date and the current date will normally be identical. Non-write tests and such caused an occasional disparity in the trial.
- Bot does not attempt to find archives for dead links, so obviously it won't be filling in accessdate for these either :) (If I add this feature addressing old links at a later date it will be it will be a separate BRFA)
- On the generating bot link titles, this is considerable hard.
- Add Site Maintenance in progress to the blacklist and avoid archive anything less than 2 KB (unless it contains frames)
- I can (and have) added a title blacklist, but see no reason to skip short pages.
- Suggest the you use <!-- BOT GENERATED TITLE --> instead of <!-- BOT GENERATED PAGE TITLE --> so we can keep the regexes simple.
- OK
- Possibility look at the DumZiBot source code for the junk leading & trailing character reduction and blacklist.
- diff which appears to be a 404 page and character encoding issues.
- I don't see any 404 there (and the bot should not be touching any 404s at this time). The character encoding discrepancy is due to that page using an ambiguous non-standard character set as outlined above.
— Dispenser 15:49, 22 April 2009 (UTC)[reply]
Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Once you're ready, have another trial. – Quadell (talk) 18:51, 24 April 2009 (UTC) Trial complete. No major problems. I did make some minor improvements to formatting and such, but nothing I would really call a bug. Well maybe - a couple improvements to the regrexes were made to catch an occasional odddly formatted ref it was missing. Logs can be seen at User:WebCiteBOT/Logs/[reply]
However, a fairly substantial error was discovered on WebCite's end. It seems they don't properly archive pages from FT.com (and presumable other sites with similiar java script). Instead of the page requested, it archives a piece of java script. See: Wikipedia edit & the bad archive. I have notified them of the bug & they said they would look into it. Hopefully it gets fixed, but in the mean time I have added a check function to make sure the page archived matches the page requested before the bot makes an edit.
I also added some code to use Perl's PDF::API2 module to read title & author info from pdf files. --ThaddeusB (talk) 15:16, 28 April 2009 (UTC)[reply]
Approved. Looks good! Anomie⚔ 01:12, 29 April 2009 (UTC)[reply]
- The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made in a new section.