User talk:GreenC bot
Flagging non-dead link as dead
This edit flagged this URL as dead even though it isn't. Jo-Jo Eumerus (talk) 11:17, 18 July 2022 (UTC)
- Same with these edits:
- I appreciate it probably has to do with some kind of automatic PDF link serving in Javascript that Academia.edu uses wouldn't be readily captured with a bot; I don't know how fixable it is, but the links noted are not dead at all; I reverted both edits that the bot flagged. Ifly6 (talk) 14:35, 18 July 2022 (UTC)
- The url that Editor Jo-Jo Eumerus linked:
- Both of the urls that Editor Ifly6 links:
- There was some discussion about these kinds of academia links at Wikipedia:Link rot/URL change requests § www.academia.edu/download/
- âTrappist the monk (talk)
14:43, 18 July 2022 (UTC)14:46, 18 July 2022 (UTC)
- Jo-Jo Eumerus & User:Ifly6 they are dead for me (USA). Example. Are you getting a redirect to a cloudfront URL? Wondering if there is some kind of location-aware policy that determines when to serve the cloudfront URL vs a 404. If the cloudfront URL was known, it would be possible to save it at the Wayback Machine, then use the Cloudfront-Wayback URL on Wikipedia treated as a dead link (due to its &Expires self-destruct mechanism see WP:AWSURL). However, I wonder about copyright if academia.edu is making them unavailable in the US and possibly elsewhere, question why have that policy if not a rights issue. -- GreenC 15:04, 18 July 2022 (UTC)
- I'm in the US and am getting the links promptly. The links I am getting are Cloudfront ones with an expiry; I used the Academic.edu links to avoid the known expiry. Ifly6 (talk) 15:41, 18 July 2022 (UTC)
- Ah I see you use British English so I assumed you are not US. What browser do you use? Do you have any plugins that might affect javascript? This is impacting archive providers as well, such as Wayback Machine and Ghostarchive (US-based), they also get 404. Archive.today it "works" (global IP pool) but they are unable to correctly save the PDF. -- GreenC 16:00, 18 July 2022 (UTC)
- I do get a "d1wqtxts1xzle7.cloudfront.net" sort of thing. Jo-Jo Eumerus (talk) 17:33, 18 July 2022 (UTC)
- Language heuristics are always right 99pc of the time haha. I've confirmed on Edge (Windows 10) and Safari (macOS) that the Academia.edu link work. I don't have any plugins installed other than ad blockers that would affect something like this. The specific link that got generated for me with Rafferty was https://d1wqtxts1xzle7.cloudfront.net/51344857/Iris-_Fall_of_the_Roman_Republic-with-cover-page-v2.pdf. There were then a pile of GET parameters that I've excerpted â they change every time anyway â but are necessary to get the file served properly. Ifly6 (talk) 19:24, 18 July 2022 (UTC)
- Jo-Jo Eumerus do you use Edge or Safari? -- GreenC 19:38, 18 July 2022 (UTC)
- Wikipedia:Village_pump_(technical)#academia.edu/download .. seeing if anything comes up here. -- GreenC 19:52, 18 July 2022 (UTC)
- Ifly6 in the above thread someone suggested perhaps you had signed up for account on academia.edu at some point? Or some old cookies that are giving permission. One way to test is try to access from a private window. -- GreenC 20:46, 18 July 2022 (UTC)
- Yea, that's probably it. I opened it in a private window and got the 404. Ifly6 (talk) 20:57, 18 July 2022 (UTC)
- Same for me (Firefox) Jo-Jo Eumerus (talk) 21:12, 18 July 2022 (UTC)
- Cool, glad it is figured out what is causing it. My thinking is to replace the academia.edu links with a Wayback version of the cloudfront URL so it's accessible for everyone. Or second option is to use
|url-access=registration
but that 404 page is confusing and will result in bots marking it dead. -- GreenC 21:30, 18 July 2022 (UTC)
- Yea, that's probably it. I opened it in a private window and got the 404. Ifly6 (talk) 20:57, 18 July 2022 (UTC)
- Ifly6 in the above thread someone suggested perhaps you had signed up for account on academia.edu at some point? Or some old cookies that are giving permission. One way to test is try to access from a private window. -- GreenC 20:46, 18 July 2022 (UTC)
- Ah I see you use British English so I assumed you are not US. What browser do you use? Do you have any plugins that might affect javascript? This is impacting archive providers as well, such as Wayback Machine and Ghostarchive (US-based), they also get 404. Archive.today it "works" (global IP pool) but they are unable to correctly save the PDF. -- GreenC 16:00, 18 July 2022 (UTC)
- I'm in the US and am getting the links promptly. The links I am getting are Cloudfront ones with an expiry; I used the Academic.edu links to avoid the known expiry. Ifly6 (talk) 15:41, 18 July 2022 (UTC)
User:Jo-Jo Eumerus|User:Ifly6|User:Biogeographist: Would like to propose this solution: Special:Diff/1098978075/1099315632. It's only for academia.edu/download links, which are about 1,000 on enwiki.
- academia.edu returns a 404 when a user is not registered and logged in, which is most users. It does not say "log in to access paper", rather a misleading 404 dead link page. This causes problems:
- Archive bots will determine the links are dead (404) and mark with a
{{dead link}}
. - Users will be confused thinking the link is dead and not behind a registration wall.
- Should the link ever actually die for real, there would be no archive available since the Wayback Machine sees only a dead 404 page - the Wayback machine is not an academia.edu registered user.
- Archive bots will determine the links are dead (404) and mark with a
- While possible to use
|url-access=registration
this does not solve the misleading 404 problems. - The cloudfront link is an AWS container with an &Expires self-destruct mechanism. It's where the paper is actually located (not on academia.edu which redirects to cloudfront).
- The proposal is to determine the active cloudfront link via bot magic, immediately create a Wayback Machine save of the cloudfront URL, and change the citation to the Wayback-cloudfront link. eg. Special:Diff/1098978075/1099315632
This is what I can do somewhat easily right away. There are limits due to bot design and coding efforts what can be done. -- GreenC 04:15, 20 July 2022 (UTC)
- Hmm. It seems a bit complex and I wonder if people will be deleting the "expires" part of the link. Jo-Jo Eumerus (talk) 10:22, 20 July 2022 (UTC)
- It's a complex situation. If they delete the &Expires the URL will break (404). It will break anyway, due to the Expires, that is why the archive URL version is made the primary. The archive URL is accessible to everyone - academia.edu account not required. -- GreenC 15:30, 20 July 2022 (UTC)
Unfortunately there is something preventing cloudfront pages from being saved at Wayback. Not all pages, but most. So we have a bad situation with academia.edu/download links - ideally they should be converted to a non /download/ links - but can't be done by bot requires manual searching. The /download/ links are probably originating from Google Scholar, copy-pasting. -- GreenC 15:56, 23 July 2022 (UTC)
Backlinks report
User:Certes/Backlinks/Report seems to have stopped, but User:GoingBatty/Backlinks/Report is running normally. I've not added any new backlinks recently. Can you see anything else that I may have broken? Certes (talk) 11:17, 25 July 2022 (UTC)
- It aborted for unknown reasons. I increased the memory allocation by 10x in case that is the problem. The data may be messed up from the abort. I've restarted the process and will see what happens over the next hour or so if it can recover. Worse case will just delete all the data and it will rebuild from scratch, but that will result in a missed day. -- GreenC 15:34, 25 July 2022 (UTC)
- Thanks. Let me know if I'm checking too many targets or if some produce exceptionally big reports, and I'll remove the less productive ones. Certes (talk) 15:45, 25 July 2022 (UTC)
- It was crashing at "m" then after increasing memory made it to "v". Odd bc it should not run out of memory, and there are no error messages system or program to suggest why it's silently halting so it might be something different. I added debug statements, takes a while to replicate an hour or more. Thanks for holding. -- GreenC 04:26, 26 July 2022 (UTC)
- Odd: "m" and "v" are early in my list, and neither they nor anything earlier have many incoming links. If it's taking an hour then we may need to remove the entries with lowest benefit per second. A few entries have never triggered a fix and could probably be removed, but I've already removed the resource-heavy ones. Maybe I need to rate them all by fixes done per 1000 incoming links or similar and chop those scoring lowest. "v" is an oddity because it can indicate that the editor failed to press Ctrl when pasting: easy to spot, but hard to fix as you need to guess what was in their clipboard. Certes (talk) 12:39, 26 July 2022 (UTC)
- The memory problem appears to be cumulative if I run m or v in isolation they do fine but when running the whole bunch there is a massive spike in memory claim that occurs at the same spot around v or x, but also others don't release their claims so it builds up. It could be related to the Sun Grid Engine caching for performance reasons. I've checked the program for errant global vars and it's fine there is nothing holding onto data. I might try separating the backlinks retrieval portion to a different program so it exits between each item clearing any memory claims. -- GreenC 16:48, 26 July 2022 (UTC)
- I think it is fixed. A combination of repetitive backlinks reported by the API and inefficiencies in the program magnifying those repetitions. It should never use more than about 25MB of ram, but with "V" (and "v") it was as high as 1 gigabyte. Why V? I suspect it's due to WP:V which is so commonly linked outside mainspace. V exposed the problem, but it was occurring at a smaller scale with everything else. (The API typically and erroneously reports 100s of the same backlink - I don't know why it's always done this.) "V" had 2.5 million non-unique occurrences. Add to this the program was inefficient in how it dealt with the repetitions, it added up and the Grid Engine was nope and dropped the job. Right now it's starting over rebuilding the database, it should be back to normal soon. -- GreenC 05:44, 27 July 2022 (UTC)
- Thanks very much. The current version looks right, considering that it's for a few hours rather than the usual 24. Is it possible to add the namespace of the link target to the query? I'm not sure how you're extracting the data but, for example, Quarry would run its SQL much faster with "and pl_namespace=0". Certes (talk) 11:21, 27 July 2022 (UTC)
- API:Backlinks. When I first made this program (not your fork of it) around April 2015, Quarry was only about 6 months old I think, anyway I wasn't aware of it, and I wanted something that would run from anywhere which left the API. Speed is not an issue when running daily, unless it takes > 24hrs. Your job completes in about 2 hours, it is exceptionally big. The API behavior of multiple results is weird but can be adjusted for. If it continues to be a problem I can look into Quarry, getting a JSON file would nice. -- GreenC 15:41, 27 July 2022 (UTC)
- In that case, blnamespace is what I meant, but I'm not clear what it should be set to: the several namespaces in which relevant links appear, or ns 0 to which relevant links lead. If my job is taking two hours then I should be checking fewer targets; any clues as to which entries take the most time would help with that. Certes (talk) 18:27, 27 July 2022 (UTC)
- Below is an 'ls' of the data files. The timestamps show how long each took to complete. The file size is misleading as the program filters out namespaces. Like "V" (and "v" they are indenitcal to the API) is not very large filesize, but took almost 25 minutes to complete. It took about 85m to finish not 120m my mistake. V/v is about 50 minutes. U/u 20 minutes. N/n 10 minutes. Those are the big three and use 95% of the time (is that right?). Probably due to WP:V, WP:U and WP:N. -- GreenC 19:28, 27 July 2022 (UTC)
- Thanks. I'll take V/v, U/u and N/n out then. U and N rarely get a hit. V gets more but I'm less confident about fixing them as most of them require me to guess what article the editor was thinking of. Certes (talk) 20:57, 27 July 2022 (UTC)
- All working as normal today, and an hour faster than previously. Thanks again for your help. Certes (talk) 10:03, 28 July 2022 (UTC)
- Yes, finished in 25 minutes. No single one took very long (or much memory!). You are welcome and thanks for reporting it because it uncovered a problem in the program that only became evident at scale. -- GreenC 15:52, 28 July 2022 (UTC)
- Below is an 'ls' of the data files. The timestamps show how long each took to complete. The file size is misleading as the program filters out namespaces. Like "V" (and "v" they are indenitcal to the API) is not very large filesize, but took almost 25 minutes to complete. It took about 85m to finish not 120m my mistake. V/v is about 50 minutes. U/u 20 minutes. N/n 10 minutes. Those are the big three and use 95% of the time (is that right?). Probably due to WP:V, WP:U and WP:N. -- GreenC 19:28, 27 July 2022 (UTC)
- In that case, blnamespace is what I meant, but I'm not clear what it should be set to: the several namespaces in which relevant links appear, or ns 0 to which relevant links lead. If my job is taking two hours then I should be checking fewer targets; any clues as to which entries take the most time would help with that. Certes (talk) 18:27, 27 July 2022 (UTC)
- API:Backlinks. When I first made this program (not your fork of it) around April 2015, Quarry was only about 6 months old I think, anyway I wasn't aware of it, and I wanted something that would run from anywhere which left the API. Speed is not an issue when running daily, unless it takes > 24hrs. Your job completes in about 2 hours, it is exceptionally big. The API behavior of multiple results is weird but can be adjusted for. If it continues to be a problem I can look into Quarry, getting a JSON file would nice. -- GreenC 15:41, 27 July 2022 (UTC)
- Thanks very much. The current version looks right, considering that it's for a few hours rather than the usual 24. Is it possible to add the namespace of the link target to the query? I'm not sure how you're extracting the data but, for example, Quarry would run its SQL much faster with "and pl_namespace=0". Certes (talk) 11:21, 27 July 2022 (UTC)
- I think it is fixed. A combination of repetitive backlinks reported by the API and inefficiencies in the program magnifying those repetitions. It should never use more than about 25MB of ram, but with "V" (and "v") it was as high as 1 gigabyte. Why V? I suspect it's due to WP:V which is so commonly linked outside mainspace. V exposed the problem, but it was occurring at a smaller scale with everything else. (The API typically and erroneously reports 100s of the same backlink - I don't know why it's always done this.) "V" had 2.5 million non-unique occurrences. Add to this the program was inefficient in how it dealt with the repetitions, it added up and the Grid Engine was nope and dropped the job. Right now it's starting over rebuilding the database, it should be back to normal soon. -- GreenC 05:44, 27 July 2022 (UTC)
- The memory problem appears to be cumulative if I run m or v in isolation they do fine but when running the whole bunch there is a massive spike in memory claim that occurs at the same spot around v or x, but also others don't release their claims so it builds up. It could be related to the Sun Grid Engine caching for performance reasons. I've checked the program for errant global vars and it's fine there is nothing holding onto data. I might try separating the backlinks retrieval portion to a different program so it exits between each item clearing any memory claims. -- GreenC 16:48, 26 July 2022 (UTC)
- Odd: "m" and "v" are early in my list, and neither they nor anything earlier have many incoming links. If it's taking an hour then we may need to remove the entries with lowest benefit per second. A few entries have never triggered a fix and could probably be removed, but I've already removed the resource-heavy ones. Maybe I need to rate them all by fixes done per 1000 incoming links or similar and chop those scoring lowest. "v" is an oddity because it can indicate that the editor failed to press Ctrl when pasting: easy to spot, but hard to fix as you need to guess what was in their clipboard. Certes (talk) 12:39, 26 July 2022 (UTC)
- It was crashing at "m" then after increasing memory made it to "v". Odd bc it should not run out of memory, and there are no error messages system or program to suggest why it's silently halting so it might be something different. I added debug statements, takes a while to replicate an hour or more. Thanks for holding. -- GreenC 04:26, 26 July 2022 (UTC)
- Thanks. Let me know if I'm checking too many targets or if some produce exceptionally big reports, and I'll remove the less productive ones. Certes (talk) 15:45, 25 July 2022 (UTC)
Extended content |
---|
22930 Jul 27 09:11 0.new 127027 Jul 27 09:11 1.new 16924 Jul 27 09:11 2.new 15575 Jul 27 09:11 3.new 15540 Jul 27 09:11 4.new 14709 Jul 27 09:12 5.new 12741 Jul 27 09:12 6.new 17054 Jul 27 09:12 7.new 15220 Jul 27 09:12 8.new 14745 Jul 27 09:12 9.new 7476 Jul 27 09:13 10.new 6315 Jul 27 09:13 100.new 15741 Jul 27 09:13 A.new 13776 Jul 27 09:13 B.new 16104 Jul 27 09:13 C.new 13410 Jul 27 09:13 D.new 13301 Jul 27 09:14 E.new 12605 Jul 27 09:14 F.new 13550 Jul 27 09:14 G.new 13518 Jul 27 09:14 H.new 14387 Jul 27 09:14 I.new 13005 Jul 27 09:14 J.new 12845 Jul 27 09:14 K.new 14099 Jul 27 09:14 L.new 13174 Jul 27 09:14 M.new 39805 Jul 27 09:18 N.new 13668 Jul 27 09:19 O.new 13088 Jul 27 09:19 P.new 11858 Jul 27 09:19 Q.new 14160 Jul 27 09:19 R.new 14529 Jul 27 09:19 S.new 13146 Jul 27 09:19 T.new 15718 Jul 27 09:21 U.new 96856 Jul 27 09:45 V.new 12403 Jul 27 09:45 W.new 12797 Jul 27 09:45 X.new 13659 Jul 27 09:45 Y.new 13403 Jul 27 09:45 Z.new 15741 Jul 27 09:45 a.new 13776 Jul 27 09:45 b.new 16104 Jul 27 09:45 c.new 13410 Jul 27 09:46 d.new 13301 Jul 27 09:46 e.new 12605 Jul 27 09:46 f.new 13550 Jul 27 09:46 g.new 13518 Jul 27 09:46 h.new 14387 Jul 27 09:46 i.new 13005 Jul 27 09:46 j.new 12845 Jul 27 09:46 k.new 14099 Jul 27 09:46 l.new 13174 Jul 27 09:46 m.new 39805 Jul 27 09:51 n.new 13668 Jul 27 09:51 o.new 13088 Jul 27 09:51 p.new 11858 Jul 27 09:51 q.new 14160 Jul 27 09:51 r.new 14529 Jul 27 09:51 s.new 13146 Jul 27 09:51 t.new 15718 Jul 27 09:53 u.new 96856 Jul 27 10:16 v.new 12403 Jul 27 10:16 w.new 12797 Jul 27 10:16 x.new 13659 Jul 27 10:16 y.new 13403 Jul 27 10:16 z.new 217699 Jul 27 10:17 ABC 5951 Jul 27 10:17 Accolade.new 118095 Jul 27 10:17 Acre.new 89027 Jul 27 10:17 Admiral.new 22088 Jul 27 10:17 Alphabet.new 29758 Jul 27 10:17 Amber.new 4295 Jul 27 10:17 Amen.new 31785 Jul 27 10:17 Aperture.new 2643 Jul 27 10:17 Ash.new 2643 Jul 27 10:17 ash.new 44238 Jul 27 10:17 Atlantic.new 1375 Jul 27 10:17 Back.new 1375 Jul 27 10:17 back.new 36337 Jul 27 10:17 Bay.new 36337 Jul 27 10:17 bay.new 53374 Jul 27 10:17 Bowling.new 53374 Jul 27 10:17 bowling.new 2048 Jul 27 10:17 Cabinet 36569 Jul 27 10:17 Captain.new 36569 Jul 27 10:17 captain.new 12368 Jul 27 10:17 Calvary.new 12368 Jul 27 10:17 calvary.new 26920 Jul 27 10:17 Caterpillar.new 28665 Jul 27 10:17 Chancellor.new 28665 Jul 27 10:17 chancellor.new 31754 Jul 27 10:17 Chestnut.new 31754 Jul 27 10:17 chestnut.new 4924 Jul 27 10:17 Chin.new 725 Jul 27 10:17 Clipboard.new 725 Jul 27 10:17 clipboard.new 44162 Jul 27 10:17 Colony.new 44162 Jul 27 10:18 colony.new 3070 Jul 27 10:18 Colonies.new 3070 Jul 27 10:18 colonies.new 55 Jul 27 10:18 Colors.new 55 Jul 27 10:18 colors.new 565 Jul 27 10:18 Colours.new 565 Jul 27 10:18 colours.new 138372 Jul 27 10:19 Company.new 138372 Jul 27 10:20 company.new 6611 Jul 27 10:20 Companies.new 6611 Jul 27 10:20 companies.new 14699 Jul 27 10:20 Consul.new 14699 Jul 27 10:20 consul.new 76725 Jul 27 10:20 Colorado 3180 Jul 27 10:21 Commonwealth.new 3180 Jul 27 10:21 commonwealth.new 30657 Jul 27 10:21 Conservative.new 1206 Jul 27 10:21 Conservatives.new 113900 Jul 27 10:21 Corvette.new 2005 Jul 27 10:21 Corvettes.new 28639 Jul 27 10:21 Delphi.new 48181 Jul 27 10:21 Family.new 48181 Jul 27 10:21 family.new 2257 Jul 27 10:21 Families.new 2257 Jul 27 10:21 families.new 61603 Jul 27 10:21 Icon.new 61603 Jul 27 10:21 icon.new 6665 Jul 27 10:21 Icons.new 6665 Jul 27 10:21 icons.new 5801 Jul 27 10:21 Interpreter.new 5801 Jul 27 10:21 interpreter.new 70977 Jul 27 10:21 Jupiter.new 12095 Jul 27 10:21 Knot.new 12095 Jul 27 10:21 knot.new 80891 Jul 27 10:21 Krishna.new 121459 Jul 27 10:21 Lead.new 121459 Jul 27 10:21 lead.new 127 Jul 27 10:21 Liberal 180 Jul 27 10:21 Libertarian 183969 Jul 27 10:22 Madonna.new 183969 Jul 27 10:22 madonna.new 65528 Jul 27 10:22 Mass.new 65528 Jul 27 10:22 mass.new 5378 Jul 27 10:22 Meta.new 770 Jul 27 10:22 Ministry 3160 Jul 27 10:22 Model.new 3160 Jul 27 10:22 model.new 176677 Jul 27 10:23 Moon.new 176677 Jul 27 10:23 moon.new 214735 Jul 27 10:23 National 199067 Jul 27 10:23 Oxygen.new 76332 Jul 27 10:23 Primate.new 76332 Jul 27 10:23 primate.new 5462 Jul 27 10:23 Roland.new 346 Jul 27 10:24 Ronaldo.new 68973 Jul 27 10:24 Salt.new 68973 Jul 27 10:24 salt.new 16813 Jul 27 10:24 Season.new 16813 Jul 27 10:24 season.new 44306 Jul 27 10:24 Shiraz.new 44306 Jul 27 10:24 shiraz.new 53287 Jul 27 10:24 Spire.new 53287 Jul 27 10:24 spire.new 153867 Jul 27 10:24 Stream.new 153867 Jul 27 10:24 stream.new 11482 Jul 27 10:24 Telegram.new 3845 Jul 27 10:24 Thermal.new 3845 Jul 27 10:24 thermal.new 88519 Jul 27 10:24 Tree.new 88519 Jul 27 10:24 tree.new 3102 Jul 27 10:24 Trojan 3102 Jul 27 10:24 trojan 167 Jul 27 10:24 U.S. 2334 Jul 27 10:24 Victory.new 26424 Jul 27 10:24 Ardennes.new 19159 Jul 27 10:24 Aspen.new 1884 Jul 27 10:24 Baler.new 105737 Jul 27 10:25 Batman.new 20662 Jul 27 10:25 Battle.new 53364 Jul 27 10:25 Bethlehem.new 439921 Jul 27 10:25 Birmingham.new 11530 Jul 27 10:25 Boulder.new 54094 Jul 27 10:25 Brampton.new 14995 Jul 27 10:25 Calvados.new 208354 Jul 27 10:25 Cambridge.new 71179 Jul 27 10:25 Canterbury.new 15715 Jul 27 10:25 Caracal.new 203571 Jul 27 10:26 Christchurch.new 78460 Jul 27 10:26 Cicero.new 43543 Jul 27 10:26 Durango.new 18943 Jul 27 10:26 East 296629 Jul 27 10:26 Edmonton.new 12304 Jul 27 10:26 Esplanade.new 25247 Jul 27 10:26 Eye.new 32977 Jul 27 10:26 Flint.new 151 Jul 27 10:26 Gladstone.new 81116 Jul 27 10:26 Gloucester.new 56266 Jul 27 10:26 Greenwich.new 780 Jul 27 10:26 Guna.new 21889 Jul 27 10:26 Horsham.new 199436 Jul 27 10:26 Hyderabad.new 89915 Jul 27 10:26 Ipswich.new 15229 Jul 27 10:26 Ithaca.new 132579 Jul 27 10:27 Lagos.new 68478 Jul 27 10:27 La 18993 Jul 27 10:27 Leek.new 439197 Jul 27 10:27 Liverpool.new 26324 Jul 27 10:27 Loire.new 54 Jul 27 10:27 Loni.new 8106 Jul 27 10:27 Malmesbury.new 35538 Jul 27 10:27 Mansfield.new 7545 Jul 27 10:27 March.new 16434 Jul 27 10:27 Mold.new 25849 Jul 27 10:27 Moselle.new 33698 Jul 27 10:27 New 270789 Jul 27 10:27 New 205009 Jul 27 10:28 Norfolk.new 112023 Jul 27 10:28 Norwich.new 28431 Jul 27 10:28 Ore.new 71930 Jul 27 10:28 Pali.new 83138 Jul 27 10:28 Panama 373705 Jul 27 10:28 Perth.new 99124 Jul 27 10:28 Piedmont.new 22133 Jul 27 10:28 Pueblo.new 73659 Jul 27 10:28 Punjab.new 30869 Jul 27 10:28 Reading.new 100419 Jul 27 10:29 Republic 19646 Jul 27 10:29 Rye.new 23084 Jul 27 10:29 Saga.new 6106 Jul 27 10:29 Saint 5866 Jul 27 10:29 St. 11630 Jul 27 10:29 Saint 5336 Jul 27 10:29 St. 97107 Jul 27 10:29 St. 22068 Jul 27 10:29 Stanford.new 255991 Jul 27 10:29 Surrey.new 93952 Jul 27 10:29 Tripoli.new 50366 Jul 27 10:29 Troy.new 38853 Jul 27 10:29 Van.new 18130 Jul 27 10:29 Vosges.new 21909 Jul 27 10:29 Warwick.new 15455 Jul 27 10:29 Angels.new 23662 Jul 27 10:29 Arsenal.new 38084 Jul 27 10:29 Avalanche.new 2391 Jul 27 10:29 Barbarians.new 1558 Jul 27 10:29 Bears.new 5145 Jul 27 10:29 Border 296 Jul 27 10:29 Broncos.new 463 Jul 27 10:29 Buccaneers.new 1063 Jul 27 10:29 Canadiens.new 15399 Jul 27 10:29 Cavaliers.new 751 Jul 27 10:29 Cheetahs.new 367 Jul 27 10:29 Corinthians.new 3529 Jul 27 10:29 Coyotes.new 9722 Jul 27 10:29 Crusaders.new 5268 Jul 27 10:29 Dolphins.new 3090 Jul 27 10:29 Dragons.new 4159 Jul 27 10:29 Ducks.new 160 Jul 27 10:29 Eagles.new 45 Jul 27 10:29 Flames.new 48481 Jul 27 10:29 Force.new 181 Jul 27 10:29 Griquas.new 2627 Jul 27 10:29 Hawks.new 27971 Jul 27 10:29 Heat.new 653 Jul 27 10:29 Hornets.new 5809 Jul 27 10:29 Hurricanes.new 949 Jul 27 10:29 Jaguars.new 223 Jul 27 10:29 Jays.new 1571 Jul 27 10:29 Leopards.new 43470 Jul 27 10:30 Lightning.new 2409 Jul 27 10:30 Lions.new 229 Jul 27 10:30 Ospreys.new 1981 Jul 27 10:30 Pelicans.new 2413 Jul 27 10:30 Penguins.new 9026 Jul 27 10:30 Pirates.new 4012 Jul 27 10:30 Predators.new 2731 Jul 27 10:30 Rockets.new 802 Jul 27 10:30 Rockies.new 7330 Jul 27 10:30 Saints.new 9918 Jul 27 10:30 Saracens.new 3954 Jul 27 10:30 Sharks.new 3306 Jul 27 10:30 Stars.new 6305 Jul 27 10:30 Thunder.new 2129 Jul 27 10:30 Tigers.new 26592 Jul 27 10:30 Titans.new 3808 Jul 27 10:30 Twins.new 98682 Jul 27 10:30 Vikings.new 663 Jul 27 10:30 Warriors.new 3396 Jul 27 10:30 Wasps.new 5597 Jul 27 10:30 Wolves.new 6 Jul 27 10:30 Zunz.new 795 Jul 27 10:30 Orsini.new 226 Jul 27 10:30 Rockefeller.new 32 Jul 27 10:30 Paintal.new 483 Jul 27 10:30 Rothschild.new 8 Jul 27 10:30 Pevsner.new 4861 Jul 27 10:30 O'Reilly.new 62 Jul 27 10:30 Primo 18 Jul 27 10:30 Cimarosa.new 53 Jul 27 10:30 Narasimha 505 Jul 27 10:30 Caracciolo.new 155 Jul 27 10:30 Bakunin.new 665 Jul 27 10:30 Weber.new 26 Jul 27 10:30 Malevich.new 57 Jul 27 10:30 Korotayev.new 18 Jul 27 10:30 Krauser.new 186 Jul 27 10:30 Ghazali.new 266 Jul 27 10:30 TourĂŠ.new 190 Jul 27 10:30 Sadat.new 288 Jul 27 10:30 Rajguru.new 289 Jul 27 10:30 Maitland.new 83 Jul 27 10:30 Strozzi.new 90 Jul 27 10:30 Delacroix.new 167 Jul 27 10:30 Reuter.new 185 Jul 27 10:30 Baden 31 Jul 27 10:30 Lessing.new 129 Jul 27 10:30 Boyle.new 96 Jul 27 10:30 Aelian.new 48 Jul 27 10:30 Zichy.new 64 Jul 27 10:30 Nomura.new 204 Jul 27 10:30 Takeda.new 21 Jul 27 10:30 Gilbert 265 Jul 27 10:30 Batista.new 939 Jul 27 10:30 AndrĂĄssy.new 544 Jul 27 10:30 Prabhu.new 165 Jul 27 10:30 Tyszkiewicz.new 22 Jul 27 10:30 Mommsen.new 251 Jul 27 10:30 KĂśppen.new 492 Jul 27 10:30 Della 168 Jul 27 10:30 Bernstein.new 32 Jul 27 10:30 Tippett.new 380 Jul 27 10:30 Sanseverino.new 51 Jul 27 10:30 Pucci.new 377 Jul 27 10:30 Hieronymus 113 Jul 27 10:30 Ghirlandaio.new 65 Jul 27 10:30 Beckett.new 711 Jul 27 10:30 O'Ryan.new 273 Jul 27 10:30 Neumann.new 10 Jul 27 10:30 Matsushita.new 1276 Jul 27 10:30 Ferrero.new 114 Jul 27 10:30 Dietz.new 59 Jul 27 10:30 Amorim.new 29 Jul 27 10:30 Wankel.new 594 Jul 27 10:30 UexkĂźll.new 20 Jul 27 10:30 Stirner.new 80 Jul 27 10:30 Sridhar.new 234 Jul 27 10:30 Rossetti.new 150 Jul 27 10:30 Nassar.new 115 Jul 27 10:30 Morandi.new 160 Jul 27 10:30 Bulgakov.new 25 Jul 27 10:30 Barks.new 136 Jul 27 10:30 Agnelli.new 350 Jul 27 10:30 Teleki.new 134 Jul 27 10:30 Tarnowski.new 574 Jul 27 10:30 Hamdan.new 93 Jul 27 10:30 Guicciardini.new 589 Jul 27 10:30 Clark.new 97 Jul 27 10:30 Borromeo.new 22 Jul 27 10:30 Bazzi.new 51 Jul 27 10:30 Wolf-Ferrari.new 357 Jul 27 10:30 Sylvester.new 26 Jul 27 10:30 Schichau.new 164 Jul 27 10:30 Scarlatti.new 67 Jul 27 10:30 Noriega.new 24 Jul 27 10:30 Bohlen.new 40 Jul 27 10:30 Boiardo.new 45 Jul 27 10:30 Bosman.new 446 Jul 27 10:30 Braun.new 9 Jul 27 10:30 Gabrielli.new 56 Jul 27 10:30 Haider.new 49 Jul 27 10:30 Jayachandran.new 72 Jul 27 10:30 Jellinek.new 332 Jul 27 10:30 Manning.new 28 Jul 27 10:30 Naryshkin.new 157 Jul 27 10:30 Sachs.new 118 Jul 27 10:30 Sacks.new 101 Jul 27 10:30 Saunders.new 159 Jul 27 10:30 Uccello.new 204 Jul 27 10:30 Velazquez.new 29 Jul 27 10:30 Wills.new 60 Jul 27 10:30 Bergman.new 759 Jul 27 10:30 Haim.new 18588 Jul 27 10:30 Agamemnon.new 3872 Jul 27 10:30 Antigone.new 33458 Jul 27 10:30 Bloomsbury.new 36678 Jul 27 10:30 Cabaret.new 494 Jul 27 10:30 Can-Can.new 23895 Jul 27 10:30 Carousel.new 7172 Jul 27 10:30 Cyrano 47072 Jul 27 10:30 Dune.new 13573 Jul 27 10:30 Euphoria.new 6460 Jul 27 10:30 Falstaff.new 13338 Jul 27 10:30 Faust.new 575 Jul 27 10:30 Fra 1650 Jul 27 10:30 Gidget.new 16873 Jul 27 10:31 Gladiator.new 85498 Jul 27 10:31 Julius 10409 Jul 27 10:31 Medea.new 7415 Jul 27 10:31 Mystic 536 Jul 27 10:31 Peaky 9674 Jul 27 10:31 Peer 16265 Jul 27 10:31 Pericles.new 60538 Jul 27 10:31 Quartz.new 9418 Jul 27 10:31 Salome.new 49778 Jul 27 10:31 St. 84 Jul 27 10:31 The 9885 Jul 27 10:31 Ansible.new 20259 Jul 27 10:31 Arrow.new 57727 Jul 27 10:31 Daily 672758 Jul 27 10:31 The 8853 Jul 27 10:32 Decanter.new 11944 Jul 27 10:32 Dissent.new 13559 Jul 27 10:32 Germania.new 7858 Jul 27 10:32 Guernica.new 29403 Jul 27 10:32 Life.new 6739 Jul 27 10:32 The 809 Jul 27 10:32 The 195831 Jul 27 10:32 The 13864 Jul 27 10:32 Referee.new 2987 Jul 27 10:32 Sunday 24360 Jul 27 10:32 Sunday 154416 Jul 27 10:32 The 5692 Jul 27 10:32 Cage.new 872 Jul 27 10:32 Carpenters.new 2853 Jul 27 10:32 Chrysalis.new 133 Jul 27 10:32 Doors.new 324 Jul 27 10:32 Fernando.new 62059 Jul 27 10:32 Grenade.new 38621 Jul 27 10:32 Guru.new 125 Jul 27 10:32 Happy.new 970 Jul 27 10:32 Hello.new 190 Jul 27 10:32 Jojo.new 13288 Jul 27 10:32 Pink.new 84108 Jul 27 10:33 Sugar.new 16057 Jul 27 10:33 anchorage.new 25 Jul 27 10:33 barks.new 105737 Jul 27 10:33 batman.new 109392 Jul 27 10:33 derby.new 166471 Jul 27 10:33 jersey.new 107237 Jul 27 10:33 limerick.new 121643 Jul 27 10:33 louvre.new 332 Jul 27 10:33 manning.new 7545 Jul 27 10:33 march.new 99124 Jul 27 10:34 piedmont.new 118 Jul 27 10:34 sacks.new 1443 Jul 27 10:34 sandbanks.new 26151 Jul 27 10:34 slough.new 255991 Jul 27 10:34 surrey.new 50366 Jul 27 10:34 troy.new 29 Jul 27 10:34 wills.new 523 Jul 27 10:34 The.new 523 Jul 27 10:34 the.new 48 Jul 27 10:34 Is.new 48 Jul 27 10:34 is.new 337 Jul 27 10:34 were.new 199 Jul 27 10:34 That.new 199 Jul 27 10:34 that.new 370 Jul 27 10:34 said.new 1155 Jul 27 10:34 One.new 1155 Jul 27 10:34 one.new 5430 Jul 27 10:34 goes.new |
Bot updating Webarchive template is adding "url" same as existing "url2"
This bot made a group of WaybackMedic 2.5 edits in June where it "rescued" an archive link in the |url=
parameter of {{Webarchive}}, replacing it with a this link which was already in the |url2=
parameter. Two examples of this are Grant Bramwell: revised 1 June 2022 and List of ICF Canoe Sprint World Championships medalists in men's kayak: revised 26 June 2022. Can the bot remove the duplicate url2/date2/title2 parameters and renumber any subsequent url3/date3/title3, etc.? I've fixed over 500 of these edits myself, but there are still over 700 remaining to be fixed. Thanks. -- Zyxw (talk) 03:54, 9 August 2022 (UTC)
- That was part of the deprecation of WebCite which is a dead archive provider. It didn't account for dups. It's complicated here because even though
|url=
and|url2=
are the same,|title=
and|title2=
are different - which do you choose. I think the best course is the keep|url=
set and remove the|url2=
set, at least based on two examples. In terms of renumbering that is not required as the webarchive template is designed to allow any numbers up to 10, so long as there is a|url=
.. aka|url1=
.. is the only requirement. I'll start looking at this today. -- GreenC 15:35, 9 August 2022 (UTC)
- @GreenC: I agree with keeping the
|url=
set and removing the|url2=
set when there is a duplicate URL and that is what I did for the 500+ already fixed. I also thought {{Webarchive}} might automatically handle the missing|url2=
set and display the|url3=
set, but as per these tests that is not the case: - archive with url/date/title, url2/date2/title2, and url3/date3/title3
- Medal Winners â Olympic Games and World Championships (1936â2007) â Part 1: flatwater (now sprint). CanoeICF.com. International Canoe Federation. at the Wayback Machine (archived 5 January 2010). Additional archives: Wayback Machine, BCU.org.uk.
- url2/date2/title2 removed with url3/date3/title3 remaining
- Medal Winners â Olympic Games and World Championships (1936â2007) â Part 1: flatwater (now sprint). CanoeICF.com. International Canoe Federation. at the Wayback Machine (archived 5 January 2010). Additional archives: BCU.org.uk.
- url2/date2/title2 removed and url3/date3/title3 renumbered
- Medal Winners â Olympic Games and World Championships (1936â2007) â Part 1: flatwater (now sprint). CanoeICF.com. International Canoe Federation. at the Wayback Machine (archived 5 January 2010). Additional archives: BCU.org.uk.
- -- Zyxw (talk) 16:15, 9 August 2022 (UTC)
- Reported at Template_talk:Webarchive#Gaps_in_argument_sequence. I wrote the template originally but Trappist did a major rewrite so I'm not sure if that is my bug or his. I processed the first 500 articles and there are only 3 with a
|url3=
suggesting 40 or 50 at most in the whole bunch. Anyway it won't be difficult to renumber them. -- GreenC 16:26, 9 August 2022 (UTC)- Ah miscalculated it's 733 not 7,330Â :) It's done see anything more let me know. -- GreenC 17:08, 9 August 2022 (UTC)
- Fixed the webarchive bug. -- GreenC 18:06, 9 August 2022 (UTC)
- Reported at Template_talk:Webarchive#Gaps_in_argument_sequence. I wrote the template originally but Trappist did a major rewrite so I'm not sure if that is my bug or his. I processed the first 500 articles and there are only 3 with a
- @GreenC: I agree with keeping the
Bad webcitation link replacement
So I've just found out that GreenC bot made edits like this, replacing a dead archive link with another dead archive link. Would it be possible to replace that archive link with, say, this one that actually works? Thanks very much! Graham87 11:48, 26 August 2022 (UTC)
- Bots are not 100% perfect. It relies on the Wayback API to determine live links and it is not perfect so for those errors it depends on human intervention to correct. The alternative is not to use bots at all , in which case most links never get fixed at all due to the scale, it's back-end boring work people want bots to do, but there is not guarantee bots, or for that matter people, will not make mistakes. The question is the scale of mistakes. -- GreenC 15:08, 26 August 2022 (UTC)
- Yeah fair enough, soft 404's and all. On re-reading my message I spectacularly failed at phrasing it clearly ... there are nearly a hundred more such links; could you instruct the bot to replace them with a working archive (i.e. the one linked above)? I thought that would be the easiest way to fix this problem. I tried changing the archive link on InternetArchiveBot's side and asking it to fix the affected articles, but that didn't do what I intended. Graham87 13:34, 27 August 2022 (UTC)
- OK it's done. Yeah there's no way to automate replace of one archive with another via IABot. That would be a good feature though when finding soft-404s. -- GreenC 16:16, 27 August 2022 (UTC)
- Opened Phab T316438 .. no idea if or when. -- GreenC 16:34, 27 August 2022 (UTC)
- OK it's done. Yeah there's no way to automate replace of one archive with another via IABot. That would be a good feature though when finding soft-404s. -- GreenC 16:16, 27 August 2022 (UTC)
- Yeah fair enough, soft 404's and all. On re-reading my message I spectacularly failed at phrasing it clearly ... there are nearly a hundred more such links; could you instruct the bot to replace them with a working archive (i.e. the one linked above)? I thought that would be the easiest way to fix this problem. I tried changing the archive link on InternetArchiveBot's side and asking it to fix the affected articles, but that didn't do what I intended. Graham87 13:34, 27 August 2022 (UTC)
Avoid editing inside HTML comments
GreenC bot now edits inside HTML comments eg. Special:Diff/1107954452, but I suggest it not to. Although the edit in this example happened to be harmless (even useful), in general, comments could be used for a wide range of reasons, so there is a higher risk that automatic edits could break their intentions. Wotheina (talk) 03:49, 2 September 2022 (UTC)
- That's true but there is a positive trade-off so for a couple reasons I am OK fixing certain (not all) link rot in comments, as I have been doing for 7 years. If someone wants to preserve a block of immutable wikitext they should use the talk page, user page or offline - otherwise anyone can edit the comment or delete it entirely. Comments can be strangely formatted, I take measures, auto and manual, to check commented text before posting a live diff. -- GreenC 05:39, 2 September 2022 (UTC)
Stopping backlinks report during wikibreak
Hello, and thanks again for the useful Backlinks reports. I'm currently taking a Wikibreak and have attempted to exclude my list from the bot's tasks thus but it still ran today. It's not a problem for me if the reports continue but, if you'd like to save some resources by stopping it properly, please go ahead. Certes (talk) 11:25, 5 September 2022 (UTC)
- Fixed, it was seeing
Action=RUN
in the "#" comment. First time this code has been tested :) Have a good break. -- GreenC 05:14, 6 September 2022 (UTC)
Please Update the monthly list of Top 10000 wikipedia users by Article Count
Please Update the monthly list of Top 10000 wikipedia users by Article Count which changes every 1st and 15th date of a month. Abbasulu (talk) 07:52, 3 October 2022 (UTC)
- It's still running for some reason very slowly in 3 days it only completed 19%. -- GreenC 12:51, 3 October 2022 (UTC)
Exactly what purpose did this edit serve? Edit summary is misleading at best
https://en.wikipedia.org/w/index.php?title=Rodney_Marks&diff=1095741886&oldid=1091111369 108.246.204.20 (talk) 20:17, 3 October 2022 (UTC)
- Don't use
{{dead link}}
if the citation has a working|archive-url=
. -- GreenC 20:46, 3 October 2022 (UTC)- it doesn't. "this page is not available". 108.246.204.20 (talk) 04:15, 14 October 2022 (UTC)
- Ah soft-404. Removed. O also updated the IABot databace. -- GreenC 04:24, 14 October 2022 (UTC)
- it doesn't. "this page is not available". 108.246.204.20 (talk) 04:15, 14 October 2022 (UTC)
A cookie for you!
Ulises12345678 (talk) 11:00, 9 October 2022 (UTC) |
- Thank you. For the Cookie. -- GreenC 14:12, 9 October 2022 (UTC)
RSSSF
Why is this bot changing "website=rsssf.com" to "website=RSSSF", where there is already "publisher=RSSSF" parameter, and then in many pages you get stupid outcome like this with double RSSSF linking? Snowflake91 (talk) 10:27, 7 February 2023 (UTC)
- Yeah it's not ideal, a work in progress. In any case the problem is there should not be both
|work=
and|publisher=
use one or the other not both. And should not use a domain name, use the name of the site, is best practice on Wikipedia. The re are so many RSSSF citations, and so many problems with them, I've done a lot of work to fix them but there are still things that need more work. -- GreenC 15:22, 7 February 2023 (UTC)- Prefer
|website=
over|publisher=
.{{cite web}}
does not include|publisher=
in the citation's metadata. - âTrappist the monk (talk) 16:18, 7 February 2023 (UTC)
- Special:Diff/1038698982/1138241646 -- GreenC 21:44, 8 February 2023 (UTC)
- Prefer
I think all the doubles are cleared, if you see any more or other problems let me know. -- GreenC 21:45, 8 February 2023 (UTC)
WaybackMedic
@GreenC: It seems that WaybackMedic 2.5 is running by GreenC bot 2. However, I can't find its source code of version 2.5 in the Github repo. I need to read the latest code to learn its current behavior. Have you published it yet? -- NmWTfs85lXusaybq (talk) 14:04, 24 March 2023 (UTC)
- I can send snippets or functions if you want for anything you are interested in. The entire codebase is not currently available for public due to containing some proprietary information. It's written in Nim, and some awk utils. -- GreenC 14:44, 24 March 2023 (UTC)
- The bot detection of businessweek.com you mentioned in Wikipedia:Village_pump_(technical)/Archive_203#businessweek.com_links may be bypassed by simply assigning an user agent of a web browser in the header of http requests, such as Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36. As far as I know from version 2.1, WaybackMedic may execute external commands (via execCmdEx) to determine page status and the assignment of user-agent should be easily implemented via some available parameters. By the way, as of version 2.1, I can see the validate_robots function is implemented in medicapi.nim. -- NmWTfs85lXusaybq (talk) 16:55, 24 March 2023 (UTC)
- Thank you for the suggestion to use a browser agent. I tried it, they appear to limit based on query rate, and it's pretty sensitive. I was able to trigger it by manually requesting 8 headers rapidly then it stopped working, sending a header with "HTTP/1.1 307 s2s_high_score" and redirect to a javascript challenge ("press and hold button"). Maybe I could slow the bot down enough between queries, it would be difficult, and extremely slow, perhaps a month or longer for 10k articles, and would need to verify every header is not 307 otherwise abort and manually clear the challenge. GreenC 21:36, 24 March 2023 (UTC)
- If they limit the query rate based on ip, you can find some web proxies to accelerate this procedure as your bot may behave like a web crawler. After you collect and validate some free proxies, you can just apply them alternately to your bot, although their stability is not guaranteed. -- NmWTfs85lXusaybq (talk) 03:47, 25 March 2023 (UTC)
- I have access to a web proxy that uses home based IPs and it still didn't work. Maybe the solution is to pull every URL into a file and process them outside the bot with a simple script that waits x seconds between each header query. Then feed the results to the bot which URLs are dead. It can run for however long it wouldn't matter. Trying to do it inside the bot is too error prone too complicated and ties up the bot too long. -- GreenC 04:11, 25 March 2023 (UTC)
- It's a good idea to run this job outside the bot. However, I'm not sure what you mean by
a web proxy that uses home based IPs
. Have you tried high-anonymity proxies? Did you change proxy IP every time you made a new request? NmWTfs85lXusaybq (talk) 04:45, 25 March 2023 (UTC)- The IPs change with every request, and the IPs are sourced to home broadband users globally, so they are not detectable by CIDR block. I don't know how they got blocked, maybe Cloudflare is on this service and recorded all of the IPs. -- GreenC 14:46, 25 March 2023 (UTC)
- Then I suppose your proxy strategy is OK. Please make sure your web proxy has high anonymity if all of your configuration works fine. -- NmWTfs85lXusaybq (talk) 15:20, 25 March 2023 (UTC)
- I ran this bot-block avoidance script and it took forever. What I discovered is just about every link should be archived. Either 404, soft-404 or better-off-dead. The later because the links went to content that was behind a paywall or otherwise messed up in some way - so the archived version is better in nearly every case. -- GreenC 14:17, 3 April 2023 (UTC)
- I see you mentioned some awk scripts as a workaround at Wikipedia:Link_rot/URL_change_requests#businessweek.com. However, I can't find the meta directory businessweek.00000-10000 you referred to in the Github repo of InternetArchiveBot and WaybackMedic. NmWTfs85lXusaybq (talk) 07:15, 24 April 2023 (UTC)
- Oh that's a note to myself, if you want the awk script let me know it's nothing more than going through a list of URLs, pausing between each to avoid rate limiting, getting the headers and recording the results and if it's a bot block header notify and abort the script. It also shuffles the agent string. It seemed to learn agent strings and block based on those which could be avoided by retiring an agent and adding a new one. -- GreenC 13:47, 24 April 2023 (UTC)
- I see you mentioned some awk scripts as a workaround at Wikipedia:Link_rot/URL_change_requests#businessweek.com. However, I can't find the meta directory businessweek.00000-10000 you referred to in the Github repo of InternetArchiveBot and WaybackMedic. NmWTfs85lXusaybq (talk) 07:15, 24 April 2023 (UTC)
- I ran this bot-block avoidance script and it took forever. What I discovered is just about every link should be archived. Either 404, soft-404 or better-off-dead. The later because the links went to content that was behind a paywall or otherwise messed up in some way - so the archived version is better in nearly every case. -- GreenC 14:17, 3 April 2023 (UTC)
- Then I suppose your proxy strategy is OK. Please make sure your web proxy has high anonymity if all of your configuration works fine. -- NmWTfs85lXusaybq (talk) 15:20, 25 March 2023 (UTC)
- The IPs change with every request, and the IPs are sourced to home broadband users globally, so they are not detectable by CIDR block. I don't know how they got blocked, maybe Cloudflare is on this service and recorded all of the IPs. -- GreenC 14:46, 25 March 2023 (UTC)
- It's a good idea to run this job outside the bot. However, I'm not sure what you mean by
- I have access to a web proxy that uses home based IPs and it still didn't work. Maybe the solution is to pull every URL into a file and process them outside the bot with a simple script that waits x seconds between each header query. Then feed the results to the bot which URLs are dead. It can run for however long it wouldn't matter. Trying to do it inside the bot is too error prone too complicated and ties up the bot too long. -- GreenC 04:11, 25 March 2023 (UTC)
- If they limit the query rate based on ip, you can find some web proxies to accelerate this procedure as your bot may behave like a web crawler. After you collect and validate some free proxies, you can just apply them alternately to your bot, although their stability is not guaranteed. -- NmWTfs85lXusaybq (talk) 03:47, 25 March 2023 (UTC)
- Thank you for the suggestion to use a browser agent. I tried it, they appear to limit based on query rate, and it's pretty sensitive. I was able to trigger it by manually requesting 8 headers rapidly then it stopped working, sending a header with "HTTP/1.1 307 s2s_high_score" and redirect to a javascript challenge ("press and hold button"). Maybe I could slow the bot down enough between queries, it would be difficult, and extremely slow, perhaps a month or longer for 10k articles, and would need to verify every header is not 307 otherwise abort and manually clear the challenge. GreenC 21:36, 24 March 2023 (UTC)
- The bot detection of businessweek.com you mentioned in Wikipedia:Village_pump_(technical)/Archive_203#businessweek.com_links may be bypassed by simply assigning an user agent of a web browser in the header of http requests, such as Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36. As far as I know from version 2.1, WaybackMedic may execute external commands (via execCmdEx) to determine page status and the assignment of user-agent should be easily implemented via some available parameters. By the way, as of version 2.1, I can see the validate_robots function is implemented in medicapi.nim. -- NmWTfs85lXusaybq (talk) 16:55, 24 March 2023 (UTC)
Backlinks report 2023
User:Certes/Backlinks/Report has stopped updating. The bot is running, as User:GoingBatty/Backlinks/Report still updates. I've not changed the job list in User:Certes/Backlinks since 8 May, nor pressed the stopbutton. Do you know how to restart the report please? Certes (talk) 12:17, 4 June 2023 (UTC)
- The process from June 2nd crashed for unknown reason and turned into a zombie preventing future runs. I can't kill it so I contacted Toolforge admins for help. -- GreenC 14:17, 4 June 2023 (UTC)
- Working again now â thanks! Certes (talk) 21:50, 4 June 2023 (UTC)
Archiving chapter urls
This is a bit of an edge case with GreenC bot's archive repair task, so I wanted to get your opinion. In several articles where I'm citing an archived book that has separate PDFs for each chapter, I use the |archive-url= parameter for the chapter url (since that's the most important one) and have a Wayback url for the book url in the |url= field. It's not ideal, but I'm not sure how else to handle it. My brief search also found this thread where you indicated that |archive-url= was okay to use for the chapter url. However, GreenC bot switches the |archive-url= field to be the archive of the |url= field (example here).
Is there a better way to format these citations? I'm not able to find any. Otherwise, is there any way I can mark the citations to be ignored by the bot? This seems like a relatively rare case; I imagine it's not worth modifying the bot to handle. Thanks, Pi.1415926535 (talk) 22:14, 14 August 2023 (UTC)
- Special:Diff/1170358971/1170410520. Another option:
- Vanasse Hangen Brustlin, Inc (August 2005). Beyond Lechmere Northwest Corridor Study: Major Investment Study/Alternatives Analysis. Massachusetts Bay Transportation Authority. Archived from the original on July 5, 2016. Chapter 4: Identification and Evaluation of Alternatives â Tier 1 at the Wayback Machine (archived 2016-07-05)
- I like this better because it doesn't hack the cite book template arguments. The downside is the display is a little messier. Another way with some duplication:
- Vanasse Hangen Brustlin, Inc (August 2005). "Chapter 4: Identification and Evaluation of Alternatives â Tier 1". Beyond Lechmere Northwest Corridor Study: Major Investment Study/Alternatives Analysis. Massachusetts Bay Transportation Authority. Archived from the original (PDF) on July 5, 2016. From Beyond Lechmere Northwest Corridor Study: Major Investment Study/Alternatives Analysis at the Wayback Machine (archived 2016-07-05)
- To keep the bot off the citation add
{{cbignore}}
template after the end of the cite book but inside the ref tags. -- GreenC 02:17, 15 August 2023 (UTC)- Thanks, much appreciated. Pi.1415926535 (talk) 17:15, 15 August 2023 (UTC)
- @GreenC: Please take a look at Special:Diff/1171111146, where the bot edited several citations already tagged with {{cbignore}}. Thanks, Pi.1415926535 (talk) 06:35, 21 August 2023 (UTC)
- I found two problems. 1) The
{{cbignore}}
should follow directly after the template it targets: Special:Diff/1171510462/1171514730 - I think the cbignore docs has this. 2) My bot has a known limitation. Within any block of text between new lines (ie. a paragraph of text), if there is more than one cbignore, the citations the cbignore follows all need to be unique. In this case the two citation are mirror copies. The bot ignored the cbignore for that reason (it has to do with disambiguate it needs to know which citation to target). So, I modified one of the citations, they are now unique: Special:Diff/1171514730/1171514803 (changed the semi-colon to colon in the publisher field for the first citation) -- a bit quirky but tested and it works now. I do recommend though using the alt suggestions above because while my bot honors cbignore most other bot's do not and eventually in the future it's probable some other tool will try to "fix" what it detects as an error (archive URL in the url field). -- GreenC 15:45, 21 August 2023 (UTC)
- I found two problems. 1) The
- @GreenC: Please take a look at Special:Diff/1171111146, where the bot edited several citations already tagged with {{cbignore}}. Thanks, Pi.1415926535 (talk) 06:35, 21 August 2023 (UTC)
- Thanks, much appreciated. Pi.1415926535 (talk) 17:15, 15 August 2023 (UTC)
Incorrect dead flags and archive.today
Hello GreenC! Your bot recently made this strange edit to PokĂŠmon. In it, the bot changed "archive.is" and "archive.ph" to "archive.today". I'm not sure what purpose this has. The task is not explained on User:GreenC bot.
Furthermore, the bot flagged these three sources as dead:
- https://www.theguardian.com/technology/gamesblog/2013/oct/11/pokemon-blockbuster-game-technology
- https://order.mandarake.co.jp/order/detailPage/item?itemCode=1052117728
- https://www.nytimes.com/1997/12/20/news/big-firms-failure-rattles-japan-asian-tremors-spread.html
But as you can see, the above links are not dead. So something must've gone wrong there. I've remarked these refs as live. Cheers, Manifestation (talk) 11:04, 19 August 2023 (UTC)
- Archive.today is what the owner of archive.today wants us to use, it's a redirector that sends traffic to other domains as they are available. The reason those three got marked dead is there was an archive URL in the
|url=
field and the bot moved it to the|archive-url=
field and the bot assumes if someone put an archive URL in the main|url=
field it was probably a dead URL. -- GreenC 14:47, 19 August 2023 (UTC)- @GreenC: Aaah! So that's why. I wrote the text, so I take full responsibility for the
url=
/archive-url=
mixup. As for archive.today: I looked at our article, and it cites this tweet from 4 January '19 in which the owner states that the .is domain might stop working soon. However, the domain is still active. In fact, the '@' handle used by the account to this day is still "@archiveis". I've used archive.today many times, including this year. It always gave me either a .is or a .ph link. Cheers, Manifestation (talk) 15:07, 19 August 2023 (UTC)- Yeah it redirects to one of the 6 domains like .is or .ph .. but if one of those domains gets shut down by the registar, he can switch where it redirects to easily, without having to change every link on Wikipedia. -- GreenC 15:24, 19 August 2023 (UTC)
- Hmm ok. Well I guess we should honor his/her request then. For the sake of clarity, maybe the description of Job #2 / WaybackMedic 2.5 on User:GreenC bot could be expanded a little to include a mention of archive.today? archive.today is not part of the Internet Archive, so the term "WaybackMedic" is a bit misleading. - Manifestation (talk) 16:03, 19 August 2023 (UTC)
- Alright I updated fix #21 which also now links to Help:Using_archive.today#Archive.today_compared_to_.is,_.li,_.fo,_.ph,_.vn_and_.md. It started out as Wayback-specific then expanded to all archive providers but I kept the original name anyway. -- GreenC 16:41, 19 August 2023 (UTC)
- @GreenC Hi! I know that .today is the domain to be used, but every time i try to open a link with .today it returns me a "This site cannot be reached" type of error, and the same goes with .ph links. The only active links i get are the one with .is Astubudustu (talk) 10:55, 2 April 2024 (UTC)
- This is because the DNS resolver you are using is hosted on CloudFlare and that won't work (well) with archive.today domains see Archive.today#Cloudflare_DNS_availability -- GreenC 15:38, 2 April 2024 (UTC)
- @GreenC Hi! I know that .today is the domain to be used, but every time i try to open a link with .today it returns me a "This site cannot be reached" type of error, and the same goes with .ph links. The only active links i get are the one with .is Astubudustu (talk) 10:55, 2 April 2024 (UTC)
- Alright I updated fix #21 which also now links to Help:Using_archive.today#Archive.today_compared_to_.is,_.li,_.fo,_.ph,_.vn_and_.md. It started out as Wayback-specific then expanded to all archive providers but I kept the original name anyway. -- GreenC 16:41, 19 August 2023 (UTC)
- Hmm ok. Well I guess we should honor his/her request then. For the sake of clarity, maybe the description of Job #2 / WaybackMedic 2.5 on User:GreenC bot could be expanded a little to include a mention of archive.today? archive.today is not part of the Internet Archive, so the term "WaybackMedic" is a bit misleading. - Manifestation (talk) 16:03, 19 August 2023 (UTC)
- Yeah it redirects to one of the 6 domains like .is or .ph .. but if one of those domains gets shut down by the registar, he can switch where it redirects to easily, without having to change every link on Wikipedia. -- GreenC 15:24, 19 August 2023 (UTC)
- @GreenC: Aaah! So that's why. I wrote the text, so I take full responsibility for the
WaybackMedic 2.5 adding unneceesary URLs
I saw the bot's task run on Guardians of the Galaxy (film) here and it made edits to three references that used {{Cite Metacritic}}, {{Cite Box Office Mojo}}, and {{Cite The Numbers}}, adding in unnecessary URLs and marking the links as dead. The citation templates construct the urls from the given parameters (as most follow a common format on those sites) and were not dead. Didn't know if this was a bot issue, or the templates themselves doing something that is flagging the citations to make the bot adjust them. I can look into the templates to see what the issues may be if that is ultimately the case (and to know what to look for for the error). - Favre1fan93 (talk) 14:16, 24 August 2023 (UTC)
- That is a bot error. It is in 9 articles. I rolled them back (you got 2). Thanks for the report. -- GreenC 15:00, 24 August 2023 (UTC)
- No problem, thank you! - Favre1fan93 (talk) 15:26, 24 August 2023 (UTC)
Timestamp mismatch
This bot is changing the archive-url as seen here, but it is not changing the archive-date as required, creating a timestamp mismatch error, as seen here. I just recently emptied this category and now it has over 80 articles (when I wrote this) in it again. Your help would be appreciated. Thanks. Isaidnoway (talk) 05:57, 2 September 2023 (UTC)
- I am aware, did it in two steps, because of the way this particular job was programmed, it was easier this way. You saw it in that 30-minute gap between runs-- GreenC 16:11, 2 September 2023 (UTC)
My bot can empty that category easily. It was 40,000 a week ago. Got it down to few hundred edge cases, which I assume you fixed manually, thank you. I'd like to fully automate it, but right now it's all integrated into WP:WAYBACKMEDIC which can't be fully automated, so I run it on request. -- GreenC 16:16, 2 September 2023 (UTC)
User:Isaidnoway, I'm running a bot job to convert archive.today URLs from short-form to long-form. Example. It is exposing old problems with date mismatches that are showing up in Category:CS1 errors: archive-url -- after this bot job completes, I'll run another bot to fix the date mismatches, it will clear the tracking cat. No need to do anything manually. -- GreenC 04:57, 8 September 2023 (UTC)
- Hi GreenC! My bot is following yours today. There were several instances when your bot reformatted archive URLs like this edit, mine fixed the archive dates like my bot did in the following edit. My bot is running on Category:CS1 errors: dates, and pulling the archive date from the archive URL. Any chance your bot could do it all in one edit? Thanks! GoingBatty (talk) 18:25, 8 September 2023 (UTC)
- I used to be able to fix archive.today problems and date mismatches in the same process, but it was semi-automated. Fixing archive.today problems can and should be full-auto, so I separated that out to its own process that uses EventStream to monitor real-time when a new short-form link shows up, log the article name, and once a month or so fix them - all full-auto. Across 100s of wikis. The downside is this program can't fix date mismatch problems. I want to fix date mismatches automatically, and hope to do that eventually with its own process. Once I have that developed I can see about including it in the archive.today program, so it saves the extra edit, when the source of the date mismatch is archive.today short to long conversion.
- The tracking category will be cleared in the next few hours, it's currently generating diffs. This is a one-off event clearing out the backlog of archive.today problems which exposed a lot of problems. Going forward there will be much smaller numbers. We both currently have bots that can clear that category on request, do you know how to update the docs for the category page? -- GreenC 23:41, 8 September 2023 (UTC)
- Not sure which category page you're referring to, but most of the text on these category pages comes from Help:CS1 errors, so if you updated the help page, it would also appear on the appropriate category page. GoingBatty (talk) 03:15, 9 September 2023 (UTC)
- Category:CS1 errors: archive-url. Do you want me to include your bot in the doc as available to clear the cat on-request? I'm going to mention WaybackMedic is available, but only if there are more than 500 entries. -- GreenC 14:25, 9 September 2023 (UTC)
- I don't have a bot to clear Category:CS1 errors: archive-url. GoingBatty (talk) 18:21, 9 September 2023 (UTC)
- Oh I see I misinterpreted what you said above I thought it was fixing mismatched dates but it was actually fixing an incomplete date. -- GreenC 19:12, 9 September 2023 (UTC)
- I don't have a bot to clear Category:CS1 errors: archive-url. GoingBatty (talk) 18:21, 9 September 2023 (UTC)
- Category:CS1 errors: archive-url. Do you want me to include your bot in the doc as available to clear the cat on-request? I'm going to mention WaybackMedic is available, but only if there are more than 500 entries. -- GreenC 14:25, 9 September 2023 (UTC)
- Not sure which category page you're referring to, but most of the text on these category pages comes from Help:CS1 errors, so if you updated the help page, it would also appear on the appropriate category page. GoingBatty (talk) 03:15, 9 September 2023 (UTC)
Economy of Zimbabwean
I need some help Mindthem (talk) 21:13, 25 September 2023 (UTC)
- @Mindthem: How would you like the bot to help with the Economy of Zimbabwe article? GoingBatty (talk) 19:20, 29 September 2023 (UTC) (talk page stalker)
Backlinks
Hi there! I see your bot delivered a new Backlinks report for Certes, but I didn't receive an update today. Could you please give the bot a nudge? Thanks! GoingBatty (talk) 19:21, 29 September 2023 (UTC)
- I saw some messages this morning Toolforge was down due to NFS, likely your run didn't complete before the outage. I see it aborted around 09:32GMT and Certes finished at 09:28 .. with minutes to spare. I'll run yours again now. -- GreenC 19:37, 29 September 2023 (UTC)
- Report received - thank you! GoingBatty (talk) 02:49, 30 September 2023 (UTC)
Bot put italics in strange places
I don't know what happened here, but the bot appears to have put italics in place where they didn't belong, and then missed putting them in where they did belong. Given that the bot had to edit three times, I imagine this bot run was stressful for you. If this code is still active, it might need yet another debugging. â Jonesey95 (talk) 18:26, 19 October 2023 (UTC)
- Yeah this was a pain, every time I thought it was done, some new issue came up. And getting those ticks right, in the right place, after the fact, wasn't easy. Anyway this task is done for me (1,200 articles deletion of
{{BFI}}
). If you see any problems they need manual adjustment. I don't think the number of problems is very large from spot checking. -- GreenC 18:35, 19 October 2023 (UTC)- I think you are correct, based on my perusal of the list of Linter errors. â Jonesey95 (talk) 18:54, 19 October 2023 (UTC)
Flagging non-dead link as dead (2)
Hello. Why did GreenC bot rewrite url-status=live to url-status=dead in Special:Diff/1186567077 for a live URL? The URL [1] is alive, at least from Japan as of 2023-11-24 04:50 UTC (checked with Firefox and Chrome on Windows 10). Wotheina (talk) 05:05, 24 November 2023 (UTC)
- It's freemimum content. Open an incognito window and see if it gives a different result. I tried to archive premium content pages for NatGeo because they use a freemium wall. View page source and search on "freemiumContentGatingEnabled". -- GreenC 05:42, 24 November 2023 (UTC)
- I see. I agree on switching from paywalls to archives, but for such unintuitive edits please write the intention somewhere, as in edit summary or embedded comment, or at least in User:GreenC/WaybackMedic 2.5. I think url-access= is the best way, but I guess you are not using that because there is no option "url-access=freemium" yet. Wotheina (talk) 06:46, 24 November 2023 (UTC)
|url-access=freemium
is a great idea. Until it appears, I think|url-access=live
is less bad, or for a bonus point|url-access=live<!--freemium-->
which can be converted in bulk later. I can see the goats too, but I block a lot of third-party scripts which might hide them in standard browsing. Certes (talk) 16:25, 8 December 2023 (UTC)- Regarding "
|url-access=live
is less bad", did you mean "|url-status=live
is less bad"? Wotheina (talk) 17:24, 8 December 2023 (UTC)- Yes, sorry, I was confusing the two parameters.
|url-access=live
seems more accurate than|url-access=dead
here. The least bad value for status might be|url-status=limited
. I can't find a definition of limited to determine whether freemium falls within its scope. Certes (talk) 18:34, 8 December 2023 (UTC)
- Yes, sorry, I was confusing the two parameters.
- When I did NatGeo, I didn't have the ability to add archive URLs with
|url-status=live
so unfortunately they were all set to dead. I have since added this ability after it was requested at Wikipedia:Link_rot/URL_change_requests#vh1.com by User:Alexis Jazz. I'm not sure about going back and resetting from dead to live the NatGeo links that are freemium, that would probably require some special one-off code and a lot of time to recheck all the links. But it's the kind of thing anyone could probably do pretty easily, if you have code to parse and edit CS1 templates. -- GreenC 17:34, 8 December 2023 (UTC)
- Regarding "
- I see. I agree on switching from paywalls to archives, but for such unintuitive edits please write the intention somewhere, as in edit summary or embedded comment, or at least in User:GreenC/WaybackMedic 2.5. I think url-access= is the best way, but I guess you are not using that because there is no option "url-access=freemium" yet. Wotheina (talk) 06:46, 24 November 2023 (UTC)
Backlinks timing
Hi there! I noticed that the Backlinks report hasn't run yet today for Certes or me. Looking at the bot's contributions, I see the report is running later each day this week. Could you please check the bots to see what's going on? Thank you! GoingBatty (talk) 15:22, 8 December 2023 (UTC)
- I started monitoring Buenos Aires as an experiment, not because its new links are likely to be wrong but because socks of a certain puppetmaster love linking to it. I've just removed it from my list, in case this widely-linked page is causing problems. Certes (talk) 16:18, 8 December 2023 (UTC)
- They are forks of the same script, they run on different cron jobs and directories, thus not be possible to effect each other. If both are not working I dunno I'll check. -- GreenC 17:40, 8 December 2023 (UTC)
GoingBatty & Certes, I found a bug that only shows up when running from cron. It wasn't apparent when the script was on Toolforge because there you signify the working directory with -wd=
with the jsub command which masked the problem. The effect of the bug was to create duplicate entries in the list at /Backlinks which is why it kept taking longer each run. For example GoingBatty had 7 instances of "hamlet" (from the scripts perspective), one for the original and 6 for each day the script ran. So I think the best solution is wipe out the data files again and start over, the data files look kind of weird anyway. The usual, you'll see the message about new entries, then the next one should be good. -- GreenC 18:20, 8 December 2023 (UTC)
- On December 8, the bot started over and published a report, but didn't publish a report for December 9. Could you please check it again? Thanks! GoingBatty (talk) 04:34, 10 December 2023 (UTC)
GoingBatty, I don't know what happened. Nevertheless, it is working now. It looks system-level. Cron logs show the process ran, but it didn't. No apparent reason, and I can't replicate. Weird. Let me know if it doesn't run again, I enabled verbose logging. Also during testing I moved the job time to around 5:30 GMT .. or do you want the previous 8:30? Or some other time? -- GreenC 06:01, 10 December 2023 (UTC)
- Thank you! I'd prefer the previous 8:30, as I'm likely to see the 5:30 job right before I should be going to sleep, and then be tempted to stay up too late to address them immediately. Thanks! GoingBatty (talk) 07:04, 10 December 2023 (UTC)
User:Certes during testing your most recent report lost some data, seen below. -- GreenC 06:01, 10 December 2023 (UTC)
- Thanks; I'll take a look at those. I've a slight preference for 0830 over 0530, as I tend to look at the entries about 1000-1200 UTC and the fresher the better. Certes (talk) 16:07, 10 December 2023 (UTC)
It didn't run again. The logging helped. I'm narrowing in on the problem and made some changes. We'll see what happens next run. -- GreenC 21:18, 11 December 2023 (UTC)
At some point when this issue is resolved, are you willing to open Backlinks to other users? For example, see Wikipedia:Help desk#Notification for Links to Pages by Other Users. Thanks! GoingBatty (talk) 04:18, 12 December 2023 (UTC)
So, it does appear my IP is being rate limited by WMF. I moved all my tools off-site and it's generating a lot of traffic. The solution is to add a retry loop with pauses. Will try that next. -- GreenC 14:42, 12 December 2023 (UTC)
- Would moving the tools on-site be a solution? I know they just made that a whole lot more difficult by deprecating GridEngine. Certes (talk) 14:49, 12 December 2023 (UTC)
- That will take time because I think it will require building a custom kerbenos image which is a learning curve. I have a ticket open asking them about this but no reply yet. I should have been using a retry loop anyway so this will help either way, I have a function, but was apparently lazy and didn't call it. -- GreenC 15:26, 12 December 2023 (UTC)
- A lot of people will be climbing the same learning curve. It would be nice if we had a page for giving each other a leg up. Sadly (or perhaps gratefully), I've never had to use Kubernetes and so can't be of much assistance. Certes (talk) 16:17, 12 December 2023 (UTC)
- I hope to learn the system eventually, probably good thing to know. -- GreenC 18:02, 12 December 2023 (UTC)
- A lot of people will be climbing the same learning curve. It would be nice if we had a page for giving each other a leg up. Sadly (or perhaps gratefully), I've never had to use Kubernetes and so can't be of much assistance. Certes (talk) 16:17, 12 December 2023 (UTC)
- That will take time because I think it will require building a custom kerbenos image which is a learning curve. I have a ticket open asking them about this but no reply yet. I should have been using a retry loop anyway so this will help either way, I have a function, but was apparently lazy and didn't call it. -- GreenC 15:26, 12 December 2023 (UTC)
Ran both manually with the new code. It will keep requesting when it gets a 429 ("Too many requests"). It tries 20 times with a 2 second delay. I have seen it make up to 5 requests, but it will depend on WMF server load. The jobs will run on the regular morning schedule tomorrow. -- GreenC 18:02, 12 December 2023 (UTC)
- If it's not too much work, escalating the delay might be good for both the program and the server, e.g. if the nth try fails, wait n seconds. (Exponential is recommended but seems extreme.) Certes (talk) 18:15, 12 December 2023 (UTC)
- There are too many tool making constant requests it almost doesn't matter, they are going to saturate regardless. I'm concerned because if slowed down too much the work never gets done. Will keep on it. It will email if/when it reaches 20. -- GreenC 19:49, 12 December 2023 (UTC)
- Hmmm. It sounds as if they need a bigger computer. They can afford it. Certes (talk) 22:33, 12 December 2023 (UTC)
- There are too many tool making constant requests it almost doesn't matter, they are going to saturate regardless. I'm concerned because if slowed down too much the work never gets done. Will keep on it. It will email if/when it reaches 20. -- GreenC 19:49, 12 December 2023 (UTC)
Everything looks good today. Thank you. The only difference from before is that the output now appears alphabetically by target rather than sorted as in the parent page, but that's not a problem. Certes (talk) 10:13, 13 December 2023 (UTC)
- Because there were duplicates in the parent page I had to unique the list which required a sort. I tried to unique it in a way that doesn't require a sort ie.
cat file.txt | awk '!s[$0]++' > out.txt
, but for some reason it dropped one of the entries.. I didn't have time to investigate it so went with the tried and true method ofsort file.txt | unique > out.txt
. You can try this yourself with the list of entries and see if the results differ in the number of entries on output compared to input. -- GreenC 15:43, 13 December 2023 (UTC)- That sounds very reasonable. (
sort -u
may work on your system too.) Certes (talk) 16:33, 13 December 2023 (UTC)
- That sounds very reasonable. (
Buck Goldstein
Hi there! In this edit, your bot changed an incorrect |url=
parameter, which added the article to Category:CS1 errors: URLââ. Should the bot have done something different, or should it ignore the |url=
parameter and only update the |archiveurl=
/|archive-url=
parameter? Thanks! GoingBatty (talk) 06:02, 18 December 2023 (UTC)
- You mean Special:Diff/1187499427/1190066019. The bot that runs this process is a global bot, it is not programmed to handle templates in different languages, it only operates on the URL itself, not with template knowledge. The bot didn't do anything wrong, that wasn't already there; it's only purpose is to normalize archive.today URLs wherever they happen to be. If that caused the pre-existing error to be exposed in the tracking cat, it's a step forward. -- GreenC 06:32, 18 December 2023 (UTC)
Preserving the correct archived version of archive.today links
In this edit, WaybackMedic 2.5 attempted to reformat a link to archive.today that had multiple different archives, but used the archive of the wrong date. The pre-existing link https://archive.is/2Ljk6 is an archive from 24 November 2023. The link should have been converted to http://archive.today/2023.11.24-014538/https://www.bloomberg.com/press-releases/1999-11-08/pokefans-can-now-eat-their-hearts-out-with-candy-planet-s (the "long link" for the page), but was instead converted to https://archive.today/20231124014538/https://www.bloomberg.com/press-releases/1999-11-08/pokefans-can-now-eat-their-hearts-out-with-candy-planet-s , which corresponds to the 6 December 2023 archive. This resulted in the new archive link leading to an archive of a 404 page instead of the successfully archived page, and the archive-date
parameter not matching the timestamp on the page or in the long URL.
Ideally, the bot would notice when the new URL's archive date does not match the old URL's archive date and not make the edit if it cannot resolve this. Also, ideally it would catch when the citation template's archive-date
doesn't match the URL's archive date, and either adjust the template's archive-date
or display some kind of warning. SnorlaxMonster 12:09, 1 January 2024 (UTC)
- Actually, there also appears to be an issue on archive.today's end. While the page https://archive.md/2Ljk6 does have a share option that says that http://archive.today/2023.11.24-014538/https://www.bloomberg.com/press-releases/1999-11-08/pokefans-can-now-eat-their-hearts-out-with-candy-planet-s is the correct long URL, as it turns out, that long URL redirects to the 404 archive as well. In cases like that, I think WaybackMedic 2.5 should not change the URL to the long version, until archive.today corrects their long URLs for URLs with multiple archives. --SnorlaxMonster 12:12, 1 January 2024 (UTC)
- That's strange. Looks like a one-off error at archive.today .. never seen it before. I can't verify every new long archive.today is the same, because of the resource load on archive.today servers would double, and the time it would take for the bot to finish. Unless there is evidence of a widespread problem, but in 7 years and over half a million conversions this is the first time it's been reported. All I can do for now is add a static string to the code to skip processing when it sees 2Ljk6. Other tools might try to do the same conversion like IABot or possibly Citation Bot. This is a tricky problem to solve long term. Ideally archive.today would be notified, is the correct solution. -- GreenC 19:23, 1 January 2024 (UTC)
- I notified archive.today about the specific issue with the long URL via their "report bug or abuse" button, but I have no idea how likely those reports are to get read. I think just manually excluding that specific case is the best option for now.
- With regards to validating that the target page is the same, I think it should be as simple as checking the timestamp is the same (ignoring that bug I mentioned in my second message, where the long URL can redirect to the wrong version). I assume whatever API you're using to get the long URL from the short URL returns the archive date of the short URL in the request you are already makingâthe long URL has the archive date in the URL itself, so to me it seems like it should be possible to validate that the archive date hasn't changed by just comparing those two values, without needing any additional API requests to archive.today. But I also don't know what the code your bot uses, so I can't verify my assumptions about how it works. (I tried taking a look at the GitHub page linked on User:GreenC/WaybackMedic 2.5, but it appears that it is for Wayback Medic 2.1 and doesn't include the
fixarchiveis
function that's included in Wayback Medic 2.5.) --SnorlaxMonster 13:22, 2 January 2024 (UTC)- There is no API for this. You download the HTML of the short URL page, and the long form is there towards the top (view source search on "long link"). The GitHub code is old, but you can see it here at line 173. If the long form URL goes to a different version of the HTML, as in this case, I would need to download both the short and long HTML page, and run a string comparison to see if they are approximately the same HTML. Thus downloading HTML twice. -- GreenC 22:28, 2 January 2024 (UTC)
- Ah okay, I suspected it could just be plain web scraping. Anyway, what I was trying to suggest was just comparing the date in the URL with the date on the HTML page (so there would be no need to resolve the long link). However, I had missed that the date in the long URL you retrieved was the correct oneâthe issue was entirely that archive.today redirects it. --SnorlaxMonster 23:34, 2 January 2024 (UTC)
- There is no API for this. You download the HTML of the short URL page, and the long form is there towards the top (view source search on "long link"). The GitHub code is old, but you can see it here at line 173. If the long form URL goes to a different version of the HTML, as in this case, I would need to download both the short and long HTML page, and run a string comparison to see if they are approximately the same HTML. Thus downloading HTML twice. -- GreenC 22:28, 2 January 2024 (UTC)
- That's strange. Looks like a one-off error at archive.today .. never seen it before. I can't verify every new long archive.today is the same, because of the resource load on archive.today servers would double, and the time it would take for the bot to finish. Unless there is evidence of a widespread problem, but in 7 years and over half a million conversions this is the first time it's been reported. All I can do for now is add a static string to the code to skip processing when it sees 2Ljk6. Other tools might try to do the same conversion like IABot or possibly Citation Bot. This is a tricky problem to solve long term. Ideally archive.today would be notified, is the correct solution. -- GreenC 19:23, 1 January 2024 (UTC)
bug report
At this edit, GreenC bot copied a malformed wayback machine url from |url=
into |archive-url=
. It ought not to have done it like that.
The wayback machine url is malformed because its timestamp is not an acceptable length (14 digits preferred, 4 or 6 tolerated). cs1|2 emits an error message for single-digit timestamps and another error message when the values assigned to |url=
and |archive-url=
are the same.
âTrappist the monk (talk) 01:46, 30 January 2024 (UTC)
- Also, not clear where
|archive-date=2007-06-15
came from. - âTrappist the monk (talk) 01:49, 30 January 2024 (UTC)
Bug report: Incorrect archive-date
Hi there! In this edit, the bot added |archive-date=18990101080101
. Is there something you could add to the bot to prevent the addition of incorrect dates such as this? Thanks! GoingBatty (talk) 18:22, 30 January 2024 (UTC)
- I do have warnings but apparently was lazy and forgot to check the logs. -- GreenC 20:08, 30 January 2024 (UTC)
bug report (2)
Category:CS1 errors: archive-url recently bloomed. I have just fixed these four articles broken by Wayback Medic 2.5:
Every error was a |archive-date=
mismatch with the |archive-url=
timestamp. |archive-date=
was always off by one day; always earlier than the time stamp except for this one from 2024 Noto earthquake.
âTrappist the monk (talk) 18:57, 1 February 2024 (UTC)
- And then there is this one that is off by a couple of weeks, this one off by a year. So it looks like what I wrote above may not hold much water...
- âTrappist the monk (talk)
19:08, 1 February 2024 (UTC)19:37, 1 February 2024 (UTC)
The date mismatch error preexisted. The bot only made it more obvious, so that CS1|2 error-checking is now able to see it. I would prefer to fix the archive-date at the same time as expanding archive.today URLs from short to long form (per RfC requirement). However this task is universal it operates on many wiki language sites, it does not have knowledge of template names or arguments in other languages. It only expands a URL wherever it may be, it doesn't look at templates. That would require another universal bot I guess, that can operate on CS1|2 templates in multiple languages. If you want to write one, I have the approval to run it. The reason the dates are frequently offset by 1 day, users add an archive.today link they just created, set |archive-date=
to their relative location, but the archive.today uses UTC time, which has already passed into a new day. The ones offset by a week or year are user entry errors. -- GreenC 21:49, 1 February 2024 (UTC)
- User:Trappist the monk: I have written a separate bot that fixes the date mismatch error populating Category:CS1 errors: archive-url. Example Special:Diff/1248926553/1248972462. It retrieves the date from the "suggested" date, generated by tCS1|2 in the HTML warning message. This way it can run on other language wikis without needing to deal with language differences. It falls back to ISO mode if it can't get a suggestion. Do you think it is OK to rely on the "suggested" date generated by CS1|2? -- GreenC 14:25, 2 October 2024 (UTC)
- The suggested date is simply the date portion of the archive-url timestamp formatted according to the format specified by
|df=
â the global{{use xxx dates}}
â format of the date in|archive-date=
â YYYY-MM-DD. Getting the date from the html seems a reasonable thing to do; the grunt work has already been done. - âTrappist the monk (talk) 15:05, 2 October 2024 (UTC)
- The suggested date is simply the date portion of the archive-url timestamp formatted according to the format specified by
bug report (3) Bot ignores cbignore
Here [[2]] I noticed that the bot edited an external link with cbignore after it. I compared the links before and after the edit to see why the cbignore template was there. The long and short links are from different dates and display different content. The altered link no longer contained the relivent content. This would not matter if the bot observed the cbignore.--198.111.57.100 (talk) 17:05, 4 June 2024 (UTC)
- OK this problem is complicated. There are multiple things going on.
- All short-form archive.today links need to be expanded to long form. This is required as Wikipedia does not allow URL shortening which has security problems.
- Archive.today has a bug. When saving links from WebCite, it incorrectly gives the long form.
- Incorrect: http://archive.today/UfV6G --> https://archive.today/20121120012223/http://romeoareateaparty.org/wordpress/2012-candidates-2/races/u-s-senate/
- Correct: http://archive.today/UfV6G --> https://archive.today/20121120012223/https://www.webcitation.org/6CIutMLaZ?url=http://romeoareateaparty.org/wordpress/2012-candidates-2/races/u-s-senate/
- Notice the "Correct" version includes the original WebCite URL. The "Incorrect" version excludes the WebCite URL.
- GreenC bot has a bug in that it can't see cbignore when making these changes.
- GreenC bot has a bug in so far as it doesn't detect the Archive.today bug
- So I need to make some adjustments to work around the Archive.today bug. I also need to report the bug to Archive.today though there is no guarantee they will fix it. -- GreenC 17:28, 4 June 2024 (UTC)
- Update the bug is reported to Archive.today -- GreenC 18:14, 4 June 2024 (UTC)
- Archive.today fixed it. -- GreenC 21:01, 4 June 2024 (UTC)
- Thank you!--198.111.57.100 (talk) 16:27, 6 June 2024 (UTC)
- Update the bug is reported to Archive.today -- GreenC 18:14, 4 June 2024 (UTC)
Please don't convert old Google patents links to archive.today
This is a very unhelpful change: special:diff/1227937929. The links on the archived page to PDFs and drawings all 404, meaning that the actual content of the patent is not accessible. Nor are any of the other features originally presented by Google patent search. This type of archive page should not ever be used for patents. You should either fix the Google patent URLs, which is fairly trivial (you can see the fix for this page at special:diff/1227941924), or switch to links to the US patent office or similar.
Can you please revert or properly fix all of the similar recent edits you have made across Wikipedia? (Judging from your recent contribution list it seems like there were a lot.) Otherwise you're just creating work for someone else / leaving confused readers. âjacobolus (t) 16:41, 8 June 2024 (UTC)
- 1. You should post this in the forum linked in the edit summary: WP:URLREQ#google.com/patents - that's the community forum for this task that everyone is reading.
- 2. There is nothing my bot can't do. And there is nothing that is permanent or can't be changed or undone. Do not panic or become upset.
- 3. Give me details. I will do it. But I need information. You gave a diff saying it's trivial, but how do I obtain https://archive.today/20121211035219/http://www.google.com/patents?id=lvNwAAAAEBAJ is the same as https://patents.google.com/patent/US417831AÂ ? There is a code in the second URL that does not exist in the first URL.
- Anyway, please follow at URLREQ so others can know what's going on. -- GreenC 16:52, 8 June 2024 (UTC)
Job 18 showing up in WPCleaner
I'm running the WPCleaner and noticed that Error 95 (Editor's signature of link to user space) has flagged the bot, specifically Job 18, on a ton of pages (Arundhathi Subramaniam is one to give an example). It looks like the bot signature is in the "reason" field of the template
{{verify source |date=September 2019 |reason=This ref was deleted Special:Diff/893567847 by a bug in VisualEditor and later restored by a bot from the original cite located at Special:Permalink/893405019 cite #4 - verify the cite is accurate and delete this template. [[User:GreenC bot/Job 18]]
I don't have a count of the pages, but it's not an insignificant amount from what I can see. Lindsey40186 (talk) 02:16, 11 June 2024 (UTC)
- I don't know about WPCleaner, or what the error message means. It was an old bot job, that no longer runs. It was a peculiar and difficult situation. -- GreenC 03:56, 11 June 2024 (UTC)
Typo
[edit]After Wikipedia:Link_rot/URL_change_requests#deccanchronicle.com, the bot is adding links to Deccan Chronical instead of Deccan Chronicle. See [3] and [4]. DareshMohan (talk) 18:59, 14 June 2024 (UTC)
- Oh sheesh, thanks. Fixed Special:Diff/1228320785/1229089609 in 829 pages . -- GreenC 20:17, 14 June 2024 (UTC)
Thanks
[edit]Hey, I just want to say thank you for using the Wayback Machine for MTV News for my citations. Can you do that for Drag-On's album Hell and Back? Ill post the original link. JuanBoss105 (talk) 13:30, 2 July 2024 (UTC)
- Hey, I found a link to a MTV.com source that can be used for Rocafella. Can you add it using the wayback machine?
- https://www.mtv.com/news/c1psz3/state-property-members-stress-independence-dont-take-orders&ved=2ahUKEwiS1cGYwIiHAxUdD1kFHf0oCVYQFnoECCIQAQ&usg=AOvVaw1m9yMSZqvcQC7xuV2PKS9D JuanBoss105 (talk) 13:53, 2 July 2024 (UTC)
- User:JuanBoss105: I found an archive URL with a different source URL: https://web.archive.org/web/20150122173241/http://www.mtv.com/news/1498885/state-property-members-stress-independence-dont-take-orders/
- I found it using the archive's search feature: Search: "State Property Members Stress Independence".
- You can find other archive URLs at MTV.com this way.
- For example in Special:Diff/1231668617/1232196891 you added https://www.mtv.com/news/v0uzg8/norah-jones-tops-a-mil-at-1-kanye-west-settles-for-2 you can find the archive URL by going to this search page: Search: "Norah Jones tops a mil". -- GreenC 16:07, 2 July 2024 (UTC)
Tampabay.com
[edit]Stop running this right now on tampabay.com links. Every one I've checked is wrong. It is adding archive links (okay) to currently live articles, and tagging them as dead (wrong). Also is overriding explicit |url-status=dead to |url-live when it encounters redirects to the main page of tampabay.com. Tired of fixing these because GreenC bot is on a roll. Â Â âś I am Grorp â 00:21, 12 July 2024 (UTC)
- Clarification: Not every single instance, but too many, for sure. Â Â âś I am Grorp â 00:31, 12 July 2024 (UTC)
Oh shoot, looks like they used an exotic redirect mechanism, it fooled the bot. I have a way around it, but this is the first I became aware of it. I'll have to reprocess. Anyway, thanks for the info. BTW you should post error reports in the section linked in the edit summary, that is the discussion for this job. -- GreenC 00:38, 12 July 2024 (UTC)
- @GreenC: That was gibberish to me so I found this talk page. I just now put a link from there to here. You're welcome to copy this over there, and delete this thread, if that makes more sense. I'll watchlist both. Â Â âś I am Grorp â 00:42, 12 July 2024 (UTC)
- Not all of the edits were incorrect or needed correcting. If you want a list of which ones I corrected, then they're in my contributions list from 22:10, 11 July 2024 to 00:37, 12 July 2024 (UTC). All but the first of my corrections has "GreenC bot" in the edit summary. (I edit in a topic area that relies heavily on tampabay.com, many of which are on my watchlist.) Â Â âś I am Grorp â 00:53, 12 July 2024 (UTC)
Grorp,
- Special:Diff/1233941553/1233989259 - this appears to be a one-off, maybe a network transient. When I run the page again (locally) the problem does not happen. I'd be surprised there are more like this. It can happen but I don't think it's systematic or common. If you see more, let me know.
- Special:Diff/1233948702/1233989465 - exotic redirect problem noted above
- Special:Diff/1233957098/1233990527 - ditto
- Special:Diff/1233959661/1233991011 - archive.today I manually verify beforehand. This one is a manual verification error, which is rare, but not impossible. I can provide a list of the archive.today URLs that were added (193).
I can redress the exotic redirect, which looks to be limited to URLs ending in .ece -- GreenC 01:29, 12 July 2024 (UTC)
- Update: I found 29 instances of the exotic redirect, among the set of 6,846 pages, or less than 1/2 of one percent. Of the archive.today error, there was one in 193, or about the same 1/2 of one percent. Thanks for the report, find any other problems let me know. -- GreenC 02:42, 12 July 2024 (UTC)
- Thanks. Will do. Â Â âś I am Grorp â 05:45, 12 July 2024 (UTC)
I have no idea how to decipher/restore/resurrect these old pqarchiver links (like in your fourth example above). If there's a writeup, or some tips, please point me in the right direction. I do come across these fairly regularly in this topic area I edit; many point to old sptimes.com news articles (St Petersburg Times was bought out by Tampa Bay Times). If there is any way I can resurrect an actual copy of some of these old articles, I'd like to try to fix some of them. Â Â âś I am Grorp â 05:45, 12 July 2024 (UTC)
- I found 63 pqarchiver links (out of the 193 archive.today links added) and they all worked, except this one. If it doesn't exist at archive.org or archive.today it's probably gone forever need to find an alternate source probably. -- GreenC 06:09, 12 July 2024 (UTC)
Other wikis
[edit]Do you ever deploy the bot to other wikis to assist with link maintenance and updates? Imzadi 1979 â 18:20, 28 July 2024 (UTC)
- It's a very big job to internationalize the bot for templates, dates etc - I'd like to eventually. But it does update links in the IABot database (iabot.org), and IABot then updates 300+ wikis based on the contents of the database. Thus when my bot discovers a dead link on enwiki, it updates enwiki adding an archive URL, then also updates the IABot database changing the status to "dead" and adding the archive URL into the database. Then IABot scans the 300+ other wikis and when it finds that link, it adds the archive URL, taken from the database. -- GreenC 18:55, 28 July 2024 (UTC)
- I was curious if it would work on the AARoads Wiki, which uses the same templates as the English Wikipedia, so no internationalization needed. Imzadi 1979 â 19:12, 28 July 2024 (UTC)
- IABot would be better since it continuously scans pages and fully automatic replace dead links. WaybackMedic does more specialized work on a per-domain basis for many types of issues with manual oversight. A good place to post a request is https://meta.wikimedia.org/wiki/User_talk:InternetArchiveBot -- GreenC 20:49, 28 July 2024 (UTC)
- I was curious if it would work on the AARoads Wiki, which uses the same templates as the English Wikipedia, so no internationalization needed. Imzadi 1979 â 19:12, 28 July 2024 (UTC)
bot destructive
[edit]I just had to a manual purge on EyjafjallajĂśkull after bot had visited as page was from the top displaying The time allocated for running scripts has expired.The time allocated for running scripts has expired. The time allocated for running scripts has expired. This is a complex page calling in a couple of data rich templates usually rendered well within normal parsing allowance of 10 seconds but if the wikipedia infrastucture is under load can fail on an edit. The bot accordingly presently needs a (? manual} check of page output after every use. Often the fail is towards the end of such a page with the references so only obvious on a full page manual skim. Please ensure you do this as many high quality pages have reference lists running into 100's with processing times about the 5 second mark. ChaseKiwi (talk) 21:16, 3 August 2024 (UTC)
Bug report - templates in images in infoboxes
Just wanted to flag Special:Diff/1239809626, doesn't seem to recognise there's a template in that URL. Primefac (talk) 12:03, 12 August 2024 (UTC)
- Oops my regex was stopping at "}" instead of "{" had it reversed. Thanks. -- GreenC 18:23, 12 August 2024 (UTC)
Job 15 GA mismatches stoppage
User:GreenC bot/Job 15 (GA mismatches) has stopped after Wikipedia:Good articles/all was edited. Adabow (talk) 10:07, 13 August 2024 (UTC)
- User:Adabow, because of Special:Diff/1229147724/1237436963 by User:Beland. The bot is not aware of Wikipedia:Good articles/all2. It aborted because the number of entries in Wikipedia:Good articles/all is below a magic number ie. it looks suspicious. Everything worked, except I neglected to add an email reminder (only logs) so I didn't notice. Thanks for the ping. -- GreenC 16:17, 13 August 2024 (UTC)
- User:Beland could you verify the lists are correct? There appears to be duplication at the top with two table of contents, for example two entries for "Agriculture, food, and drink". There is also a line that says "View the entire list of all good articles or" in which points to Wikipedia:Good articles/all .. is that still accurate? -- GreenC 16:22, 13 August 2024 (UTC)
- The duplicate TOCs were being transcluded from the per-topic pages. I suppressed them with "noinclude" tags. The link from subpages still points to /all, but once readers get there they will see "all" is split between /all and /all2. I think that's probably fine for now, unless we want to just stop altogether with combining multiple per-topic pages into one or two massive scrollable lists. -- Beland (talk) 20:33, 13 August 2024 (UTC)
- I think this change could break three bots: FACBot, LivingBot, and GreenC bot. There is a message in the page that says changes to the page layout will break the bots (GreenC bot not mentioned I will add it later).
Bots should be notified given time to adjust.(looks like the two bots were notified, ty) There might be other tools and bots as well. -- GreenC 16:34, 13 August 2024 (UTC) - Actually it looks like the creation of "all2" was in February: Special:Diff/1066123344/1229147724 .. so my bot has not been running properly since. Trying this to better communicate: Special:Diff/1237436963/1240124928 -- GreenC 16:46, 13 August 2024 (UTC)
- User:Beland could you verify the lists are correct? There appears to be duplication at the top with two table of contents, for example two entries for "Agriculture, food, and drink". There is also a line that says "View the entire list of all good articles or" in which points to Wikipedia:Good articles/all .. is that still accurate? -- GreenC 16:22, 13 August 2024 (UTC)
Broke 139 archive.ph links! They are clearly labeled.
Your bot took the url=
with the live link & altered 139 archive-url=
that had archive.ph links & changed them to "archive.today/[url of live link]". Not only is archive.today a DEAD site, but all my archive links were live.
This is an error when you visit that site:
This site canât be reached https://archive.today/ is unreachable.
ERR_ADDRESS_UNREACHABLE
What is the purpose of this? ĆɧƥĘÉŹÉÉ (talk) 23:35, 15 September 2024 (UTC)
- Archive.today is not dead it works fine for me and everyone else. Your local machine's DNS resolver is having temporary problems. See Archive.today#Cloudflare_DNS_availability. Use a different DNS resolver and the problem will be solved. Please use archive.today it is the main gateway host to the site, which redirects to one of the backend sites like archive.ph .. the site is literally called "Archive.today" not "Archive ph", the .ph is an internal thing they do to protect against domain name hijacking. -- GreenC 00:15, 16 September 2024 (UTC)
Archive.today isn't accessible from Italy
Hi, I saw your bot replaced archive.is links with the respective archive.today ones in some pages on italian Wikipedia (here is an example). However archive.today redirects to archive.ph, which has apparently been blocked by italian Internet providers after being reported by police for hosting illegal content. This is a screenshot I took and here are other people talking about it. I wanted to warn you about this because now archived URLs aren't accessible and can't be checked without using proxies. Hope you can fix this. Un mondo a stelle e strisce (talk) 15:55, 17 September 2024 (UTC)
- User:Un mondo a stelle e strisce, thank you for this information. Archive.today has problem sometimes. They created multiple domains: archive.is, .fo, .li, .today, .vn, .md, .ph .. do you know if all are blocked in Italy? I read the discussion (6 months old) and this appears to be something done by the postal police? You could also try using a different DNS resolver that isn't going through Cloudflare, this is the problem for most people, due to a policy disagreement between Archive.today and Cloudflare -- GreenC 16:36, 17 September 2024 (UTC)
- archive.ph is the only one blocked, the others are all fine and working except for .today that redirects to it and therefore isn't accessible, too. According to the warning displayed when trying to reach the address, postal police took this measure because they found pedopornographic content on the website. I don't think the problem has anything to do with Cloudflare, as the page is still accessible via proxy. Un mondo a stelle e strisce (talk) 21:12, 17 September 2024 (UTC)
- If you want, we can change everything to .is or whichever. In the mean time, I have disabled the twice-monthly process that converts everything to .today -- GreenC 21:24, 17 September 2024 (UTC)
- Yes, replacing things with .is would be great, thanks for your help. Un mondo a stelle e strisce (talk) 08:26, 18 September 2024 (UTC)
- User:Un mondo a stelle e strisce, changed the first 3,000 pages, which is about 10%, then wait time before continuing (example). -- GreenC 01:32, 26 September 2024 (UTC)
- Yes, replacing things with .is would be great, thanks for your help. Un mondo a stelle e strisce (talk) 08:26, 18 September 2024 (UTC)
- If you want, we can change everything to .is or whichever. In the mean time, I have disabled the twice-monthly process that converts everything to .today -- GreenC 21:24, 17 September 2024 (UTC)
- archive.ph is the only one blocked, the others are all fine and working except for .today that redirects to it and therefore isn't accessible, too. According to the warning displayed when trying to reach the address, postal police took this measure because they found pedopornographic content on the website. I don't think the problem has anything to do with Cloudflare, as the page is still accessible via proxy. Un mondo a stelle e strisce (talk) 21:12, 17 September 2024 (UTC)
User:Un mondo a stelle e strisce, this job is complete. Keep in mind, archive.today will continue to be added in many ways, by editors and bots. If you want to clear them out again, drop me a note. Or if this ban is ever lifted, drop me a note. Cheers. -- GreenC 15:18, 17 October 2024 (UTC)
- Yes, I'll let you know about eventual further developments. Thanks very much for your help. Un mondo a stelle e strisce (talk) 16:14, 17 October 2024 (UTC)
"url-status=usurped" causes a CS1 message
Hi GreenC!
I just noticed that the GreenC bot has flagged many refs as part of an effort to combat the passive spamming of the Judi gambling syndicate.
However, | url-status=usurped
is currently causing a CS1 maintenance message. I am seeing these messages because I opted to make them visible through my common.css. Normally, they can not be seen.
When I preview a page with a usurped
ref, it shows this warning at the top:
- "Script warning: One or more (...) templates have maintenance messages; messages may be hidden (help)."
Also, with me, the altered refs have this bit tagged at the end:
- "CS1 maint: unfit URL (link)"
See Category:CS1 maint: unfit URL, which currently has 48,594 entries.
Again, the maintenance message is normally not visible, not even to logged-in users. So this isn't an acute problem.
I believe the maintenance message is shown incorrectly. If the URL has been usurped, but the original page was properly archived, then the ref as used on Wikipedia is probably not "unfit", right? What can be done about this?
Cheers, Manifestation (talk) 19:09, 24 September 2024 (UTC)
- It looks like we are tracking all usages of unfit/usurped even legitimate uses and this automatically creates a maintenance message. I don't know what the rationale is. Maybe someone wants to know where the usurped URLs are? -- GreenC 19:35, 24 September 2024 (UTC)
- I have started a thread about this at Help talk:Citation Style 1. This has to be a bug. Cheers, Manifestation (talk) 19:41, 25 September 2024 (UTC)
Oil for your bot
Oil for your bot | |
A hard working bot deserves a refreshing glass of motor oil! Big Blue Cray(fish) Twins (talk) 09:26, 18 November 2024 (UTC) |
Backlinks report not updating
Hi there! After several months taking a break from Backlinks, I've recently started using the report again. I noticed that the bot last created a report on December 3. Could you please jump start the bot for me? Thanks! GoingBatty (talk) 17:03, 6 December 2024 (UTC)
- Thanks for restarting the bot! GoingBatty (talk) 05:03, 7 December 2024 (UTC)
- User:GoingBatty, No problem, thanks for the report because this problem (a missing symbolic links the result of moving to a new computer) was having an effect on multiple tools. -- GreenC 17:41, 8 December 2024 (UTC)
Question about the Wikipedia:Good_articles/all page
Please see my question at Wikipedia_talk:Good_articles/all#Question to the bots. (I wrote it there because I am asking the same question to 3 bots.) Thank you. Prhartcom (talk) 20:03, 9 December 2024 (UTC)
bot "reformatting" valid dates to URL strings
I noticed a few edits such as this one where the bot replaced the date with a portion of the URL. Looks like it's specifically happening with webcitation.org URLs that don't have actual dates within the URL itself. = paul2520 đŹ 19:57, 18 December 2024 (UTC)
- User:Paul2520, problem in the bot fixed. I see you corrected the pages about 27. -- GreenC 21:34, 18 December 2024 (UTC)
- (BTW that string is actually a date encoded in base62 - this repo will decode) -- GreenC 21:37, 18 December 2024 (UTC)
- TIL! Thanks for clarifying (and fixing). = paul2520 đŹ 22:19, 18 December 2024 (UTC)
Bot makes error when seeing non-CS1 already-archived URLs
The domain xenu-directory.net was usurped, and a few days ago I manually checked/fixed all occurrences in mainspace. In the case where a citation uses the square bracket method like [https://web.archive.org/web/...etc. title-text]
and the url is already a wayback machine URL, your bot is incorrectly adding {{usurped}} which renders for readers as little "usurped" tags. (The bot works just fine when the citation is CS1 template style.)
The following recent edits by GreenC bot include an incorrect tag (and it's still running and adding more):
- https://en.wikipedia.org/w/index.php?title=Aaron_Saxton&curid=26660450&diff=1264164606&oldid=1262817586
- https://en.wikipedia.org/w/index.php?title=APA_Task_Force_on_Deceptive_and_Indirect_Methods_of_Persuasion_and_Control&curid=7589571&diff=1264169964&oldid=1262815584
- https://en.wikipedia.org/w/index.php?title=Brain-Washing_(book)&curid=7881283&diff=1264176579&oldid=1262818821
- https://en.wikipedia.org/w/index.php?title=Citizens_Commission_on_Human_Rights&curid=20949376&diff=1264181897&oldid=1262810467
- https://en.wikipedia.org/w/index.php?title=Hubbard_v_Vosper&curid=37647003&diff=1264202456&oldid=1262815517
- https://en.wikipedia.org/w/index.php?title=Inside_Scientology:_How_I_Joined_Scientology_and_Became_Superhuman&curid=9497749&diff=1264203752&oldid=1262819362
- https://en.wikipedia.org/w/index.php?title=List_of_Masonic_buildings_in_the_United_States&curid=31726117&diff=1264221816&oldid=1262815969
I had to check about 2 dozen of your recent run (showed on my watchlist) and the above errors will need to be fixed by changing their citations into CS1 cite style. I really don't enjoy doing double work! Fix your bot to recognize when a citation is from archive.org instead of a usurped domain. Â Â âś I am Grorp â 03:48, 21 December 2024 (UTC)
- It's correct. See documentation for
{{usurped}}
. -- GreenC 03:51, 21 December 2024 (UTC)
Bot edit ate a bunch of unrelated text
See [5]. The existing wikitext was obviously messed up (template inside an unclosed template inside of a reference), but the bot still shouldn't screw it up that badly. Jay8g [Vâ˘Tâ˘E] 05:41, 12 January 2025 (UTC)
- The nature of GIGO (Garbage In / Garbage Out). Not that I designed it that way. Probably it tried to find a closing }} and ate everything in between. Sometimes it might be a small amount other times a lot, depending on the article and location. I'd be happy to hear suggestions how to avoid this particular problem, or even better, someone to write a bot that detects and fixes these - AFAIK no one has ever been able to do it. I did make an attempt once and had some success but not fully automated. -- GreenC 06:13, 12 January 2025 (UTC)
Issues with comment inside reference
Happy New Year! I'm working through Category:CS1 errors: dates and ran across a couple edits by your bot like this one where the citation template has the |access-date=
commented out for some reason, and your bot doesn't seem to be expecting that. Happy editing! GoingBatty (talk) 21:32, 14 January 2025 (UTC)
- GoingBatty, Sorry, did you find many? I spent time looking at this, and have concluded the bot can't support this without significant work. For now I detect and skip the template. It can edit templates with wikicomments, just not edit fields within templates where the wikicomment co-exists. Also, I'm in the middle of a large batch that was started before this came up. Do you mind if I complete this batch, hopefully it won't be too many . -- GreenC 03:02, 15 January 2025 (UTC)
- I found two: the one I mentioned earlier and this one. If I find a lot more, I'll let you know. Happy editing! GoingBatty (talk) 04:15, 15 January 2025 (UTC)