User:Nickj/Can We Link It

G'day All,

There's a link suggesting tool I'm temporarily putting out there for you all to have a play with and to give feedback and comments on (either via email or on my talk page).

What it does is it takes an article of your choosing from the English Wikipedia, and suggests bits of text in that article that could potentially be linked. You can then accept or reject those individual suggestions, and then save your changes back to the Wikipedia. Please do have a look at WP:OVERLINK for some ideas about how to use your editorial judgment when selecting which links to employ. Also before adding a wikilink to an article, follow the suggested links to the article that they point to, in order to avoid subtle mistakes. For example, many people have the same name; linking to a football player in a biochemistry article is probably not correct.

It tries to do this in a reasonably pleasant UI, where you see the list of suggestions, and then simply select "yes", "no", or "don't know" for each suggestion, and click "Preview with Added Links".

Quick overview of the UI

On the landing page, you type the name of the article that you want to see links for. It should appear in the list as you type (this bit uses suggestion searching).
Press enter or click the relevant link, then wait for up to 10 seconds for it to fetch the current version and suggest links, and then you'll be presented with a list of possible links that you can make.
To go through the list, you can either use the mouse, or you can use the keyboard.
For the keyboard, the keys are: Up arrow, Down arrow, "y" for yes, "n" for no, and "s" or "d" for skip/don't know.
"Yes" adds the link, "no" doesn't, and "don't know" doesn't add the link either; but "Don't know" will make the exact same link suggestion in future, whereas "yes" and "no" bring closure in that the same suggestion will no longer be made for that page in future.
If you don't make any choice for a suggestion, that's treated the same as choosing "don't know".
Each suggestion has a link that opens in a new tab/window, so if you want to determine whether something is an appropriate link or not, you can just click its link.

If you want to play with it now, it's at: http://can-we-link-it.nickj.org/

Some caveats to be aware of

Currently only works for the English Wikipedia. Although I haven't tried it yet, conceptually similar languages like French should probably work (i.e. Left-to-Right, spaces between words to separate out ideas [no or quite limited compound words], generally use same characters in both article text and article names for the same idea, etc). No idea if this can be made to work for languages which differ substantially from this.
This site a temporary experiment to see what happens, and whilst it has been up for over a month now, it may not stay up indefinitely.
Super-alpha status. It may blow up, eat your homework, key your vehicle, trash your favourite article, etc.
The tool will work much better if you have JavaScript turned on, and the front page won't work at all if you have JavaScript turned off.
It's SLOOOW (e.g. might take 7 seconds to generate suggestions for a 32 Kb page). It doesn't inherently have to be slow, but it is currently - partially because it's behind a DSL link, but mostly because it's not very efficient currently. I'd rather put out an early version though with some rough edges and slowness than wait until getting something perfect (which I may never get around to doing).
Currently the suggested links will include links to disambiguation pages. Shouldn't really do this (i.e. disambig pages should ideally be excluded from the results).
Someone reported that they saw the "null edit summary" detector complain at them when using this; Not sure why this happens, as there is a default edit summary supplied.

Other things

It has "learning from its mistakes" functionality, in that a suggestion which is regularly rejected will no longer be suggested. The current cut-offs are that a suggestion must be rejected at least 5 times, and also rejected 50% or more of the time; once this threshold has been crossed the suggestion will no longer be made. Thus, the bad cruft should hopefully be progressively filtered out as the tool is used more, and what remains should hopefully be mostly useful.
There are some hidden switches, which you can add to the URL that shows the link suggestions, if you want to fiddle with stuff:
- The first is to add "&exhaustive" to the end of the URL, in which case it will stop trying to be smarter about suggesting links based on grammatical structure (e.g. by excluding single word links), and will be exhaustive about showing you the links it finds. This will result in roughly 4 times as many links being found.
- The second switch is that you can specify the number of characters to include in the "context" before and after the suggested link. The default is 60 characters, but you can set this to anything between 0 and 100 characters inclusive, such as by adding "&context=20" to the URL for 20 characters of context.
- Lastly you can specify to just check the wiki syntax. It performs some very simplistic checks on the wiki syntax automatically, that are all about balance (e.g. checks number of [[ equals number of ]] and so forth), and if an article's syntax looks invalid then it'll tell you what's wrong, but deliberately won't give you the link suggestions until you've fixed the syntax on the Wikipedia :-) However, if you don't want link suggestions, and only want syntax checking, then tack "&onlyCheckSyntax" onto the URL.

Source code

The source code for this available, under the GPL. Detailed setup instructions are below. You'll need Julien Lemoine's Suggestion Search daemon (TcpQuery) installed for this to work (use the "[archive]" link for downloading - it's C++ code that gets compiled - I think it assumes a UNIX / Linux type of system, although I'm not certain - check the README in the archive), which runs as a daemon that my PHP script talks to, to help determine which phrases have existing Wikipedia articles.

CWLI Setup instructions

If you want to set up a local copy of CWLI, then you can. However, setting this up has a number of steps, and is a bit complicated. I'm not trying to make it more complicated than it needs to be, but CWLI depends on a number of other bits of software to work, and so those dependencies do make it more complicated.

Firstly, you need a machine with PHP, a web server (such as Apache), and MySQL, and that those bits of software work and are already configured. It is assumed from here onwards that you already have this, as setting these up is outside of the scope of this document.

The second thing you need is need Julien Lemoine's Suggestion Search daemon (TcpQuery) installed for this to work: http://suggest.speedblue.org/download.php

Compiling TcpQuery

mkdir suggest
cd suggest
# Note: I am using TcpQuery v0.44 - there is a later version, v0.51, but I do not suggest using it
# because it crashes for me, whereas v0.44 does not crash.
wget http://suggest.speedblue.org/tgz/wikipedia-suggest-0.44.tar.gz
tar xfz wikipedia-suggest-0.44.tar.gz
cd wikipedia-suggest-0.44
# If you need to install expat & glib and are on Debian or Ubuntu, try the following command:
# aptitude install libexpat1 libexpat1-dev libglib2.0-0 libglib2.0-dev
# ... and if you need a compiler installed, do this:
# aptitude install build-essential
# The netcat package provides the nc command used below. To install, use the following instruction:
# aptitude install netcat
./configure
make
make check
cd cmd
# This next step downloads a 122 Mb file, so can be a bit slow depending on your internet connection...
wget http://www2.speedblue.org/download/WikipediaSuggestCompiledEn-20060810.tar.bz2
# This command will be quite slow as it decompresses the above file:
tar xfj WikipediaSuggestCompiledEn-20060810.tar.bz2
# Check that there is a "pages.bin" and a "trie.bin" file in this directory from the above archive:
ls -al En
# Now start the TcpQuery daemon. Note: Can include a "-m" switch for much improved speed if have ~ 1 Gb of memory on this box:
./TcpQuery -t 10 22581 En/trie.bin En/pages.bin &
# Now test with:
echo fish | nc localhost 22581
# Should get back an answer like: [["Fish","3167",""],["Fishing","2146",""],["Fishspinner","1113","Tropical cyclone"] (...etcetera...)

# Once you've got this working, you can add a line to your /etc/rc.local file to make the daemon run on bootup.
# An example line is something like: 
#( cd /root/tmp/wikipedia-suggest/wikipedia-suggest-0.44/cmd ; ./TcpQuery -m -t 10 22581 trie.bin pages.bin > /dev/null & )
# Note: you'll need to update the paths and file names as appropriate

Copying over the PHP files

# You path may be "/var/www/", so update as appropriate based on how you have configured your apache:
cd /var/www/hosts
# Note: I have updated the ZIP file, so this will have to re-downloaded if you already have it - sorry!
wget http://can-we-link-it.nickj.org/can-we-link-it.zip
unzip can-we-link-it.zip
# This should show some files:
ls -al can-we-link-it.nickj.org/

# Then open this directory in your web browser (the URL to use will depend on how you have configured apache),
# and it should show a page just like on http://can-we-link-it.nickj.org/ . If it doesn't then there is something
# wrong with this step (either with copying over the PHP files, or with the configuration of apache).
# Now try typing something in the box (e.g. "test") As you type, it should suggest articles that match. If it
# does, then it's working - if not, then there is a problem with TcpQuery.

Setting up MySQL

The last step is setting up MySQL. I have created a full dump of MySQL as I currently have it, so with this you should have all the data that I have as at 19 Oct 2007.

To download and load this dump:

wget http://can-we-link-it.nickj.org/suggest-links.sql.bz2
bunzip2 suggest-links.sql.bz2
echo "create database suggest_links; " | mysql
# load the data - might take a minute or two:
mysql suggest_links < suggest-links.sql
mysql suggest_links
# and issue these three commands in MySQL:
grant all on suggest_links   to links@localhost identified by "links";
grant all on suggest_links.* to links@localhost identified by "links";
exit;

Now you should have a full dump of the files, and a functional local copy of the site.

Try it on a Wikipedia page. When it is finding links, it may take about 20 seconds on a page, and the hard disk light will spin a lot.

If this is too slow to be usable, then you'll need to add the "-m" memory switch to the TcpQuery line, which loads everything into memory - however, it needs around 600 Mb of RAM for this to work.

Oh, and periodically, you should purge the unpopular link suggestions, so that they are not suggested any more. Here are the queries that I use:

mysql suggest_links
# Shows suggested links that were strongly disliked and which have not been purged yet:
select * from link_votes where against < 100 and against >= in_favour + 2 order by against - in_favour;
# Check that the results look sensible. If they do, then can purge the strongly disliked links so that they are not suggested any more like so:
update link_votes set against += 100 where against < 100 and against >= in_favour + 2;

Thanks

I want to give a big thank you to Julien Lemoine for writing his Suggestion Search daemon / server, which this tool uses (or rather, abuses) in a rather cruel way to determine what's a valid article name and what's not :-) Also the front page uses a modified version of his web form to help you find the right page that you want to suggest links for.

Questions people have asked

Comparison with Link Suggester and LinkBot

Q: What is the difference between the Link Suggester tool, and Can We Link It?

A: The difference between the Link Suggester / LinkBot and Can-We-Link-It is that the Link Suggester came first, and it was an offline script that I would manually run to suggest links, and the LinkBot would save those link suggestions to article talk pages. However, after 3 or 4 small-scale test runs it became clear that this approach had a number of problems:

The links suggested would become out-of-date as the article changed.
The talk page would become cluttered with suggestions, which annoyed people.
Sometimes links would be suggested for articles and ignored.
It was hard for people to give, and for me to get, feedback about which link suggestions were good and which were bad.
People would ask for it to be run on specific articles, in addition to the ones I randomly selected.

Because of these problems, the talk-page approach was abandoned. Instead, the Link Suggester scripts were modified to make an web-based link-suggesting tool, called Can-We-Link-It. This tool has a number of benefits:

Its suggestions are always current and up-to-date.
It doesn't clutter up the talk page.
It only suggests links for articles that people want suggestions for.
It's easy for people to give feedback about good or bad links by saying "yes" or "no" to a link.
It doesn't require me to manually run it, instead it runs on-demand.

The main downsides of the tool as it currently stands are:

Makes it easy to add lots of links - requiring evaluation of each link's merit from users.
You have to know the tool exists (i.e. the link suggestions don't come to you, you have to go and request them).

-- All the best,
Nickj (t) 01:48, 22 September 2006 (UTC)

Shutdown imminent

Sorry folks, but I'm going to have to shut down the Can-we-link-it web site in the next few weeks (during November 2009), and I don't know when or even if it will be back.

Why now?

We've sold our home. The site was hosted on a spare machine in my cupboard. We have to be out during November.

Why don't you just move it to your new home?

We don't have one yet. Finding a new home is our current work-in-progress. So for the next indefinite period of time we will be "between places", and all our stuff will be in storage.

Okay, so why don't you just move to the toolserver?

I enquired years ago, but the site uses a daemon that consumes around 600 to 800 MB of RAM, and there wasn't the available spare RAM on the toolserver then. Unless I'm informed otherwise, I'm going to assume that this still applies.

So why don't you rewrite it to use less RAM / do something else?

It's at the end of a very long to-do list, which means it's simply not going to happen. Sorry.

So where to from here?

Seems to me there are the following options:

Let it die, with no replacement. I.e. chalk it up as a "mildly useful whilst it existed, but hey, life moves on". Also known as "the do nothing option".
Shift it to somewhere else (e.g. the toolserver or somewhere else). Source code is available, and I'd send the database dump and any other info needed on to anyone who wants to take over running the site, and redirect the domain name, and so forth.
Rewrite it, probably as a MediaWiki extension. I personally think that this is the best way forward, and therefore think it's worth enumerating some of the costs and benefits of doing this:
- Costs:
  - A do-over from scratch as an extension would cost someone in time, and in effort, and in opportunity cost.
- Benefits:
  - The current code is a bit icky, as it evolved incrementally into what it is now. Starting over could be a good idea, and would probably also allow leveraging the MediaWiki parser (see below).
  - I'm sure the behavior / appearance / functionality can be improved. Think of the current thing as a proof-of-concept experiment to show one of the many possible ways of doing it.
  - If the code is hosted in subversion along with all the other extensions, then updates / bug fixes / improvements could be added by multiple people, which would be a big improvement and remove a single person from being the bottleneck (which has been the case up until now).
  - At the moment it only supports the English Wikipedia. There are more languages that English spoken on the Earth. A MediaWiki extension could presumably ultimately support all the languages that MediaWiki supports.
  - It could presumably hook into MediaWiki's parser to work out what text can be linked. At the moment, it gets the raw wikitext and tries to decipher that to to work out what could be linked, which is just "a bad idea" (TM), and prone to breaking on certain wiki text, and it also means the current tool complains if it does not see balanced wiki text. Presumably the parser already has to work all this stuff out, and that could be hooked into.
  - As a completely separate site, it currently has no knowledge of changes to the Wikipedia - e.g. new articles, deleted articles, moved articles, etc - and so the data gradually gets more and more out-of-date. A MediaWiki extension could avoid this problem entirely by not duplicating the list of article names.
  - As a MediaWiki extension, it would be easier for users to discover, and could maybe even be turned on by default on the Wikipedia sites, if and when it has proved itself. Basically, the more articles a wiki has, the more it could benefit from a tool like this. Imagine you start a new article, and after you save, it says "you know you could add this link and this link?". Further, the more that people use it, the better it could get at knowing what's a good link, and what's a bad link.
  - Much more advanced models are possible. For example, currently it assumes the same text is as deserving of a link in one article as in another. But that's not really the case - some phrases should be linked in some articles, but not in others. Essentially if similar/related/same-subject-matter articles to yours are linking some text, and that same text occurs in your article, then you probably want to add that same link.

So those are the options that I am aware of. I personally favor the rewrite approach, throwing away the current code, taking the current ideas, improving them, and making them native to MediaWiki as a MediaWiki extension, that can hopefully ultimately run on the WikiMedia servers ... but I don't envy the poor sod who gets to do the rewrite!

Anyway, whatever is going to happen, you have between 2 weeks and 4 weeks from now to make a decision and do it, because that's when the current site will be shut down.

-- All the best, Nickj (t) 07:37, 13 October 2009 (UTC)

Thanks all the same Nickj. I have found this to be a really useful tool. I do hope to see it again in whatever guise that might be (BTW, I have no expertise to manage a tool like this). All the best in finding a new home! Heds (talk) 02:41, 15 October 2009 (UTC)

You have my thanks too - it was a very useful tool while it lasted! --Alvestrand (talk) 14:02, 15 October 2009 (UTC)

Absolutely first rate! A tool that retains the place of a human thought process for the end result! Best luck, GeorgeLouis (talk) 04:42, 16 October 2009 (UTC)

figures; i just discovered it today. well, you deserve thanks for the use i got out of it so far!Gzuckier (talk) 23:54, 21 October 2009 (UTC)

Best of luck in the home search. So sorry to see this tool go, as it was massively excellent. I hope someone will pick it up and run with it (I sure wish I had the know how to do it). I'll remove the link to this page at WP:DEP.--Fabrictramp | talk to me 23:22, 26 December 2009 (UTC)

Just FYI the tool server RAM limit is 1000 Megabytes now, but you can't use that much all the time. Tim1357 (talk) 22:24, 25 April 2010 (UTC)

Fahrer / Team für die Suche eingeben: