Wikipedia:Wikipedia Signpost/Single/2015-12-02

Single-page Edition

WP:POST/1

2 December 2015

Op-ed
Whither Wikidata?

News and notes
Online harassment consultation; High voter turnout at ArbCom elections

In the media
Is Wikidata as transparent as it seems?; Wikimedia Fund-raising drive launches

Traffic report
Jonesing for episodes

Featured content
This Week's Featured Content

Technology report
Tech news in brief

2015-12-02

Whither Wikidata?

Contribute —

Share this

By Andreas Kolbe

Wikidata, a Wikimedia project spearheaded by Wikimedia Deutschland, recently celebrated its third anniversary. The project has a dual purpose: 1. Streamline data housekeeping within Wikipedia. 2. Serve as a data source for re-users on the web; in particular, Wikidata is the designated successor to Google's Freebase, designed to deliver data for the Google Knowledge Graph.

We need to talk about Wikidata.

Wikidata, covered in last week's Signpost issue in a celebratory op-ed that highlighted the project's potential (see Wikidata: the new Rosetta Stone), has some remarkable properties for a Wikimedia wiki:

A little more than half its statements are unreferenced.
Of those statements that do have a reference, significantly more than half are referenced only to a language version of Wikipedia (projects like the English, Latvian or Burmese Wikipedia).
Wikidata statements referenced to Wikipedia do not cite a specific article version, but only name the Wikipedia in question.
Wikidata has a no-attribution CC0 licence; this means that third parties can use the data on their sites without indicating their provenance, obscuring the fact that the data came from a crowdsourced project subject to the customary disclaimers.
Hoaxes long extinguished on Wikipedia live on, zombie-like, in Wikidata.

This op-ed examines the situation and its implications, and suggests corrective action.

But first ...

A little bit of history

Wikidata is one of the younger Wikimedia projects. Launched in 2012, the project's development has not been led by the Wikimedia Foundation itself, but by Wikimedia Deutschland, the German Wikimedia chapter.

The initial development work was funded by a donation of 1.3 million Euros, made up of three components:

Half the money came from Microsoft co-founder Paul Allen's Institute for Artificial Intelligence (AI2).
A quarter came from Google, Inc.
Another quarter came from the Gordon and Betty Moore Foundation, established by Intel co-founder Gordon E. Moore and his wife Betty I. Moore.

The original team of developers was led by Denny Vrandečić (User:Denny), who came to Wikimedia Deutschland from the Karlsruhe Institute of Technology (KIT). Vrandečić was, together with Markus Krötzsch (formerly KIT, University of Oxford, presently Dresden University of Technology), the founder of the Semantic MediaWiki project. Since 2013, Vrandečić has been a Google employee; in addition, since summer 2015 he has been one of the three community-elected Wikimedia Foundation board members.

Microsoft co-founder Paul Allen's Institute for Artificial Intelligence provided half the funding for the initial development of Wikidata

Wikimedia Deutschland's original press release, dated 30 March 2012, said,

“

Wikidata will provide a collaboratively edited database of the world's knowledge. Its first goal is to support the more than 280 language editions of Wikipedia with one common source of structured data that can be used in all articles of the free encyclopedia. For example, with Wikidata the birth date of a person of public interest can be used in all Wikipedias and only needs to be maintained in one place. Moreover, like all of Wikidata's information, the birth date will also be freely usable outside of Wikipedia. The common-source principle behind Wikidata is expected to lead to a higher consistency and quality within Wikipedia articles, as well as increased availability of information in the smaller language editions. At the same time, Wikidata will decrease the maintenance effort for the tens of thousands of volunteers working on Wikipedia.

The CEO of Wikimedia Deutschland, Pavel Richter, points out the pioneering spirit of Wikidata: "It is ground-breaking. Wikidata is the largest technical project ever undertaken by one of the 40 international Wikimedia chapters. Wikimedia Deutschland is thrilled and dedicated to improving data management of the world's largest encyclopedia significantly with this project."

Besides the Wikimedia projects, the data is expected to be beneficial for numerous external applications, especially for annotating and connecting data in the sciences, in e-Government, and for applications using data in very different ways. The data will be published under a free Creative Commons license.

”

Wikidata thus has a dual purpose: it is designed to make housekeeping across the various Wikipedia language versions easier, and to serve as a one-stop data shop for sundry third parties.

To maximise third-party re-use, Wikidata—unlike Wikipedia—is published under the CC0 1.0 Universal licence, a complete public domain dedication that waives all author's rights, to the extent allowed by law. This means that re-users of Wikidata content are not obliged to indicate the source of the data to their readers.

In this respect Wikidata differs sharply from Wikipedia, which is published under the Creative Commons Attribution-ShareAlike 3.0 Unported Licence, requiring re-users of Wikipedia content to credit Wikipedia (attribution) and to distribute copies and adaptations of Wikipedia content only under the same licence (Share-alike).

Search engines take on a new role as knowledge providers

Google contributed a quarter of the initial funding for the development of Wikidata, which is now replacing Freebase as one of the sources for the Google Knowledge Graph

The March 30, 2012 announcement of the development of Wikidata was followed six weeks later, on May 16, 2012, by the arrival of a new Google feature destined to have far-reaching implications: the Google Knowledge Graph. Similar developments also happened at Microsoft's Bing. These two major search engines, no longer content to simply provide users with a list of links to information providers, declared that they wanted to become information providers in their own right.

The Google Knowledge Graph, Google said, would enable Internet users

“

to search for things, people or places that Google knows about—landmarks, celebrities, cities, sports teams, buildings, geographical features, movies, celestial objects, works of art and more—and instantly get information that's relevant to your query. This is a critical first step towards building the next generation of search, which taps into the collective intelligence of the web and understands the world a bit more like people do.

Google's Knowledge Graph isn't just rooted in public sources such as Freebase, Wikipedia and the CIA World Factbook. It's also augmented at a much larger scale—because we're focused on comprehensive breadth and depth. It currently contains more than 500 million objects, as well as more than 3.5 billion facts about and relationships between these different objects. And it's tuned based on what people search for, and what we find out on the web.

”

The move makes sense from a business perspective: by trying to guess the information in which people are interested and making that information available on their own pages, search engines can entice users to stay on their sites for longer, increasing the likelihood that they will click on an ad—a click that will add to the search engine's revenue (in Google's case running at around $200 million a day).

Moreover, search engine results pages that do not include a Knowledge Graph infobox often feature ads in the same place where the Knowledge Graph is usually displayed: the right-hand side of the page. The Knowledge Graph thus trains users to direct their gaze to the precise part of a search engine results page that generates the operator's revenue. Alternatively, ads may also be (and have been) inserted directly into the Knowledge Graph itself.

Microsoft's Bing search engine has followed much the same path as Google with its "Snapshot" feature drawing on Wikimedia content

Microsoft's Bing followed a very similar development from 2012 onwards, with Bing's Satori-powered "Snapshot" feature closely mimicking the appearance and content of Google's Knowledge Graph. Bing has used some of the same sources as Google, in particular Wikipedia and Freebase, a crowdsourced database published under a Creative Commons Attribution Licence that was acquired by Google in 2010.

Neither Freebase nor Wikipedia really profited from this development. Wikipedia noted a significant downturn in pageviews that was widely attributed to the introduction of the Google Knowledge Graph, causing worries among Wikimedia fundraisers and those keen to increase editor numbers. After all, Internet users not clicking through to Wikipedia would miss both the Wikimedia Foundation's fundraising banners and a chance to become involved in Wikipedia themselves.

As for Freebase, Google announced in December 2014, a little over four years after acquiring the project, that it would shut it down in favour of the more permissively licensed Wikidata and migrate its content to Wikidata—Freebase's different Creative Commons licence, which required attribution, notwithstanding.

"The PR Pros and SEOs are Coming"

Freebase was widely considered a weak link in the information supply chain ending at the Knowledge Graph. Observers noted that search engine optimization (SEO) specialists were able to manipulate the Knowledge Graph by manipulating Freebase.

In a Wikidata Office Chat conducted on March 31, 2015, future Wikimedia Foundation board member Denny Vrandečić—juggling his two hats as a Google employee and the key thought leader of Wikimedia's Wikidata project—spoke about Google's transition from Freebase to Wikidata, explaining that Wikidata's role would be slightly different from the role played by Freebase:

Denny Vrandečić, the co-founder of the Semantic MediaWiki project, has to juggle three hats: he is a Google employee as well as a community-elected Wikimedia Foundation board member and the primary Wikidata thought leader

“

16:31:17 <dennyvrandecic> Google has decided that we want to rerelease as much data as possible from Freebase to Wikidata

16:31:34 <dennyvrandecic> and we are currently working hard on a number of pieces for that

16:31:48 <benestar> one question already: how many users do we have to expect who come from freebase to wikidata?

16:32:02 <sjoerddebruin> There are already people moving.

16:32:11 <dennyvrandecic> benestar: tough to say, but we Freebase never had anywhere near the numbers Wikidata has

16:32:14 <sjoerddebruin> Mostly annoying SEO / web design people

16:32:45 <dennyvrandecic> benestar: it is likely that they won't even make a difference in the Wikidata user numbers

16:33:05 <benestar> so servers wont explode :D

16:33:16 <dennyvrandecic> nah, not even close

16:33:26 <dennyvrandecic> yeah, one problem is that SEOs think that Wikidata is replacing Freebase within the Google infrastructure

16:33:35 <benestar> but we need guidelines on SEO on Wikidata

16:33:42 <dennyvrandecic> yes, that would be good

16:33:45 <benestar> companies will come and edit wikidata a lot now

16:33:55 <sjoerddebruin> We've already seen a huge wave of spam of companies and "SEO experts"

16:33:55 <dennyvrandecic> also, Wikidata is not a free ticket into the Knowledge Graph as Freebase was

16:34:07 <dennyvrandecic> it is just one source among many

16:34:27 <Lydia_WMDE> i think we really need to highlight this

16:34:30 <dennyvrandecic> benestar: actually I think that companies editing Wikidata might be very beneficial

”

Noam Shapiro, writing in Search Engine Journal, drew the following conclusions from his review of this chat, focusing on the statements highlighted in yellow above:

“

1. The PR Pros and SEOs are Coming: There is certainly an awareness that SEO professionals and many company PR representatives have a newfound interest in Wikidata.

2. Wikidata's Mixed Feelings: These Wikidata thought leaders are somewhat wary regarding the pending influx of new Wiki editors, though there are conflicting views as to whether this is a positive or negative development.

3. No Spam/Bias Allowed: Keep in mind that some users might be on the lookout for spammy or biased edits. Keep any Wikidata edits as factual and unbiased as possible. Given the data-centered nature of Wikidata, and the need for recognized references, most edits will, in any event, most likely follow these guidelines, but be careful!

4. No More Free Lunch: As one of the insiders notes above, "Wikidata is not a free ticket into the Knowledge Graph as Freebase was." It may very well be that the direct relationship observed between Freebase and the Knowledge Graph will not be replicated in Wikidata's relationship with the Knowledge Graph. That being said, it is still "one source among many," and likely an important one. After all, the Knowledge Graph thrives on the existence of structured data, and—especially in the absence of Freebase—that is exactly what Wikidata provides.

[…] Wikidata matters. To one degree or another, Wikidata will be a source for the Google Knowledge Graph. And second, taking control of your brand has never been more important! Optimizing your brand's Wikidata entry is becoming increasingly critical for strengthening online presence, and is therefore highly recommended.

”

Shapiro's point concerning spam and bias mentioned "the need for recognized references". This is a topic that we will shortly return to, because Wikidata seems to have adopted a very lax approach to this requirement.

The relationship between Wikidata and Wikipedia: Sources? What sources?

Citations to Wikipedia (blue) outnumber all other sources (red) together (yellow = unreferenced)

The fact that Wikidata and Wikipedia have what seems on the face of it incompatible licences has been a significant topic of discussion within the Wikimedia community. It is worth noting that in 2012, Denny Vrandečić wrote on Meta,

“

Alexrk2, it is true that Wikidata under CC0 would not be allowed to import content from a Share-Alike data source. Wikidata does not plan to extract content out of Wikipedia at all. Wikidata will provide data that can be reused in the Wikipedias. And a CC0 source can be used by a Share-Alike project, be it either Wikipedia or OSM. But not the other way around. Do we agree on this understanding? --Denny Vrandečić (WMDE) (talk) 12:39, 4 July 2012 (UTC)

”

More recently, the approach seems to have been that because facts cannot be copyrighted, mass imports from Wikipedia are justified. The legal situation concerning database rights in the US and EU is admittedly fairly complex. At any rate, whatever licensing qualms Denny may have had about this issue at the time seem to have evaporated. If the original plan was indeed "not [...] to extract content out of Wikipedia at all", then the plan changed.

Bot imports from Wikipedia have long been the order of the day. In fact, in recent months contributors on Wikidata have repeatedly raised alarms about mass imports of content from various Wikipedias, believing that these imports compromise quality (the following quote, written by a non-native speaker, has been lightly edited for spelling and grammar):

“

In a Wikipedia discussion I came by chance across a link to the following discussion:

Wikidata:Project_chat/Archive/2015/10#STOP_with_bot_import

[…] To provide an outside perspective as a Wikipedian (and a potential user of WD in the future), I wholeheartedly agree with Snipre, in fact "bots are running wild", and the uncontrolled import of data/information from Wikipedias is one of the main reasons for some Wikipedias developing an increasingly hostile attitude towards WD and its usage in Wikipedias. If WD is ever to function as a central data storage for various Wikimedia projects and in particular Wikipedia as well (in analogy to Commons), then quality has to take the driver's seat over quantity. A central storage needs much better data integrity than the projects using it, because one mistake in its data will multiply throughout the projects relying on WD, which may cause all sorts of problems. For a crude comparison think of a virus placed on a central server rather than on a single client. The consequences are much more severe and nobody in their right mind would run the server with even less protection/restrictions than the client.

Another thing is that if you envision users of other Wikimedia projects such as Wikipedia or even 3rd party external projects to eventually help with data maintenance when they start using WD, then you might find them rather unwilling to do so, if not enough attention is paid to quality; instead they probably just dump WD from their projects.

In general all the advantages of the central data storage depend on the quality (reliability) of data. If that is not given to a reasonably high degree, there is no point in having central data storage at all. All the great applications become useless if they operate on false data.--Kmhkmh (talk) 12:00, 19 November 2015 (UTC)

”

The circular reference loop connecting Wikidata and Wikipedia

The result of these automated imports is that Wikipedia is today by far the most commonly cited source in Wikidata.

According to current Wikimedia statistics:

Half of all statements in Wikidata are completely unreferenced.
Close to a third of all statements in Wikidata are only referenced to a Wikipedia.

References to a Wikipedia do not identify a specific article version; they simply name the language version of Wikipedia. This includes many minor language versions whose referencing standards are far less mature than those of the English Wikipedia. Moreover, some Wikipedia language versions, like the Croatian and Kazakh Wikipedias, are not just less mature, but are known to have very significant problems with political manipulation of content.

Recall Shapiro's expectation above that spam and bias would be held at bay by the "need for recognized references". Wikidata's current referencing record seems unlikely to live up to that expectation.

Of course, allowances probably have to be made for the fact that some statements in Wikidata may genuinely not be in need of a reference. For example, in a Wikidata entry like George Bernard Shaw, one might expect to receive some sympathy for the argument that the statement "Given name: George" is self-evident and does not need a reference. Wikidata, some may argue, will never need to have 100 per cent of its statements referenced.

However, it does not seem healthy for Wikipedia to be cited more often in Wikidata than all other types of sources together. This is all the more important as Wikidata may not just propagate errors to Wikipedia, but may also spread them to the Google Knowledge Graph, Bing's Snapshot, myriad other re-users of Wikidata content, and thence to "reliable sources" cited in Wikipedia, completing the "citogenesis" loop.

Data are not truth: sometimes they are phantoms

Citogenesis

As the popularity of Wikipedia has soared, citogenesis has been a real problem in the interaction between "reliable sources" and Wikipedia. A case covered in May 2014 in The New Yorker provides an illustration:

“

In July of 2008, Dylan Breves, then a seventeen-year-old student from New York City, made a mundane edit to a Wikipedia entry on the coati. The coati, a member of the raccoon family, is "also known as … a Brazilian aardvark," Breves wrote. He did not cite a source for this nickname, and with good reason: he had invented it. He and his brother had spotted several coatis while on a trip to the Iguaçu Falls, in Brazil, where they had mistaken them for actual aardvarks.

"I don't necessarily like being wrong about things," Breves told me. "So, sort of as a joke, I slipped in the 'also known as the Brazilian aardvark' and then forgot about it for awhile."

Adding a private gag to a public Wikipedia page is the kind of minor vandalism that regularly takes place on the crowdsourced Web site. When Breves made the change, he assumed that someone would catch the lack of citation and flag his edit for removal.

Over time, though, something strange happened: the nickname caught on. About a year later, Breves searched online for the phrase "Brazilian aardvark." Not only was his edit still on Wikipedia, but his search brought up hundreds of other Web sites about coatis. References to the so-called "Brazilian aardvark" have since appeared in the Independent, the Daily Mail, and even in a book published by the University of Chicago. Breves's role in all this seems clear: a Google search for "Brazilian aardvark" will return no mentions before Breves made the edit, in July, 2008. The claim that the coati is known as a Brazilian aardvark still remains on its Wikipedia entry, only now it cites a 2010 article in the Telegraph as evidence.

”

It seems inevitable that falsehoods of this kind will be imported into Wikidata, eventually infecting both other Wikipedias and third-party sources. That this not only can, but does happen is quickly demonstrated. Among the top fifteen longest-lived hoaxes currently listed at Wikipedia:List of hoaxes, six (nos. 1, 2, 6, 7, 11 and 13) still have active Wikidata entries at the time of writing. The following table reproduces the corresponding entries in Wikipedia:List of hoaxes, with a column identifying the relevant Wikidata item and supplementary notes added:

Hoax	Length	Start date	End date	Links	Wikidata item
Jack Robichaux Fictional 19th‑century serial rapist in New Orleans	10 years, 1 month	July 31, 2005	September 3, 2015	Wikipedia:Articles for deletion/Jack Robichaux	https://archive.is/Z6Gne Note: The English Wikipedia link has been updated.
Guillermo Garcia "Highly influential" but imaginary oil and forestry magnate in 18th-century South America	9 years, 10 months	November 17, 2005	September 19, 2015	Wikipedia:Articles for deletion/Guillermo Garcia (businessman)	https://archive.is/0pprA
Gregory Namoff An "internationally known" but nonexistent investment banker, minor Watergate figure, and U.S. Senate candidate.	9 years, 6½ months	June 17, 2005	January 13, 2015	Wikipedia:Articles for deletion/Gregory Namoff Archive	https://archive.is/urElB Note: 10 months after the hoax article was deleted on Wikipedia, a user added "natural causes" as the manner of death on Wikidata
Double Hour Supposed German and American television show, covering historic events over a two-hour span.	9 years, 6 months	September 23, 2005	April 4, 2015	Double Hour (TV series) deletion log	https://archive.is/rjFjw Note: This item has only ever been edited by bots.
Nicholas Burkhart Fictitious 17th-century legislator in the House of Keys on the Isle of Man.	9 years, 2 months	July 19, 2006	September 26, 2015	Wikipedia:Articles for deletion/Nicholas Burkhart	https://archive.is/A0lt7
Emilia Dering Long-lived article about a non-existent 19th century German poet started with the rather basic text "Emilia Dering is a famous poet who was Berlin,Germany on April 16, 1885" by a single-purpose account	8 years, 10 months	December 6, 2006	October 6, 2015	Emilia Dering deletion log; deleted via A7. On the day of the article's creation, a person claiming to be the granddaughter of Emilia Dering published a blog post with a poem supposedly written by her.	https://archive.is/eNJbc

Using the last entry from the above list as an example, a Google search quickly demonstrates that there are dozens of other sites listing Emilia Dering as a German writer born in 1885. The linkage between Wikidata and the Knowledge Graph as well as Bing's Snapshot can only make this effect more powerful: if falsehoods in Wikidata enter the infoboxes displayed by the world's major search engines, as well as the pages of countless re-users, the result could rightly be described as citogenesis on steroids.

The only way for Wikidata to avoid this is to establish stringent quality controls, much like those called for by Kmhkmh above. Such controls would appear absent at Wikidata today, given that the site managed to tell the world, for five months in 2014, that Franklin D. Roosevelt was also known as "Adolf Hitler". If even the grossest vandalism can survive for almost half a year on Wikidata, what chance is there that more subtle falsehoods and manipulations will be detected before they spread to other sites?

Yet this is the project that Wikimedians like Max Klein, who has been at Wikidata from the beginning, imagine could become the "one authority control system to rule them all". The following questions and answers are from a 2014 interview with Klein:

“

How was Wikidata originally seeded?

In the first days of Wikidata we used to call it a 'botpedia' because it was basically just an echo chamber of bots talking to each other. People were writing bots to import information from infoboxes on Wikipedia. A heavy focus of this was data about persons from authority files.

Authority files?

An authority file is a Library Science term that is basically a numbering system to assign authors unique identifiers. The point is to avoid a "which John Smith?" problem. At last year's Wikimania I said that Wikidata itself has become a kind of "super authority control" because now it connects so many other organisations' authority control (e.g. Library of Congress and IMDB). In the future I can imagine Wikidata being the one authority control system to rule them all.

”

Given present quality levels, this seems like a nightmare scenario: the Internet's equivalent of the Tower of Babel.

What is a reliable source?

An aardvark

A crowdsourced project like Wikidata becoming "the one authority control system to rule them all" is a very different vision from the philosophy guiding Wikipedia. Wikipedians, keenly aware of their project's vulnerabilities and limitations, have never viewed Wikipedia as a "reliable source" in its own right. For example, community-written policies expressly forbid citing one Wikipedia article as a source in another (WP:CIRCULAR):

“

Content from a Wikipedia article is not considered reliable unless it is backed up by citing reliable sources. Confirm that these sources support the content, then use them directly.

”

Wikidata abandons this principle—doubly so. First, it imports data referenced only to Wikipedia, treating Wikipedia as a reliable source in a way Wikipedia itself would never allow. Secondly, it aspires to become itself the ultimate reliable source—reliable enough to inform all other authorities.

For example, Wikidata is now used as a source by the Virtual International Authority File (VIAF), while VIAF in turn is used as a source by Wikidata. In the opinion of one Wikimedia veteran and librarian I spoke to at the recent Wikiconference USA 2015, the inherent circularity in this arrangement is destined to lead to muddles which, unlike the Brazilian aardvark hoax, will become impossible to disentangle later on.

The implications of a non-attribution licence

Not an aardvark

The lack of references within Wikidata makes verification of content difficult. This flaw is only compounded by the fact that its CC0 licence encourages third parties to use Wikidata content without attribution.

Max Klein provided an insightful thought on this in the interview he gave last year, following Wikimania 2014:

“

Wikidata uses a CC0 license which is less restrictive than the CC BY SA license that Wikipedia is governed by. What do you think the impact of this decision has been in relation to others like Google who make use of Wikidata in projects like the Google Knowledge Graph?

Wikidata being CC0 at first seemed very radical to me. But one thing I noticed was that increasingly this will mean where the Google Knowledge Graph now credits their "info-cards" to Wikipedia, the attribution will just start disappearing. This seems mostly innocent until you consider that Google is a funder of the Wikidata project. So in some way it could seem like they are just paying to remove a blemish on their perceived omniscience. But to nip my pessimism I have to remind myself that if we really believe in the Open Source, Open Data credo then this rising tide lifts all boats.

”

Klein seems torn between his lucid rational assessment and his appeal to himself to "really believe in the Open Source, Open Data credo". Faith may have its rightful place in love and the depths of the human soul, but our forebears learned centuries ago that when you are dealing with the world of facts, belief is not the way to knowledge: knowledge comes through doubt and verification.

What this lack of attribution means in practice is that the reader will have no indication that the data presented to them comes from a project with strong and explicit disclaimers. Here are some key passages from Wikidata's own disclaimer:

“

Wikidata cannot guarantee the validity of the information found here.

[...] No formal peer review[:] Wikidata does not have an executive editor or editorial board that vets content before it is published. Our active community of editors uses tools such as the Special:Recentchanges and Special:Newpages feeds to monitor new and changing content. However, Wikidata is not uniformly peer reviewed; while readers may correct errors or engage in casual peer review, they have no legal duty to do so and thus all information read here is without any implied warranty of fitness for any purpose or use whatsoever. None of the contributors, sponsors, administrators or anyone else connected with Wikidata in any way whatsoever can be responsible for the appearance of any inaccurate or libelous information or for your use of the information contained in or linked from these web pages [...] neither is anyone at Wikidata responsible should someone change, edit, modify or remove any information that you may post on Wikidata or any of its associated projects.

”

Internet users are likely to take whatever Google and Bing tell them on faith. As a form of enlightenment, it looks curiously like a return to the dark ages.

When a single answer is wrong

Jerusalem—one of the most contested places on earth

This obscuring of data provenance has other undesirable consequences. An article published in Slate this week (Nov. 30, see this week's In the Media) introduces a paper by Mark Graham of the Oxford Internet Institute and Heather Ford of the School of Media and Communication at the University of Leeds. The paper examines the problems that can result when Wikidata and/or the Knowledge Graph provide the Internet public with a single, unattributed answer.

Ford and Graham say they found numerous instances of Google Knowledge Graph content taking sides in the presentation of politically disputed facts. Jerusalem for example is described in the Knowledge Graph as the "capital of Israel". Most Israelis would agree, but even Israel's allies (not to mention the Palestinians, who claim Jerusalem as their own capital) take a different view – a controversy well explained in the lead of the English Wikipedia article on Jerusalem, which tells its readers, "The international community does not recognize Jerusalem as Israel's capital, and the city hosts no foreign embassies." Graham provides further examples in Slate:

“

A search for "Londonderry" (the name used by unionists) in Northern Ireland is corrected to "Derry" (the name used by Irish nationalists). A search for Abu Musa lists it as an Iranian island in the Persian Gulf. This stands in stark contrast to an Arab view that the island belongs to the United Arab Emirates and that it is instead in the Arabian Gulf. In response to a search for Taipei, Google claims that the city is the capital of Taiwan (a country only officially recognized by 21 U.N. member states). Similarly, the search engine lists Northern Cyprus as a state, despite only one other country recognizing it as such. But it lists Kosovo as a territory, even though it's formally recognized by 112 other countries.

My point is not that any of these positions are right or wrong. It is instead that the move to linked data and the semantic Web means that many decisions about how places are represented are increasingly being made by people and processes far from, and invisible to, people living under the digital shadows of those very representations. Contestations are centralized and turned into single data points that make it difficult for local citizens to have a significant voice in the co-construction of their own cities.

”

Ford and Graham reviewed Wikidata talk page discussions to understand the consensus forming process there, and found users warring and accusing each other of POV pushing—context that almost none of the Knowledge Graph readers will ever be aware of.

In Ford's and Graham's opinion, the envisaged movement of facts from Wikipedia to Wikidata and thence to the Google Knowledge Graph has "four core effects":

“

data becomes a) less nuanced, b) its provenance (or source) is obscured, c) the agency of users to contest information is diminished and d) the use of personalised filters mean that users cannot see how the information presented to them is different from what is being presented to others. […]

We know that the engineers and developers, volunteers and passionate technologists are often trying to do their best in difficult circumstances. But there need to better attempts by people working on these platforms to explain how decisions are made about what is represented. These may just look like unimportant lines of code in some system somewhere, but they have a very real impact on the identities and futures of people who are often far removed from the conversations happening among engineers.

”

This is a remarkable reversal, given that Wikimedia projects have traditionally been hailed as bringing about the democratisation of knowledge.

Conclusions

Errors can always be fixed

From my observation, many Wikimedians feel problems such as those described here are not all that serious. They feel safe in the knowledge that they can fix anything instantly if it's wrong, which provides a subjective sense of control. It's a wiki! And they take comfort in the certainty that someone surely will come along one day, eventually, to fix any other error that might be present today.

This is a fallacy. Wikimedians are privileged by their understanding of the wiki way; the vast majority of end users would not know how to change or even find an entry in Wikidata. As soon as one stops thinking selfishly, and starts thinking about others, the fact that any error in Wikidata or Wikipedia can potentially be fixed becomes secondary to the question, "How much content in our projects is false at any given point in time, and how many people are misled by spurious or manipulated content every day?" Falsehoods have consequences.

Faced with quality issues like those in Wikidata, some Wikimedians will argue that cleverer bots will, eventually, help to correct the errors introduced by dumber bots. They view dirty data as a welcome programming challenge, rather than a case of letting the end user down. But it seems to me there needs to be more emphasis on controlling incoming quality, on problem prevention rather than problem correction. Statements in Wikidata should be referenced to reliable sources published outside the Wikimedia universe, just like they are in Wikipedia, in line with the WP:Verifiability policy.

Wikidata development was funded by money from Google and Microsoft, who have their own business interests in the project. These ties mean that Wikidata content may reach an audience of billions. It may make Wikidata an even greater honey pot to SEO specialists and PR people than Wikipedia itself. Wikis' vulnerabilities in this area are well documented. Depending on the extent to which search engines will come to rely on Wikidata, and given the observed loss of nuance in Knowledge Graph displays, an edit war won in an obscure corner of Wikidata might literally re-define truth for the English-speaking Internet.

If information is power, this is the sort of power many will desire. They will surely flock to Wikidata, swelling the ranks of its volunteers. It's a propagandist's ideal scenario for action. Anonymous accounts. Guaranteed identity protection. Plausible deniability. No legal liability. Automated import and dissemination without human oversight. Authoritative presentation without the reader being any the wiser as to who placed the information and which sources it is based on. Massive impact on public opinion.

... to rule them all

As a volunteer project, Wikidata should be done well. Improvements are necessary. But, looking beyond the Wikimedia horizon, we should pause to consider whether it is really desirable for the world to have one authority—be it Google or Wikidata—"to rule them all". Such aspirations, even when flying the beautiful banner of "free content", may have unforeseen downsides when they are realised, much like the ring of romance that was made "to rule them all" in the end proved remarkably destructive. The right to enjoy a pluralist media landscape, populated by players who are accountable to the public, was hard won in centuries past. Some countries still do not enjoy that luxury today. We should not give it away carelessly, in the name of progress, for the greater glory of technocrats.

One last point has to be raised: Denny Vrandečić combines in one person the roles of Google employee, community-elected Wikimedia Foundation board member and Wikidata thought leader. Given the Knowledge Graph's importance to Google's bottom line, there is an obvious potential for conflicts of interest in decisions affecting the Wikidata project's licensing and growth rate. While Google and Wikimedia are both key parts of the world's information infrastructure today, the motivations and priorities of a multi-billion-dollar company that depends on ad revenue for its profits and a volunteer community working for free, for the love of knowledge, will always be very different.

Further reading

Ford, H., and Graham, M. 2016. "Semantic Cities: Coded Geopolitics and the Rise of the Semantic Web". In Code and the City, eds. Kitchin, R., and Perng, S-Y. London: Routledge (in press).
Crum, C. October 29, 2013. "Google's 'Misinformation Graph' Strikes Again". WebProNews.
"Quality issues". November–December 2015 (Wikimedia-l mailing list discussion).
Taraborelli, D. 2015. "Wikipedia as the Front Matter to All Research". Wikimedia Foundation. Presentation video and slides available online.

Andreas Kolbe has been a Wikipedia contributor since 2006. He is a member of the Signpost's editorial board. The views expressed in this editorial are his alone and do not reflect any official opinions of this publication. Responses and critical commentary are invited in the comments section.

Reader comments

2015-12-02

Online harassment consultation; High voter turnout at ArbCom elections

Contribute —

Share this

By Andreas Kolbe

2015 voter edit count distribution vs. 2014 and 2013

This year's Arbitration Committee elections are seeing substantially higher voter turnout than in previous years, a reflection of the use of mass messaging to notify volunteers eligible to vote. SecurePoll's voters' list shows 2,778 votes cast at the time of this writing (including some repeat votes from users who changed their votes). Last year's ArbCom elections, by comparison, attracted just 593 valid votes.

The population of voters taking part also appears to shape up very differently from past years. According to an analysis by Opabinia regalis, herself a candidate in the election, the percentage of voters with relatively few contributions (150–5,000 edits) is markedly higher this year. However, early fears that the mass messaging would attract large numbers of voters who had not edited the English Wikipedia in years and were consequently out of touch with the project seem to have been disconfirmed. According to a post by Opabinia regalis on 1 December 2015, based on data available at the time, only 162 voters would have been filtered out by a "must have edited Wikipedia in the last three months before the election" eligibility criterion.

The elections are scheduled to close at 23:59 (UTC) on Sunday, December 6, 2015.

Reader comments

2015-12-02

Is Wikidata as transparent as it seems?; Wikimedia Fund-raising drive launches

Contribute —

Share this

By Mdann52

Jerusalem's Old City, at the centre of this week's article

This week, Slate commented on how, with Google sourcing more and more content automatically from databases such as Wikidata, with little human intervention, misinformation can spread quickly (Nov. 30). The article points towards a search for Jerusalem, which comes up with the result "Capital of Israel" in the Google Knowledge Graph, even though the city's status is in fact intensely contested. The author, Mark Graham from the Oxford Internet Institute, argues that as much information and sources are stripped away, it can easily lead to less transparency on where information comes from, and a lack of context when interpreting it.

“

... because of the ease of separating content from containers, the provenance of data is often obscured. Contexts are stripped away, and sources vanish into Google’s black box. For instance, most of the information in Google’s infoboxes on cities doesn’t tell us where the data is sourced from.

”

Does the Wikimedia Foundation really need more money?

The amount the WMF are asking for is the price of a cup of coffee.

Following the launch of the Wikimedia Foundation's annual English-speaking fund-raising drive this week, The Washington Post published a piece (December 2), commenting that the language of the banner may well lead readers to think ...

“

... that the world's seventh-largest site risks going dark if you don't donate. In reality, that couldn’t be further from the case

”

. They also point out that Wikipedia's drive is controversial within the community:

“

At other nonprofits, of course, even those in the media space, fundraising drives rarely provoke such contempt

”

One of the files involved with the lawsuit.

"Monkey see, monkey sue...?": Arstechnica covers latest developments in the PETA lawsuit against David Slater^{[citation needed]} over the monkey selfies (December 5). PETA asserts that Naruto, the monkey named in their complaint, actually holds the rights, a claim disputed by the defendants, who argue that even if a monkey could hold the copyright, Naruto wasn't the monkey who operated the camera's shutter, and the suit should be dismissed on these grounds alone.
Vandalism following Syria vote: the Express reports (December 4) that, following a House of Commons vote to authorise air-strikes on Syria, several Labour defectors had their Wikipedia articles vandalised, including Hilary Benn.
VIP access after Wikipedia vandalism: Several articles, including ones in the Guardian, BBC and Independent, were written about a fan gaining backstage access at a performance by Peking Duk, after adding himself as family on the artist's Wikipedia article (December 3).
Jimmy Wales to lobby China: Channel News Asia reports (December 2) that Wikipedia co-founder Jimmy Wales plans to fly out to China in the next few weeks to discuss getting Wikipedia unblocked in China.
Artificial Intelligence tool launched: MIT Technology Review (November 30), the BBC (December 2) and others reported on the launch of ORES (Objective Revision Evaluation Service) this week. The tool aims to automatically flag up low-quality edits to articles. (Read more in The Wikimedia Blog.)

Do you want to contribute to "In the media" by writing a story or even just an "in brief" item? Edit next week's edition in the Newsroom or contact the editor.

Reader comments

2015-12-02

Jonesing for episodes

Contribute —

Share this

By Serendipodous

Once again, disaster fatigue has set in, and viewers are taking comfort in the bosom of mass media. The vast majority of articles on this list are either pop culture-related, returnees from previous weeks or years, or both. The main focus of that interest was the latest Marvel Netflix series, Jessica Jones, which took the top two slots.

For the full top-25 list, see WP:TOP25. See this section for an explanation of any exclusions. For a list of the most edited articles of the week, see here.

As prepared by Serendipodous, for the week of November 22 to 28, 2015, the 10 most popular articles on Wikipedia, as determined from the report of the most viewed pages, were:

Rank	Article	Views	Notes
1	Jessica Jones	1,918,186	The Netflix series based on this Marvel Comics superhero, starring Krysten Ritter (pictured), debuted on November 20, 2015, and, like its predecessor, Daredevil, has shot to the top of this list. Pandemic binge-watching of the latter among MCU fans led to a rapid decline in interest, as everyone scoffed down the entire season in two days. It will be interesting if this series, which is far more thematically complex and problematic than Daredevil and thus more likely to trigger debate, will share the same fate. As before, ratings don't really apply to Netflix shows but the critics have given this show almost as much love as Dardevil, with a 92% RT rating.
2	Jessica Jones (TV series)	1,599,328	See above.
3	Adele	1,389,379	Up from 903,238 views last week, as the popular singer's new album 25 (#20) was released on November 20 and set about rewriting sales records all over the world.
4	Thanksgiving	1,384,676	This beloved North American holiday has, in the past, been very ill-used by Wikipedia viewers. Every year, when it came around, immediately money-spinning spammers started flooding Wikipedia with fake views for this article, thus forcing us to remove what should have been a perfectly acceptable annual addition to this list. This is the first time since the project began that I can safely say that the article has been included entirely on its own merits without any, ahem, stuffing.
5	Lucy (Australopithecus)	1,148,394	In 1974 the discovery of this half-complete fossil assembly in Ethiopia's Hadar Formation overturned centuries of anthropocentric thinking by showing that our evolutionary big break from the animal kingdom came not when we started getting smarter, but when we started walking upright. On November 24, the 41st anniversary of her discovery, her "birthday" was celebrated by a Google Doodle.
6	Black Friday (shopping)	1,129,549	The day after Thanksgiving is also the day that retailers have earned enough to cover their debts from the previous year, and are thus "in the black" (at least, that's what they say; in truth it probably originated as a reaction to the traffic). Because of this, they often mark down their prices, leading this to become a major day on the shopping calendar and the unofficial start of the Christmas shopping season. Over in the UK, where I live, more pious commentators have been staring at this phenomenon with something like horror, decrying its gradual "consecration" as a holy day for the new religion of consumerism. They may be right.
7	Lady Colin Campbell	828,117	Campbell is a British socialite who is now appearing on the new season of Britain's I'm a Celebrity...Get Me Out of Here!, which debuted on November 15. This reality series is one whose American version floundered, but has enjoyed great success elsewhere including in Britain, and in Germany where it is nicknamed "Das Dschungelcamp". Her continuing popularity may also be due to her interesting backstory, as she was born intersex and raised as a boy, though her androgynous name is due to her brief and ill-fated marriage to Lord Colin Campbell, son of the 11th Duke of Argyll.
8	Survivor Series (2015)	774,114	As expected, this professional wrestling event, held on November 22, got a fifty percent boost in numbers this week. Roman Reigns (pictured) won the main event.
9	The Man in the High Castle (TV series)	763,741	Amazon Video's big competitor to Netflix's stable, an adaptation of Philip K. Dick's dystopian alternate history set in an America ruled by the victorious Axis Powers, put up eight of its ten episodes on November 20.
10	Islamic State of Iraq and the Levant	742,427	If they are to be believed, the repellent non-state has finally managed to extend its war beyond its shredded borders and into the heart of the West. This is an unprecedented escalation from them, but then, if there's one thing they've proven themselves good at in the last few years, it's unprecedented escalation. Some see it as desperation; ISIL have suffered numerous substantial losses from bombing and Kurdish incursions. Others have pondered if it marks the first shot in a new generational conflict.

Reader comments

2015-12-02

This Week's Featured Content

Contribute —

Share this

By Armbrust

This Signpost "Featured content" report covers material promoted from 22 to 28 November.
Text may be adapted from the respective articles and lists; see their page histories for attribution.

Featured articles

Eight featured articles were promoted this week.

Stefan Lochner (nominated by Ceoil) (c. 1410–1451) was a German painter working in the late "soft style" of the International Gothic. His paintings combine that era's tendency towards long flowing lines and brilliant colours with the realism, virtuoso surface textures and innovative iconography of the early Northern Renaissance. Based in Cologne, Lochner was one of the most important German painters before Albrecht Dürer.
Suillus bovinus (nominated by Sasata and Casliber) is a pored mushroom of the genus Suillus in the family Suillaceae. A common fungus native to Europe and Asia, it has been introduced to North America and Australia. The fungus grows in coniferous forests in its native range, and pine plantations in countries where it has become naturalised. It forms symbiotic ectomycorrhizal associations with living trees by enveloping the tree's underground roots with sheaths of fungal tissue, and is sometimes parasitised by the related mushroom Gomphidius roseus.
"From The Doctor to my son Thomas" (nominated by Cirt) is a viral video recorded by actor Peter Capaldi and sent to the autistic nine-year-old Thomas Goodall to console him over grief from the death of his grandmother. Capaldi filmed the video in character as the 12th incarnation of the Doctor in the BBC science-fiction series Doctor Who. Capaldi's message had a positive effect on Thomas; his father said the boy smiled for the first time since learning of his grandmother's death and gained the courage to go to her funeral.
The Norse-American medal (nominated by Wehwalt) was struck at the Philadelphia Mint in 1925, pursuant to an act of Congress. It was issued for the 100th anniversary of the voyage of the ship Restauration, bringing early Norwegian immigrants to the United States. The medals recognize those immigrants' Viking heritage, depicting a warrior on the obverse and a vessel on the reverse. They also recall the early Viking explorations of North America.
Smilodon (nominated by FunkMonk and LittleJerry) is an extinct genus of machairodont felid, commonly known as the saber-toothed tiger. It lived in the Americas during the Pleistocene epoch. The genus was named in 1842, based on fossils from Brazil. The Smilodon hunted large herbivores such as bison and camels and it remained successful even when encountering new prey species. It died out at the same time that most American megafauna disappeared.
History of York City F.C. (1908–80) (nominated by Mattythewhite) York City F.C. is a professional association football club based in York, England. The history of the club from 1908 to 1980 covers the period from the club's original foundation, through their reformation and progress in the Football League, to the end of the 1979–80 season.
The 2006 UAW-Ford 500 (nominated by Bentvfan54321) was a stock car racing competition which took place on October 8, 2006. Held at Talladega Superspeedway in Talladega, Alabama, the 188-lap race was the thirtieth in the 2006 NASCAR Nextel Cup Series and the fourth in the ten-race, season-ending Chase for the Nextel Cup.
Hurricane Fay (nominated by Juliancolton) was the first hurricane to make landfall on Bermuda since Hurricane Emily in 1987. The sixth named storm and fifth hurricane of the 2014 Atlantic hurricane season, Fay began as a subtropical cyclone on October 10, transitioning into a tropical storm on October 11, briefly achieving Category 1 hurricane status on October 12, and eventually degenerating into an open trough on October 13.

Featured lists

Four featured lists were promoted this week.

The class the stars fell on (nominated by Hawkeye7) is an expression used to describe the United States Military Academy Class of 1915. In the United States Army, the insignia reserved for generals is one or more stars. Of the 164 graduates that year, 59 attained the rank of general, the most of any class in the history of the United States Military Academy at West Point, New York, hence the expression.
List of Padma Bhushan award recipients (1954–1959) (nominated by Vivvt) The Padma Bhushan is the third-highest civilian award of the Republic of India. Instituted on 2 January 1954, the award is given for "distinguished service of a high order", without distinction of race, occupation, position, or sex. The recipients receive a Sanad (certificate) signed by the President of India and a circular-shaped medallion with no monetary grant associated with the award. In its first six years 94 awards were conferred.
List of Connecticut Huskies in the NFL Draft (nominated by Grondemar and Robert4565) The University of Connecticut Huskies football team has had 39 players selected in the National Football League Draft. Two of those selections were in the first round of the draft, both with the 27th overall pick: Donald Brown in 2009 and Byron Jones in 2015. A Connecticut football alumnus has been selected in every NFL Draft since 2007 and ten of the last eleven NFL Drafts.
The AAA Mega Championship (nominated by Wrestlinglover) is a professional wrestling world heavyweight championship in the Mexican Asistencia Asesoría y Administración promotion. The championship is generally contested in professional wrestling matches in which participants execute scripted finishes rather than in direct competition. Since it's inception in 2007, eight wrestlers had hold the title for a total of twelve reigns. The title is currently vacated.