Langbahn Team – Weltmeisterschaft

Wikipedia:Turnitin/Objections

This page is for compiling and exploring objections to a potential collaboration with Turnitin.

Turnitin

Who's Turnitin again? Why should we trust some random company?

Globally, Turnitin evaluates about 40 million papers a year for text-matches. During final exam periods, the site processes 400 new submissions per second. As of 2012, Turnitin serves about 10,000 institutions in 13 languages in 126 countries. More than 2,500 higher education institutions use Turnitin, including 69 percent of the top 100 colleges and universities (U.S. News and World Report Best Colleges list). Almost 5,000 middle and high schools use Turnitin, including 56 percent of the top 100 high schools (U.S. News and World Report's America's Best High Schools). In Colorado, Turnitin is used by 100 schools--both secondary and higher education--and more than 200,000 students. More than 100 colleges use Turnitin to detect plagiarism in application essays. Turnitin's parent company iParadigms employs almost 100 people. It is backed by the private equity firm Warburg Pincus. It has 8 global offices serving almost 130 countries. It is headquartered in Oakland, California. Turnitin is one of the largest and most well known companies that provides this type of service.

Turnitin is not the only or best plagiarism detection service.

There are indeed other plagiarism detection companies which charge for their services. In setting up the project it did not appear that any other company had the history, reputation, knowledge, partnerships, database, code, scope, or scale to address Wikipedia's issues. It is true that not every possible company was approached and given the opportunity to match or better Turnitin's offer. Frankly, Turnitin's offer had so few strings attached, and their reputation as a leading service provider is so well established, that there was little motivation to do so aside from simply avoiding claims that there was no open bidding process. That said, it is necessary to demonstrate that Turnitin is not only known for successfully detecting plagiarism but that they are capable of doing so for Wikipedia. To that end, Turnitin must design and test a system that works, specifically for us. Furthermore, as the collaboration would be non-contractual and non-exclusive, if Wikipedia found a comparable or better service at any point, it could switch to using it instead if it was deemed more beneficial.

Turnitin doesn't crawl the entire web, have access to some subscription sites (like the New York Times) or index Google Books, which are stored as images. Thus, Turnitin can't catch all or even some common sources for plagiarism would give a false sense of security.

The relevant question is whether Turnitin would be an improvement over our current copyright detection regime, not whether it is perfect. Also any use of Turnitin would come with a clear disclaimer: "This report does not prove a copyright violation; only an editor's close inspection can do that. Absence of a text match does not mean the text is guaranteed to be free from copyright." Turnitin has access to about 20 billion web pages and also 100 million additional articles from content publishers including library databases, textbooks, digital reference collections, subscription-based publications, homework helper sites and books. Some of these are sources that we've never had the ability to check before (except manually), even using existing tools such as MadmanBot. Whether Turnitin is in fact an improvement and an asset to us would need to be tested and demonstrated by them during a pilot program. They're willing to invest the time to conduct one, so we should at least consider seeing how well they can perform.

There's something in it for Turnitin.

Yes. There are actually many things in it for Turnitin. They get to help Wikipedia. They get to advance their core values involving education. They get increased visibility in the Wikipedia community both among editors and readers. They get to say they work on Wikipedia. They get to put their algorithm and database to a novel use. The collaboration is supposed to be mutually beneficial. It is not pretending to be a purely altruistic arrangement, although Turnitin does seem to support Wikipedia's mission and believes Turnitin's services complement and further it.

We would be dependent on Turnitin for a vital service with no guarantee they'll continue to provide it.

This is true. The agreement would be informal and non-contractual. We could end it at any time, as could they, for any reason. Mutual benefit from the collaboration reduces the possibility of this happening, but it is of course possible. Reasons to end the partnership could be a lack of effectiveness in helping Wikipedia, the discovery of a comparable free or open-source system, changing attitudes on cooperating with companies, lack of increased visitors or customers for Turnitin, or reevaluation that the service is too taxing on Turnitin's servers for too little benefit. Any of those could happen. The question is, if they don't, is the collaboration worth pursuing for as long as we can keep it going?

Turnitin could try to charge us later for what they initially offer for free.

Turnitin has expressed no such intentions. They understand Wikipedia is a 'public good'. There is no contract as part of the arrangement, and any collaboration would operate with freedom of exit for either side. If Turnitin later asked for payment for services provided, the community could reevaluate and stop using Turnitin immediately. Turnitin would then lose the benefit from working with us, so there is a strong disincentive for them to do so.

There are free or open source alternatives.

There are indeed many free alternatives. On inspection, it appears that they only provide one aspect of the plagiarism detection process, which is a web crawler. This approach lacks one of Turnitin's core strengths, which is its database of millions of books and articles which it has developed through proprietary partnerships with various content providers. Also, Turnitin's web crawler may be superior for the purposes of finding plagiarism. It uses a pattern-matching algorithm that has been developed over two decades and which is different from standard keyword-matching algorithms used by search indexes such as Google, Bing and Yahoo. Turnitin's web index is also very large, up to 20 billion articles. Turnitin has devoted thousands of hours and hundreds of employees to developing their system, expanding it, and refining it--a process that free alternatives simply can't invest in. Last, free alternatives are unlikely to scale in a systematic and massive way, such as using them to check every single Wikipedia article on some regular basis.

Turnitin's code is proprietary and we shouldn't rely on it if we can't see it.

It's not necessary for us to see Turnitin's code to know that it works, or at least that it works better than current systems. We can take into account Turnitin's reputation and history. We can also evaluate the results of a pilot program ourselves and check to see what Turnitin catches and what it does not, and with what frequency and reliability. As a community which shares many of the open source movement's goals, it may be ideal for Wikipedia to use only open source products, but it is not an obligation. It may simply be pragmatic and beneficial for us, at least in the short or medium term, to collaborate with those who have the extensive time, capital, resources, and motivation that are frequently (but not exclusively) found in successful private companies.
Turnitin has been sued or protested in various schools and by specific students ([1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17]). An appeals court found that Turnitin's storing of documents was to the public benefit, highly transformative, caused no harm to the market value of those submissions, and constituted fair use. In a Wikipedia-specific context, all content is released under a CC-BY-SA license anyway, which permits use, modification, or commercial reuse. It seems that having Wikipedia stored in Turnitin's database is therefore not an issue; moreover, Turnitin already stores every Wikipedia article and revision in its database.

Wikipedia

This is an endorsement of a private company. Wikipedia doesn't do that.

This is a collaboration with a private company. It is explicitly not an endorsement. We would remain free to use other competing services if we choose to do so, or to discontinue the collaboration at any time.

Turnitin will advertise that they provide Wikipedia a service; regardless of semantics, the public will assume this is an endorsement of their corporation by Wikipedia.

Turnitin would be proud to collaborate with Wikipedia and want to share that news. Turnitin will have to be very clear about the context of the collaboration, using language that only reflects the specific extent that achieved consensus in the community. That might include saying that they are donating services to Wikipedia, that they are 'checking' Wikipedia, or that they are 'working with' Wikipedia to help resolve copyright violations. The community can refine this language and come to an understanding and agreement with Turnitin. If the collaboration was pursued, it would reflect that Wikipedia found Turnitin provided a useful service, but that is different from formally or exclusively endorsing the company.

We do/can/should do this ourselves.

In exploring a collaboration with Turnitin, a necessary question is whether current tools are sufficient, and whether they could or should be expanded rather than replaced by Turnitin's system. CorenSearchBot and/or MadmanBot currently check new articles for copyright violations. There are limitations to those bots: they do not check existing Wikipedia content, they only check articles against webpages not a content database, and they don't have a corpus of prior submissions. It's possible that our bots are not as developed as the proprietary code by Turnitin and the webcrawler not as extensive (Turnitin's indexes 20 billion pages). Turnitin does not use keyword-matching; instead its proprietary code uses various forms of pattern recognition and document fingerprinting. Our bots are limited to run one check per 5 seconds, which would allow it to check over 6 million articles yearly, enough to cover English Wikipedia almost twice; however, it's not clear if that level of operation is feasible. Our bots do not generate an itemized report which allows editors to actually see and compare plagiarized sections or identify the various sources which result in the match (for recent Coren's bot reports, see User:CorenSearchBot/manual). Also, our bots do not have access to a content database like Turnitin's which contains millions of articles and journals. To determine which path forward is best, Turnitin needs to explain and demonstrate how they would approach analyzing Wikipedia content. Also, Coren, Madman, or others in the community would have to suggest or propose on-Wiki methods which were comparable. Still, it's highly unlikely that an on-Wiki tool could design a system specifically for Wikipedia or have the resources and server capacity to check all of Wikipedia on a regular basis. In the end, there is nothing preventing us from using Turnitin only up until the point that one of our own tools is sufficient to do the job. In the meantime, there are significant pragmatic reasons to consider a collaboration.

If we do this, we lose our independence.

Wikipedia is unique because it is wholly community owned and operated. Turnitin would not change that. We would control the degree, process, and ability to exit from any collaboration with Turnitin. Turnitin has not requested anything on 'their terms' besides attribution where its reports are linked and permission to tell its own clients and the public about its involvement with Wikipedia. In short: we would benefit from the collaboration, could freely leave it, and would not have to make any other concessions. If losing a little independence is the cost of all of our articles being checked for copyright violations, then the community should weigh not only its principles, but also the opportunity cost. Sometimes we can achieve more with the help of others, even a private company, than by acting alone.

If we do this, we lose our neutrality.

Broadly, Wikipedia would be using Turnitin as a tool rather than entering a partnership or contract. There is no expectation of compensation beyond attribution, and it can be explicitly stated in the proposal that collaborating with Turnitin gives them no special right to attention or protection in articles related to their company. That's just text, of course, so it would be incumbent upon the community to maintain the neutrality of the specific articles related to Turnitin or plagiarism. If past work with conflicts of interest is any indication, sources that are at the focus of a COI are often scrutinized more not less, and Turnitin could expect to receive extra attention. Wikipedia manages to write about articles related to itself without particular difficulty; it doesn't seem particularly more difficult to write about a company we are collaborating with. Surely, many in the community will want to go out of their way to make those articles comprehensive, critical where appropriate, and above all, neutral.

We're exposing ourselves to 3rd party liability.

Wikimedia Foundation's legal counsel will have addressed this issue and decided if there are any risks or concerns. Using a Turnitin report does not endorse the company's operations or align us with their actions. Using a Turnitin report and finding a low match-level is also not a guarantee that an article is free from copyright violations, as it is only a starting point for investigations. The relevant notices and help documentation can state that plainly and up front.

If we compromise our principles here, we will make even worse concessions going forward.

This assumes we are compromising our principles here, which needs to be discussed and demonstrated. Also, this is a slippery slope argument, so it has to be taken in context. Claims that Wikipedia will end up branding itself, selling advertisements, or creating formal, exclusive partnerships--though reasonable fears--are just speculation. The main distinction about a collaboration with Turnitin is that it supports a core site operation and policy. We shouldn't preclude ourselves from exploring something that would benefit us specifically just because other hypothetical future arrangements could compromise our neutrality or independence.

Attribution

We're giving away too much.

This collaboration should be mutually beneficial. The arrangement with Turnitin would give them limited attribution off-wiki, on their reports and give them the ability to say that they collaborate with Wikipedia. That is it. Although Wikipedia is a completely non-profit operation, Turnitin would be providing tens or hundreds of thousands of dollars worth of services for the research, resources, and execution that would comprise ongoing work. If implemented effectively, Turnitin's services could lead to significant reforms or even a complete overhaul of how copyrighted content on Wikipedia is handled. Seen in that context, Wikipedia is receiving quite a lot and not giving away much.

Attribution is co-branding Wikipedia with a private company.

The level of attribution Turnitin received would have to be approved by the community. One way to think about attribution for these reports is WP:SAYWHEREYOUGOTIT. Even if Turnitin was willing to provide reports anonymously, it would be of some benefit for editors and readers to know about the source of those reports. Although Turnitin's specific code is proprietary, someone researching copyright problems could still learn about the company providing those reports, its approach, and reputation.

We're leading readers to an external site and away from Wikipedia.

Not all relevant information for Wikipedia is stored on Wikipedia. The best example is sources, which are commonly included with links to online copies. This allows readers to pursue further investigation.

The collaboration sets a dangerous precedent for corporatizing and/or commercializing Wikipedia.

Wikipedia is ardently non-corporate and non-commercial, and any action which threatens that principle warrants serious consideration. The first point that needs to be repeated is that this is a non-exclusive, informal collaboration. That might seem like semantics, but it has meaning. There literally is no contract, and we really could use other services or discontinue the collaboration at any time. The only aspect which directly rewards Turnitin is attribution on their reports whenused. There is a risk that people will interpret a collaboration as an endorsement or partnership, and we would have to actively correct that and promote what it actually is when we discuss the project. We should ensure Turnitin does the same. In the end, if the community sees that there is any corporatizing or commercializing impact, they will have to weigh its degree and the opportunity cost from turning down the collaboration.

Implementation

Turnitin attempts to be comprehensive but acknowledges that it does not have access to all publications or websites. Further, its algorithms make certain choices regarding what to identify as a match and what to ignore. It would be made clear at all points of the collaboration that Turnitin reports are not conclusive and can't provide proof that something has or does not have copyright violations. For one, matches may be to text which is from Wikipedia itself (mirrors) or from a site with a compatible copyright license. Two, Turnitin reports will always just be starting points for an investigation. Editors will have to use their discretion in exploring Turnitin reports and acknowledge that a clear Turnitin report is only proof that Turnitin didn't find any matches. Further investigation may be necessary. If Turnitin does find matches, investigation and confirmation is still necessary. The question is whether having Turnitin's reports gives us another beneficial tool and improves upon our current copyright checking regime. We can test to see if it does, and if it does, then a partial solution is better than a no solution.

Turnitin's algorithms won't work, especially on existing (rather than new) articles, because of the many mirrors and copies of Wikipedia content.

Turnitin is well aware of the problem of mirrors and has committed to adapting and designing a separate algorithm that worked just for Wikipedia. They are willing to invest the time, energy, and resources into that process. They also realize that no collaboration could be pursued on a wide scale without a well designed pilot program that rigorously tested the effectiveness of Turnitin's algorithms and reports, and a period of feedback, analysis, and refinement to optimize the functionality of their approach.

Checking every article for plagiarism and copyright violations is assuming bad faith.

In the educational or academic context, some students argue that requiring papers to be checked for plagiarism poisons the student-teacher relationship by making a presumption of guilt. The collaboration with Wikipedia would work differently and in a different context than it does with schools. For one, at least until the system was widely tested, adopted, and approved, submissions would never be pre-reviewed; Turnitin reports would only be run post-publication. So, there is no new barrier to entry (such as an edit filter, although the community could potentially explore that later). Secondly, Wikipedia editors are not students; they are contributors to a live encyclopedia. Our goal is not to educate contributors but to produce a reliable and truly free encyclopedia, as both law and policy require. Suspected copyright violations would nonetheless be handled tactfully, with appropriate levels of article tags and editor notifications handled by thoughtful editors using their discretion.
This is true. Turnitin reports are not conclusions, they are starting points for investigations. The collaboration would do several things to facilitate identifying copyright violations. It would create reports that presented matching text. It would identify the source of those matches. It would add a percentage-matched statistic. That statistic could be used on a Wikipedia project page which ranked articles by percentage-match. Those would be the most likely candidates for an investigation, and having the list would allow for suspected copyright issues, copyright problems, and copyright cleanup investigations to prioritize their work. Also, for questions about a specific article, editors could use the Turnitin report to greatly speed up and enhance their work.
Turnitin has conducted research that supports its effectiveness in reducing plagiarism over time. A recent study showed otherwise. In any case, the use of Turnitin on Wikipedia would not primarily be as a deterrent, but as a tool for prioritizing and investigating copyright violations. Copyright violations often happen inadvertently, by editors who are uninformed about what it means to create an original work of writing. Others happen from carelessness or desperation. The purpose of the collaboration is not to educate those editors (although that may be a happy side-effect); rather, it is to identify instances that violate the law and/or our policies and to fix them.

Even if effective in deterring plagiarism, Turnitin's reports will only motivate more efforts to circumvent detection, such as through close paraphrasing. The appearance of plagiarism prevention may only reflect that people have become smarter, because of Turnitin, in avoiding detection.

Turnitin's pattern recognition may be able to detect certain degrees and instances of close paraphrasing. This would have to be tested before implementation. Although still a violation, there is some reasoning that close paraphrasing is a lesser offense than outright copying, insofar as it at least disguises itself. On the other hand, close paraphrasing is notoriously difficult to detect, and an increase in the practice would raise real concerns. In the end, a lack of Turnitin's ability to detect close paraphrasing may not be a reason not to enter the collaboration; after all, we can't detect close paraphrasing with any ease or reliability as it is. Also, the presence of Turnitin reports may encourage writers to make a more vigorous effort to avoid close paraphrasing, which although not necessarily benevolent in motivation, may ultimately result in text that is meaningfully different enough that it is no longer close paraphrasing at all, and is just acceptable paraphrasing.

We don't have the editors or infrastructure in place to actually evaluate the plagiarism reports for every article.

This is partly resolved already and at least partly resolvable. There is an extensive copyright detection and cleanup effort at Copyright cleanup investigations, Copyright problems, and Suspected copyright violations. There are a variety of copyright tags and templates for articles which may have or do have problems. There are also various user warnings for copyright violation contributors' talk pages. Turnitin could easily integrate with this existing system, through one of the existing copyright bots such as MadmanBot, or through a new bot that posted article talk page notices with links to Turnitin reports. If Turnitin's algorithm was rigorously tested, we could design bots which automatically tagged pages with a high level of text-matching. Last, Turnitin page-matching scores could be placed on a Wikipedia page in order of degree. This would allow copyright investigators to prioritize their efforts. It may be the case that Turnitin reveals more copyright issues than we currently have editors to fix. That may be a problem, but it is ultimately better than not knowing about those issues at all.

Providing Turnitin reports would attract editors who just want to submit and check their text through Turnitin's algorithms, as a way to avoid paying for its services. This would introduce a stream of vandalism and unconstructive contributions and could cannibalize Turnitin's paid services.

This is a concern not only for Wikipedia but also for Turnitin, as they have a strong incentive not to cannibalize their own services. The main way this undesirable situation would be avoided is by not running Turnitin reports constantly or after each edit, but instead doing so at periodic intervals. This would remove the immediacy of posting to Wikipedia and then seeing a Turnitin report. (It has not yet been decided what the optimal and achievable timeframe is. Although reports run monthly or after significant additions of new text may be optimal, it might also be the case that reports run as infrequently as yearly would still keep copyright investigators busy for many months.)

Miscellaneous

There must be something in it for Ocaasi.

Ocaasi is neither paid by nor personally affiliated or associated with Turnitin in any way. He read about the company in a news article and contacted them independently. They were interested in the idea, so he brought it back to the community. There is no compensation, financial or otherwise which Ocaasi will receive as part of this collaboration. He thinks it might benefit the community, but is considering all objections seriously and deferring to whatever the community decides is best.