Langbahn Team – Weltmeisterschaft

Wikipedia:Wikipedia Signpost/2009-11-16/Sister projects

Sister projects

Wiktionary interview

God knows which logo will we be using...

This is an interview about Wikimedia Foundation sister projects. The aim is to help Wikipedia editors understand these projects, with the hopes that more will be interested in participating.

This week, the Signpost invited Dominic to explain Wiktionary.

Can you describe what is Wiktionary? What is its history?

Wiktionary is a dictionary, though it also includes material that might typically be found a thesaurus, a rhyme book, and other similar language-related references. Wiktionary is the first sister project, founded in 2002. Also, more languages have their own Wiktionary than any other sister project. Unusually, the highest article count among Wiktionaries is held by a non-English edition, the French Wiktionary. And, in terms of total content pages, both the French and English Wiktionaries are larger than all other projects besides the English Wikipedia. Both Wiktionaries are each quickly nearing 1.5 million total content pages. However, because Wiktionary pages can have any number of definitions in any number of languages on them, that measure does not actually mean all that much (it is estimated that the English Wiktionary actually outpaces the French one in terms of definition count). Also note, as a bit of trivia, that Wiktionary is the one content-oriented sister project that does not start with "wiki-".
One major development in Wiktionary history was when the projects switched from first-character capitalization to first-character case-sensitivity. The decapitalization of Wiktionary (or "decapitation," as some of its detractors called it) initially faced opposition on the English Wiktionary, and around 25 other projects had already voted for and implemented case-sensitivity before the English Wiktionary, in early 2005, voted in favor of case-sensitivity. This caused the need for considerable cleanup to put everything in the right place (like all of the things that are properly capitalized, and had been automatically decapitalized in the move). By now, all Wiktionaries are case-sensitive by default. The change was prompted by Wiktionary's unique need for precision in page titles, because of the linguistic implications for changes in letter case. The distinction between words like "analyse" and "Analyse" turns out to be rather important, especially in a multilingual dictionary where a language like German (which capitalizes nouns by convention) coexists with English. As well, even in English, it is common for capitalization to distinguish completely different words, especially proper nouns from common nouns (cf. "afghan" vs. "Afghan").
Precision also leads to another peculiarity of Wiktionary: the aversion to redirects. Page titles on Wiktionary are sacred, and not used to disambiguate words spelled the same but with different meanings, or in different languages. Instead, all senses of all words in all languages that share a particular spelling go on the same page, with different language and part of speech headings to separate them. And whereas on Wikipedia you would expect common alternative spellings and misspellings and different forms of words to redirect you to the article you want, on Wiktionary, generally these forms are either common enough (attestable) that they merit their own entry for that form, or not included at all. "color," "colour," "colors," and "colored" all have separate linguistic existences; similarly, "friend" and "freind" both have their own entry, despite the fact that one is commonly considered a misspelling for the other (the entry is marked as such, of course). In any case, many misspellings or word forms are also valid words in their own right, often in different languages, and a redirect at that page title would prevent the addition of those other meanings at that spelling. Capitalization is also an issue of proper spelling for words like "pH."

What is the purpose of Wiktionary? Any aims and objectives?

Wiktionary is a multilingual dictionary, which means that each Wiktionary seeks to encompass all words in all languages, defined in that project's language. For example, the English Wiktionary will have an entry for all words in English, Azeri, Mapudungun, Zulu, and every other language, giving their definitions (or translations, if non-English), etymologies, pronunciations, synonyms, and so on, in English. The French Wiktionary also seeks to include all the same words, but by defining and describing them in French, and translating the non-French words into French.
In my opinion, Wiktionary is actually the most ambitious project currently envisioned by the Wikimedia Foundation, which may come as a surprise to Wikipedians. The "all words in all languages" mission is mind-boggling in its scope if you think about it. The Oxford English Dictionary contains over half a million words (counting the various listed derivatives and forms which don't get their own entries) supported by over two-and-a-half million quotations in 22,000 pages. Of course, the OED, while one of humanity's best attempts at a comprehensive dictionary of any language, is quite incomplete: favoring formal over informal language; British, especially English, regionalisms over other regions; well-established words over relatively new terms; print sources and language over electronic ones; in-use language over obsolete; and so on. While Wiktionary lacks many of the OED's entries, it manages to include many English terms that the OED lacks. Many authorities estimate the language has at least a million words. And Wiktionary's goal is not simply all of those, but all words in all languages. While it is likely all or most languages have fewer words than English, the principle that a substantial portion of each language's vocabulary remains undocumented holds true for all, especially for the many languages which lack a strong history of lexicography like English does. With between 5,000 and 10,000 languages in the world, that could mean over 100 million words in living languages alone (which Wiktionary does not limit itself to). And while we have a ways to go, a recent New York Times article noted that the OED adds around 1,000 new words every year, while we have just about doubled every year—meaning adding more than half a million pages in 2008 alone.

I see that Wikipedia often contains the definition of a word already. How is Wiktionary different from other projects?

If you look at a well-made Wiktionary entry, you will see major differences between how Wiktionary goes about defining words and how Wikipedia would. Wiktionary, as a dictionary, is concerned with linguistic information and relationships, not cultural ones. You might be surprised to see what you find at Tigger and Abraham Lincoln, for example.
One important consequence of this is that Wiktionary has no concept of notability. Officially, the mantra is "all words in all languages" (and by "words," we mean any idiomatic linguistic unit, even if it includes spaces or punctuation, like hot dog and live by the sword, die by the sword). The challenge is not determining the notability of a term, but whether it it is sufficiently used by the language's speakers to be recognized as a word of that language. This is what we call "attestation." This is the reason that Tigger at Wikipedia is about the fictional character, a notable encyclopedic concept, while Tigger is about the word, derived from the character, that entered the English language.
Another difference is the sort of evidence that a Wiktionary entry requires for proper citation. Typically, if a Wikipedia articles needs to define a word, it will simply cite the definition from other dictionaries; anything less would be original research. As a dictionary, though, it is Wiktionary's job to cite its definitions with actual examples of the words in the literature of the source language, not to rely on secondary works. In order to attest a sense of a word, Wiktionary typically requires three or more independent examples spanning more than a year of it being used (and not merely mentioned) in durably archived sources. Wiktionary might end up quibbling with other dictionaries about inclusion or definition because the sources point it in a different direction, and it also includes many, many terms that are not found in any other dictionaries—especially new, vernacular, vulgar, colloquial, and jargon terms. (See cum dumpster and whore's paint.)
For further differences, see above where I discussed case-sensitivity, disambiguation, and redirects.

What do you tend to do on Wiktionary?

I have dabbled in several areas. I don't consider myself particularly great at coming up with definitions, but I do enjoy adding new and interesting words, especially idioms, from scratch. The majority of my content edits are probably in adding Spanish entries and translations. Wiktionary is particularly in need of people with skills in languages outside of English, since the English language is only a small segment of the project's scope. Even common languages like Spanish still have lots of room for growth. I like to doing research, and so citing senses, especially ones that are challenged, can also be interesting.

Ok, I'm the new kid in Wiktionary. What are some of the things that a newcomer can do easily?

There is a tutorial for newcomers. The English Wiktionary's two guiding content policies are the criteria for inclusion (CFI) and entry layout. There is a information desk for asking questions.
Wiktionary can be a good fit for the accomplished word nerd and the mere dabbler alike. It particularly attracts pedants, the linguistics-inclined, polyglots, and coders.
What you do depends completely on your interest. First, remember that even if you are an experienced Wikipedian, there is no shame in asking for help when you are unsure. There always seem to be idioms even in English, especially regional English, that are missing, and so these are often the easiest places to try your hand at creating new entries from scratch. Many people who have a non-English native language should, of course, consider joining that language's Wiktionary, but if you do edit the English Wiktionary, you may find that you can fill your days relatively easily adding new entries and translations. (Foreign-language entries are some of the simplest, since the basic form is just needs the English translation of the word on the definition line, rather than coming up with a definition as English words require.)
There is also room for more new page/vandalism/spam patrollers, of course, if you are that type of person. We get considerably less abuse than Wikipedia, but we also have essentially no automated tools and few regular patrollers.

What's the big fuss about the logo change? It seems like a perennial discussion that just never go away...

Original logo, used by en.wikt and others.

Wiktionary's original logo design (the one currently used on the English Wiktionary), has been a chronic source of controversy. In a way, the concept is clever, since the logo is itself a mock dictionary entry, and so it actually describes the project, while many other attempts to represent a dictionary end up just looking like a clip-art open book (when it's a website). On the other hand, this represents several problems, the main ones being that because it is entirely words, it is not easily transferable or identifiable. When you want to adopt the logo to a new project, you cannot just slap on a shared image; the entire thing has to be translated, creating no single logo image. The English Wiktionary, and the majority of Wiktionaries by number (though not necessarily by content), use this logo, which is the default for new projects.

New logo, used by fr.wikt and others.

As a result, there was an effort to create an improved logo, culminating in the 2006 vote that selected the scrabble tile logo used on the French Wiktionary and several other projects. This logo has the advantage of being more easily adaptable to all the languages, as well as a favicon that is different from Wikipedia. Unfortunately, this effort only compounded the problem by creating a confusing situation in which some Wiktionaries chose to adopt the new logo and others ignored the vote. The new logo turned out to be much less popular with the Wiktionarians, who were expected to use it, than with the people who had designed and supported it on meta, many of whom were not Wiktionary editors. For many Wiktionarians, particularly on the English one, the original logo is bad and the tile logo is worse.

Recently, there has been a renewed effort to fix the logo problem. Many new logo proposals were submitted and a vote is currently envisioned to pick a global Wiktionary logo. It remains to be seen how, if at all, such a vote would occur and if it can avoid the problems of the previous attempt without making it worse.

On Wikipedia we have "Featured Article" to show its best selection of articles. Is there similar scheme for Wiktionary?

Wiktionary has a "Word of the Day" section on its main page that features English words that are considered exotic while still being useful. There is also currently a plan to test a foreign-language word of the day on the expected redesigned main page. However, while a word of the day must meet some basic requirements, it is not the same as a featured article which has been vetted and determined to be the highest quality. Wiktionary entries tend to be more basic than encyclopedia articles. At the same time, no English-language entry has ever really been finished, despite some valiant efforts (like "water"'s multi-gazillion translations, "time"'s 373 derived terms, "stick"'s 56 senses), considering the scope of the project. I am not aware of any real effort, despite occasional proposals, ever to create a process similar to Wikipedia's featured articles, and I am not sure if there ever will be one.

Is there a way that I recognize the best works of Wiktionary?

One way to measure quality of Wiktionary entries is their citations. A good definition will have at least three primary source quotations that demonstrate the word being used in real literature (e.g.: feed a cold). This is, of course, most important for less common words or meanings of words, especially idioms and colloquialisms. As an alternative, particularly for words unlikely to be in dispute, example sentences help differentiate senses and demonstrate usage (e.g. it). Another desirable quality is ample non-definitional data, like pronunciations, etymology, derived and related terms, synonyms and antonyms, translations, and so on.Wiktionary does not have a formal approval or vetting process, though any word can be challenged to verification at requests for verification, and will be removed if not attested properly.

Does the project have any plans to promote itself or recruit more members?

I'm trying! :-)
If you are interested in getting involved, feel free to talk to me or ask on the information desk (the equivalent of the help desk) for any guidance or personal mentorship you would like. Volunteers wanted and appreciated!

What are some challenges that Wiktionary faced?

Wiktionary struggles with writing a dictionary in software that was originally, and actively continues to be, designed for an encyclopedia. In terms of software, one of the most important differences between the two types of references in the structure.
As a simple example, "bat" has several meanings in English. The Spanish word "murciélago" (one of my favorite Spanish words) corresponds to only one of those English meanings—the mammal. However, as you will see at murciélago, no mechanism exists for linking a specific sense to another specific sense; rather links go the tops of pages or sections. The English Wiktionary is forced to compensate by trying to ensure that a manual gloss "(winged mammal)" accompanies all translations. The Spanish Wiktionary solves a similar problem by using the numbers of the senses to designate the meanings intended by particular translations (as in "es:cambio"). The result is that any addition or removal of a sense requires taking care not to break the numbering scheme in the translation section and have translations suddenly point to the wrong meaning.
Another symptom of the same problem is the software's inability to recognize things like language and part of speech section headers as data about the entry. All Wiktionary entries are organized into top-level sections according to the language, but despite this universal organization, it is impossible to allow readers to, say, only look at English or only Romanian, and, before a recent Toolserver hack allowing the reader to choose a language, hitting "random page" would usually just give you a Spanish or Italian verb form, rather than the English lemma forms most readers wanted.
If you would like to help with Wiktionary development, please let us know at the grease pit (which is like the technical Village Pump) or visit #wiktionary on IRC.