Wikipedia:Version 1.0 Editorial Team/Article selection
The process for article selection for offline releases of Wikipedia is now mainly automated. It is specifically used for assembling the so-called release versions; the next release (as of September 2011) is called Version 0.9.
The User:WP 1.0 bot collects information about quality and importance of articles for these WikiProjects. These data are then used to rank articles based on a combination of quality and importance (as described below), and articles are selected based on these rankings. To ensure that nothing is overlooked, single articles can still be nominated manually then reviewed. This approach has been used for all releases beginning with Version 0.7 (31,000 articles).
The bot computes a numeric score for each article. Articles that have a score over a certain threshold (which will change from one release to the next) will be included in the release version. The threshold for Version 0.8 has been set at 1240. This page describes the algorithm that the selection bot uses to assign scores.
Older tests are described at Wikipedia:Version 1.0 Editorial Team/Selection trials.
Selection technique
The bot generates a score for each article in each project that has assessed the article. The overall article score consists of two components, the importance score and the quality score:
Overall article score = Importance_score + Quality score.
An article will have one overall score for each project that assesses the article. The highest score given to an article by any project will determine whether the article is included in a release version.
Importance score
In most cases, the overall importance score is obtained by adding points based on the importance assigned by the WikiProject and points based on external interest in the article:
Importance score = Assessed_importance_points + External_interest_points.
Some WikiProjects, such as WP:MILHIST, have chosen not to assess for importance. In such cases, the overall importance score is calculated using the external interest points alone:
Importance_score = External_interest_points * (4/3).
This formula is also used for articles whose importance is marked as 'Unknown-Class' or 'Unassessed-Class'.
Assessed importance points
The assessed importance of an article is used to assign points based on the WikiProject itself and the importance rating assigned to the article:
Assessed_importance points = Base_importance_points + WikiProject_scope_points.
The base importance points are taken from the following table.
Rating | Top | High | Mid | Low |
Points | 400 | 300 | 200 | 100 |
If the importance is not assessed, the 4/3 formula is used, and the base importance points are not used in the final score calculation. In this case, the Wikiproject scope points also do not count towards the final score.
WikiProject scope points
WikiProject scope points are used to compensate for the difference in scope between WikiProjects. For example, the Geography WikiProject has a very broad scope, while the Åland WikiProject has a more narrow scope.
The WikiProject scope points are typically based on the external interest points, defined below, for the Top-Importance article that best represents the scope of the project. For example, Wikipedia:WikiProject Chicago is best represented by the article Chicago.
Some projects cover several subjects, either explicitly (Wikipedia:WikiProject Amphibians and Reptiles) or implicitly (Wikipedia:WikiProject Kingdom of Naples includes Kingdom of Two Sicilies). In these cases, the WikiProject scope points are based on two or more articles that cover the main subjects of the WikiProject.
In other cases, there is no single article that adequately represents the entire project, or the "representative" article is of much lower score than major topics within that subject. In such cases, a selection two or three Top-Importance articles that lie at the core of the subject matter may be used. For example, the articles Jimi Hendrix and Eric Clapton were selected for Wikipedia:WikiProject Guitarists.
To compute the WikiProject score when multiple articles are considered, the page view counts, incoming page links, and interwiki links for all the articles are totaled, and then used as if they were the data for a single article in the formula for external interest points given below. This results in a raw score. The distribution of raw scores for Wikipedia 0.7 is shown in the following table.
Percentile | 10% below | 25% below | 50% below | 75% below | 90% below |
Raw score | 785 | 900 | 1025 | 1130 | 1200 |
The Wikiproject scope points are obtained by subtracting 1000 from the raw score and dividing the resulting number by 2.
Task forces and child projects
Many WikiProjects, such as WP:Films and WP:Australia, use task forces to assess specialized areas within their general scope. In some cases (such as WP:Australia) the task force can assess importance within the speciality area independent of the parent project's importance assessments. In these cases, a separate Wikiproject score is computed for the child project. In other cases (such as WP:Philosophy), importance is assessed only by the parent project. In these cases, the parent project's Wikiproject score is used as the Wikiproject score for the child project.
External interest points
These points measure the external interest in an article, independent of the ratings assigned by the WikiProject. The points are formed by combining the number of page views (hitcount) as well as the number of incoming internal links and the number of incoming interwiki links from Wikipedias in other languages:
External interest points = 50 * log10(hitcount) + 100 * log10(internal links) + 250 * log10(interwiki links)
The counts of page views, pagelinks, and interwiki links for all pages that redirect to a given article are added to the article's own counts before the external interest points are computed.
The hitcount data is obtained from http://dammit.lt/wikistats/ (this is the same data used by http://stats.grok.se). From this data, a list of daily hitcounts over a period of several weeks is formed. For each article, the highest 20 percent and lowest 20 percent of these daily hitcounts are discarded, and the remaining data points are averaged (see truncated mean). The resulting statistic is used as a measure of the typical daily page views of the article. The hit statistics displayed in the selection bot stats on the toolserver are actually monthly hitcounts.
Quality score
The quality score for an article in a project is based on the quality rating assigned by the wikiproject.
Rating | FA | FL | A | GA | B | C | Start | Other |
Points | 500 | 500 | 400 | 400 | 300 | 225 | 150 | 0 |