Eisspeedway

Quantile normalization

In statistics, quantile normalization is a technique for making two distributions identical in statistical properties. To quantile-normalize a test distribution to a reference distribution of the same length, sort the test distribution and sort the reference distribution. The highest entry in the test distribution then takes the value of the highest entry in the reference distribution, the next highest entry in the reference distribution, and so on, until the test distribution is a perturbation of the reference distribution.

To quantile normalize two or more distributions to each other, without a reference distribution, sort as before, then set to the average (usually, arithmetic mean) of the distributions. So the highest value in all cases becomes the mean of the highest values, the second highest value becomes the mean of the second highest values, and so on.

Generally a reference distribution will be one of the standard statistical distributions such as the Gaussian distribution or the Poisson distribution. The reference distribution can be generated randomly or from taking regular samples from the cumulative distribution function of the distribution. However, any reference distribution can be used.

Quantile normalization is frequently used in microarray data analysis. It was introduced as quantile standardization[1] and then renamed as quantile normalization.[2]

Example

A quick illustration of such normalizing on a very small dataset, organized into columns (1-3) and rows (A-D):

For each column, rank the entries from lowest to highest (i to iv):

Set aside these rank values to use later. Go back to the first set of data. Rearrange each columns' values such that each column is in order from lowest to highest. The result is:

Now find the mean for each row, and rank them lowest to highest (i to iv):

Now take the ranking order from earlier and substitute in the means according to their corresponding ranks:

These are the new normalized values.

However, note that when, as in column two, values are tied in rank, they should instead be assigned the mean of the values corresponding to the ranks they would normally represent if they were different. In the case of column 2, they represent ranks iii and iv. So we assign the two tied rank iii entries the average of rank iii and rank iv ((4.67 + 5.67)/2 = 5.17). And so we arrive at the following set of normalized values:

The new values have the same distribution and can now be easily compared. Here are the summary statistics for each of the three columns:

References

  1. ^ Amaratunga, D.; Cabrera, J. (2001). "Analysis of Data from Viral DNA Microchips". Journal of the American Statistical Association. 96 (456): 1161. doi:10.1198/016214501753381814. S2CID 18154109.
  2. ^ Bolstad, B. M.; Irizarry, R. A.; Astrand, M.; Speed, T. P. (2003). "A comparison of normalization methods for high density oligonucleotide array data based on variance and bias". Bioinformatics. 19 (2): 185–193. doi:10.1093/bioinformatics/19.2.185. PMID 12538238.