Bonferroni校正:如果在同一數(shù)據(jù)集上同時(shí)檢驗(yàn)n個(gè)獨(dú)立的假設(shè),那么用于每一假設(shè)的統(tǒng)計(jì)顯著水平,應(yīng)為僅檢驗(yàn)一個(gè)假設(shè)時(shí)的顯著水平的1/n。

簡(jiǎn)介

舉個(gè)例子:如要在同一數(shù)據(jù)集上檢驗(yàn)兩個(gè)獨(dú)立的假設(shè),顯著水平設(shè)為常見的0.05。此時(shí)用于檢驗(yàn)該兩個(gè)假設(shè)應(yīng)使用更嚴(yán)格的0.025。即0.05* (1/2)。該方法是由Carlo Emilio Bonferroni發(fā)展的,因此稱Bonferroni校正。

這樣做的理由是基于這樣一個(gè)事實(shí):在同一數(shù)據(jù)集上進(jìn)行多個(gè)假設(shè)的檢驗(yàn),每20個(gè)假設(shè)中就有一個(gè)可能純粹由于概率,而達(dá)到0.05的顯著水平。

維基百科原文

Bonferroni correction

Bonferroni correction states that if an experimenter is testing n independent hypotheses on a set of data, then the statistical significance level that should be used for each hypothesis separately is 1/n times what it would be if only one hypothesis were tested.

For example, to test two independent hypotheses on the same data at 0.05 significance level, instead of using a p value threshold of 0.05, one would use a stricter threshold of 0.025.

The Bonferroni correction is a safeguard against multiple tests of statistical significance on the same data, where 1 out of every 20 hypothesis-tests will appear to be significant at the α = 0.05 level purely due to chance. It was developed by Carlo Emilio Bonferroni.

A less restrictive criterion is the rough false discovery rate giving (3/4)0.05 = 0.0375 for n = 2 and (21/40)0.05 = 0.02625 for n = 20.

數(shù)據(jù)分析中常碰見多重檢驗(yàn)問題(multiple testing).Benjamini于1995年提出一種方法,是假陽性的。在統(tǒng)計(jì)學(xué)上,這也就等價(jià)于控制FDR不能超過5%.

根據(jù)Benjamini在他的文章中所證明的定理,控制fdr的步驟實(shí)際上非常簡(jiǎn)單。

設(shè)總共有m個(gè)候選基因,每個(gè)基因?qū)?yīng)的p值從小到大排列分別是p(1),p(2),...,p(m),

The False Discovery Rate (FDR) of a set of predictions is the expected percent of false predictions in the set of predictions. For example if the algorithm returns 100 genes with a false discovery rate of .3 then we should expect 70 of them to be correct.

The FDR is very different from ap-value, and as such a much higher FDR can be tolerated than with a p-value. In the example above a set of 100 predictions of which 70 are correct might be very useful, especially if there are thousands of genes on the array most of which are not differentially expressed. In contrast p-value of .3 is generally unacceptabe in any circumstance. Meanwhile an FDR of as high as .5 or even higher might be quite meaningful.

FDR錯(cuò)誤控制法是Benjamini于1995年提出一種方法,通過控制FDR(False Discovery Rate)來決定P值的域值. 假設(shè)你挑選了R個(gè)差異表達(dá)的基因,其中有S個(gè)是真正有差異表達(dá)的,另外有V個(gè)其實(shí)是沒有差異表達(dá)的,是假陽性的。實(shí)踐中希望錯(cuò)誤比例Q=V/R平均而言不能超過某個(gè)預(yù)先設(shè)定的值(比如0.05),在統(tǒng)計(jì)學(xué)上,這也就等價(jià)于控制FDR不能超過5%.

對(duì)所有候選基因的p值進(jìn)行從小到大排序,則若想控制fdr不能超過q,則只需找到最大的正整數(shù)i,使得 p(i)<= (i*q)/m.然后,挑選對(duì)應(yīng)p(1),p(2),...,p(i)的基因做為差異表達(dá)基因,這樣就能從統(tǒng)計(jì)學(xué)上保證fdr不超過q。因此,F(xiàn)DR的計(jì)算公式如下:

p-value(i)=p(i)*length(p)/rank(p)

參考文獻(xiàn)

1.Audic, S. and J. M. Claverie (1997). The significance of digital gene expression profiles. Genome Res 7(10): 986-95.

2.Benjamini, Y. and D. Yekutieli (2001). The control of the false discovery rate in multiple testing under dependency. The Annals of Statistics. 29: 1165-1188.

計(jì)算方法 請(qǐng)參考 R統(tǒng)計(jì)軟件的p.adjust函數(shù):

> p<-c(0.0003,0.0001,0.02)

> p

[1] 3e-04 1e-04 2e-02

>

> p.adjust(p,method="fdr",length(p))

[1] 0.00045 0.00030 0.02000

>

> p*length(p)/rank(p)

[1] 0.00045 0.00030 0.02000

> length(p)

[1] 3

> rank(p)

[1] 2 1 3

sort(p)

[1] 1e-04 3e-04 2e-02[1]