2015年3月31日星期二

Two sided p value for one sample mean difference-bootstrap mehtod

boottestonemean<-function(x,theta0,b){
#
#  x = sample
#  theta0 is the null value of the mean
#  b is the number of bootstrap resamples
#
#  origtest contains the value of the test statistics
#           for the original sample
#  pvalue is the bootstrap p-value
#  teststatall contains the b bootstrap tests
#
n<-length(x)
v<-mean(x)
z<-x-v+theta0
d<-mean(x)-theta0
counter<-0
teststatall<-rep(0,b)
for(i in 1:b){xstar<-sample(z,n,replace=T)
              vstar<-mean(xstar)
               #if(vstar-theta0 >= d){counter<-counter+1}# one sided test
               if(abs(vstar-theta0) >= d){counter<-counter+1} #two sided
teststatall[i]<-vstar}
pvalue<-counter/b  #if we do the count *2  it has the same effect as above
list(origtest=v,pvalue=pvalue,teststatall=teststatall)
#list(origtest=v,pvalue=pvalue)
}
x<-c(119.7,104.1,92.8,85.4,108.6,93.4,67.1,88.4,101.0,97.2,95.4,77.2,100.0,114.2,150.3,102.3,105.8,107.5,0.9,94.1)
R<-boottestonemean(x,90,3000)
R$pvalue
hist(R$teststatall)
T<-(mean(x)-90)/(sd(x)/20^0.5)  #one sided t test
T
1-pt(T,19)




2015年3月30日星期一

one sided or two sided p value for permutation and bootstrap test

The one-sided p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than or equal to T(obs). The two-sided p-value of the test is calculated as the proportion of sampled permutations where the absolute difference was greater than or equal to ABS(T(obs)).

From wiki

Hang-in and Have Smart Friends - The Road to HIV Resistance-by Professor Roger Detels


Exact test

Exact test

From Wikipedia, the free encyclopedia
Jump to: navigation, search
In statistics, an exact (significance) test is a test where all assumptions, upon which the derivation of the distribution of the test statistic is based, are met as opposed to an approximate test (in which the approximation may be made as close as desired by making the sample size big enough). This will result in a significance test that will have a false rejection rate always equal to the significance level of the test. For example an exact test at significance level 5% will in the long run reject true null hypotheses exactly 5% of the time.
Parametric tests, such as those described in exact statistics, are exact tests when the parametric assumptions are fully met, but in practice the use of the term exact (significance) test is reserved for those tests that do not rest on parametric assumptions – non-parametric tests. However, in practice most implementations of non-parametric test software use asymptotical algorithms for obtaining the significance value, which makes the implementation of the test non-exact.
So when the result of a statistical analysis is said to be an “exact test” or an “exact p-value”, it ought to imply that the test is defined without parametric assumptions and evaluated without using approximate algorithms. In principle however it could also mean that a parametric test has been employed in a situation where all parametric assumptions are fully met, but it is in most cases impossible to prove this completely in a real world situation. Exceptions when it is certain that parametric tests are exact include tests based on the binomial or Poisson distributions. Sometimes permutation test is used as a synonym for exact test, but although all permutation tests are exact tests, not all exact tests are permutation .

2015年3月26日星期四

A SAS Macro to search diseases by ICD 10 code and return order and number of patients for each block of ICD-10

%macro search_by_group(letter,from_,to_,k, database, result);
data dummy;
set &database;
%do i=1 %to &k;
icd10d&i._letter=substr(icd10d&i,1,1);
icd10d&i._number=input(substr(icd10d&i,2,5),best12.);
%end;
run;
%do j=1 %to &k;
data dummy&k;
set dummy;
if icd10d&j._letter="&letter" and icd10d&j._number>=&from_ and icd10d&j._number<&to_ then condition=icd10d&j;/***it should be a double quotation here, not a single quotation***/
new_ppn=compress(put(ppn,12.))||condition;
run;
proc sort data=dummy&j nodupkey;
by new_ppn;
run;
%End;
data &result(keep=ppn new_ppn condition diseasename);
set dummy1-dummy&k;
diseasename=condition;
format diseasename icd10ff.;
run;
proc sort data=&result nodupkey;
by new_ppn;
run;
proc freq data=&result order=freq;
table condition diseasename ;
run;
%mend search_by_group;
%macro numpatients(database);
data &database._2;
set &database;
if diseasename='' then delete;
run;
proc sort data=&database._2 nodupkey;
by ppn;
run;
data numberpatients;
set &database._2;
No_Patients='Number of patients';
run;
title "Order and Number of patients in the &database block";
proc freq data=numberpatients;
table No_Patients;
run;
%mend numpatients;

2015年3月18日星期三

2015年3月17日星期二

Marsaglia and Bray algorithm to generate iid bivariate normal observation

Marsaglia and Bray algorithm to generate iid bivariate normal observations


n<-50000
w<-c()
x1<-c()
x2<-c()
z<-c()
for (i in 1:n) {
u <- runif(1,-1,1)
v <- runif(1,-1,1)
w[i] <- u^2 + v^2
if(w[i]<=1){
z[i]<-(-2*log(w[i])/w[i])^0.5
x1[i]<-u*z[i]
x2[i]<-v*z[i]
}
else if(w[i]>1){
x1[i]<-NA
x2[i]<-NA
}
}

y1<-x1[!is.na(x1)]
y2<-x2[!is.na(x2)]
par(mfrow=c(1,2))
hist(y1,freq=F, breaks=100)
hist(y2,freq=F, breaks=100)

library(MASS)
bivn <- mvrnorm(50000, mu = c(0, 0), Sigma = matrix(c(1, 0, 0, 1), 2))
bivn.kde <- kde2d(bivn[,1], bivn[,2], n = 50)

Marsaglia <- kde2d(y1, y2, n = 50)


par(mfrow=c(3,2))
contour(Marsaglia,main="Marsaglia")
contour(bivn.kde,main="mvrnorm")

image(Marsaglia,main="Marsaglia")
image(bivn.kde,main="mvrnorm")

persp(Marsaglia, phi = 0, theta = 0,main="Marsaglia")

persp(bivn.kde, phi = 0, theta = 0, main="mvrnorm")
Compare the Marsaglia and mvrnorm:



2015年3月15日星期日

The constant "c" in Bayesian

"The posterior probability is proportional to the product of the prior probability and the likelihood."  why it is “proportional” or why the “c” is constant but not vary?


An interesting post from professor Larry Wasserman


https://normaldeviate.wordpress.com/2012/10/05/the-normalizing-constant-paradox/



2015年3月13日星期五

up to a normalizing constant

What does it mean when it says we only know the target distribution [π(x)] up to a normalizing constant?

This means that π(x) is proportional to the p.d.f. for every value x. That is, if f(x) is the p.d.f., then there is some constant K such that
f(x) = K * π(x)
for all values of x.

2015年3月12日星期四

Simulate significance level and power function for a t test with underling logistic distribution


emp_alpha<-function(nsims){

n<-20                          # sample size of 20
tc<-qt(1-0.05,n-1)            #this is 1.729, i.e the reject region,here we suppose u=0
ic<-0
for(i in 1:nsims){
    samp<-rlogis(n,0,1)             # generate 20 observations from logistic distribution with u=0
    ttest<-(sqrt(n)*mean(samp))/var(samp)^.5 # calcuate ts from sample.  
    if(ttest > tc){ic<-ic+1}
    }
alpha<-ic/nsims
err<-1.96*sqrt((alpha*(1-alpha))/nsims)
list(empiricalalpha=alpha,error=err)
}

emp_alpha(100000)


#######simulate a power function

emp_power<-function(nsims,u){

n<-20                          # sample size of 20
tc<-qt(1-0.05,n-1)            #this is 1.729, i.e the reject region,here we suppose u=0
ic<-0
for(i in 1:nsims){
    samp<-rlogis(n,u,1)          # generate 20 samples from the logistic distribuiton with u
    ttest<-(sqrt(n)*mean(samp))/var(samp)^.5 # calcuate ts from sample.  
    if(ttest > tc){ic<-ic+1}
    }
power<-ic/nsims
return(power)
#err<-1.96*sqrt((power*(1-power))/nsims)
#list(emppower=power,error=err)
}
emp_power(1000,0.5) #power for u=0.5

#We will draw a power function for this simulation
powers<-NULL
u<-seq(0,1,0.001)
for(i in 1:1001){
powers[i]<-emp_power(1000,u[i])
}

plot(u,powers,type="l") # this is power function plot





Simulate power function for a t-test with a contaminated normal distribution

rcn<-function(n,eps,sigmac,u){
ind<-rbinom(n,1,eps)
x<-rnorm(n,u,1)                             #norm with mean=u, sd=1
rcn<-x*(1-ind)+sigmac*x*ind        #produce obs fromcontaminated normal distriubtion with u
return(rcn)                              
}

#  returns a random sample of size n drawn from
#  a contaminated normal distribution with percent
#  contamination eps and variance ratio sigmac


emp_power<-function(nsims,mean){
sigmac<-25
eps<-.25
alpha<-.05
n<-20                               # sample size of 20
tc<-qt(1-alpha,n-1)            #this is 1.729, i.e the reject region,here we suppose u=0
ic<-0
for(i in 1:nsims){
    samp<-rcn(n,eps,sigmac,mean)                    # simulate 20 samples from
                                                                       #the contaminated normal distribution
    ttest<-(sqrt(n)*mean(samp))/var(samp)^.5  # calcuate ts from sample.
    if(ttest > tc){ic<-ic+1}
    }
power<-ic/nsims
return(power)
}

#err<-1.96*sqrt((power*(1-power))/nsims)
#list(empiricalpower=power,error=err)


#We will draw a power function for this simulation
powers<-NULL
u<-seq(0,1,0.001)
for(i in 1:1001){
powers[i]<-emp_power(1000,u[i])
}

plot(u,powers,type="l") # this is power function plot