biostatistics and epidemiology with R

Wednesday, February 6, 2013

Weighted Logistic Regression in R, SPSS, Stata

In R:

We will use mtcars dataset to illustrate. (R version 2.5.1)

> data(mtcars)
> mtcars <- as.data.frame(mtcars)
> mtcars$p <- runif(nrow(mtcars)) # create weights
> mtcars2 <- mtcars[c("am", "hp", "wt", "p")]
> head(mtcars2)
am hp wt p
Mazda RX4 1 110 2.620 0.8901532
Mazda RX4 Wag 1 110 2.875 0.4601596
Datsun 710 1 93 2.320 0.9779983
Hornet 4 Drive 0 110 3.215 0.6889322
Hornet Sportabout 0 175 3.440 0.3856835
Valiant 0 105 3.460 0.1209667

Estimating Cox Model parameters with and without coxph function from survival package

Reading data

library(survival)
data(veteran)
time.var <- veteran[,3]
status.var <- veteran[,4]
covariate.matrix <- veteran[,c(1,5:8)]
covariate.names <- names(veteran[,c(1,5:8)])
data <- cbind.data.frame(time.var,status.var,covariate.matrix)

Kaplan-Meier and Nelson-Aalen estimates with and without survfit function from survival package

Data reading:

rm(list=ls(all=TRUE)) # wipe out everything else
require(survival)
data = aml[aml$x == "Maintained",]
data

Y-axis on both sides in Kaplan-Meier Survival Curve

Here is an example of how to draw Y-axis on both sides in Kaplan-Meier Survival Curve, with same label, axis-titles and positions. This should work for the general plots as well.

Nested case-control study

Nested case-control study can be described as follows: for a particular disease, all the patients that become diseased in a given cohort are labeled as "cases". Then corresponding to each "case", a pre-specified number (say, 4) of "controls" or healthy subjects (at the time when disease occurred for the case) are matched (irrespective of whether these healthy subjects became case at a later period). This design is interesting because cost can be minimized at the expense of negligible statistical inefficiencies compared to considering whole cohort. More can be found here.

Conditional estimates for OR

When two independent samples from separate population are collected, a product binomial likelihood can be used. However, conditioning on margins, a different formulation is achieved, as follows (shai being population OR):

Why log(RR) instead of RR?

As an example of how the variances of difference measures are calculated, the derivation of approximate variance or SE for RR is shown as follows (uses multivariate delta method):