Wednesday, January 26, 2011

Biostatistics and R

Biostatistics is mainly statistics for clinical and epidemiological studies that studies the occurrence of illness (morbidity), death (mortality) in a point of time or in a course of time and finds various models and estimates of risks (probability of that event occurrence).



To obtain risk, it is a common practice to compare the diseased population with a population that is not diseased yet. Depending on the study design, it can be a comparison between exposed and unexposed subjects as well.

For simplicity, lets start with binary outcome variable (diseased / not diseased) and binary exposure or grouping variable (exposed / unexposed). A 2x2 contingency table will be appropriate to summarize the frequencies:



Response or outcome



Diseased
Not diseased
Marginal total
Group or exposure or predictive variable
Exposed
a = number of observations where exposure present and outcome present
b = number of observations where exposure present and outcome absent
n1 = number of observations where exposure present
Unexposed
c = number of observations where exposure absent and outcome present
d = number of observations where exposure absent and outcome absent
n2 = number of observations where exposure absent
Marginal total

m1 = number of observations where outcome present
m2 = number of observations where outcome absent
N = total subjects in the study


Now the analysis of this table will depend on various aspects, such as:
    • objective of the study (usually the goal is for find a causal dependence from an observed association)
    • design of the study (randomized trial or observational)
    • involvement of other variables (covariate and confounder)
    • sample sizes
    • reliability of measurement
and so on.

Prevalence is the probability p in sample (pi in population) of being diseased in a specific point in time (cross-sectional study). Incidence is the same thing, but over a time period (follow-up); all patients being independent from baseline characteristics. Estimation of these can be done using Binomial.

For example, we have the following case control study result summary in a tabulated form:

Case-control study

cancer



Diseased
Not diseased
Marginal total
smoking
Exposed
a = 41
b = 28
n1 = 69
Unexposed
c = 19
d =  32
n2 = 51
Marginal total

m1 = 60
m2 = 60
N = 120

 Due to its sampling design, estimation risk can not be done directly. This is because the frequency of cancer occurring was determined by the design in advance. Also these studies are prone to confounding and hence can provide misleading results. Therefore, one has to be cautious about analyzing such data. Usually additional tools (such as stratification, etc) are used to handle them in a proper way.


There are various R tools that we can use to analyze various bio-statistical methods. Here are a few of them:

In the subsequent blogs, I plan to discuss a bit of use of R in Biostatistical context.

No comments:

Post a Comment