\documentclass{article} \usepackage{hyperref} \begin{document} \title{Microarray analysis of oral cancer samples} \author{Humberto Ortiz-Zuazaga} \date{April 27, 2011} \maketitle \section{Introduction} Bioconductor~\cite{bioconductor} is a set of R packages for analysis of biological data, with an emphasis on microarray and other high-throughput datasets. This example will use standard \verb@affy@~\cite{affy} and \verb@limma@~\cite{limma} commands to analyze the workshop dataset. Bioconductor has extensive help, which you can access in many ways. One simple way is to type \verb@?foo@ where you want help on the object called ``foo''. You can open an interactive browser interface to the help system by typing \verb@help.start()@. In the browser, you can look at the documentation for the installed packages to find help on \verb@limma@ and \verb@affy@. <<>>= library(limma) library(affy) @ \section{Reading the data} A simple text file with tab separated columns can describe the microarray samples. In our case the first 6 samples are positive for HPV, and the remaining 5 samples are negative. These are labeled ``pos'' and ``neg'' in the targets file. <<>>= targets <- readTargets("targets.txt") targets ab <- ReadAffy(filenames=targets$FileName) @ \verb@ab@ will contain the AffyBatch, with the raw expression values for each probe in each sample, with additional information on the probes and samples. \section{Normalization and pre-processing} We can use the \verb@rma@ command to normalize and summarize the probes for each feature. Prior to the summarization, each feature is represented with four probes. After the normalization and summarization routine, we have a single expression value for each feature in each sample. <<>>= probeNames(ab)[1:10] eset <- rma(ab) featureNames(eset)[1:10] @ A boxplot shows the distribution of expression values before (Figure~\ref{fig:boxplot:ab}) and after (Figure~\ref{fig:boxplot:es}) the normalization. <>= boxplot(ab) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Box plot before normalization} \label{fig:boxplot:ab} \end{figure} <>= boxplot(exprs(eset)) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Box plot after normalization} \label{fig:boxplot:es} \end{figure} \section{Experimental design} The experiment has a simple design, each sample is labeled in the targets file with the target it was hybridized with. This information can be used to constuct a design matrix that identifies each group. <>= f <- factor(targets$Target, levels=c("pos", "neg")) design <- model.matrix(~0+f) colnames(design) <- c("pos", "neg") design @ We can fit a model that has a mean for each group, and test if the group means are different. The \texttt{eBayes} function computes an empirical Bayes factor, pooling the variances from all the genes to estimate significance. <>= cont.matrix <- makeContrasts(posvsneg=pos-neg, levels=design) cont.matrix fit <- lmFit(eset, design) fit2 <- contrasts.fit(fit, cont.matrix) fit.b <- eBayes(fit2) @ \section{Reporting the results} We now have a model fit that estimates the log ratios between the positive and negative samples. An MA plot (Figure~\ref{fig:maplot}) summarizes the fit. The y axis plots M, the log ratio of expression in the positive and negative coefficients. The x axis plots the A, or average log intensity of each gene. <>= #smoothScatter(fit.b$Amean, fit.b$coefficients, xlab="A", ylab="M") plotMA(fit.b) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{MA plot} \label{fig:maplot} \end{figure} The fit also has an estimate of the Bayes factor, the log odds of differential expression for each gene. A plot of the B vs log ratios is called a volcanoplot (see Figure~\ref{fig:volcano}). <>= #lod <- -log10(fit.b$p.value) #smoothScatter(fit.b$coefficients, lod, yaxt="n", xlab="M", ylab="p-value") #axis(2, at = seq(0, 4, by = 1), labels = 10^(-seq(0, 4, by = 1))) volcanoplot(fit.b) @ \begin{figure} \begin{center} <>= <> @ \end{center} \caption{Volcano plot} \label{fig:volcano} \end{figure} Another way to report the results is exporting a table with the most significant features. Estimated p-values using a number of multiple testing corrections can be computed, in this case we use the Benjamini \& Hochberg correction.~\cite{BH} <<>>= topTable(fit.b, adjust="BH") @ \begin{thebibliography}{} \bibitem{BH} Benjamini, Y., and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. \emph{Journal of the Royal Statistical Society Series B,} 57, 289--300. \bibitem{affy} Gautier, L., Cope, L., Bolstad, B. M., and Irizarry, R. A. (2004). affy---analysis of Affymetrix GeneChip data at the probe level. \textit{Bioinformatics} 20, 3 (Feb. 2004), 307-315. \bibitem{bioconductor} R. Gentleman, V. J. Carey, D. M. Bates, B.Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, and others Bioconductor: Open software development for computational biology and bioinformatics (2004). \textit{Genome Biology}, Vol. 5, R80 \bibitem{limma} Smyth, G. K. (2005). Limma: linear models for microarray data. In: 'Bioinformatics and Computational Biology Solutions using R and Bioconductor'. R. Gentleman, V. Carey, S. Dudoit, R. Irizarry, W. Huber (eds), Springer, New York, pages 397--420. \end{thebibliography} \end{document}