Custom Quality Control for Spotted Arrays
April 18, 2008
Agnes Paquet
1, Andrea Barczak
1, (Jean) Yee Hwa
Yang
2
1. Department of Medicine, Functional Genomics Core Facility,
University of California, San Francisco
[email protected]
2. School of Mathematics and Statistics, University of Sydney, Australia
Content
ArrayQualityprovides
a flexible framework for assessment of hybridization quality. Many of
the functions can be customized to better suit user's
needs. For example, arrayQuality is
currently using look-up tables adapted to hybridizations performed in the
Functional Genomics Core Facility at UCSF. Depending on your data, you may find
that the probes defined as controls in arrayQuality are not
present on your array, leading to NAs in the comparative boxplot, or you may be
working with a genome for which we are not providing references. In this section, we will describe how to:
- Define your own controls in gpQuality
- Use your own collection of good quality slides for quantitative assessment of the comparative boxplot part of gpQuality
- Define a new quantitative quality measure in the comparative boxplot
All example code will be
provided for GenePix data format, but the same functionalities are
available for Agilent or Spot data format.
Warning:
Modifying the default settings requires a good understanding
of two-color array design and some programming experience in R. Only
more advanced users should use the functionalities described in
this section. Other users should start by first using default
functions for general hybridization, as described in the basicQuality
user's guide, available from the main online help page.
1. How to specify your own set of controls or spot types in gpQuality
gpQuality has
several arguments that you can modify in order to use your own spot types or
your own collection of good slides. gpQuality arguments are listed
below:
gpQuality(fnames = NULL,
path = ".", organism = c("Mm", "Hs"),
compBoxplot = TRUE, reference
= NULL,
controlMatrix = controlCode, controlId = c("ID", "Name"),
output = FALSE, resdir =".", dev= "png", DEBUG =
FALSE,...)
To
use your own set of spot types (i.e. controls...): you will need to change controlMatrix and/or
controlId.
The spot types currenlty used in arrayQuality are defined in
a 2-column matrix called controlCode.
Pattern
|
Name
|
Buffer
|
Buffer
|
Empty
|
Empty
|
EMPTY
|
Empty
|
AT
|
Negative
|
M200009348
|
Positive
|
M200003425
|
Positive
|
NLG
|
con
|
Table 1: Examples of controls used in
arrayQuality
To define your own spot types, you will need to
replace the default values in controlCode with the controls present on your arrays. The easiest way to
do it is to create a tab-delimited text file named SpotTypes.txt, and read it
into arrayQuality using the function readcontrolCode. It is also possible to
create a new controlCode matrix directly.
1.1 If you want to use a Spot
Types file:
A spot types file is a tab-delimited text file which allows
you to identify different types of spots from the gene list. It should contain
at least a column named SpotType where all different spot types are listed and
one or more other columns, which should have the same names as columns in the
GAL file, containing patterns or regular expressions sufficient to identify the
spot-type. For more information, you can refer to the limma package
user's guide.
Warning: You will need
to include a spot type named probes!!
Below is an example of spot
types files for the swirl dataset. In this case there are only two types of
spots, probes and controls.
Example of spot types file
To
read the new spot types in arrayQuality:
- Create your spot types
file.
- Find which column of the file contains probes identification for each
type. In the example Figure 8, it is the "ID" column. You will need to pass this
column name as argument at the next step.
- Read the spot types files using
the readcontrolCode function.
>
controlCode <- readcontrolCode(file=”mySpotTypes.txt”,
controlId="ID")
- Find which column of the gpr
file can be used to identify your new spot types. It is typically the "ID" or
the "Name" column.
- To generate both types of plots: call gpQuality specifying
your new controlCode matrix in controlMatrix and
which column is used to define your spot types in controlId.
>
result <- gpQuality(controlMatrix = controlCode, controlId=”Id”)
1.2 If
you want to create a new controlCode matrix
directly
You will need to create another controlCode table
containing two columns as well, and then overwrite the default controlCode loaded
with arrayQuality. The controlCode matrix must have 2 columns:
-
A column named "Pattern" containing your control IDs
- A column named
"Name", describing what king of control is each probe (in particular what are
Positive, Negative, Empty controls)
You can do it by creating a tab
delimited text file and read it in R after loading arrayQuality:
>
library(arrayQuality)
> mycontrolCode <-
as.matrix(read.table("mycontrolCode.txt", sep="\t",
header=TRUE, quote="\"", fill=TRUE)))
Then, pass your new matrix as
argument when calling gpQuality. You can specify which column of the gpr file
contains probes identifiers in the controlId arguments (typically, it would be
"Id" or "Name").
> results <- gpQuality(controlMatrix =
mycontrolCode, controlId = "ID")
2. How to use your own collection of good quality reference slides
The
comparative boxplot generated as part of the general hybridization
diagnostic plot can be used to detect outlier arrays within in a large
dataset or if you want to study hybridization quality for other genomes.
This is done by comparing some quantitative statistics from each array
to a range of values corresponding to good quality arrays, which are determined over a collection of "good quality" arrays
selected from the same dataset. To use your own collection
of good slides: you will need to modify a look-up table named reference passed in the arguments of gpQuality.
To generate your
own reference:
1. Gather the slides of "good" quality you would like
to use as reference in a directory, for example "MyReferences". Slides can be
from different print-runs.
2. Change your R working directory to
"MyReferences"
3. Load arrayQuality
package by typing library(arrayQuality) in R
4.
Create your reference quality measures by typing:
> myReference <-
globalQuality()
5. Change R
working directory to the directory containing slides you would like to test, as
described. You can only compare slides from the same print-run
here. If you have an experiment using 2 print-runs, you will need to
run gpQuality 2 times.
6. Run gpQuality using the reference
measures and the scaling table you have generated:
> results <- gpQuality(reference =
myReference)
Other gpQuality arguments described above can
also be applied here.
3. How to define your own quantitative measures for the comparative boxplot
We have selected a set of 15 measures that can be used to verify
quantitatively the quality of an array. These measures are described in
detail in the basicQuality guide, that can be accessd from the main
help page of the package. This part can be customized a step
further, as it is possible to define your own set of quantitative
measures. The quantitative boxplot is generated as follows:
- Read in the array data from the image analysis software input. This can be done by calling the function readGPR, which stores specific columns from the input file in a list.
- Calculate the quantitative measures and statistics you need for
quality assessment. The is done one slide at a time using the function slideQuality. You will need to create your own slideQuality function if you prefer to use your own set of quantitative measures.
- Select a collection of "good quality" arrays to use as reference.
- Estimate the range of each measure corresponding to "good
quality" arrays by repeating steps 1 and 2 for each array in your
collection. Save as your new reference (see section 2 above for details).
- Draw the comparative boxplot for each slide using your new reference.
- Estimate the QC score for your array using the qcScore function.
The new slideQuality function must call the same argument as
slidequality, and return a matrix. Here is an example R code for such a
function.
The gprData argument is the result you get from running readGPR on your array. readGPR
reads in specific columns from the gpr file only, and returns the
results in a list. Depending on your needs, you may need to create your
own reading function as well. Then, you can use the elements from gprData
to calculate the QC measures you are interested in and return them in a
matrix. This example function returns 4 quality measures: the range of
red forground, the range of green foreground, the median spot area and
the spot radius.
mySlideQuality <- function(gprData = NULL, controlMatrix = controlCode, controlId = c("ID", "Name"), DEBUG = FALSE, ...)
{
Rf <- log.na(gprData[["RfMedian"]], 2)
Gf <- log.na(gprData[["GfMedian"]], 2)
rRf <- range(Rf, na.rm = TRUE)
rangeRf <- rRf[2] - rRf[1]
rGf <- range(Gf, na.rm = TRUE)
rangeGf <- rGf[2] - rGf[1]
spotArea <- median(gprData[["spotArea"]], na.rm = TRUE)
spotRadius <- round(sqrt(spotArea)/pi)
sortedMeasures <- c("range RF", "range GF", "spotArea", "spotRadius")
sortedRes <- c(rangeRf, rangeGf, spotArea, spotRadius)
numResult <- as.matrix(sortedRes)
rownames(numResult) <- sortedMeasures
colnames(numResult) <- gprData[["File"]]
return(numResult)
}
Example
1) Create mySlideQuality (as above)
2) Run the following R code:
> datadir <- system.file("gprQCData", package="arrayQuality")
> gprData <- readGPR(fnames="9Mm137.gpr", path=datadir) ## read gpr file
> res = cbind(mySlideQuality(gprData),mySlideQuality(gprData) + 2, mySlideQuality(gprData)+4) ## create dummy reference
> res1 = mySlideQuality(gprData) ## run the new function of the slide
> qualBoxplot(arrayQuality = res1, reference = res) ## draw the boxplot
> qcScore(arrayQuality = res1, reference = res) ## calculate QC score