Custom Quality Control for Spotted Arrays

April 18, 2008

Agnes Paquet¹, Andrea Barczak¹, (Jean) Yee Hwa Yang²

1. Department of Medicine, Functional Genomics Core Facility, University of California, San Francisco
[email protected]
2. School of Mathematics and Statistics, University of Sydney, Australia

Content

ArrayQualityprovides a flexible framework for assessment of hybridization quality. Many of the functions can be customized to better suit user's needs. For example, arrayQuality is currently using look-up tables adapted to hybridizations performed in the Functional Genomics Core Facility at UCSF. Depending on your data, you may find that the probes defined as controls in arrayQuality are not present on your array, leading to NAs in the comparative boxplot, or you may be working with a genome for which we are not providing references. In this section, we will describe how to:

Define your own controls in gpQuality
Use your own collection of good quality slides for quantitative assessment of the comparative boxplot part of gpQuality
Define a new quantitative quality measure in the comparative boxplot

All example code will be provided for GenePix data format, but the same functionalities are available for Agilent or Spot data format.

Warning: Modifying the default settings requires a good understanding of two-color array design and some programming experience in R. Only more advanced users should use the functionalities described in this section. Other users should start by first using default functions for general hybridization, as described in the basicQuality user's guide, available from the main online help page.

1. How to specify your own set of controls or spot types in gpQuality

gpQuality has several arguments that you can modify in order to use your own spot types or your own collection of good slides. gpQuality arguments are listed below:

gpQuality(fnames = NULL, path = ".", organism = c("Mm", "Hs"),
        compBoxplot = TRUE, reference = NULL,
        controlMatrix = controlCode, controlId = c("ID", "Name"),
          output = FALSE, resdir =".", dev= "png", DEBUG = FALSE,...)

To use your own set of spot types (i.e. controls...): you will need to change controlMatrix and/or controlId.

The spot types currenlty used in arrayQuality are defined in a 2-column matrix called controlCode.

Pattern	Name
Buffer	Buffer
Empty	Empty
EMPTY	Empty
AT	Negative
M200009348	Positive
M200003425	Positive
NLG	con

Table 1: Examples of controls used in arrayQuality

To define your own spot types, you will need to replace the default values in controlCode with the controls present on your arrays. The easiest way to do it is to create a tab-delimited text file named SpotTypes.txt, and read it into arrayQuality using the function readcontrolCode. It is also possible to create a new controlCode matrix directly.

1.1 If you want to use a Spot Types file:

A spot types file is a tab-delimited text file which allows you to identify different types of spots from the gene list. It should contain at least a column named SpotType where all different spot types are listed and one or more other columns, which should have the same names as columns in the GAL file, containing patterns or regular expressions sufficient to identify the spot-type. For more information, you can refer to the limma package user's guide.

Warning: You will need to include a spot type named probes!!

Below is an example of spot types files for the swirl dataset. In this case there are only two types of spots, probes and controls.

Example of spot types file

To read the new spot types in arrayQuality:

- Create your spot types file.
- Find which column of the file contains probes identification for each type. In the example Figure 8, it is the "ID" column. You will need to pass this column name as argument at the next step.
- Read the spot types files using the readcontrolCode function.

> controlCode <- readcontrolCode(file=”mySpotTypes.txt”, controlId="ID")

- Find which column of the gpr file can be used to identify your new spot types. It is typically the "ID" or the "Name" column.
- To generate both types of plots: call gpQuality specifying your new controlCode matrix in controlMatrix and which column is used to define your spot types in controlId.

> result <- gpQuality(controlMatrix = controlCode, controlId=”Id”)

1.2 If you want to create a new controlCode matrix directly

You will need to create another controlCode table containing two columns as well, and then overwrite the default controlCode loaded with arrayQuality. The controlCode matrix must have 2 columns:
    - A column named "Pattern" containing your control IDs
    - A column named "Name", describing what king of control is each probe (in particular what are Positive, Negative, Empty controls)

You can do it by creating a tab delimited text file and read it in R after loading arrayQuality:

> library(arrayQuality)
> mycontrolCode <- as.matrix(read.table("mycontrolCode.txt", sep="\t",
                          header=TRUE, quote="\"", fill=TRUE)))

Then, pass your new matrix as argument when calling gpQuality. You can specify which column of the gpr file contains probes identifiers in the controlId arguments (typically, it would be "Id" or "Name").
> results <- gpQuality(controlMatrix = mycontrolCode, controlId = "ID")

2. How to use your own collection of good quality reference slides

The comparative boxplot generated as part of the general hybridization diagnostic plot can be used to detect outlier arrays within in a large dataset or if you want to study hybridization quality for other genomes. This is done by comparing some quantitative statistics from each array to a range of values corresponding to good quality arrays, which are determined over a collection of "good quality" arrays selected from the same dataset. To use your own collection of good slides: you will need to modify a look-up table named reference passed in the arguments of gpQuality.

To generate your own reference:

    1. Gather the slides of "good" quality you would like to use as reference in a directory, for example "MyReferences". Slides can be from different print-runs.

    2. Change your R working directory to "MyReferences"

    3. Load arrayQuality package by typing library(arrayQuality) in R

    4. Create your reference quality measures by typing:
    > myReference <- globalQuality()

    5. Change R working directory to the directory containing slides you would like to test, as described. You can only compare slides from the same print-run here. If you have an experiment using 2 print-runs, you will need to run gpQuality 2 times.

    6. Run gpQuality using the reference measures and the scaling table you have generated:
    > results <- gpQuality(reference = myReference)

    Other gpQuality arguments described above can also be applied here.

3. How to define your own quantitative measures for the comparative boxplot

We have selected a set of 15 measures that can be used to verify quantitatively the quality of an array. These measures are described in detail in the basicQuality guide, that can be accessd from the main help page of the package. This part can be customized a step further, as it is possible to define your own set of quantitative measures. The quantitative boxplot is generated as follows:

Read in the array data from the image analysis software input. This can be done by calling the function readGPR, which stores specific columns from the input file in a list.
Calculate the quantitative measures and statistics you need for quality assessment. The is done one slide at a time using the function slideQuality. You will need to create your own slideQuality function if you prefer to use your own set of quantitative measures.
Select a collection of "good quality" arrays to use as reference.
Estimate the range of each measure corresponding to "good quality" arrays by repeating steps 1 and 2 for each array in your collection. Save as your new reference (see section 2 above for details).
Draw the comparative boxplot for each slide using your new reference.
Estimate the QC score for your array using the qcScore function.

The new slideQuality function must call the same argument as slidequality, and return a matrix. Here is an example R code for such a function.
The gprData argument is the result you get from running readGPR on your array. readGPR reads in specific columns from the gpr file only, and returns the results in a list. Depending on your needs, you may need to create your own reading function as well. Then, you can use the elements from gprData to calculate the QC measures you are interested in and return them in a matrix. This example function returns 4 quality measures: the range of red forground, the range of green foreground, the median spot area and the spot radius.

mySlideQuality <- function(gprData = NULL, controlMatrix = controlCode, controlId = c("ID", "Name"), DEBUG = FALSE, ...)
{
Rf <- log.na(gprData[["RfMedian"]], 2)
Gf <- log.na(gprData[["GfMedian"]], 2)
rRf <- range(Rf, na.rm = TRUE)
rangeRf <- rRf[2] - rRf[1]
rGf <- range(Gf, na.rm = TRUE)
rangeGf <- rGf[2] - rGf[1]
spotArea <- median(gprData[["spotArea"]], na.rm = TRUE)
spotRadius <- round(sqrt(spotArea)/pi)
sortedMeasures <- c("range RF", "range GF", "spotArea", "spotRadius")
sortedRes <- c(rangeRf, rangeGf, spotArea, spotRadius)
numResult <- as.matrix(sortedRes)
rownames(numResult) <- sortedMeasures
colnames(numResult) <- gprData[["File"]]
return(numResult)
}

Example

1) Create mySlideQuality (as above)
2) Run the following R code:
> datadir <- system.file("gprQCData", package="arrayQuality")
> gprData <- readGPR(fnames="9Mm137.gpr", path=datadir) ## read gpr file
> res = cbind(mySlideQuality(gprData),mySlideQuality(gprData) + 2, mySlideQuality(gprData)+4) ## create dummy reference
> res1 = mySlideQuality(gprData) ## run the new function of the slide
> qualBoxplot(arrayQuality = res1, reference = res) ## draw the boxplot
> qcScore(arrayQuality = res1, reference = res) ## calculate QC score