Package: cydar
Author: Aaron Lun ([email protected])
Last modified: 2017-03-21
Compilation date: 2017-11-09
Mass cytometry is a technique that allows simultaneous profiling of many (> 30) protein markers on each of millions of cells. This is frequently used to characterize cell subpopulations based on unique combinations of markers. One way to analyze this data is to identify subpopulations that change in abundance between conditions, e.g., with or without drug treatment, before and after stimulation. This vignette will describe the steps necessary to perform this “differential abundance” (DA) analysis.
The analysis starts from a set of Flow Cytometry Standard (FCS) files containing intensities for each cell. For the purposes of this vignette, we will simulate some data to demonstrate the methods below. This experiment will assay 30 markers, and contain 3 replicate samples in each of 2 biological conditions. We add two small differentially abundant subpopulations to ensure that we get something to look at later.
ncells <- 20000
nda <- 200
nmarkers <- 31
down.pos <- 1.8
up.pos <- 1.2
conditions <- rep(c("A", "B"), each=3)
combined <- rbind(matrix(rnorm(ncells*nmarkers, 1.5, 0.6), ncol=nmarkers),
matrix(rnorm(nda*nmarkers, down.pos, 0.3), ncol=nmarkers),
matrix(rnorm(nda*nmarkers, up.pos, 0.3), ncol=nmarkers))
combined[,31] <- rnorm(nrow(combined), 1, 0.5) # last marker is a QC marker.
combined <- 10^combined # raw intensity values
sample.id <- c(sample(length(conditions), ncells, replace=TRUE),
sample(which(conditions=="A"), nda, replace=TRUE),
sample(which(conditions=="B"), nda, replace=TRUE))
colnames(combined) <- paste0("Marker", seq_len(nmarkers))
We use this to construct a ncdfFlowSet
for our downstream analysis.
library(ncdfFlow)
collected.exprs <- list()
for (i in seq_along(conditions)) {
stuff <- list(combined[sample.id==i,,drop=FALSE])
names(stuff) <- paste0("Sample", i)
collected.exprs[[i]] <- poolCells(stuff)
}
names(collected.exprs) <- paste0("Sample", seq_along(conditions))
collected.exprs <- ncdfFlowSet(as(collected.exprs, "flowSet"))
In practice, we can use the read.ncdfFlowSet
function to load intensities from FCS files into the R session.
The ncdfFlowSet
object can replace all instances of collected.exprs
in the downstream steps.
The intensities need to be transformed and gated prior to further analysis.
We first pool all cells together into a single flowFrame
, which will be used for construction of the transformation and gating functions for all samples.
This avoids spurious differences from using sample-specific functions.
pool.ff <- poolCells(collected.exprs)
We use the estimateLogicle
method from the flowCore package to obtain a transformation function, and apply it to pool.ff
.
This performs a biexponential transformation with parameters estimated for optimal display.
library(flowCore)
trans <- estimateLogicle(pool.ff, colnames(pool.ff))
proc.ff <- transform(pool.ff, trans)
The next step is to construct gates to remove uninteresting cells. There are several common gates that are used in mass cytometry data analysis, typically used in the following order:
dnaGate
function.To demonstrate, we will construct a gate to remove low values for the last marker, using the outlierGate
function.
The constructed gate is then applied to the flowFrame
, only retaining cells falling within the gated region.
gate.31 <- outlierGate(proc.ff, "Marker31", type="upper")
gate.31
## Rectangular gate 'Marker31_outlierGate' with dimensions:
## Marker31: (-Inf,4.00784615101964)
filter.31 <- filter(proc.ff, gate.31)
summary(filter.31@subSet)
## Mode FALSE TRUE
## logical 35 20173
We apply the gate before proceeding to the next marker to be gated.
“{r] proc.ff <- Subset(proc.ff, gate.31)
### Applying functions to the original data
Applying the transformation functions to the original data is simple.
```r
processed.exprs <- transform(collected.exprs, trans)
Applying the gates is similarly easy. Use methods the flowViz package to see how gating diagnostics can be visualized.
processed.exprs <- Subset(processed.exprs, gate.31)
Markers used for gating are generally ignored in the rest of the analysis. For example, as long as all cells contain DNA, we are generally not interested in differences in the amount of DNA. This is achieved by discarding those markers (in this case, marker 31).
processed.exprs <- processed.exprs[,1:30]
By default, we do not perform any normalization of intensities between samples. This is because we assume that barcoding was used with multiplexed staining and mass cytometry. Thus, technical biases that might affect intensity should be the same in all samples, which means that they cancel out when comparing between samples.
In data sets containing multiple batches of separately barcoded samples, we provide the normalizeBatch
function to adjust the intensities.
This uses range-based normalization to equalize the dynamic range between batches for each marker.
Alternatively, it can use warping functions to eliminate non-linear distortions due to batch effects.
The problem of normalization is much harder to solve in data sets with no barcoding at all.
In such cases, the best solution is to expand the sizes of the hyperspheres to "smooth over” any batch effects.
See the expandRadius
function for more details.
We quantify abundance by assigning cells to hyperspheres in the high-dimensional marker space, and counting the number of cells from each sample in each hypersphere.
To do this, we first convert the intensity data into a format that is more amenable for counting.
The prepareCellData
function works with either a list of matrices or directly with a ncdfFlowSet
object, and generates a CyData
object containing the reformatted intensities.
cd <- prepareCellData(processed.exprs)
We then assign cells to hyperspheres using the countCells
function.
Each hypersphere is centred at a cell to restrict ourselves to non-empty hyperspheres, and has radius equal to 0.5 times the square root of the number of markers.
The square root function adjusts for increased sparsity of the data at higher dimensions, while the 0.5 scaling factor allows cells with 10-fold differences in marker intensity (due to biological variability or technical noise) to be counted into the same hypersphere.
Also see the neighborDistances
function for guidance on choosing a value of tol
.
cd <- countCells(cd, tol=0.5)
The output is another CyData
object with extra information added to various fields.
In particular, the reported count matrix contains the set of counts for each hypersphere (row) from each sample (column).
head(assay(cd))
## Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
## [1,] 8 6 7 6 9 4
## [2,] 5 6 10 3 3 4
## [3,] 1 1 5 2 0 1
## [4,] 3 3 3 1 3 2
## [5,] 2 3 3 5 6 7
## [6,] 1 2 2 6 0 2
Also reported are the “positions” of the hyperspheres, defined for each marker as the median intensity for all cells assigned to each hypersphere. This will be required later for interpretation, as the marker intensities are required for defining the function of each subpopulation. Shown below is the position of the first hypersphere, represented by its set of median intensities across all markers.
head(intensities(cd))
## Marker1 Marker2 Marker3 Marker4 Marker5 Marker6 Marker7
## [1,] 1.931244 1.866820 2.379113 2.582273 2.062216 2.525648 2.049491
## [2,] 2.424262 2.101528 2.173225 2.481787 2.357530 2.242472 2.306254
## [3,] 1.606079 1.446424 2.950898 1.672853 2.119159 3.008177 1.650358
## [4,] 2.334605 1.336171 1.682531 2.072550 2.223555 2.891503 1.942141
## [5,] 2.424519 1.808555 2.075388 2.207880 2.164884 2.613409 2.131138
## [6,] 1.659978 1.968532 2.433053 2.108803 2.166295 1.981348 2.452967
## Marker8 Marker9 Marker10 Marker11 Marker12 Marker13 Marker14
## [1,] 1.800529 2.317094 2.322218 1.660237 2.142871 2.181511 1.690442
## [2,] 2.065421 2.411989 1.875516 2.120249 2.171454 2.157398 1.805432
## [3,] 1.989199 2.245853 2.309089 1.450550 2.331697 2.194163 1.101783
## [4,] 1.808780 2.178555 2.235732 1.967146 2.114671 2.430949 1.513738
## [5,] 1.710951 1.954568 2.528689 1.988484 2.452533 2.306967 1.003715
## [6,] 1.650627 1.768316 2.254658 1.766145 2.270032 2.379800 1.822746
## Marker15 Marker16 Marker17 Marker18 Marker19 Marker20 Marker21
## [1,] 2.413519 2.064204 1.969572 2.275450 2.082601 1.723337 2.136551
## [2,] 2.525809 2.247500 2.497703 2.859261 2.375079 1.903090 1.959238
## [3,] 2.420948 1.742913 2.197725 1.994398 2.367873 1.875352 1.921021
## [4,] 2.881521 2.132088 2.407831 1.961040 2.627807 1.967286 2.383922
## [5,] 2.037492 2.160304 1.779051 2.663430 2.114902 1.332043 2.455978
## [6,] 2.017618 2.327225 1.799140 2.181881 2.347408 2.322177 2.448152
## Marker22 Marker23 Marker24 Marker25 Marker26 Marker27 Marker28
## [1,] 2.613480 1.943236 2.254744 2.405229 2.208645 2.330122 2.447775
## [2,] 2.098471 2.227989 2.560375 2.060277 1.574099 2.389267 2.810435
## [3,] 2.457890 2.070755 2.727472 1.802785 2.346651 2.234629 2.570514
## [4,] 3.030380 1.942028 2.342264 2.133311 1.308392 1.966470 1.545777
## [5,] 1.948633 2.388765 2.216602 2.289087 2.129856 2.397688 2.286880
## [6,] 2.124992 1.986571 2.600916 1.886158 1.796845 2.140024 2.004693
## Marker29 Marker30
## [1,] 2.750324 2.009846
## [2,] 1.650106 1.797402
## [3,] 1.965060 1.789484
## [4,] 2.233714 2.106068
## [5,] 1.658948 2.678580
## [6,] 1.547964 2.164445
There is some light filtering in countCells
to improve memory efficiency, which can be adjusted with the filter
argument.
We can use a number of methods to test the count data for differential abundance. Here, we will use the quasi-likelihood (QL) method from the edgeR package. This allows us to model discrete count data with overdispersion due to biological variability.
library(edgeR)
y <- DGEList(assay(cd), lib.size=cd$totals)
First, we do some filtering to remove low-abundance hyperspheres with average counts below 5. These are mostly uninteresting as they do not provide enough evidence to reject the null hypothesis. Removing them also reduces computational work and the severity of the multiple testing correction. Lower values can also be used, but we do not recommend going below 1.
keep <- aveLogCPM(y) >= aveLogCPM(5, mean(cd$totals))
cd <- cd[keep,]
y <- y[keep,]
We then apply the QL framework to estimate the dispersions, fit a generalized linear model and test for significant differences between conditions. We refer interested readers to the edgeR user's guide for more details.
design <- model.matrix(~factor(conditions))
y <- estimateDisp(y, design)
fit <- glmQLFit(y, design, robust=TRUE)
res <- glmQLFTest(fit, coef=2)
Note that normalization by total cell count per sample is implicitly performed by setting lib.size=out$totals
.
We do not recommend using calcNormFactors
in this context, as its assumptions may not be applicable to mass cytometry data.
To correct for multiple testing, we aim to control the spatial false discovery rate (FDR).
This refers to the FDR across areas of the high-dimensional space.
We do this using the spatialFDR
function, given the p-values and positions of all tested hyperspheres.
qvals <- spatialFDR(intensities(cd), res$table$PValue)
Hyperspheres with significant differences in abundance are defined as those detected at a spatial FDR of, say, 5%.
is.sig <- qvals <= 0.05
summary(is.sig)
## Mode FALSE TRUE
## logical 151 69
This approach is a bit more sophisticated than simply applying the BH method to the hypersphere p-values. Such a simple approach would fail to account for the different densities of hyperspheres in different parts of the high-dimensional space.
To interpret the DA hyperspheres, we use dimensionality reduction to visualize them in a convenient two-dimensional representation. This is done here with PCA, though for more complex data sets, we suggest using something like Rtsne.
sig.coords <- intensities(cd)[is.sig,]
sig.res <- res$table[is.sig,]
coords <- prcomp(sig.coords)
Each DA hypersphere is represented as a point on the plot below, coloured according to its log-fold change between conditions. We can see that we've recovered the two DA subpopulations that we put in at the start. One subpopulation increases in abundance (red) while the other decreases (blue) in the second condition relative to the first.
plotCellLogFC(coords$x[,1], coords$x[,2], sig.res$logFC)
This plot should be interpreted by examining the marker intensities, in order to determine what each area of the plot represents.
We suggest using the plotCellIntensity
function to make a series of plots for all markers, as shown below.
Colours represent to the median marker intensities of each hypersphere, mapped onto the viridis colour scale.
par(mfrow=c(6,5), mar=c(2.1, 1.1, 3.1, 1.1))
limits <- intensityRanges(cd, p=0.05)
all.markers <- rownames(markerData(cd))
for (i in order(all.markers)) {
plotCellIntensity(coords$x[,1], coords$x[,2], sig.coords[,i],
irange=limits[,i], main=all.markers[i])
}
We use the intensityRanges
function to define the bounds of the colour scale.
This caps the minimum and maximum intensities at the 5th^ and 95th^ percentiles, respectively, to avoid colours being skewed by outliers.
Note that both of these functions return a vector of colours, named with the corresponding numeric value of the log-fold change or intensity.
This can be used to construct a colour bar – see ?plotCellLogFC
for more details.
An alternative approach to interpretation is to examine each hypersphere separately, and to determine the cell type corresponding to the hypersphere's intensities. First, we prune done the number of hyperspheres to be examined in this manner. This is done by identifying “non-redundant” hyperspheres, i.e., hyperspheres that do not overlap hyperspheres with lower p-values.
nonred <- findFirstSphere(intensities(cd), res$table$PValue)
summary(nonred)
## Mode FALSE TRUE
## logical 217 3
We pass these hyperspheres to the interpretSpheres
, which creates a Shiny app where the intensities are displayed.
The idea is to allow users to inspect each hypersphere, annotate it and then save the labels to R once annotation is complete.
See the documentation for more details.
all.coords <- prcomp(intensities(cd))
app <- interpretSpheres(cd, select=nonred, metrics=res$table, run=FALSE,
red.coords=all.coords$x[,1:2], red.highlight=is.sig)
# Set run=TRUE if you want the app to run automatically.
Users wanting to identify specific subpopulations may consider using the selectorPlot
function from scran.
This provides an interactive framework by which hyperspheres can be selected and saved to a R session for further examination.
The best markers that distinguish cells in one subpopulation from all others can also be identified using pickBestMarkers
.
sessionInfo()
## R version 3.4.2 (2017-09-28)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 16.04.3 LTS
##
## Matrix products: default
## BLAS: /home/biocbuild/bbs-3.6-bioc/R/lib/libRblas.so
## LAPACK: /home/biocbuild/bbs-3.6-bioc/R/lib/libRlapack.so
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] edgeR_3.20.1 limma_3.34.0
## [3] ncdfFlow_2.24.0 BH_1.65.0-1
## [5] RcppArmadillo_0.8.100.1.0 flowCore_1.44.0
## [7] cydar_1.2.1 SummarizedExperiment_1.8.0
## [9] DelayedArray_0.4.1 matrixStats_0.52.2
## [11] Biobase_2.38.0 GenomicRanges_1.30.0
## [13] GenomeInfoDb_1.14.0 IRanges_2.12.0
## [15] S4Vectors_0.16.0 BiocGenerics_0.24.0
## [17] BiocParallel_1.12.0 BiocStyle_2.6.0
## [19] knitr_1.17
##
## loaded via a namespace (and not attached):
## [1] locfit_1.5-9.1 Rcpp_0.12.13
## [3] mvtnorm_1.0-6 lattice_0.20-35
## [5] corpcor_1.6.9 rprojroot_1.2
## [7] digest_0.6.12 mime_0.5
## [9] R6_2.2.2 plyr_1.8.4
## [11] backports_1.1.1 pcaPP_1.9-72
## [13] evaluate_0.10.1 highr_0.6
## [15] ggplot2_2.2.1 zlibbioc_1.24.0
## [17] rlang_0.1.4 lazyeval_0.2.1
## [19] hexbin_1.27.1 Matrix_1.2-11
## [21] rmarkdown_1.6 splines_3.4.2
## [23] statmod_1.4.30 stringr_1.2.0
## [25] RCurl_1.95-4.8 munsell_0.4.3
## [27] shiny_1.0.5 compiler_3.4.2
## [29] httpuv_1.3.5 IDPmisc_1.1.17
## [31] htmltools_0.3.6 tibble_1.3.4
## [33] gridExtra_2.3 GenomeInfoDbData_0.99.1
## [35] viridisLite_0.2.0 flowViz_1.42.0
## [37] rrcov_1.4-3 MASS_7.3-47
## [39] bitops_1.0-6 grid_3.4.2
## [41] xtable_1.8-2 gtable_0.2.0
## [43] magrittr_1.5 scales_0.5.0
## [45] graph_1.56.0 KernSmooth_2.23-15
## [47] stringi_1.1.5 XVector_0.18.0
## [49] viridis_0.4.0 latticeExtra_0.6-28
## [51] robustbase_0.92-8 RColorBrewer_1.1-2
## [53] tools_3.4.2 DEoptimR_1.0-8
## [55] yaml_2.1.14 colorspace_1.3-2
## [57] cluster_2.0.6