I. DATA NORMALIZATION AND CORRECTION

The functions and code of the RNAdecay package were written for the purpose of analyzing decreasing levels of RNA abundance as measured by RNA short read sequencing (RNA-seq) after inhibition of transcription. Tissue is treated to block transcription (e.g., using a chemical such as cordycepin or ActinomycinD) and collected at regular intervals (referred to as the RNA decay time course, e.g., T0, T30, T60 refer to 0, 30, and 60 min of blocked transcription). Total RNA is extracted from the tissue and measured by RNA-seq. Although we wrote this package to analyze RNA-seq data, it could also be adapted to analyze individual gene data from quantitative reverse transcription polymerase chain reaction (qRT-PCR) measures, however, normalization (as herein described) is best performed using a number of stable reference genes.

Raw (Illumina) RNA-seq data are typically libraries made of up 10M-250M short (50 nt) sequences. These sequences are aligned to the genome and counted based on the location of their alignment in annotated genomic ranges (e.g., genes, exons, etc). Read counts for each gene are normalized to the size of their respective RNA-seq library for comparison to libraries generated from other biological material, therefore, read counts are expressed as reads per million (RPM). We begin RNA decay analysis in this package with RPM RNA abundance data.

1. T0 NORMALIZATION

Prior to modeling RNA decay, we normalize the read counts to the initial (T0) RNA abundance, the transcript level just prior to blocking transcription. This is unique for each gene, and, in an experiment with multiple replicates (reps), we use the mean value for all T0 reps. All RPM values in the time course are divided by the mean T0 value. For example:

# you can see that in this fabricated data the RNAs have a half-life of ~30 min
RPM_geneX <- data.frame(T0 = c(150, 135, 148), T30 = c(72, 76, 69), T60 = c(35, 35, 30), 
                     row.names = paste0("rep", 1:3))
RPM_geneX

##       T0 T30 T60
## rep1 150  72  35
## rep2 135  76  35
## rep3 148  69  30

RPM_geneX_T0norm <- RPM_geneX/mean(RPM_geneX$T0)
RPM_geneX_T0norm

##             T0       T30       T60
## rep1 1.0392610 0.4988453 0.2424942
## rep2 0.9353349 0.5265589 0.2424942
## rep3 1.0254042 0.4780600 0.2078522

colMeans(RPM_geneX_T0norm)

##        T0       T30       T60 
## 1.0000000 0.5011547 0.2309469

The following workflow depends on beginning with a data frame of RPM values with column names consisting of unique, non-nested variable tags separated by underscores (‘_’), with rownames as unique gene identifiers. Here, we use with built-in example data (from Sorenson et al., 2018) which we will use throughout this workflow. It has 4 genotypes, 8 time points, and 4 biological replicates for 128 samples (columns) and 118 genes (rows).

library(RNAdecay)
RPMs <- RNAdecay::RPMs # built-in example data
RPMs[1:2, c(1:11, 128)] # NOTE: not showing all columns here

##           WT_00_r1  WT_07_r1  WT_15_r1  WT_30_r1  WT_60_r1 WT_120_r1
## AT1G02860 10.35838  9.772902  9.682167  9.298616  8.478808  7.007951
## AT1G09660 19.77509 17.684298 19.717329 15.455808 15.164023 13.912843
##           WT_240_r1 WT_480_r1 sov_00_r1 sov_07_r1 sov_15_r1 vs_480_r4
## AT1G02860  5.482903  6.523851  11.78139  11.85849  9.238458  25.50327
## AT1G09660 10.342748 11.261410  21.48637  19.33097 18.476917  17.56892

Biological replicate data columns are indexed using the cols() function.

cols(RPMs, "WT", "00") #gives the column indices that contain both tags in the column names

## [1]  1 33 65 97

colnames(RPMs)[cols(RPMs, "WT", "00")]

## [1] "WT_00_r1" "WT_00_r2" "WT_00_r3" "WT_00_r4"

# NOTE: this is based on grep pattern matching so tags MUST be unique and non-nested (i.e. one entire label should not be part of another).

The mean and standard error (SE) RPM values are calculated by looping over label combinations and binding the values together in a new data frame.

# create a directory for results output
wrdir <- "~/DecayAnalysis/Example analysis results 1"
if(!file.exists(wrdir)) dir.create(wrdir, recursive = TRUE)

# specify sample tags from colnames of RPMs
treatment <- c("WT", "sov", "vcs", "vs") # NOTE: I did not use 'vcs sov' for the name of the double mutant, or cols(RPMs, "vcs", "00") would have indexed all T0 samples of both "vcs" and "vcs" sov"; instead, I used 'vs'.
reps <- c("r1", "r2", "r3", "r4")
tdecay <- c("00", "07", "15", "30", "60", "120", "240", "480") #again NO nesting of one label in another hence "00" instead of "0".

Loop over the treatments and timepoints to subset the appropriate columns, and calculate mean and SE of replicates.

mean_RPM <- data.frame("geneID" = rownames(RPMs))
SE_RPM <- data.frame("geneID" = rownames(RPMs))
for(g in treatment){
  for(t in tdecay){
    mean_RPM <- cbind(mean_RPM, rowMeans(RPMs[, cols(RPMs, g, t)]))
    names(mean_RPM)[length(mean_RPM)] <- paste0(g, "_", t)
    SE_RPM <- cbind(SE_RPM, apply(X = RPMs[, cols(RPMs, g, t)], MARGIN = 1, FUN = stats::sd)/sqrt(length(reps)))
    names(SE_RPM)[length(SE_RPM)] <- paste0(g, "_", t)
  }}
mean_RPM[1:2, ]

##              geneID    WT_00    WT_07    WT_15    WT_30     WT_60
## AT1G02860 AT1G02860 10.30576 10.92457 12.07227 10.18583  7.951206
## AT1G09660 AT1G09660 17.67649 17.60419 17.01889 15.56747 15.591995
##              WT_120    WT_240    WT_480  sov_00  sov_07   sov_15   sov_30
## AT1G02860  5.978875  4.907999  5.427231 10.4751 11.5972 10.76603 10.01565
## AT1G09660 13.271012 10.672795 10.035496 19.6877 19.3090 19.11892 18.04514
##              sov_60   sov_120   sov_240   sov_480   vcs_00   vcs_07
## AT1G02860  7.483476  5.882106  4.932124  5.242097 15.21011 16.66707
## AT1G09660 16.465382 13.815930 10.919155 11.141820 30.52888 28.87662
##             vcs_15   vcs_30   vcs_60   vcs_120   vcs_240   vcs_480
## AT1G02860 16.86892 14.45997 12.41270  9.471179  7.028314  6.223005
## AT1G09660 29.98092 26.93621 26.31798 24.230218 23.312002 22.954986
##              vs_00    vs_07    vs_15    vs_30    vs_60   vs_120   vs_240
## AT1G02860 23.98874 25.13802 24.56582 25.85643 23.09965 21.47141 20.11297
## AT1G09660 24.06467 23.25254 24.05339 21.89650 20.80315 22.72681 20.65078
##             vs_480
## AT1G02860 19.47309
## AT1G09660 20.31687

SE_RPM[1:2, ]

##              geneID     WT_00     WT_07     WT_15     WT_30     WT_60
## AT1G02860 AT1G02860 0.4482383 0.5495215 0.8677714 0.6957607 0.1826975
## AT1G09660 AT1G09660 1.0154376 0.9293058 0.9753225 0.9843020 1.3145629
##              WT_120    WT_240    WT_480    sov_00    sov_07    sov_15
## AT1G02860 0.4938045 0.2494622 0.5508578 0.4671639 0.5698232 0.6208221
## AT1G09660 0.7515276 0.6203130 0.8833117 0.7574344 0.4105105 0.5440400
##              sov_30    sov_60   sov_120   sov_240   sov_480   vcs_00
## AT1G02860 0.4900057 0.2521764 0.4775238 0.6288319 0.2047149 1.065802
## AT1G09660 0.9838571 0.8045619 0.7516455 0.4322530 0.7700577 1.149763
##              vcs_07    vcs_15    vcs_30    vcs_60   vcs_120   vcs_240
## AT1G02860 0.5303556 0.5052474 0.8726631 0.5491621 0.7702577 0.2626263
## AT1G09660 1.9180603 0.9563549 0.5693349 0.9244970 1.2275583 1.2626453
##             vcs_480    vs_00    vs_07    vs_15    vs_30    vs_60    vs_120
## AT1G02860 0.4689800 1.547137 1.006693 1.481889 2.374982 2.469157 2.7944529
## AT1G09660 0.9924902 2.215020 2.072957 1.388435 1.515420 1.204576 0.7517306
##              vs_240   vs_480
## AT1G02860 1.5576869 2.059173
## AT1G09660 0.4056269 1.758722

# write output to file
write.table(x = mean_RPM, paste0(wrdir, "/RPM_mean.txt"), sep = "\t")
write.table(x = SE_RPM,   paste0(wrdir, "/RPM_SE.txt"), sep = "\t")

Optionally, an RPM-based filtering step could be applied at this point to remove lowly expressed genes, for example.

filt1 <- rep(TRUE, 118) # we will not filter in this example, so the filter value for each gene is set to TRUE.

Replicate mean T0 RPM values are then used to normalize gene level data.

mT0norm <- data.frame(row.names = rownames(RPMs)[filt1])
for(g in treatment){
  mean_T0reps <- rowMeans(RPMs[filt1, cols(RPMs, g, "00")])
  for(r in reps){
    df <- RPMs[filt1, colnames(RPMs)[cols(RPMs, g, r)]]
    df <- df[, 1:length(tdecay)]/mean_T0reps
    mT0norm <- cbind(mT0norm, df)
  }}
write.table(x = mT0norm, file = paste0(wrdir, "/T0 normalized.txt"),  sep = "\t")

The mean and standard error of the T0 normalized data are then calculated for replicate samples.

mean_mT0norm <- data.frame(row.names = rownames(mT0norm))
for(g in treatment){
  for(t in tdecay){
    mean_mT0norm <- cbind(mean_mT0norm, rowMeans(mT0norm[, cols(mT0norm, g, t)]))
    names(mean_mT0norm)[length(names(mean_mT0norm))] <- paste0(g, "_", t)
  }}

SE_mT0norm <- data.frame(row.names = rownames(mT0norm))
for(g in treatment){
  for(t in tdecay){
    SE_mT0norm <- cbind(SE_mT0norm, apply(X = mT0norm[, cols(mT0norm, g, t)], MARGIN = 1, FUN = function(x) stats::sd(x)/sqrt(length(reps))))
    names(SE_mT0norm)[length(names(SE_mT0norm))] <- paste0(g, "_", t)
  }}

# write output to file
write.table(x = mean_mT0norm, file = paste0(wrdir, "/T0 normalized_Mean.txt"),  sep = "\t")
write.table(x = SE_mT0norm, file = paste0(wrdir, "/T0 normalized_SE.txt"),  sep = "\t")

2. DECAY FACTOR CORRECTION

After inhibition of RNA synthesis, RNA degradation continues causing the total pool of RNA in cells to decrease. As this occurs, very stable RNAs become a much larger proportion of the total RNA pool even though their concentration in the cell remains nearly constant. RNA-seq RPM data is scaled to a normalized library size (i.e. to the total RNA pool), and, because of this, abundance of stable RNAs appears to increase. The apparent increase in abundance of stable genes is proportional to the reduction in the total pool of RNA, therefore, we can use it to apply a correction to the data. We call this decay factor correction. The decay factor is estimated based on the RPM increase (relative to the T0 samples) of a set of stable and abundant reference genes.

We selected 30 genes with high abundance on which to calculate the decay factors for each sample. These were then used to correct abundance in each of the respective samples. A unique decay factor is calculated for each treatment at each time point.

First, make a vector of ’geneID’s of abundant and stable reference genes present in the data set.

stablegenes <- c( "ATCG00490", "ATCG00680", "ATMG00280", "ATCG00580", "ATCG00140", "AT4G38970", "AT2G07671", "ATCG00570", "ATMG00730", "AT2G07727", "AT2G07687", "ATMG00160" ,"AT3G11630", "ATMG00060", "ATCG00600", "ATMG00220", "ATMG01170", "ATMG00410", "AT1G78900", "AT3G55440", "ATMG01320", "AT2G21170" ,"AT5G08670", "AT5G53300", "ATMG00070", "AT1G26630", "AT5G48300", "AT2G33040", "AT5G08690", "AT1G57720")

Mean T0 normalized values of stable genes are then pulled out of the mean_mT0norm data frame and used to calculate the decay factor for each set of replicate samples.

stabletable <- mean_mT0norm[stablegenes, ]
normFactors <- colMeans(stabletable)
write.table(x <- normFactors, paste0(wrdir, "/Normalziation Decay Factors.txt"), sep = "\t")
normFactors_mean <- matrix(normFactors, nrow = length(tdecay))
normFactors_SE <- matrix(apply(X = stabletable, MARGIN = 2, function(x) stats::sd(x)/sqrt(length(stablegenes))), nrow = length(tdecay))
t.decay <- c(0, 7.5, 15, 30, 60, 120, 240, 480)
rownames(normFactors_mean) <- t.decay
rownames(normFactors_SE) <- t.decay
colnames(normFactors_mean) <- treatment
colnames(normFactors_SE) <- treatment
list(normalizationFactors = normFactors_mean, SE = normFactors_SE)

## $normalizationFactors
##           WT      sov      vcs       vs
## 0   1.000000 1.000000 1.000000 1.000000
## 7.5 1.036541 1.018474 1.019401 1.006349
## 15  1.039264 1.049762 1.030781 1.016110
## 30  1.093192 1.089058 1.068392 1.057446
## 60  1.159184 1.150108 1.143107 1.127042
## 120 1.244396 1.254913 1.238043 1.211260
## 240 1.360045 1.361653 1.355177 1.331136
## 480 1.518916 1.524910 1.508710 1.490484
## 
## $SE
##               WT          sov          vcs           vs
## 0   3.764009e-18 9.219900e-18 8.416579e-18 5.323112e-18
## 7.5 7.187629e-03 6.156141e-03 6.354406e-03 3.671789e-03
## 15  8.350257e-03 8.745853e-03 4.268441e-03 7.639774e-03
## 30  1.018092e-02 9.213515e-03 8.169738e-03 9.683929e-03
## 60  8.408549e-03 9.406738e-03 7.099825e-03 1.257889e-02
## 120 1.081113e-02 1.397098e-02 1.073154e-02 1.346151e-02
## 240 1.827014e-02 1.776487e-02 1.474298e-02 1.564277e-02
## 480 3.641681e-02 4.239630e-02 3.448004e-02 3.649816e-02

Generate a matrix of the same dimensions as mT0norm with the appropriate decay factors in the same respective positions.

nF <- vector()
ind <- sapply(names(normFactors), function(x) grep(x, colnames(mT0norm)))
for(i in 1:ncol(ind)){
nF[ind[, i]] <- colnames(ind)[i]
}
normFactorsM <- t(matrix(rep(normFactors[nF], nrow(mT0norm)), ncol = nrow(mT0norm)))
rm(nF, ind)

Apply the decay factor corrections.

mT0norm_2 <- data.frame(mT0norm/normFactorsM, 
                     row.names = rownames(mT0norm))

write.table(mT0norm_2, paste0(wrdir, "/T0 normalized and decay factor corrected.txt"), sep = "\t")

Rearrange the RNA decay data into a long form data frame for modeling and visualization.

mT0norm_2.1 <- reshape2::melt(as.matrix(mT0norm_2), varnames = c("geneID", "variable"))
mT0norm_2.1 <- cbind(mT0norm_2.1, reshape2::colsplit(mT0norm_2.1$variable, "_", names = c("treatment", "t.decay", "rep")))

mT0norm_2.1 <- mT0norm_2.1[, colnames(mT0norm_2.1) !=  "variable"]
mT0norm_2.1 <- mT0norm_2.1[, c(1, 3, 4, 5, 2)]
colnames(mT0norm_2.1) <- c("geneID", "treatment", "t.decay", "rep", "value")
mT0norm_2.1$rep <- gsub("r", "rep", mT0norm_2.1$rep)
mT0norm_2.1$t.decay <- as.numeric(gsub("7", "7.5", as.numeric(mT0norm_2.1$t.decay)))
mT0norm_2.1$treatment <- factor(mT0norm_2.1$treatment,levels = c("WT","sov","vcs","vs"))
mT0norm_2.1$rep <- factor(mT0norm_2.1$rep, levels = paste0("rep",1:4))
write.table(x = mT0norm_2.1,  file = paste0(wrdir, 
"/ExampleDecayData+stableGenes.txt"), sep = "\t")

Following the steps in this vignette, the resulting normalized data frame should be identical to the one supplied in this package called decay_data. To check:

table(RNAdecay::decay_data[,1:4] == mT0norm_2.1[,1:4]) # should be all TRUE no FALSE

## 
##  TRUE 
## 60416

The normalized data is now ready to be used for modeling decay rates.

II. MODELING DECAY RATES

RNA degradation is typically modeled with a constant exponential decay model. This model assumes a constant decay rate throughout the decay time course. However, a number of issues might violate this assumption. For example, decay might depend on continuous transcription of rapidly turned over components of decay machinery; inhibition of this transcription would cause slowed decay due to lost supply of decay components. Alternatively, decay rate might be regulated (e.g., diurnally) leading to a slow change in decay rate over a long decay time course. Feedback due to slowed transcription could also lead to slowed RNA decay. We, therefore, apply both a constant exponential decay model and a time-dependent exponential decay model. However, besides these possibilies, other distinct mechanisms could lead to an apparent slowing of RNA decay. For example, because RNA-seq reads are pooled from multiple cell types and mapped to genes, distinct gene mRNA subpopulations might have different decay rates (i.e., different splice isoforms, or the same mRNA in different cell types of a multicellular organism) that are are averaged when these are pooled together. We believe that for these the time-dependent exponential decay model will also better capture this average than a constant exponential decay model.

The dynamic nature of RNA decay can be a point of gene regulation. To identify treatments that might affect decay rate, we model all possibilities of treatment effect on both [initital] decay rates ($\alpha$) and decay of decay rates ($\beta$) using maximum likelihood modeling. This is done by running the modeling with constraints on treatment decay rates (i.e., constraining decay rates of treatments to be equal or allowing them to combinatorially vary in independent groups). This can be done with up to four treatments in this package. Modeling is performed one gene at a time with each set of constraints, and then each set of parameter estimates (from each set of contraints) are compared using the AICc statistic. In this package, we refer to these as equivalence constraint groupings.

For the detailed mathematical framework of the decaying decay model see Sorenson et. al. (2018). Below we describe the steps of the algorithmic implementation and walk though an example data set.

3. LOAD NORMALIZED AND CORRECTED DATA AND CHECK THE FORMAT

Load normalized and corrected data and check the format: the following script requires a data frame with five columns with the following column names exactly: “geneID”, “treatment”, “t.decay”, “rep”, “value”:

* geneID    = gene identifiers (`factor`)
* treatment = experimental treatment or genotype (`factor`)
* t.decay   = time after transcriptional inhibition (`numeric`)
* rep       = biological replicate number (`factor`)
* value     = RNA abundance normalized to library size, decay factor and mean T0 values (`numeric`)

library(RNAdecay)
decay_data <- RNAdecay::decay_data # built-in example data

# make new directory for results
wrdir <- "~/DecayAnalysis/Example analysis results 2"
if(!file.exists(wrdir)) dir.create(wrdir, recursive = TRUE)

Rename factor levels and specify level order of decay_data$treatment for sorting and making figures. The example data set is already ordered, but when data is read from a text file (e.g., .csv) factor levels are ordered automatically and should be reordered according to user preference now to avoid later headaches.

levels(decay_data$treatment) # This can not be longer than 4 elements for modeling.

## [1] "WT"  "sov" "vcs" "vs"

# reorder factor levels if necessary (skipped here):
# decay_data$treatment <- factor(decay_data$treatment, levels = levels(decay_data$treatment)[c(4, 1, 2, 3)]) # numbers here refer to the current order position in the factor levels() - the order of the numbers designates the new position

levels(decay_data$treatment)[4] <- "vcs.sov"
levels(decay_data$treatment)

## [1] "WT"      "sov"     "vcs"     "vcs.sov"

Sort the decay_data data frame as follows (NOTE: REQUIRED step for proper indexing below).

decay_data <- decay_data[order(decay_data$t.decay), ]
decay_data <- decay_data[order(decay_data$rep), ]
decay_data <- decay_data[order(as.numeric(decay_data$treatment)), ] 
decay_data <- decay_data[order(decay_data$geneID), ]
ids <- as.character(unique(decay_data$geneID)) # 118 in example set
decay_data[1:10, ]

##         geneID treatment t.decay  rep     value
## 1    AT1G02860        WT     0.0 rep1 1.0051056
## 119  AT1G02860        WT     7.5 rep1 0.9148649
## 237  AT1G02860        WT    15.0 rep1 0.9039961
## 355  AT1G02860        WT    30.0 rep1 0.8253565
## 473  AT1G02860        WT    60.0 rep1 0.7097449
## 591  AT1G02860        WT   120.0 rep1 0.5464521
## 709  AT1G02860        WT   240.0 rep1 0.3911804
## 827  AT1G02860        WT   480.0 rep1 0.4167638
## 945  AT1G02860        WT     0.0 rep2 1.1086173
## 1063 AT1G02860        WT     7.5 rep2 1.1531651

4. GENERATE MATRICES OF $\alpha$ AND $\beta$ EQUIVALENCE CONSTRAINT GROUPS

Create objects and set varables used in the modeling based on decay_data.

# (no changes needed here - just execute the code)
nEquivGrp <- if (length(unique(decay_data$treatment)) == 2) {2} else
  if (length(unique(decay_data$treatment)) == 3) {5} else 
    if (length(unique(decay_data$treatment)) == 4) {15}
genoSet <- 1:(length(unique(decay_data$rep)) * length(unique(decay_data$t.decay)))
nTreat <- length(unique(decay_data$treatment))
nSet <- length(genoSet)*nTreat
groups <- groupings(decay_data)
mods <- data.frame(
  "a" = as.vector(sapply(1:nEquivGrp, function(x) {rep(x, nEquivGrp + 1)})),
  "b" = rep(1:(nEquivGrp + 1), nEquivGrp),
  row.names = paste0("mod", 1:(nEquivGrp*(nEquivGrp+1)))
)

Create a colormap of model groups.

group_map(decaydata = decay_data, path = paste0(wrdir, "/Model grouping colormap.pdf"), nEquivGrp = nEquivGrp, groups = groups, mods = mods)

## png 
##   2

This file (“Model grouping colormap.pdf”) should now be in your working directory. It is a color map reference for understanding model constraints. For example, model 1 has all different colored boxes representing unconstrained $\alpha$s and unconstrained betas, whereas, the second to last model has only two box colors - one for all $\alpha$s indicating that they all have constrained equivalence and the other indicating that all $\beta$s also have constrained equivalence. Constant decay models (i.e., $\beta$s = 0) are indicated with gray color in the $\beta$ columns.

5. MODEL OPTIMIZATION

Determine the bounds for $\alpha$ and $\beta$ parameters. The bounds on $\alpha$ and $\beta$ are related to distinguishing constant decay, and decaying decay as described in the modeling supplement of Sorenson et al., 2018.

Care should be taken in determining these bounds given each experimental design. We have provided functions to aid in selection of the lower bounds of $\alpha$ and $\beta$ (a_low and b_low, respectively), and upper bound for $\alpha$ (a_high). These functions calculate limits based on the time points data were collected. For example, if abundance for an unstable mRNA is measured at the first time point to be 0.02% of the initial (T0) abundance, the information needed to estimate this decay rate was not collected. So, a_high caluclates the maximum estimatable decay rate based on the time at which the first decay data point was collected. Similarly, the lower bound for $\beta$ is required to distinguish the constant decay model (const_decay) from the decaying decay model (decaying_decay). The upper bound for $\beta$ is required to distinguish the no decay model (const_decay(0,t)) from decaying decay model. If $\beta$ is too small 1-exp(-$\beta$*t) ~ -$\beta$*t and c(t) ~ exp(-$\alpha$*t); therefore, we can’t distinguish between constant decay and decaying decay models. If $\beta$ is too big 1-exp(-$\beta$*t) ~ 1 and c(t) ~ exp(-$\alpha$/$\beta$), a constant, we can’t distinguish between no decay and decaying decay models. Refer to Sorenson et al. (2018) supplemental material for more detail about how we calculated the bounds for $\alpha$ and $\beta$.

a_bounds <- c(a_low(max(decay_data$t.decay)),
             a_high(min(unique(decay_data$t.decay)[unique(decay_data$t.decay)>0])))
b_bounds <- c(b_low(max(decay_data$t.decay)), 0.075)
a_bounds;b_bounds

## [1] 0.0001 0.7100

## [1] 0.001 0.075

Modeling is accomplished using the mod_optimization function. The function takes as arguments, the gene identifier, $\alpha$ and $\beta$ bounds, the decay_data data frame, a vector of model names to run optimization for (e.g., as c("mod1", "mod239", ... ), the equivalence constraint matrix (defined as groups above), and a matrix to specify which contstraint groupings to use for $\alpha$ and $\beta$ parameters (defined as mods above), respectively, for each model, and a results folder (which will be made if it doesn’t already exist) to which the results are written. A number of other arguments are available as well, see help file for details using ?mod_optimization.

Efficient model parameter optimization is accomplished using objective functions coded as binary dynamically linked shared libraries. These libraries are compiled from functions coded in the C++ language using functionality of the TMB package. The compiled files are dynamically linked library (*.dll, for Windows) or shared object (*.so, linux and Mac) files for each of the objective functions. If RNAdecay is installed from source, compiling C++ code will require a compiler be installed separatedly on your system (e.g., Rtools34 or later for Windows, https://cran.r-project.org/bin/windows/Rtools/; Xcode comand line tools for Mac, http://railsapps.github.io/xcode-command-line-tools.html; R development package for Linux, r-base-dev). If you don’t already have one of these installed, install one and restart R. C++ source files are located in the installed ‘RNAdecay/src’ folder. The compiled (shared library) files should also be written to this same folder upon installation.

Test the modeling on a couple of genes on a handful of models.

mod_optimization("AT2G18150", decay_data, group = groups, mod = mods, a_bounds, b_bounds, c("mod1", "mod2", "mod16", "mod239", "mod240"), file_only = FALSE)

## AT2G18150 done

##           geneID    mod     alpha_WT    alpha_sov    alpha_vcs
## mod1   AT2G18150   mod1 0.0001000000 0.0005096304 0.0008160446
## mod2   AT2G18150   mod2 0.0001000009 0.0001000009 0.0008160504
## mod16  AT2G18150  mod16 0.0001000001 0.0001000001 0.0006962210
## mod239 AT2G18150 mod239 0.0009407433 0.0009407433 0.0009407433
## mod240 AT2G18150 mod240 0.0001403765 0.0001403765 0.0001403765
##        alpha_vcs.sov    beta_WT    beta_sov    beta_vcs beta_vcs.sov
## mod1    0.0019003661 0.07500000 0.074999989 0.001000000  0.001000000
## mod2    0.0019003680 0.07499995 0.001000004 0.001000004  0.001000004
## mod16   0.0016237903 0.00000000 0.000000000 0.000000000  0.000000000
## mod239  0.0009407433 0.01558834 0.015588337 0.015588337  0.015588337
## mod240  0.0001403765 0.00000000 0.000000000 0.000000000  0.000000000
##           sigma2    logLik nPar nStarts  J     range.LL nUnique.LL
## mod1   0.1155597 -43.51417    9      50 50 1.230693e-07          1
## mod2   0.1162461 -43.89319    7      50 48 4.604865e-04          3
## mod16  0.1178604 -44.77586    5      50 41 1.079744e-02         10
## mod239 0.1314419 -51.75597    3      50 50 0.000000e+00          1
## mod240 0.1321416 -52.09573    2      50 50 3.765876e-13          1
##             C.alpha       C.beta        C.tot     AICc AICc_est
## mod1   8.629355e-06 1.035762e-06 9.665117e-06 106.5538 106.5538
## mod2   7.257466e-05 4.838938e-05 1.209640e-04 102.7197 102.7197
## mod16  3.727459e-04 0.000000e+00 3.727459e-04 100.0435 100.0435
## mod239 2.812325e-07 3.074500e-07 5.886825e-07 109.7055 109.7055
## mod240 1.214854e-07 0.000000e+00 1.214854e-07 108.2875 108.2875

mod_optimization("AT4G09680", decay_data, group = groups, mod = mods, a_bounds, b_bounds, c("mod1", "mod2", "mod16", "mod239", "mod240"), file_only = FALSE)

## AT4G09680 done

##           geneID    mod    alpha_WT   alpha_sov   alpha_vcs alpha_vcs.sov
## mod1   AT4G09680   mod1 0.006151427 0.008553631 0.008342841   0.004383190
## mod2   AT4G09680   mod2 0.006151427 0.007880245 0.009021612   0.004316542
## mod16  AT4G09680  mod16 0.004257736 0.005375819 0.006498702   0.002395319
## mod239 AT4G09680 mod239 0.006823760 0.006823760 0.006823760   0.006823760
## mod240 AT4G09680 mod240 0.004351118 0.004351118 0.004351118   0.004351118
##            beta_WT    beta_sov    beta_vcs beta_vcs.sov      sigma2
## mod1   0.003531260 0.005527378 0.003470373  0.004653506 0.003781782
## mod2   0.003531260 0.004501911 0.004501911  0.004501911 0.003862530
## mod16  0.000000000 0.000000000 0.000000000  0.000000000 0.007170759
## mod239 0.004544744 0.004544744 0.004544744  0.004544744 0.008177741
## mod240 0.000000000 0.000000000 0.000000000  0.000000000 0.012303559
##           logLik nPar nStarts  J     range.LL nUnique.LL      C.alpha
## mod1   175.33971    9      50 50 5.400125e-13          1 5.187145e-08
## mod2   173.98757    7      50 50 9.094947e-13          1 6.526222e-08
## mod16  134.39147    5      50 50 1.108447e-12          1 7.629824e-08
## mod239 125.98159    3      50 50 0.000000e+00          1 3.113375e-08
## mod240  99.83933    2      50 50 0.000000e+00          1 3.098428e-09
##              C.beta        C.tot      AICc  AICc_est
## mod1   1.271417e-07 1.790132e-07 -331.1540 -331.1540
## mod2   1.104053e-07 1.756675e-07 -333.0418 -333.0418
## mod16  0.000000e+00 7.629824e-08 -258.2911 -258.2911
## mod239 5.502242e-08 8.615616e-08 -245.7696 -245.7696
## mod240 0.000000e+00 3.098428e-09 -195.5827 -195.5827

Next, run all the models on each and every gene. This requires significant computational resources depending on data set size, number of models, and computing speed. The sample data set has 8 time points x 4 replicates each x 4 treatments (= 128 samples) for 118 genes and requires about 10 h processor time. 20000 genes at ~5 min each is ~70 d processor time, so be sure to test a few different models on a few handfuls of genes to feel comfortable running all the modeling. We recommend parallel processing (on multiple processor cores) using parallel::mclapply to cut overall time. If this function is unavailable on your machine (e.g., on Windows) you can use lapply, but it will take much longer. mod_optimization() writes results to file in tab-delimited form, and, optionally, can also return them as a data frame object.

####### To run all genes in parallel use:
# parallel::mclapply(ids, FUN = mod_optimization, 
#                    data = decay_data, group = groups, mod = mods, 
#                    alpha_bounds = a_bounds, beta_bounds = b_bounds,
#                    models = paste0("mod", 1:240), 
#                    path = paste0(wrdir, "/modeling_results"),
#   mc.cores = getOption("mc.cores",  15L), # set the number of compute cores to use here (e.g., 9L = 9 cores, 11L = 11 cores)
#   mc.preschedule = TRUE,
#   mc.set.seed = TRUE,
#   mc.silent = FALSE,
#   mc.cleanup = TRUE,
#   mc.allow.recursive = TRUE)

The entire example data set takes ~ 10 h processor time with lapply so for this example we will randomly select one gene ID to run here.

test_ids <- sample(ids, 1) # NOTE: that everytime this line is run it generates a different random sampling, therefore the gene modeled below will be different each time this code is run. To test the exact gene shown in the vignette make a new character vector of the gene id reported below and pass it to the gene argument of mod_optimization using lapply instead of passing 'test_ids' as we do here.  
test_ids

## [1] "AT5G07580"

a <- proc.time()[3]
models <- lapply(X = test_ids, # alternatively use `ids` here to complete the entire sample data set, but be prepared to wait ~ 10 h. These gene IDs will get passed one at time to the "gene" argument of mod_optimization() and return a list of the results data frame.
                FUN = mod_optimization,
                data = decay_data, 
                group = groups, 
                mod = mods,
                alpha_bounds = a_bounds, 
                beta_bounds = b_bounds,
                models = rownames(mods), 
                path = paste0(wrdir, "/modeling_results"),
                file_only = FALSE)

## AT5G07580 done

names(models) <- test_ids
b <- proc.time()[3]
(b-a)/60/length(test_ids) # gives you average min per gene

##  elapsed 
## 2.439117

For each gene, read in the results data frame written by mod_optimization as an element of a single list object.

models <- lapply( paste0( wrdir, "/modeling_results/", test_ids, "_results.txt"), read.delim, header = TRUE )   
names(models) <- test_ids

6. MODEL SELECTION

The AICc statistic is used to compare model performance. Better models have lower AICc values. Select a single model for each gene based on the lowest AICc statistic. We will continue here with the RNAdecay::models data included in the package which uses all 118 genes from the sample data set RNAdecay::decay_data.

models <- RNAdecay::models # built-in example data, comment this line out to continue with your own modelling output
results <- t(sapply(models, function(x) x[x[, "AICc"] == min(x[, "AICc"]), ]))
results <- as.data.frame(results)
results[, 1:2] <- sapply(as.data.frame(results[, 1:2]), function(x) as.character(unlist(x)))
results[, -c(1,2)] <- sapply(results[, -c(1,2)], unlist)
write.table(results, file = paste0(wrdir,"/best model results.txt"), sep = "\t")
results <- read.delim(paste0(wrdir,"/best model results.txt"))

Generate some graphics to visualize results.

library(ggplot2)
pdf(paste0(wrdir,"/distributions of stats.pdf"))
p <- ggplot(results)
print(p+geom_histogram(aes(x = sigma2), bins = 300)+
        coord_cartesian(xlim = c(0,0.5))+
        geom_vline(xintercept = 0.0625,color = "red2")+
        plain_theme())
print(p+stat_bin(aes(x = sigma2), breaks = c(seq(0,0.25,0.25/50),1), geom = "bar")+coord_cartesian(xlim = c(0,0.25)))
print(p+stat_ecdf(aes(sigma2), geom = "line")+
        coord_cartesian(xlim = c(0,0.5)))
print(p+stat_ecdf(aes(sqrt(sigma2)), geom = "line")+
        coord_cartesian(xlim = c(0,sqrt(0.5))))
print(p+geom_histogram(aes(x = range.LL), bins = 60))
print(p+geom_histogram(aes(x = nUnique.LL), bins = 60))
dev.off()

## png 
##   2

pdf(paste0(wrdir,"/lowest AICc model counts.pdf"), height = 8, width = 32)
p <- ggplot(data = data.frame(
  model = as.integer(gsub("mod","",names(table(results$mod)))),
  counts = as.integer(table(results$mod))))+
  geom_bar(aes(x = model, y = counts), stat = "identity")+
  scale_x_continuous(limits = c(0,nrow(mods)), breaks = seq(0,nrow(mods),5))+
  ggtitle("model distribution of absolute lowest AICs")
print(p+plain_theme(25))

## Warning: Removed 1 rows containing missing values (geom_bar).

dev.off()

## png 
##   2

Because two models are considered to perform differently if their AICc value difference is >2, any models that have AICc values within two of the model with the lowest AICc are considered to have performed similarly. The number of models performing similarly to the best model are plotted as a histogram here. The number of different alpha groupings represented in these models is also plotted.

min_mods <- sapply(models, function(x) which (x[, "AICc"] < (2+min(x[, "AICc"])))) 
min_alpha_mods <- lapply(min_mods, function(x) unique(mods[x, "a"]))

pdf(paste0(wrdir,"/number of models that performed similar to the one selected.pdf"))
barplot(height = table(sapply(min_mods, length)), xlab = "No. models in the lowest AICc group (not more than 2 different from lowest)",
        ylab = "No. genes")
barplot(height = table(sapply(min_alpha_mods, length)), xlab = "No. alpha groups in the lowest AICc group (not more than 2 different from lowest)",
        ylab = "No. genes")
dev.off()

## png 
##   2

Build the modeling results data frame. Group genes with similar $\alpha$ groupings and then group genes with similar $\beta$ groupings.

results <- read.delim(paste0(wrdir,"/best model results.txt"))
results$alpha_grp <- mods[as.character(results$mod), "a"]
results$beta_grp <- mods[as.character(results$mod), "b"]
results$mod <- as.numeric(gsub("mod", "", as.character(results$mod)))

results$alphaPattern <- sapply(rownames(results), function(x) {
  paste0(gsub("alpha_", "", colnames(results)[3:(2+nTreat)][order(round(results[x, 3:(2+nTreat)], 4))]), collapse = "<=")
  })
results$alphaPattern <- paste0(results$alpha_grp, "_", results$alphaPattern)
results$betaPattern <- sapply(rownames(results), function(x){
  paste0(gsub("beta_", "", colnames(results)[(3+nTreat):(2+2*nTreat)][order(round(results[x, (3+nTreat):(2+2*nTreat)], 4))]), collapse = "<=")
  })
results$betaPattern <- paste0(results$beta_grp, "_", results$betaPattern)

results <- results[order(rownames(results)), ]
results <- results[order(results$beta_grp), ]
results <- results[order(results$alphaPattern), ]
results <- results[order(results$alpha_grp), ]

results$alphaPattern <- factor(results$alphaPattern, levels = as.character(unique(results$alphaPattern)))

results <- data.frame(results[, 3:(2*nTreat+3), 2], results[, c("AICc", "alpha_grp", "beta_grp", "alphaPattern", "betaPattern")])
results$nEqMods <- sapply(min_mods[rownames(results)], length)
results$nEqAgp <- sapply(min_alpha_mods[rownames(results)], length)

# Customize: add columns of relative alphas and betas as desired, e.g.:
results$rA_sov.WT    <- results$alpha_sov      / results$alpha_WT
results$rA_vcs.WT    <- results$alpha_vcs      / results$alpha_WT
results$rA_vcssov.WT <- results$alpha_vcs.sov  / results$alpha_WT

write.table(results, paste0(wrdir,"/alphas+betas+mods+grps+patterns+relABs.txt"), sep = "\t")

results <- read.delim(paste0(wrdir,"/alphas+betas+mods+grps+patterns+relABs.txt"), header = TRUE, colClasses =  c(NA, rep("numeric", 2+2*nTreat), rep("integer", 2), rep("character", 2), rep("integer", 2), rep("numeric", 3)))
# results$alpha_subgroup <- factor(results$alpha_subgroup, levels = unique(results$alpha_subgroup))
results$alphaPattern <- factor(results$alphaPattern, levels = unique(results$alphaPattern))
results$betaPattern <- factor(results$betaPattern, levels = unique(results$betaPattern))
results[1:3, ]

##             alpha_WT  alpha_sov  alpha_vcs alpha_vcs.sov    beta_WT
## AT1G69760 0.05813074 0.06732814 0.01556491   0.006992057 0.02257430
## AT2G01430 0.08379579 0.05122255 0.01793076   0.014767874 0.07500000
## AT5G11090 0.07968370 0.06943801 0.01926957   0.030393364 0.07125452
##             beta_sov    beta_vcs beta_vcs.sov      sigma2      AICc
## AT1G69760 0.02257430 0.005966764  0.002746884 0.009627945 -214.7804
## AT2G01430 0.07500000 0.014944595  0.014944595 0.017846952 -137.8476
## AT5G11090 0.07125452 0.011931300  0.027141770 0.007789566 -241.9016
##           alpha_grp beta_grp            alphaPattern
## AT1G69760         1       11 1_vcs.sov<=vcs<=WT<=sov
## AT2G01430         1       12 1_vcs.sov<=vcs<=sov<=WT
## AT5G11090         1       11 1_vcs<=vcs.sov<=sov<=WT
##                        betaPattern nEqMods nEqAgp rA_sov.WT rA_vcs.WT
## AT1G69760 11_vcs.sov<=vcs<=WT<=sov       8      2 1.1582193 0.2677570
## AT2G01430 12_vcs<=vcs.sov<=WT<=sov       4      2 0.6112783 0.2139816
## AT5G11090 11_vcs<=vcs.sov<=WT<=sov       3      2 0.8714205 0.2418258
##           rA_vcssov.WT
## AT1G69760    0.1202816
## AT2G01430    0.1762365
## AT5G11090    0.3814251

All files were written to the wrdir. The modeling results are now ready to be visualized.

RNAdecay Vignette: Normalization, Modeling, and Visualization of RNA Decay Data

Reed Sorenson | [email protected] | Department of Biology | University of Utah

2019-01-04

I. DATA NORMALIZATION AND CORRECTION

1. T0 NORMALIZATION

2. DECAY FACTOR CORRECTION

II. MODELING DECAY RATES

3. LOAD NORMALIZED AND CORRECTED DATA AND CHECK THE FORMAT

4. GENERATE MATRICES OF \(\alpha\) AND \(\beta\) EQUIVALENCE CONSTRAINT GROUPS

5. MODEL OPTIMIZATION

6. MODEL SELECTION

III. DATA VISULIZATION

7. LOAD DECAY DATA AND MODELING RESULTS (AND, OPTIONALLY, GENE DESCRIPTIONS)

8. PRINT DECAY PLOTS TO PDF