Author: Charles Plessy <[email protected]>
Date: 1 May 2013
clonotypeR is a R package and accompanying scripts to identify and analyse clonotypes from high-throughput T cell receptors sequence libraries. clonotypeR is suited to process and organise very large number of clonotypes, in the order of millions, typically produced by Roche 454 instruments, and to prepare these sequences for differential expression analysis with the typical transcriptomics tools as well as for statistical analysis using existing R packages.
The home page of clonotypeR is http://clonotyper.branchable.com/.
Typically, the user receives the output of a next-generation sequencer and runs some shell commands that are not part of the clonotypeR R package, but that are distributed with it on http://clonotyper.branchable.com/.
Currently, clonotypeR provides only support for mouse sequences, by providing
pre-formatted sequences from V
and J
segments. Support of human sequences
is in preparation. See http://clonotyper.branchable.com/references/README
for more information.
The workflow presented here summarises the different commands to run. Other examples are available on line at http://clonotyper.branchable.com/doc/workflow/.
This example analysis assumes a unix system (Linux, Mac OS, …)
ClonotypeR come with test data in the R
package. While it is not deep enough
to discuss about biology, you can use it to familiarise yourself with the
commands or test them.
The R package is loaded as usual.
library(clonotypeR)
The data is a table of 120 clonotypes in the extdata
folder of the package.
The command read_clonotypes
will parse it in a data frame. The clonotypes
are arbitrarily assigned to three libraries called A
, B
, and C
. The
read_clonotypes
comments determines at load time if the peptidic sequence has
a stop codon or is frame-shifted, and records the information in the
unproductive
column.
clonotypes <- read_clonotypes(system.file('extdata', 'clonotypes.txt.gz', package = "clonotypeR"))
summary(clonotypes)
## lib V J score mapq
## A:40 TRAV14-2 : 1 TRAJ31 :35 Min. : 45.0 Min. : 0.00
## B:40 TRAV14-3 :23 TRAJ49 :10 1st Qu.:215.0 1st Qu.: 31.00
## C:40 TRAV14D-3/DV8: 1 TRAJ57 :10 Median :226.0 Median : 37.50
## TRAV14N-1 :83 TRAJ34 : 9 Mean :221.6 Mean : 36.62
## TRAV14N-2 :11 TRAJ27 : 8 3rd Qu.:233.0 3rd Qu.: 40.00
## TRAV14N-3 : 1 TRAJ23 : 7 Max. :247.0 Max. :218.00
## (Other):41
## read dna qual
## Length:120 Length:120 Length:120
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## pep unproductive ambiguous
## Length:120 Mode :logical Mode :logical
## Class :character FALSE:102 FALSE:120
## Mode :character TRUE :18 NA's :0
## NA's :0
##
##
##
The clonotype_table
command counts how many times a given clonotype is found
in each library. It can also count simpler features, in particular V
and J
segments, or any combination of them.
head(clonotype_table(levels(clonotypes$lib), data=clonotypes))
## A B C
## TRAV14-2 AASAHYGSSGNKLI TRAJ32 1 0 0
## TRAV14-3 AAADYSNNRLT TRAJ7 1 0 0
## TRAV14-3 AAISNYNQGKLI TRAJ23 1 0 0
## TRAV14-3 AAMNQGGSAKLI TRAJ57 1 1 0
## TRAV14-3 AASAVDYANKMI TRAJ47 1 0 0
## TRAV14-3 AATVDSNYQLI TRAJ33 1 0 0
head(clonotype_table(levels(clonotypes$lib), "V", data=clonotypes))
## A B C
## TRAV14-2 1 0 0
## TRAV14-3 6 7 5
## TRAV14N-1 23 24 24
## TRAV14N-2 2 3 4
## TRAV14N-3 1 0 0
## TRAV14D-3/DV8 0 0 1
head(clonotype_table(levels(clonotypes$lib), "J", data=clonotypes))
## A B C
## TRAJ13 1 1 0
## TRAJ15 1 0 1
## TRAJ23 5 0 1
## TRAJ26 2 2 0
## TRAJ27 2 2 3
## TRAJ31 9 12 12
ClonotypeR provides other functions for further analysis. yassai_identifier
calculates a unique identifier using the V
, J
, peptidic and nucleotidic
information, following the work of Yassai et al.
# Unique identifier
head(yassai_identifier(clonotypes))
## [1] "aAYt.22A14N2A49L10" "aYt.2A14N2A49L10"
## [3] "aGRt.34A14N1A40L11" "aADTIt.1221A14N1A27L11"
## [5] "AEMn.411A14N1A23L11" "sAHy.11A14-2A32L14"
unique_clonotypes
and common_clonotypes
are typically used when comparing libraries.
clonotypes <- clonotype_table(levels(clonotypes$lib), data=clonotypes)
# First six clonotypes of library C
head(unique_clonotypes("C", data=clonotypes))
## [1] "TRAV14N-1 AASDTNTGKLT TRAJ27" "TRAV14N-1 AASNSNNRIF TRAJ31"
## [3] "TRAV14N-2 AAYTGYQNFY TRAJ49" "TRAV14-3 AASYGSSGNKLI TRAJ32"
## [5] "TRAV14N-1 AALNSNNRIF TRAJ31" "TRAV14N-1 AAPSNTNKVV TRAJ34"
# Count clonotypes found in library A, and B or C.
length(common_clonotypes(group1="A", group2=c("B","C"), data=clonotypes))
## [1] 5
# Matrix of numbers of common clonotypes
common_clonotypes(data=clonotypes)
## A B C
## A 26 2 3
## B 2 27 4
## C 3 4 27
With deeper data, a typical follow-up would be to identify differentially
represented clonotypes between libraries, for instance with the edgeR
package,
or to calculate distance between libraries, for instance with the
vegan
package.
The data provided on-line at http://clonotyper.branchable.com/example_data/ is a sub-sample of three sequence libraries of mouse T cell receptors α (2,000 reads each) made on the 454 Titanium or the 454 junior platforms. The original libraries will be deposited in public databanks after publication in a peer-reviewed journal.
These example libraries are called A
, B
and C
, and are in FASTQ format,
with entries like the following (the sequence was truncated for the convenience
of the display).
@HKTLYLP01B0MTM
gactGTCCATCTTCCTTTTATCGGACACTGAAGTATGGATATCAGAAGTGCAgggccttcccacgggaacg
+
IIIIIIIIIIIHHFF::::G>IIIGGGIIIIIIIIIGGIIIIIIFEBDCDC<//-5522------
Run the command clonotypeR detect A.fastq
in the same directory as a copy of
the file A.fastq
.
The result is stored in a temporary directory called extraction_files
, that
will be created if it does not already exist.
clonotypeR detect
compares the sequences to the reference V segments using
BWA, and produces output like the following.
[bsw2_aln] read 2000 sequences/pairs (843395 bp)...
[samopen] SAM header is present: 167 sequences.
[main] Version: 0.6.2-r126
[main] CMD: bwa bwasw -t8 /usr/share/clonotypeR/references/V/index A.fastq
[main] Real time: 1.099 sec; CPU: 8.225 sec
This indicates that 2,000 reads have been processed, representing 843,395 base
pairs in total. There were 167 reference V segments, and the version number of
BWA was 0.6.2-r126
. The whole process took less than 10 seconds.
Process the example libraries B
and C
similarly with the commands
clonotypeR detect B.fastq
and clonotypeR detect C.fastq
.
Run the command clonotypeR extract A
in the same directory as where you ran
clonotypeR detect A.fastq
. The result is a table stored in a directory
called clonotypes
, that will be created if it does not already exist.
The output is quite voluminous, and indicates which V / J combinations are being found, like on the following.
TRAV14-3 233
TRAJ61 0
TRAJ60 0
TRAJ59 0
TRAJ58 1
TRAJ57 39
TRAJ56 2
TRAJ55 0
The format of the table is explained in the manual page of the function
read_clonotypes()
of the R package.
For each library (A
, B
and C
), one file is available in the clonotypes
directory. With BWA 0.6.2-r126
, the following numbers of clonotypes are found.
1072 clonotypes/A.tsv
924 clonotypes/B.tsv
689 clonotypes/C.tsv
The files need to be concatenated before analysis in R
, with the following command.
find clonotypes/ -name '*tsv' | xargs cat > clonotypes.tsv
A copy of the result file is provided in inst/extdata/clonotypes2.tsv.xz
for convenience.
Load the clonotypeR library:
library(clonotypeR)
Load the data in a R object called clonotypes:
clonotypes <- read_clonotypes('clonotypes.tsv')
Alternatively, you can load the convenience copy from
inst/extdata/clonotypes2.tsv.xz
(see above).
clonotypes <- read_clonotypes(system.file("extdata", "clonotypes2.tsv.xz", package = "clonotypeR"))
The command summary(clonotypes)
already provides useful information.
summary(clonotypes)
## lib V J score
## A:1114 TRAV14-1 :1000 TRAJ31 : 403 Min. : 44.0
## B: 996 TRAV14-2 : 247 TRAJ23 : 281 1st Qu.:214.0
## C: 716 TRAV14-3 : 265 TRAJ22 : 279 Median :225.0
## TRAV14D-3/DV8: 255 TRAJ37 : 160 Mean :218.6
## TRAV14N-1 : 624 TRAJ34 : 147 3rd Qu.:232.0
## TRAV14N-2 : 250 TRAJ40 : 112 Max. :248.0
## TRAV14N-3 : 185 (Other):1444
## mapq read dna qual
## Min. : 0.00 Length:2826 Length:2826 Length:2826
## 1st Qu.: 31.00 Class :character Class :character Class :character
## Median : 38.00 Mode :character Mode :character Mode :character
## Mean : 63.50
## 3rd Qu.: 56.75
## Max. :223.00
##
## pep unproductive ambiguous
## Length:2826 Mode :logical Mode :logical
## Class :character FALSE:2223 FALSE:2787
## Mode :character TRUE :603 TRUE :39
## NA's :0 NA's :0
##
##
##
Identify unique clonotypes, count their sequences in the libraries A
, B
and C
, and store the result as a table arbitrarily named abc
.
abc <- clonotype_table(c('A','B','C'), data=clonotypes)
head(abc)
## A B C
## TRAV14-1 AAASSGSWQLI TRAJ22 1 0 0
## TRAV14-1 AACNNRIF TRAJ31 1 0 0
## TRAV14-1 AAGAKLT TRAJ39 3 0 0
## TRAV14-1 AAGGSWQLI TRAJ22 1 0 0
## TRAV14-1 AAGTNTGKLT TRAJ27 1 0 0
## TRAV14-1 AAHDTNAYKVI TRAJ30 1 0 0
summary(abc)
## A B C
## Min. : 0.0000 Min. : 0.0000 Min. : 0.0000
## 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000
## Median : 0.0000 Median : 0.0000 Median : 0.0000
## Mean : 0.7669 Mean : 0.6797 Mean : 0.5107
## 3rd Qu.: 1.0000 3rd Qu.: 1.0000 3rd Qu.: 0.0000
## Max. :18.0000 Max. :22.0000 Max. :124.0000
The summary shows that the most frequent clonotype is in C
. Using R
index vectors, we can see that its CDR3 sequence is AASDSNNRIF and that it was not found in the other libraries.
abc[abc$C == 124, ]
## A B C
## TRAV14N-1 AASDSNNRIF TRAJ31 0 0 124
The clonotype_table
function can also produce a count table for and combination of V, CDR3 or J segments.
clonotype_table(c('A','B','C'), "V", data=clonotypes)
## A B C
## TRAV14-1 238 515 0
## TRAV14-2 134 65 0
## TRAV14-3 82 11 117
## TRAV14D-3/DV8 145 51 4
## TRAV14N-1 78 26 399
## TRAV14N-2 83 63 52
## TRAV14N-3 102 33 2
head(clonotype_table(c('A','B','C'), c("V","J"), data=clonotypes))
## A B C
## TRAV14-1 TRAJ11 1 1 0
## TRAV14-1 TRAJ12 2 2 0
## TRAV14-1 TRAJ13 2 1 0
## TRAV14-1 TRAJ15 11 11 0
## TRAV14-1 TRAJ16 4 3 0
## TRAV14-1 TRAJ18 1 22 0