make-data 4 ExperimentHub

Christian Panse1*

1Functional Genomics Center Zurich, UZH|ETHZ

2018-12-13

Abstract

Mass spectrometry data are available via ProteomeXchange with identifier PXD009301 NGS datasets are available at the European Nucleotide Archive (ENA) under accession number PRJEB25673. NGS and MS data were handled and annotated using the B-Fabric Türker et al. (2010) information management system and are available for registered users under the project identifiers 1644 and 1875. The following content is descibed in more detail in Egloff et al. (2018) (under review NMETH-A35040).

1 Timeline Plots
2 Make Data (replaces make-data.R)
3 Uploading to S3
4 Overview/Getting started using Bioconductor ExperimentHub
5 Session info
References

1 Timeline Plots

The package contains only a subset of the most important data generated over a period of five years. To get an impression an overview of all annotated sample (S) and workunits (W) in the B-Fabric system, Türker et al. (2010), is graphed in the timeline plots.

the NGS data p1644

the mass spec data p1875

2 Make Data (replaces `make-data.R`)

2.1 `NL42_100K.fastq.gz`

Sample NGS data contains 100K merged MiSeq reads that demonstrate the linkage between nanobodies (NB) and flycodes (FC) in FASTQ.

NL42_100K <- NestLink:::.getReadsFromFastq("inst/extdata/NL42_100K.fastq.gz")
save(NL42_100K, file="inst/extdata/NestLink_NL42_100K.RData")

2.2 `knownNB.txt`

An optional part of the NestLink workflow is the usage of known nanobodies in the sequencing experiment to estimate sensitity and specificity levels. This example file contains nucleotide sequences of nanobodies that should be detectable in this experiment. In the later workflow, these nanabodies are highlighted and labeled as known NB.

2.3 `nanobodyFlycodeLinkage.RData`

NGS ground truth derived by applying the function runNGSAnalysis to the two previous files.

expFile <- query(eh, c("NestLink", "NL42_100K.fastq.gz"))[[1]]
expect_true(file.exists(expFile))
scratchFolder <- tempdir()
setwd(scratchFolder)

knownNB_File <- query(eh, c("NestLink", "knownNB.txt"))[[1]]
knownNB_data <- read.table(knownNB_File,
                           sep='\t',
                           header = TRUE,
                           row.names = 1,
                           stringsAsFactors = FALSE)

knownNB <- Biostrings::translate(DNAStringSet(knownNB_data$Sequence))
names(knownNB) <- rownames(knownNB_data)
knownNB <- sapply(knownNB, toString)

param <- list()
param[['NB_Linker1']] <- "GGCCggcggGGCC"
param[['NB_Linker2']] <- "GCAGGAGGA"
param[['ProteaseSite']] <- "TTAGTCCCAAGA"
param[['FC_Linker']] <- "GGCCaaggaggcCGG"
param[['knownNB']] <- knownNB
param[['nReads']] <- 100
param[['minRelBestHitFreq']] <- 0.8 
param[['minConsensusScore']] <- 0.9
param[['maxMismatch']] <- 1
param[['minNanobodyLength']] <- 348
param[['minFlycodeLength']] <- 33
param[['FCminFreq']] <- 1

nanobodyFlycodeLinkage.RData <- runNGSAnalysis(file = expFile[1], param)

2.4 `NB.tryptic` and `FC.tryptic`

Both files are the output of the previous NGS step generating the linkage between NBs and FCs.

The files are used to demonstrate the detectability of the AA sequences.

The wrapper functions are extended by the SSRC prediction and the parent ion mass (pim) determined by using protViz.

The column ESP_Prediction was generated by using the service from https://genepattern.broadinstitute.org, see also Fusaro et al. (2009).

library(NestLink)
NB <- getNB()
FC <- getFC()

The first ten lines of each table is shown below:

peptide	ESP_Prediction	cond	pim	ssrc	peptideLength
AAAGITYYADSVK	0.82378	NB	1329.6685	21.93845	13
AACCPVAR	0.39342	NB	904.4127	5.56465	8
AADPGSWGQGTPVTVSSELK	0.64844	NB	1986.9767	26.10345	20
AADYYYGMNHWGK	0.15954	NB	1575.6685	24.80345	13
AANPFGLVQGFGSWGK	0.44514	NB	1635.8278	40.19691	16
AAPDYWGQGTPVTVSSELK	0.39622	NB	2005.9865	31.76845	19

	peptide	ESP_Prediction	cond	pim	ssrc	peptideLength
120	GSAAAAADSWLTVR	0.75450	FC	1375.696	27.80445	14
121	GSAAAAATDWLTVR	0.76422	FC	1389.712	29.00445	14
122	GSAAAAATGWLTVR	0.65522	FC	1331.707	28.60445	14
123	GSAAAAATVWLR	0.65496	FC	1173.637	29.10445	12
124	GSAAAAAYEWLTVR	0.72754	FC	1465.743	33.10445	14
125	GSAAAADAAWQEGGR	0.53588	FC	1417.645	11.70445	15

2.5 `F255744.RData` and `WU160118.RData`

2.5.1 Mass spec data

the mass spec files below are available through ProteomeXchange PXD009301.

2.5.2 Compute the peptide spectrum matches

the mass spectra were assigned to peptide sequences using the most important parameter listed in the table below and the Matrix Science’s Mascot Server Perkins et al. (1999) version 2.5.

Parameter	Value
COM	170819_MS1708116_NL5idx4to5_Competition2BG_db8_db10_swissprot_d_merge
FASTA 1	p1875_db8_20160704.fasta
FASTA 2	p1875_db10_20170817.fasta
TOL	10
TOLU	ppm
ITOL	0.6
ITOLU	Da
USERNAME	egloffp
CHARGE	2+
IT_MODS	Deamidated (NQ),Oxidation (M)
INSTRUMENT	ESI-TRAP
release	fgcz_swissprot_d_20140403.fasta

The results were exported as XML. The XML was parsed and exported as data.frame using protViz Panse and Grossmann (2019) function protViz:::as.data.frame.mascot.

2.5.3 Workflow available through B-Fabric

The above-described results and workflows are available for registered users in B-Fabric. However, it is not necessary to access B-Fabric in order to use this package.

2.5.4 make-data for NestLink

The following code snippet was executed to generate the data set shiped with the NestLink package.

Here only the metadata were extracted (no MS2).

load("~/Downloads/444589.RData")
library(protViz)
library(NestLink)
WU160118 <- do.call('rbind', lapply(list("F255737", "F255744", "F255747", 
  "F255749", "F255751", "F255760", "F255761", "F255762"), 
  function(datfilename){
      df <- as.data.frame.mascot(get(datfilename))
      df$datfilename <- datfilename
      df
    }
  ))
save(WU160118, file = "../inst/extdata/WU160118.RData", 
     compress = TRUE, compression_level = 9)

The data ships with the NestLink package and can be browsed using the following code snippet:

library(ExperimentHub)
eh <- ExperimentHub(); 
load(query(eh, c("NestLink", "WU160118.RData"))[[1]])
class(WU160118)

## [1] "data.frame"

PATTERN <- "^GS[ASTNQDEFVLYWGP]{7}(WR|WLTVR|WQEGGR|WLR|WQSR)$"
idx <- grepl(PATTERN, WU160118$pep_seq)
WU <- WU160118[idx & WU160118$pep_score > 25,]

x
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_02_IMACelution.raw”
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_03_IMACelution.raw”
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_05_HiLoadElution.raw”
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_04_HiLoadElution.raw”
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_08_MaxBindingBG.raw”
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_07_MaxBindingBG.raw”
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_09_MaxBinding.raw”
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_10_MaxBinding.raw”
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_12_Competition1.raw”
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_13_Competition1.raw”
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_14_Competition1BG.raw”
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_15_Competition1BG.raw”
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_17_Competition2.raw”
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_18_Competition2.raw”
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_19_Competition2BG.raw”
“S:/p1875/Proteomics/FUSION_2/egloffp_20170814_NL5idx4to5/20170814_20_Competition2BG.raw”

2.6 `PGexport2_normalizedAgainstSBstandards_Peptides.csv`

contains mass spectrometry based label free quantitative (LFQ) results of nanobodies expressed in SMEG and COLI species.

Workunit : 158716 - QEXACTIVEHF_1
- 20170919_16_62465_nl5idx1-3_6titratecoli.raw
- 20170919_05_62465_nl5idx1-3_6titratecoli.raw
Workunit : 158717 - QEXACTIVEHF_1
- 20170919_14_62466_nl5idx1-3_7titratesmeg.raw
- 20170919_09_62466_nl5idx1-3_7titratesmeg.raw

Two LC-MS/MS runs were aligned in Progenesis QI (Nonlinear Dynamics) with an alignment score of 93.1 %, followed by peak picking with an allowed ion charge of +2 to +5.

3 Uploading to S3

#!/bin/bash

aws --profile AnnotationContributor s3 cp NestLink/F255744.RData s3://annotation-contributor/NestLink/F255744.RData --acl public-read

aws --profile AnnotationContributor s3 cp NestLink/WU160118.RData s3://annotation-contributor/NestLink/WU160118.RData --acl public-read

aws --profile AnnotationContributor s3 cp NestLink s3://annotation-contributor/NestLink --recursive --acl public-read

4 Overview/Getting started using Bioconductor ExperimentHub

load metadata

fl <- system.file("extdata", "metadata.csv", package='NestLink')
kable(metadata <- read.csv(fl, stringsAsFactors=FALSE))

Title	Description	BiocVersion	Genome	SourceType	SourceUrl	SourceVersion	Species	TaxonomyId	Coordinate_1_based	DataProvider	Maintainer	RDataClass	DispatchClass	RDataPath	Tags	Notes
Sample NGS NB FC linkage data	Sample NGS demonstratig the linkage between nanobodies (NB) and flycodes (FC). data in FASTQ	3.9	NA	FASTQ	https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-project.html?id=1644	Nov 28 2018	NA	NA	NA	Functional Genomics Center Zurich (FGCZ)	Markus Seeger [email protected], Pascal Egloff [email protected], Lennart Opitz [email protected]	DNAStringSet	FilePath	NestLink/NL42_100K.fastq.gz	NA	md5=4a13c5c61a5b29f4fd8830c1c15419b6;
Flycodes tryptic digested	Flycodes tryptic digested amino acid sequences with ESP_Prediction score.	3.9	NA	TXT	https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-project.html?id=1875	Nov 28 2018	NA	NA	NA	Functional Genomics Center Zurich (FGCZ)	Markus Seeger [email protected], Pascal Egloff [email protected], Christian Panse [email protected]	data.frame	FilePath	NestLink/FC.tryptic	NA	md5=f6faa7458350ce1805bec30e9ffdeaae;
Nanobodies tryptic digested	Nanobodies tryptic digested amino acid sequences with ESP_Prediction score.	3.9	NA	TXT	https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-project.html?id=1875	Nov 28 2018	NA	NA	NA	Functional Genomics Center Zurich (FGCZ)	Markus Seeger [email protected], Pascal Egloff [email protected], Christian Panse [email protected]	data.frame	FilePath	NestLink/NB.tryptic	NA	md5=db85a806c5151113536b710d566d9cf3;
FASTA as ground-truth for unit testing	FASTA data as ground-truth for unit testing.	3.9	NA	RData	https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-project.html?id=1644	Nov 28 2018	NA	NA	NA	Functional Genomics Center Zurich (FGCZ)	Markus Seeger [email protected], Pascal Egloff [email protected], Lennart Opitz [email protected]	data.frame	FilePath	NestLink/nanobodyFlycodeLinkage.RData	NA	md5=57b2756fb0ebcf73d4036846580cb5b2;
Known nanobodies	Known nanobodies as nucleic acid sequences.	3.9	NA	TXT	https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-project.html?id=1644	Nov 28 2018	NA	NA	NA	Functional Genomics Center Zurich (FGCZ)	Markus Seeger [email protected], Pascal Egloff [email protected], Lennart Opitz [email protected]	data.frame	FilePath	NestLink/knownNB.txt	NA	md5=003bf82c58f0a96a2bd945d171dc907c;
Quantitaive results for SMEG and COLI	Mass spectrometry based label free quantitative results of nanobodies expressed in SMEG and COLI species.	3.9	NA	CSV	https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-project.html?id=1875	Nov 28 2018	NA	NA	NA	Functional Genomics Center Zurich (FGCZ)	Markus Seeger [email protected], Pascal Egloff [email protected], Christian Panse [email protected]	data.frame	FilePath	NestLink/PGexport2_normalizedAgainstSBstandards_Peptides.csv	NA	md5=0ca525d0a65d4938f0cbc785b7e0d2d3; bfabric WU158716, WU158717
F255744 Mascot Search result	F255744 peptide spectrum matches (PSMs) of Flycodes.	3.9	NA	TXT	https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-resource.html?id=409912	Dec 13 2018	NA	NA	NA	Functional Genomics Center Zurich (FGCZ)	Markus Seeger [email protected], Pascal Egloff [email protected], Christian Panse [email protected]	data.frame	FilePath	NestLink/F255744.RData	NA	md5=d5e4d13e9ecba4231d1808c6bb0bb454; R409912
WU160118 Mascot Search results	WU160118 peptide spectrum matches (PSMs) Flycodes.	3.9	NA	TXT	https://fgcz-bfabric.uzh.ch/bfabric/userlab/show-workunit.html?id=160118	Dec 13 2018	NA	NA	NA	Functional Genomics Center Zurich (FGCZ)	Markus Seeger [email protected], Pascal Egloff [email protected], Christian Panse [email protected]	data.frame	FilePath	NestLink/WU160118.RData	NA	md5=a17f4505e322d440bc0e9edf8e5277bb; bfabric WU160118

query and load NestLink package data from aws s3

library(ExperimentHub)

eh <- ExperimentHub(); 
query(eh, "NestLink")

## ExperimentHub with 8 records
## # snapshotDate(): 2024-10-24
## # $dataprovider: Functional Genomics Center Zurich (FGCZ)
## # $species: NA
## # $rdataclass: data.frame, DNAStringSet
## # additional mcols(): taxonomyid, genome, description,
## #   coordinate_1_based, maintainer, rdatadateadded, preparerclass, tags,
## #   rdatapath, sourceurl, sourcetype 
## # retrieve records with, e.g., 'object[["EH2063"]]' 
## 
##            title                                 
##   EH2063 | Sample NGS NB FC linkage data         
##   EH2064 | Flycodes tryptic digested             
##   EH2065 | Nanobodies tryptic digested           
##   EH2066 | FASTA as ground-truth for unit testing
##   EH2067 | Known nanobodies                      
##   EH2068 | Quantitaive results for SMEG and COLI 
##   EH2069 | F255744 Mascot Search result          
##   EH2070 | WU160118 Mascot Search results

load(query(eh, c("NestLink", "F255744.RData"))[[1]])
dim(F255744)

## [1] 15655    21

load(query(eh, c("NestLink", "WU160118.RData"))[[1]])
dim(WU160118)

## [1] 128390     22

5 Session info

Here is the compiled output of sessionInfo():

## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.20-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.12.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] knitr_1.48                  scales_1.3.0               
##  [3] ggplot2_3.5.1               NestLink_1.22.0            
##  [5] ShortRead_1.64.0            GenomicAlignments_1.42.0   
##  [7] SummarizedExperiment_1.36.0 Biobase_2.66.0             
##  [9] MatrixGenerics_1.18.0       matrixStats_1.4.1          
## [11] Rsamtools_2.22.0            GenomicRanges_1.58.0       
## [13] BiocParallel_1.40.0         protViz_0.7.9              
## [15] gplots_3.2.0                Biostrings_2.74.0          
## [17] GenomeInfoDb_1.42.0         XVector_0.46.0             
## [19] IRanges_2.40.0              S4Vectors_0.44.0           
## [21] ExperimentHub_2.14.0        AnnotationHub_3.14.0       
## [23] BiocFileCache_2.14.0        dbplyr_2.5.0               
## [25] BiocGenerics_0.52.0         BiocStyle_2.34.0           
## 
## loaded via a namespace (and not attached):
##  [1] DBI_1.2.3               bitops_1.0-9            deldir_2.0-4           
##  [4] rlang_1.1.4             magrittr_2.0.3          compiler_4.4.1         
##  [7] RSQLite_2.3.7           mgcv_1.9-1              png_0.1-8              
## [10] vctrs_0.6.5             pwalign_1.2.0           pkgconfig_2.0.3        
## [13] crayon_1.5.3            fastmap_1.2.0           magick_2.8.5           
## [16] labeling_0.4.3          caTools_1.18.3          utf8_1.2.4             
## [19] rmarkdown_2.29          UCSC.utils_1.2.0        tinytex_0.54           
## [22] purrr_1.0.2             bit_4.5.0               xfun_0.49              
## [25] zlibbioc_1.52.0         cachem_1.1.0            jsonlite_1.8.9         
## [28] blob_1.2.4              highr_0.11              DelayedArray_0.32.0    
## [31] jpeg_0.1-10             parallel_4.4.1          R6_2.5.1               
## [34] bslib_0.8.0             RColorBrewer_1.1-3      jquerylib_0.1.4        
## [37] Rcpp_1.0.13-1           bookdown_0.41           splines_4.4.1          
## [40] Matrix_1.7-1            tidyselect_1.2.1        abind_1.4-8            
## [43] yaml_2.3.10             codetools_0.2-20        hwriter_1.3.2.1        
## [46] curl_5.2.3              lattice_0.22-6          tibble_3.2.1           
## [49] withr_3.0.2             KEGGREST_1.46.0         evaluate_1.0.1         
## [52] pillar_1.9.0            BiocManager_1.30.25     filelock_1.0.3         
## [55] KernSmooth_2.23-24      generics_0.1.3          BiocVersion_3.20.0     
## [58] munsell_0.5.1           gtools_3.9.5            glue_1.8.0             
## [61] tools_4.4.1             interp_1.1-6            grid_4.4.1             
## [64] latticeExtra_0.6-30     AnnotationDbi_1.68.0    colorspace_2.1-1       
## [67] nlme_3.1-166            GenomeInfoDbData_1.2.13 cli_3.6.3              
## [70] rappdirs_0.3.3          fansi_1.0.6             S4Arrays_1.6.0         
## [73] dplyr_1.1.4             gtable_0.3.6            sass_0.4.9             
## [76] digest_0.6.37           SparseArray_1.6.0       farver_2.1.2           
## [79] memoise_2.0.1           htmltools_0.5.8.1       lifecycle_1.0.4        
## [82] httr_1.4.7              mime_0.12               bit64_4.5.2

References

Egloff, Pascal, Iwan Zimmermann, Fabian M. Arnold, Cedric A. J. Hutter, Damien Damien Morger, Lennart Opitz, Lucy Poveda, et al. 2018. “Engineered Peptide Barcodes for In-Depth Analyses of Binding Protein Ensembles.” bioRxiv. https://doi.org/10.1101/287813.

Fusaro, V. A., D. R. Mani, J. P. Mesirov, and S. A. Carr. 2009. “Prediction of high-responding peptides for targeted protein assays by mass spectrometry.” Nat. Biotechnol. 27 (2): 190–98.

Panse, Christian, and Jonas Grossmann. 2019. protViz: Visualizing and Analyzing Mass Spectrometry Related Data in Proteomics. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org.

Perkins, David N., Darryl J. C. Pappin, David M. Creasy, and John S. Cottrell. 1999. “Probability-Based Protein Identification by Searching Sequence Databases Using Mass Spectrometry Data.” Electrophoresis 20 (18): 3551–67. https://doi.org/10.1002/(sici)1522-2683(19991201)20:18<3551::aid-elps3551>3.0.co;2-2.

Türker, Can, Fuat Akal, Dieter Joho, Christian Panse, Simon Barkow-Oesterreicher, Hubert Rehrauer, and Ralph Schlapbach. 2010. “B-Fabric: The Swiss Army Knife for Life Sciences.” In Proceedings of the 13th International Conference on Extending Database Technology - EDBT 10. ACM Press. https://doi.org/10.1145/1739041.1739135.