Summix: estimating and adjusting for ancestry in genetic summary data

Audrey Hendricks, [email protected]

Gregory Matesi, [email protected]

November 1, 2020

1 Introduction

Hidden heterogeneity (such as ancestry) within genetic summary data can lead to confounding in association testing or inaccurate prioritization of putative variants. Here, we provide Summix, a method to estimate and adjust for reference ancestry groups within genetic allele frequency data. This method was developed by the Summix team at the University of Colorado Denver and is headed by Dr Audrey Hendricks.

References Arriaga-MacKenzie IS, Matesi G, Chen S, Ronco A, Marker KM, Hall JR, Scherenberg R, Khajeh-Sharafabadi M, Wu Y, Gignoux CR, Null M, Hendricks AE (2021). Summix: A method for detecting and adjusting for population structure in genetic summary data. Am J Hum Genet 2021 108, 1270-1282. https://doi.org/10.1016/j.ajhg.2021.05.016

2 Installation

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Summix")

if(!requireNamespace("Summix")){
    BiocManager::install("Summix")
}
suppressPackageStartupMessages(library(Summix)) 

3 Summix

Function to estimate reference ancestry proportions in heterogeneous genetic data.

3.1 SLSQP

The Summix function uses the slsqp() function in the nloptr package to run Sequential Quadratic Programming. https://www.rdocumentation.org/packages/nloptr/versions/1.2.2.2/topics/slsqp

3.2 Usage

summix(data, reference, observed, pi.start)

3.2.1 Arguments

data Data frame of the observed and reference allele frequencies for N genetic variants. See data formatting document at https://github.com/hendriau/Summix for more information

reference Character vector of the column names for the reference ancestries.

observed Column name of the heterogeneous observed ancestry as a character string,

pi.start Numeric vector of length K of the starting guess for the ancestry proportions. If not specified, this defaults to 1/K where K is the number of reference ancestry groups.

3.2.2 Details

Estimates the proportion of each reference ancestry within the chosen observed group

3.2.3 Value

Data frame with components:

objective Least Squares value at solution

iterations Number of iterations

time Function run time in seconds

filtered Number of NA SNPs filtered out

K estimated proportions Total of K estimated ancestry proportions.

3.3 Example

library("Summix")
# load the data
data("ancestryData")
 
# Estimate 5 reference ancestry proportion values for the gnomAD African/African Amercian ancestry group
summix( data = ancestryData, 
      reference=c("ref_AF_afr_1000G", 
          "ref_AF_eur_1000G", 
          "ref_AF_sas_1000G", 
          "ref_AF_iam_1000G", 
          "ref_AF_eas_1000G"), 
      observed="gnomad_AF_afr" )
##   objective iterations           time filtered ref_AF_afr_1000G
## 1  2.135725         23 0.3747385 secs        0        0.8250554
##   ref_AF_eur_1000G ref_AF_sas_1000G ref_AF_iam_1000G ref_AF_eas_1000G
## 1        0.1576768      0.003285205      0.006308355      0.007674304

3.4 Example

library("Summix")
# load the data
data("ancestryData")
 
# Estimate 5 reference ancestry proportion values for the gnomAD African/African Amercian ancestry group
summix( data = ancestryData, 
      reference=c("ref_AF_afr_1000G", 
          "ref_AF_eur_1000G", 
          "ref_AF_sas_1000G", 
          "ref_AF_iam_1000G", 
          "ref_AF_eas_1000G"), 
      observed="gnomad_AF_afr",
      pi.start = c(0.8, 0.1, 0.05, 0.02, 0.03))
##   objective iterations           time filtered ref_AF_afr_1000G
## 1  2.135725         27 0.3392296 secs        0        0.8250554
##   ref_AF_eur_1000G ref_AF_sas_1000G ref_AF_iam_1000G ref_AF_eas_1000G
## 1        0.1576768      0.003285201      0.006308353      0.007674306

4 adjAF

Ancestry Adjusted Allele Frequency Function to estimate ancestry adjusted allele frequencies given the proportion of reference ancestry groups.

4.1 Usage

adjAF(data, reference, observed, pi.target, pi.observed)

4.1.1 Arguments

data Data frame of unadjusted allele frequency for observed group, K-1 reference ancestry allele frequencies for N SNPs

reference Character vector of the column names for K-1 reference ancestry groups. The name of the last reference ancestry group is not included as that group is not used to estimate the adjusted allele frequencies.

observed Column name for the observed ancestry.

pi.observed Numeric vector of the mixture proportions for K reference ancestry groups for the observed group. The order must match the order of the reference specified reference character vector with the last entry matching the missing ancestry reference group.

pi.target Numeric vector of the mixture proportions for K reference ancestry groups in the target sample or subject. Order must match the order of the specified reference character vector with the last entry matching the missing ancestry reference group.

4.1.2 Details

Estimates ancestry adjusted allele frequencies in an observed sample of allele frequencies given estimated reference ancestry proportions and the observed AFs for K-1 reference ancestry groups.

4.1.3 Value

List with components:

pi table of input reference ancestry groups, pi.observed values, and pi.target values

observed.data name of the data column for the observed group from which adjusted ancestry allele frequency is estimated

Nsnps number of SNPs for which adjusted AF is estimated

adjusted.AF data frame of original data with an appended column of adjusted allele frequencies

4.2 Example

library("Summix")
data(ancestryData)

head(ancestryData)
##   CHR       RSID     POS A1 A2 ref_AF_eur_1000G ref_AF_afr_1000G
## 1   1  rs2887286 1156131  C  T      0.173275495       0.54166349
## 2   1 rs41477744 2329564  A  G      0.001237745       0.03571448
## 3   1  rs9661525 2952840  G  T      0.168316089       0.12004821
## 4   1  rs2817174 3044181  C  T      0.428212624       0.95932526
## 5   1 rs12139206 3504073  T  C      0.204214851       0.80156548
## 6   1  rs7514979 3654595  T  C      0.004950604       0.41865218
##   ref_AF_sas_1000G ref_AF_eas_1000G ref_AF_iam_1000G gnomad_AF_afr
## 1       0.53171227        0.8462232           0.7093     0.4886100
## 2       0.00000000        0.0000000           0.0000     0.0459137
## 3       0.09918029        0.3938534           0.2442     0.1359770
## 4       0.63907198        0.5704540           0.5000     0.8548790
## 5       0.39367076        0.3898812           0.3372     0.7241780
## 6       0.00000000        0.0000000           0.0000     0.3362490
##   gnomad_AF_amr gnomad_AF_oth
## 1    0.52594300    0.22970500
## 2    0.00117925    0.00827206
## 3    0.28605200    0.15561700
## 4    0.48818000    0.47042500
## 5    0.29550800    0.25874800
## 6    0.01650940    0.02481620
tmp.aa<-adjAF(data   = ancestryData,
     reference   = c("ref_AF_eur_1000G"),
     observed    = "gnomad_AF_afr",
     pi.target   = c(0, 1),
     pi.observed = c(.15, .85))
## $pi
##          ref.group pi.observed pi.target
## 1 ref_AF_eur_1000G        0.15         0
## 2             NONE        0.85         1
## 
## $observed.data
## [1] "observed data to update AF: 'gnomad_AF_afr'"
## 
## $Nsnps
## [1] 10000
## 
## [[4]]
## [1] "use $adjusted.AF to see adjusted AF data"
tmp.aa$adjusted.AF[1:5,]
##   CHR       RSID     POS A1 A2 ref_AF_eur_1000G ref_AF_afr_1000G
## 1   1  rs2887286 1156131  C  T      0.173275495       0.54166349
## 2   1 rs41477744 2329564  A  G      0.001237745       0.03571448
## 3   1  rs9661525 2952840  G  T      0.168316089       0.12004821
## 4   1  rs2817174 3044181  C  T      0.428212624       0.95932526
## 5   1 rs12139206 3504073  T  C      0.204214851       0.80156548
##   ref_AF_sas_1000G ref_AF_eas_1000G ref_AF_iam_1000G gnomad_AF_afr
## 1       0.53171227        0.8462232           0.7093     0.4886100
## 2       0.00000000        0.0000000           0.0000     0.0459137
## 3       0.09918029        0.3938534           0.2442     0.1359770
## 4       0.63907198        0.5704540           0.5000     0.8548790
## 5       0.39367076        0.3898812           0.3372     0.7241780
##   gnomad_AF_amr gnomad_AF_oth adjustedAF
## 1    0.52594300    0.22970500 0.54425727
## 2    0.00117925    0.00827206 0.05379769
## 3    0.28605200    0.15561700 0.13027010
## 4    0.48818000    0.47042500 0.93017307
## 5    0.29550800    0.25874800 0.81593620

5 sessionInfo()

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /home/biocbuild/bbs-3.18-bioc/R/lib/libRblas.so 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_GB              LC_COLLATE=C              
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] Summix_2.8.0
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.33     R6_2.5.1          fastmap_1.1.1     xfun_0.40        
##  [5] cachem_1.0.8      knitr_1.44        htmltools_0.5.6.1 rmarkdown_2.25   
##  [9] cli_3.6.1         nloptr_2.0.3      sass_0.4.7        jquerylib_0.1.4  
## [13] compiler_4.3.1    tools_4.3.1       evaluate_0.22     bslib_0.5.1      
## [17] yaml_2.3.7        rlang_1.1.1       jsonlite_1.8.7