karyotapR Basic Workflow

Setup

library(karyotapR)
library(forcats)
set.seed(2023)  # seed is set to ensure tutorial is reproducible

Data Download

This guide uses the cell mixture experiment from the KaryoTap publication (Mays, 2023). The Tapestri Pipeline .h5 output file is available on Zenodo and can be downloaded by [curl::curl_download()] or directly from the website.

curl::curl_download(url = "https://zenodo.org/record/8305841/files/tapestri-experiment01-panelv1.h5?download=1",
    destfile = "./tap-cellmixture.h5", quiet = FALSE)

Basic Usage

Data Import

The cell mixture dataset is imported from the .h5 file that is generated by the Tapestri Pipeline, which generates a new TapestriExpriment object. This dataset comprises a mixture of 5 cell lines with differing karyotypes and was processed on the Tapestri instrument for single-cell DNA sequencing using Custom Oligo Panel 261 (a.k.a. Panel Version 1). Setting the panel.id parameter automatically assigns the correct probes to the grnaProbe and barcodeProbe slots in the object, which are used for special applications. Several useful processes run automatically on import, indicated in the status messages. For example, cytobands are automatically added to the probe metadata, chromosome Y probes are automatically detected and moved to a specific slot in the object (although now chrY probes exist in this panel), and any special probes that do not target the endogenous human genome are moved to their appropriate slots.

cellmix <- createTapestriExperiment("./tap-cellmixture.h5", panel.id = "CO261")

# ── Loading Tapestri Experiment ──────────────────

# • Sample Name: Teresa_s_cell_line_mix

# • Pipeline Panel Name: CO261_NYU_Davoli_03102021_hg19

# • Pipeline Version: 2.0.2

# • Date Created: 2021-09-15

# ── Metrics ──

# • Number of Cells: 3555

# • Number of Probes: 330

# • Mean Reads per Cell per Probe: 89.22

# ── Notes ── ℹ Adding cytobands from hg19.

# ℹ ChrY probe ID(s) not found in TapestriExperiment object.

# ℹ No non-genomic probe IDs found.

Calling the object will print a summary of the contained data. The TapestriExperiment class is built on top of the SingleCellExperiment and SummarizedExperiment classes, so they inherit their basic functionality and interface. Calling colData() and rowData() will return the metadata for the cells and probes/amplicons respectively.

cellmix
#> class: TapestriExperiment 
#> dim: 330 3555 
#> metadata(7): sample.name pipeline.panel.name ... date.h5.created
#>   mean.reads.per.cell.per.probe
#> assays(1): counts
#> rownames(330): AMPL158802 AMPL146998 ... AMPL162086 AMPL161738
#> rowData names(8): probe.id chr ... cytoband arm
#> colnames(3555): AACAACCTAATAGTGGTT-1 AACAACCTAGTCCTAGTT-1 ...
#>   TTGTAATGCTCGTACCTT-1 TTGTATCACACTTGATCT-1
#> colData names(2): cell.barcode total.reads
#> reducedDimNames(0):
#> mainExpName: CNV
#> altExpNames(1): alleleFrequency
#> barcodeProbe: not specified
#> grnaProbe: not specified
#> gmmParams(0):

colData(cellmix)  # cell metadata
#> DataFrame with 3555 rows and 2 columns
#>                              cell.barcode total.reads
#>                               <character>   <numeric>
#> AACAACCTAATAGTGGTT-1 AACAACCTAATAGTGGTT-1       60264
#> AACAACCTAGTCCTAGTT-1 AACAACCTAGTCCTAGTT-1       22455
#> AACAACCTATGGACGAGA-1 AACAACCTATGGACGAGA-1       51206
#> AACAACTGGCGCACTCTT-1 AACAACTGGCGCACTCTT-1       29616
#> AACAACTGGGACATAACG-1 AACAACTGGGACATAACG-1       17998
#> ...                                   ...         ...
#> TTGGTAACTCCATATCTT-1 TTGGTAACTCCATATCTT-1       44154
#> TTGTAATGCCTATAGCTC-1 TTGTAATGCCTATAGCTC-1       24469
#> TTGTAATGCGAGTGTGCC-1 TTGTAATGCGAGTGTGCC-1       45023
#> TTGTAATGCTCGTACCTT-1 TTGTAATGCTCGTACCTT-1       16216
#> TTGTATCACACTTGATCT-1 TTGTATCACACTTGATCT-1       18831

rowData(cellmix)  # probe metadata
#> DataFrame with 330 rows and 8 columns
#>               probe.id      chr start.pos   end.pos total.reads median.reads
#>            <character> <factor> <numeric> <numeric>   <numeric>    <integer>
#> AMPL158802  AMPL158802        1   1479191   1479385      202832           49
#> AMPL146998  AMPL146998        1   6196653   6196900      112408           24
#> AMPL158817  AMPL158817        1  11832076  11832330       99472           22
#> AMPL158827  AMPL158827        1  17087135  17087388     1162986          304
#> AMPL158845  AMPL158845        1  24412922  24413191        1008            0
#> ...                ...      ...       ...       ...         ...          ...
#> AMPL161732  AMPL161732        X 128880366 128880635      568729          133
#> AMPL161734  AMPL161734        X 130415574 130415843      231144           49
#> AMPL161735  AMPL161735        X 135429800 135430069      156462           33
#> AMPL162086  AMPL162086        X 140993717 140993974      195494           47
#> AMPL161738  AMPL161738        X 142794979 142795248      400235           93
#>               cytoband      arm
#>            <character> <factor>
#> AMPL158802      p36.33       1p
#> AMPL146998      p36.31       1p
#> AMPL158817      p36.22       1p
#> AMPL158827      p36.13       1p
#> AMPL158845      p36.11       1p
#> ...                ...      ...
#> AMPL161732       q26.1       Xq
#> AMPL161734       q26.2       Xq
#> AMPL161735       q26.3       Xq
#> AMPL162086       q27.2       Xq
#> AMPL161738       q27.3       Xq

Allele Frequency Clustering

We cluster on allele frequency to partition different cell lines represented in the experiment. First, we run Principal Components Analysis (PCA) and use the knee plot to identify the principal components (PCs) accounting for the most variation in the dataset.

cellmix <- runPCA(cellmix)
PCAKneePlot(cellmix)

Next, we run UMAP with the top PCs to embed them into two dimensions and plot the result.

cellmix <- runUMAP(cellmix, pca.dims = 1:4)
#> ℹ Running UMAP on: alleleFrequency.
reducedDimPlot(cellmix, dim.reduction = "umap")

Next, we partition the data into clusters using the dbscan method. The eps parameter can be used to adjust the granularity of the clustering. We can then update the UMAP plot with the clusters.

cellmix <- runClustering(cellmix, eps = 0.9)
#> ℹ Finding clusters in: alleleFrequency UMAP
reducedDimPlot(cellmix, dim.reduction = "umap", group.label = "cluster")

As expected, we have 5 major clusters corresponding to the 5 cell lines in the sequencing run, with the smaller clusters representing doublets (i.e., two cells in sequenced together in one droplet). We can subset the doublets out by pulling the cell barcodes corresponding to clusters 1-5 (clusters are ordered and named by descending size) and using those to subset the “columns” of the object into a new object. This is done here using a logical vector, but the cell barcodes can be passed in as a character vector of barcodes as well.

cellmix.subset <- cellmix[, colData(cellmix)$cluster %in% 1:5]

We’ll rename the cluster labels by renaming the factor levels of the “cluster” column in the colData slot, print an updated plot, and count the number of cells in each cluster.

colData(cellmix.subset)$cluster <- forcats::fct_recode(colData(cellmix.subset)$cluster,
    cellline1 = "1", cellline2 = "2", cellline3 = "3", cellline4 = "4", cellline5 = "5")
reducedDimPlot(cellmix.subset, dim.reduction = "umap", group.label = "cluster")

forcats::fct_count(colData(cellmix.subset)$cluster)
#> # A tibble: 13 × 2
#>    f             n
#>    <fct>     <int>
#>  1 cellline1   987
#>  2 cellline2   591
#>  3 cellline3   575
#>  4 cellline4   450
#>  5 cellline5   398
#>  6 6             0
#>  7 7             0
#>  8 8             0
#>  9 9             0
#> 10 10            0
#> 11 11            0
#> 12 12            0
#> 13 13            0

Copy Number Calling

The KaryoTap method works best with a reference population where the copy number for each chromosome arm is known. Here we used RPE1 cells which are diploid (2 copies) except for a third copy of the chromosome 10q arm. We know from the KaryoTap preprint that “cellline2” corresponds to the RPE1 cells.

Her we normalize the read counts in the object and calculate a copy number score relative to cellline 2. control.copy.number gives the cluster label and copy number value to normalize each chromosome arm to. generateControlCopyNumberTemplate() creates a dataframe that is used to indicate the copy number of the reference population. The entry for chr10q has to be changed to 3.

cellmix.subset <- calcNormCounts(cellmix.subset)
control.copy.number <- generateControlCopyNumberTemplate(cellmix.subset, sample.feature.label = "cellline3",
    copy.number = 2)
control.copy.number["10q", "copy.number"] <- 3

The calcCopyNumber function will throw an error if the median normalized counts for a probe in the reference population is zero, which would otherwise result in a division-by-zero calculation error. These probes need to be removed before moving forward.

try(cellmix.subset <- calcCopyNumber(cellmix.subset, control.copy.number = control.copy.number,
    sample.feature = "cluster"))
#> Error in calcCopyNumber(cellmix.subset, control.copy.number = control.copy.number,  : 
#>   AMPL158845, AMPL147043, AMPL147154, AMPL159975, AMPL147293, AMPL113086,
#> AMPL147323, AMPL158390, and AMPL158655 control cell median equal to 0. This
#> will cause a division-by-zero error. Filter out prior to proceeding.

probes.to.remove <- c("AMPL158845", "AMPL147043", "AMPL147154", "AMPL159975", "AMPL147293",
    "AMPL113086", "AMPL147323", "AMPL158390", "AMPL158655")
cellmix.subset <- cellmix.subset[!rowData(cellmix.subset)$probe.id %in% probes.to.remove,
    ]

cellmix.subset <- calcCopyNumber(cellmix.subset, control.copy.number = control.copy.number,
    sample.feature = "cluster")

The calcNormCounts() and calcCopyNumber() functions take the count matrix in the main assay slot of of the TapestriExperiment, perform their operation, and save the result to new assay slots, which can be accessed using assay() or listed using assays(). Assays in the SingleCellExperiment sense are sets of measurements for the same set of samples (columns) and features (rows). For the counts, normcounts and copyNumber assays, the features (probes) and samples (cell barcodes) are the same.

assays(cellmix.subset)
#> List of length 3
#> names(3): counts normcounts copyNumber

calcSmoothCopyNumber() produces one smoothed copy number score for each chromosome and cell. Since the features here are chromosomes, no longer probes, the values get saved to an altExp (alternate experiment) slot, which allows for measurements from the same samples (cell barcodes), with a different feature set than the top-level experiment in the TapestriExperiment object (i.e. probes vs. chromosomes).

cellmix.subset <- calcSmoothCopyNumber(cellmix.subset)
#> ℹ Smoothing copy number by median...
#> ✔ Smoothing copy number by median... [7.9s]
#>

Visualization of the copy number scores reveals the heterogeneity. Here we’re showing the copy number scores by probe, and the smoothed copy number scores by whole chromosome and by chromosome arm. See the documentation for assayHeatmap() for details on customization.

assayHeatmap(cellmix.subset, assay = "copyNumber", split.col.by = "arm", split.row.by = "cluster",
    annotate.row.by = "cluster", color.preset = "copy.number")

assayHeatmap(cellmix.subset, alt.exp = "smoothedCopyNumberByChr", assay = "smoothedCopyNumber",
    split.row.by = "cluster", annotate.row.by = "cluster", color.preset = "copy.number")

assayHeatmap(cellmix.subset, alt.exp = "smoothedCopyNumberByArm", assay = "smoothedCopyNumber",
    split.row.by = "cluster", annotate.row.by = "cluster", color.preset = "copy.number")

Finally, a integer copy number value for each chromosome in each cell can be called using a Gaussian Mixture Model (GMM) framework. calcGMMCopyNumber() takes a vector of cell barcodes for the reference sample and a template data frame from generateControlCopyNumberTemplate() indicating the copy number of each chromosome arm in the reference. Here we can use the same control.copy.number template that was generated earlier. Here we are specifying a model that has copy number = {1,2,3,4}. The results of this are saved as new assays in the “smoothedCopyNumber” altExp slots for chromosomes and chromosome arms.

reference.bcs <- colData(cellmix.subset)$cell.barcode[colData(cellmix.subset)$cluster ==
    "cellline2"]
cellmix.subset <- calcGMMCopyNumber(cellmix.subset, cell.barcodes = reference.bcs,
    control.copy.number = control.copy.number, model.components = 1:4)
#> ℹ Calculating GMMs using 591 input cells.
#> ℹ Generating probe values for 500 simulated cells...
#> ✔ Generating probe values for 500 simulated cells... [2.1s]
#> 
#> ℹ Fitting Gaussian distributions to simulated cells...
#> ✔ Fitting Gaussian distributions to simulated cells... [10.3s]
#> 
#> ℹ Calculating posterior probabilities...
#> ✔ Calculating posterior probabilities... [502ms]
#> 
#> ℹ Calling copy number from posterior probabilities...
#> ✔ Calling copy number from posterior probabilities... [139ms]
#> 
#> ✔ Saving whole chromosome copy number calls to altExp: smoothedCopyNumberByChr,
#>   assay: gmmCopyNumber...
#> ✔ Saving chromosome arm copy number calls to altExp: smoothedCopyNumberByArm,
#>   assay: gmmCopyNumber...
#> ✔ Saving GMM models and metadata to `gmmParams` slot...

assayHeatmap(cellmix.subset, alt.exp = "smoothedCopyNumberByChr", assay = "gmmCopyNumber",
    split.row.by = "cluster", annotate.row.by = "cluster", color.preset = "copy.number")

assayHeatmap(cellmix.subset, alt.exp = "smoothedCopyNumberByArm", assay = "gmmCopyNumber",
    split.row.by = "cluster", annotate.row.by = "cluster", color.preset = "copy.number")

Session Info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.3.3 (2024-02-29)
#>  os       Ubuntu 22.04.4 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en
#>  collate  C.UTF-8
#>  ctype    C.UTF-8
#>  tz       UTC
#>  date     2024-04-12
#>  pandoc   3.1.11 @ /opt/hostedtoolcache/pandoc/3.1.11/x64/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package              * version    date (UTC) lib source
#>  abind                  1.4-5      2016-07-21 [1] CRAN (R 4.3.3)
#>  askpass                1.2.0      2023-09-03 [1] CRAN (R 4.3.3)
#>  Biobase              * 2.62.0     2023-10-24 [1] Bioconductor
#>  BiocGenerics         * 0.48.1     2023-11-01 [1] Bioconductor
#>  bitops                 1.0-7      2021-04-24 [1] CRAN (R 4.3.3)
#>  bslib                  0.7.0      2024-03-29 [1] CRAN (R 4.3.3)
#>  cachem                 1.0.8      2023-05-01 [1] CRAN (R 4.3.3)
#>  circlize               0.4.16     2024-02-20 [1] CRAN (R 4.3.3)
#>  cli                    3.6.2      2023-12-11 [1] CRAN (R 4.3.3)
#>  clue                   0.3-65     2023-09-23 [1] CRAN (R 4.3.3)
#>  cluster                2.1.6      2023-12-01 [3] CRAN (R 4.3.3)
#>  codetools              0.2-19     2023-02-01 [3] CRAN (R 4.3.3)
#>  colorspace             2.1-0      2023-01-23 [1] CRAN (R 4.3.3)
#>  ComplexHeatmap         2.18.0     2023-10-24 [1] Bioconductor
#>  crayon                 1.5.2      2022-09-29 [1] CRAN (R 4.3.3)
#>  dbscan                 1.1-12     2023-11-28 [1] CRAN (R 4.3.3)
#>  DelayedArray           0.28.0     2023-10-24 [1] Bioconductor
#>  desc                   1.4.3      2023-12-10 [1] CRAN (R 4.3.3)
#>  digest                 0.6.35     2024-03-11 [1] CRAN (R 4.3.3)
#>  doParallel             1.0.17     2022-02-07 [1] CRAN (R 4.3.3)
#>  dplyr                  1.1.4      2023-11-17 [1] CRAN (R 4.3.3)
#>  evaluate               0.23       2023-11-01 [1] CRAN (R 4.3.3)
#>  fansi                  1.0.6      2023-12-08 [1] CRAN (R 4.3.3)
#>  farver                 2.1.1      2022-07-06 [1] CRAN (R 4.3.3)
#>  fastmap                1.1.1      2023-02-24 [1] CRAN (R 4.3.3)
#>  fitdistrplus           1.1-11     2023-04-25 [1] CRAN (R 4.3.3)
#>  forcats              * 1.0.0      2023-01-29 [1] CRAN (R 4.3.3)
#>  foreach                1.5.2      2022-02-02 [1] CRAN (R 4.3.3)
#>  formatR                1.14       2023-01-17 [1] CRAN (R 4.3.3)
#>  fs                     1.6.3      2023-07-20 [1] CRAN (R 4.3.3)
#>  generics               0.1.3      2022-07-05 [1] CRAN (R 4.3.3)
#>  GenomeInfoDb         * 1.38.8     2024-03-15 [1] Bioconduc~
#>  GenomeInfoDbData       1.2.11     2024-04-12 [1] Bioconductor
#>  GenomicRanges        * 1.54.1     2023-10-29 [1] Bioconductor
#>  GetoptLong             1.0.5      2020-12-15 [1] CRAN (R 4.3.3)
#>  ggplot2                3.5.0      2024-02-23 [1] CRAN (R 4.3.3)
#>  GlobalOptions          0.1.2      2020-06-10 [1] CRAN (R 4.3.3)
#>  glue                   1.7.0      2024-01-09 [1] CRAN (R 4.3.3)
#>  gtable                 0.3.4      2023-08-21 [1] CRAN (R 4.3.3)
#>  highr                  0.10       2022-12-22 [1] CRAN (R 4.3.3)
#>  htmltools              0.5.8.1    2024-04-04 [1] CRAN (R 4.3.3)
#>  IRanges              * 2.36.0     2023-10-24 [1] Bioconductor
#>  iterators              1.0.14     2022-02-05 [1] CRAN (R 4.3.3)
#>  jquerylib              0.1.4      2021-04-26 [1] CRAN (R 4.3.3)
#>  jsonlite               1.8.8      2023-12-04 [1] CRAN (R 4.3.3)
#>  karyotapR            * 1.0.1.9000 2024-04-12 [1] local
#>  knitr                  1.46       2024-04-06 [1] CRAN (R 4.3.3)
#>  labeling               0.4.3      2023-08-29 [1] CRAN (R 4.3.3)
#>  lattice                0.22-5     2023-10-24 [3] CRAN (R 4.3.3)
#>  lifecycle              1.0.4      2023-11-07 [1] CRAN (R 4.3.3)
#>  magrittr               2.0.3      2022-03-30 [1] CRAN (R 4.3.3)
#>  MASS                   7.3-60.0.1 2024-01-13 [3] CRAN (R 4.3.3)
#>  Matrix                 1.6-5      2024-01-11 [3] CRAN (R 4.3.3)
#>  MatrixGenerics       * 1.14.0     2023-10-24 [1] Bioconductor
#>  matrixStats          * 1.3.0      2024-04-11 [1] CRAN (R 4.3.3)
#>  memoise                2.0.1      2021-11-26 [1] CRAN (R 4.3.3)
#>  munsell                0.5.1      2024-04-01 [1] CRAN (R 4.3.3)
#>  openssl                2.1.1      2023-09-25 [1] CRAN (R 4.3.3)
#>  pillar                 1.9.0      2023-03-22 [1] CRAN (R 4.3.3)
#>  pkgconfig              2.0.3      2019-09-22 [1] CRAN (R 4.3.3)
#>  pkgdown                2.0.8      2024-04-10 [1] any (@2.0.8)
#>  png                    0.1-8      2022-11-29 [1] CRAN (R 4.3.3)
#>  purrr                  1.0.2      2023-08-10 [1] CRAN (R 4.3.3)
#>  R6                     2.5.1      2021-08-19 [1] CRAN (R 4.3.3)
#>  ragg                   1.3.0      2024-03-13 [1] CRAN (R 4.3.3)
#>  RColorBrewer           1.1-3      2022-04-03 [1] CRAN (R 4.3.3)
#>  Rcpp                   1.0.12     2024-01-09 [1] CRAN (R 4.3.3)
#>  RCurl                  1.98-1.14  2024-01-09 [1] CRAN (R 4.3.3)
#>  reticulate             1.35.0     2024-01-31 [1] CRAN (R 4.3.3)
#>  rjson                  0.2.21     2022-01-09 [1] CRAN (R 4.3.3)
#>  rlang                  1.1.3      2024-01-10 [1] CRAN (R 4.3.3)
#>  rmarkdown              2.26       2024-03-05 [1] CRAN (R 4.3.3)
#>  RSpectra               0.16-1     2022-04-24 [1] CRAN (R 4.3.3)
#>  S4Arrays               1.2.1      2024-03-04 [1] Bioconduc~
#>  S4Vectors            * 0.40.2     2023-11-23 [1] Bioconduc~
#>  sass                   0.4.9      2024-03-15 [1] CRAN (R 4.3.3)
#>  scales                 1.3.0      2023-11-28 [1] CRAN (R 4.3.3)
#>  sessioninfo            1.2.2      2021-12-06 [1] any (@1.2.2)
#>  shape                  1.4.6.1    2024-02-23 [1] CRAN (R 4.3.3)
#>  SingleCellExperiment * 1.24.0     2023-10-24 [1] Bioconductor
#>  SparseArray            1.2.4      2024-02-11 [1] Bioconduc~
#>  SummarizedExperiment * 1.32.0     2023-10-24 [1] Bioconductor
#>  survival               3.5-8      2024-02-14 [3] CRAN (R 4.3.3)
#>  systemfonts            1.0.6      2024-03-07 [1] CRAN (R 4.3.3)
#>  textshaping            0.3.7      2023-10-09 [1] CRAN (R 4.3.3)
#>  tibble                 3.2.1      2023-03-20 [1] CRAN (R 4.3.3)
#>  tidyr                  1.3.1      2024-01-24 [1] CRAN (R 4.3.3)
#>  tidyselect             1.2.1      2024-03-11 [1] CRAN (R 4.3.3)
#>  umap                   0.2.10.0   2023-02-01 [1] CRAN (R 4.3.3)
#>  utf8                   1.2.4      2023-10-22 [1] CRAN (R 4.3.3)
#>  vctrs                  0.6.5      2023-12-01 [1] CRAN (R 4.3.3)
#>  viridisLite            0.4.2      2023-05-02 [1] CRAN (R 4.3.3)
#>  withr                  3.0.0      2024-01-16 [1] CRAN (R 4.3.3)
#>  xfun                   0.43       2024-03-25 [1] CRAN (R 4.3.3)
#>  XVector                0.42.0     2023-10-24 [1] Bioconductor
#>  yaml                   2.3.8      2023-12-11 [1] CRAN (R 4.3.3)
#>  zlibbioc               1.48.2     2024-03-13 [1] Bioconduc~
#> 
#>  [1] /home/runner/work/_temp/Library
#>  [2] /opt/R/4.3.3/lib/R/site-library
#>  [3] /opt/R/4.3.3/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────