Getting Started with expranno • expranno

library(expranno)
demo <- example_expranno_data()

expranno expects:

an expression matrix-like data frame with gene_id in the first column
sample columns after gene_id
a metadata table with sample in the first column

flowchart TB
  subgraph inputs["Inputs"]
    expr["expr
gene x sample table"]
    meta["meta
sample x metadata table"]
  end

  subgraph core["Core workflow"]
    validate["Validate
check required columns"]
    annotate["Annotation
normalize Ensembl IDs
hybrid human or mouse mapping"]
    report["Coverage report
annotation_rate by field"]
    merge["Merge
combine expr_anno and meta"]
  end

  subgraph outputs["Saved outputs"]
    expranno["expr_anno.csv"]
    merged["expr_meta_merged.csv"]
  end

  subgraph downstream["Downstream analyses"]
    deconv["Deconvolution
immunedeconv
one CSV per method"]
    signature["Signature
GSVA or ssGSEA
one CSV per method"]
  end

  expr --> validate
  meta --> validate
  validate --> annotate
  annotate --> expranno
  annotate --> report
  expranno --> merge
  meta --> merge
  merge --> merged
  merged --> deconv
  merged --> signature

  classDef input fill:#eef5fb,stroke:#315c86,color:#17324d,stroke-width:1.5px;
  classDef process fill:#ffffff,stroke:#315c86,color:#17324d,stroke-width:1.5px;
  classDef output fill:#eaf6ef,stroke:#3c7a57,color:#17324d,stroke-width:1.5px;

  class expr,meta input;
  class validate,annotate,report,merge,deconv,signature process;
  class expranno,merged output;

The workflow at a glance

The workflow is easiest to read in five checkpoints:

validate the input contract
annotate genes and inspect coverage
inspect provenance and ambiguity
merge expression with metadata
run Deconvolution or Signature analyses from the merged table

If you want fixed human or mouse defaults instead of manually setting species, version stripping, and Ensembl release each time, use one of the built-in presets.

list_annotation_presets()
#>   annotation_preset species  recommended_input expr_scale annotation_engine
#> 1        human_v102   human                any       auto            hybrid
#> 2        mouse_v102   mouse                any       auto            hybrid
#> 3    human_tpm_v102   human TPM-like abundance  abundance            hybrid
#> 4    mouse_tpm_v102   mouse TPM-like abundance  abundance            hybrid
#> 5  human_count_v102   human         raw counts      count            hybrid
#> 6  mouse_count_v102   mouse         raw counts      count            hybrid
#>   strip_version biomart_version                   symbol_priority
#> 1          TRUE             102 hgnc_symbol -> external_gene_name
#> 2          TRUE             102  mgi_symbol -> external_gene_name
#> 3          TRUE             102 hgnc_symbol -> external_gene_name
#> 4          TRUE             102  mgi_symbol -> external_gene_name
#> 5          TRUE             102 hgnc_symbol -> external_gene_name
#> 6          TRUE             102  mgi_symbol -> external_gene_name
#>                                   fallback_order
#> 1  biomaRt -> org.Hs.eg.db -> EnsDb.Hsapiens.v86
#> 2 biomaRt -> org.Mm.eg.db -> EnsDb.Mmusculus.v79
#> 3  biomaRt -> org.Hs.eg.db -> EnsDb.Hsapiens.v86
#> 4 biomaRt -> org.Mm.eg.db -> EnsDb.Mmusculus.v79
#> 5  biomaRt -> org.Hs.eg.db -> EnsDb.Hsapiens.v86
#> 6 biomaRt -> org.Mm.eg.db -> EnsDb.Mmusculus.v79
#>                       bundled_truth
#> 1 example_annotation_truth('human')
#> 2 example_annotation_truth('mouse')
#> 3 example_annotation_truth('human')
#> 4 example_annotation_truth('mouse')
#> 5 example_annotation_truth('human')
#> 6 example_annotation_truth('mouse')

If you want the preset choices in a more explicit comparison table, including recommended input scale, backend cascade, and the matching bundled truth resource, see the preset reference article.

The bundled truth resources are useful when you want a reproducible validation example without preparing your own table first.

example_annotation_truth("human")
#>              gene_id symbol                        gene_name        biotype
#> 1 ENSG00000141510.17   TP53                tumor protein p53 protein_coding
#> 2 ENSG00000146648.18   EGFR epidermal growth factor receptor protein_coding
#> 3 ENSG00000012048.23  BRCA1      BRCA1 DNA repair associated protein_coding

Pick a species

Use species = "human" for ENSG... input IDs and species = "mouse" for ENSMUSG... input IDs. If IDs are clean Ensembl IDs, species = "auto" can infer this for you.

human_demo <- example_expranno_data("human")
mouse_demo <- example_expranno_data("mouse")

head(human_demo$expr$gene_id)
#> [1] "ENSG00000141510.17" "ENSG00000146648.18" "ENSG00000012048.23"
head(mouse_demo$expr$gene_id)
#> [1] "ENSMUSG00000059552.8"  "ENSMUSG00000020122.15" "ENSMUSG00000017167.16"

Pick an annotation engine

expranno exposes five annotation modes:

"hybrid": recommended production mode
"biomart": use only biomaRt
"orgdb": use only org.Hs.eg.db or org.Mm.eg.db
"ensdb": use only EnsDb
"none": skip annotation lookup and keep the normalized IDs only

"hybrid" is the default because it is the most coverage-oriented mode. It uses a cascade:

biomaRt with a fixed Ensembl release (v102 by default)
species-specific orgdb
species-specific EnsDb

That mirrors a practical manual workflow: query Ensembl first, fill unmapped IDs from org.Hs.eg.db or org.Mm.eg.db, then use EnsDb to enrich structural fields such as biotype and coordinates.

For lightweight docs and tests, the examples below use "none".

For real runs, a reproducible preset is often easier to share than a long list of arguments. Examples:

annotation_preset = "human_tpm_v102"
annotation_preset = "mouse_tpm_v102"
annotation_preset = "human_count_v102"
annotation_preset = "mouse_count_v102"

The preset table above is the quickest way to compare recommended input scale, backend order, and bundled truth resource.

annotated <- annotate_expr(
  expr = demo$expr,
  meta = demo$meta,
  species = "human",
  annotation_engine = "none"
)

merged <- merge_expr_meta(
  expr_anno = annotated$expr_anno,
  meta = annotated$meta_checked
)

The all-in-one wrapper is run_expranno().

result <- run_expranno(
  expr = demo$expr,
  meta = demo$meta,
  species = "human",
  annotation_engine = "none",
  output_dir = tempdir(),
  run_deconvolution = FALSE,
  run_signature = FALSE
)

result
#> <expranno_result>
#>   annotated genes: 3
#>   merged rows: 9
#>   deconvolution runs: 0
#>   signature runs: 0
#>   benchmark runs: 0
#>   validation runs: 0

For a reproducible production-style run, switch to a preset explicitly:

run_expranno(
  expr = demo$expr,
  meta = demo$meta,
  annotation_preset = "human_count_v102",
  output_dir = "results"
)

After a run, you can also return to a Bioconductor-native container.

annotated <- annotate_expr(
  expr = demo$expr,
  meta = demo$meta,
  species = "human",
  annotation_engine = "none",
  verbose = FALSE
)

if (requireNamespace("SummarizedExperiment", quietly = TRUE)) {
  se <- as_expranno_se(annotated)
  se
} else {
  "Install SummarizedExperiment to use as_expranno_se()."
}
#> class: SummarizedExperiment 
#> dim: 3 3 
#> metadata(1): expranno
#> assays(1): expression
#> rownames(3): ENSG00000141510 ENSG00000146648 ENSG00000012048
#> rowData names(48): gene_id_raw gene_id ... annotation_backend_mirror
#>   annotation_date
#> colnames(3): sample_a sample_b sample_c
#> colData names(4): sample group batch species

If your data already live in a SummarizedExperiment, you can start there directly.

result <- run_expranno(
  expr = se,
  species = "human",
  assay_name = "counts",
  gene_id_col = "gene_id",
  annotation_engine = "hybrid"
)

Control duplicate symbols and score kernels

Downstream methods usually operate on gene symbols, so duplicated symbols have to be collapsed. expranno now makes that rule explicit.

expr_scale = "count" usually pairs with duplicate_strategy = "sum"
expr_scale = "abundance" or "log" usually pairs with duplicate_strategy = "mean"

For signature scoring, signature_kcdf should also match the expression scale:

use "Poisson" for count-like input
use "Gaussian" for continuous log-scale input
use "auto" if you want GSVA to choose

Method-specific deconvolution arguments

Some immunedeconv methods require extra arguments. In particular, timer and consensus_tme need an indications vector with one tumor type per sample.

run_expranno(
  expr = demo$expr,
  meta = demo$meta,
  species = "human",
  annotation_engine = "hybrid",
  output_dir = "results",
  run_deconvolution = TRUE,
  deconv_args = list(indications = c("SKCM", "SKCM")),
  run_signature = TRUE,
  geneset_file = "hallmark.gmt",
  signature_method = "both",
  signature_kcdf = "Gaussian"
)

What to inspect after a run

For a production analysis with "hybrid", the first things to check are:

expr_anno.csv
annotation_report.csv
annotation_ambiguity.csv
annotation_provenance.csv
expr_meta_merged.csv

If symbol coverage is weak, downstream Deconvolution and Signature methods will usually be the first places where the problem appears.

If you want to compare engines on the same input before standardizing a lab workflow, benchmark them explicitly.

benchmark <- benchmark_annotation_engines(
  expr = demo$expr,
  meta = demo$meta,
  species = "human",
  engines = c("none"),
  verbose = FALSE
)

benchmark$summary
#>      engine status species message annotated_genes total_genes
#> none   none     ok   human    <NA>               0           3
#>      ambiguous_gene_fields symbol_coverage gene_name_coverage
#> none                     0               0                  0

If you also have a truth table, you can validate exact matches instead of only coverage.

truth <- data.frame(
  gene_id = demo$expr$gene_id,
  symbol = c("TP53", "EGFR", "BRCA1"),
  stringsAsFactors = FALSE
)

validation <- validate_annotation_engines(
  expr = demo$expr,
  meta = demo$meta,
  truth = truth,
  species = "human",
  engines = "none",
  fields = "symbol",
  verbose = FALSE
)

validation$summary
#>              engine  field truth_rows matched_rows missing_prediction_rows
#> none::symbol   none symbol          3            0                       3
#>              mismatch_rows ambiguous_rows match_rate missing_prediction_rate
#> none::symbol             0              0          0                       1
#>              mismatch_rate
#> none::symbol             0

For reproducible examples or CI runs, you do not need to write that truth table by hand. expranno ships with small human and mouse truth resources that line up with the demo Ensembl IDs.

example_annotation_truth("human")
#>              gene_id symbol                        gene_name        biotype
#> 1 ENSG00000141510.17   TP53                tumor protein p53 protein_coding
#> 2 ENSG00000146648.18   EGFR epidermal growth factor receptor protein_coding
#> 3 ENSG00000012048.23  BRCA1      BRCA1 DNA repair associated protein_coding

If your workflow starts and ends in Bioconductor containers, convert the result back into a SummarizedExperiment.

se <- as_expranno_se(result)

That keeps the annotated expression matrix in the assay, writes annotation fields to rowData, and appends sample-level Deconvolution or Signature outputs to colData.