pangenome
Tags: alignment core-genome pan-genome phylogeny comparative-genomics bactopia-tool
Pangenome analysis with optional core-genome phylogeny.
This Bactopia Tool creates a pangenome from GFF3 annotation files using one of three tools: Panaroo (default), PIRATE, or Roary. It generates core-genome alignments and gene presence/absence matrices, followed by SNP distance calculations. You can supplement your pangenome with completed genomes using the --species or --accessions parameters, which downloads genomes from RefSeq and annotates them with Prokka. A phylogeny based on the core-genome alignment is created by IQ-Tree, with optional recombination masking using ClonalFrameML. Finally, pan-genome wide association studies can be conducted using Scoary.
Usage
Bactopia CLI:
bactopia --wf pangenome \
--bactopia /path/to/your/bactopia/results
Nextflow:
nextflow run bactopia/bactopia/workflows/bactopia-tools/pangenome/main.nf \
--bactopia /path/to/your/bactopia/results
Outputs
Expected Output Files
<BACTOPIA_DIR>
└── <SAMPLE_NAME>
└── pangenome-<TIMESTAMP>
├── clonalframeml
│ ├── core-genome.ML_sequence.fasta.gz
│ ├── core-genome.em.txt
│ ├── core-genome.emsim.txt
│ ├── core-genome.importation_status.txt
│ ├── core-genome.labelled_tree.newick
│ ├── core-genome.position_cross_reference.txt.gz
│ └── logs
│ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ └── versions.yml
├── core-genome.distance.tsv
├── core-genome.masked.aln.gz
├── core-genome.masked.distance.tsv
├── core-genome.treefile
├── iqtree
│ ├── core-genome.alninfo.gz
│ ├── core-genome.bionj
│ ├── core-genome.ckp.gz
│ ├── core-genome.contree
│ ├── core-genome.iqtree
│ ├── core-genome.log
│ ├── core-genome.mldist
│ ├── core-genome.splits.nex
│ ├── core-genome.ufboot
│ └── logs
│ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ └── versions.yml
├── iqtree-fast
│ ├── logs
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ ├── roary.bionj
│ ├── roary.ckp.gz
│ ├── roary.iqtree
│ ├── roary.log
│ ├── roary.mldist
│ ├── roary.model.gz
│ └── roary.treefile
├── nf-reports
│ ├── pangenome-dag.dot
│ ├── pangenome-report.html
│ └── pangenome-timeline.html
├── roary
│ ├── accessory.header.embl
│ ├── accessory.tab
│ ├── accessory_binary_genes.fa.gz
│ ├── accessory_binary_genes.fa.newick
│ ├── accessory_graph.dot
│ ├── blast_identity_frequency.Rtab
│ ├── clustered_proteins
│ ├── core_accessory.header.embl
│ ├── core_accessory.tab
│ ├── core_accessory_graph.dot
│ ├── core_alignment_header.embl
│ ├── core_gene_alignment.aln.gz
│ ├── gene_presence_absence.Rtab
│ ├── gene_presence_absence.csv
│ ├── logs
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ ├── number_of_conserved_genes.Rtab
│ ├── number_of_genes_in_pan_genome.Rtab
│ ├── number_of_new_genes.Rtab
│ ├── number_of_unique_genes.Rtab
│ ├── pan_genome_reference.fa.gz
│ └── summary_statistics.txt
├── roary.aln.gz
├── scoary
│ ├── Bogus_trait.results.csv
│ ├── Tetracycline_resistance.results.csv
│ └── logs
│ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ ├── scoary.log
│ └── versions.yml
├── snpdists
│ └── logs
│ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ └── versions.yml
└── snpdists-masked
└── logs
├── nf.command.{begin,err,log,out,run,sh,trace}
└── versions.yml
Pangenome Results
| File | Description |
|---|---|
*.aln | Core-genome alignment file containing genes present across all input genomes |
*.csv | Gene presence/absence matrix showing which genes are present in each genome |
*.tsv | SNP distance matrix between all samples |
Phylogeny Results
Only created if --skip_phylogeny is not enabled
| File | Description |
|---|---|
*.treefile | Maximum likelihood phylogenetic tree in Newick format |
*.iqtree | IQ-Tree analysis report with model selection and support values |
*.log | IQ-Tree execution log |
Recombination Analysis
Only created if --skip_recombination is not enabled
| File | Description |
|---|---|
*.masked.aln | Core-genome alignment with recombination regions masked |
Association Analysis
Only created if --scoary_traits is specified
| File | Description |
|---|---|
scoary/* | Scoary association analysis results and plots |
Panaroo Results
Only created when Panaroo is selected as the pangenome tool
| File | Description |
|---|---|
panaroo/* | Panaroo-specific output files including graph and statistics |
PIRATE Results
Only created when PIRATE is selected as the pangenome tool
| File | Description |
|---|---|
pirate/* | PIRATE-specific output files including gene families and clusters |
Roary Results
Only created when Roary is selected as the pangenome tool
| File | Description |
|---|---|
roary/* | Roary-specific output files including gene presence/absence matrices |
Audit Trail
Below are files that can assist you in understanding which parameters and program versions were used.
Logs
Each process that is executed will have a folder named logs. In this folder are helpful
files for you to review if the need ever arises.
| Extension | Description |
|---|---|
| .begin | An empty file used to designate the process started |
| .err | Contains STDERR outputs from the process |
| .log | Contains both STDERR and STDOUT outputs from the process |
| .out | Contains STDOUT outputs from the process |
| .run | The script Nextflow uses to stage/unstage files and queue processes based on given profile |
| .sh | The script executed by bash for the process |
| .trace | The Nextflow trace report for the process |
| versions.yml | A YAML formatted file with program versions |
Nextflow Reports
These Nextflow reports provide great a great summary of your run. These can be used to optimize resource usage and estimate expected costs if using cloud platforms.
| Filename | Description |
|---|---|
| pangenome-dag.dot | The Nextflow DAG visualization |
| pangenome-report.html | The Nextflow Execution Report |
| pangenome-timeline.html | The Nextflow Timeline Report |
| pangenome-trace.txt | The Nextflow Trace report |
Parameters
Required Parameters
Define where the pipeline should find input data and save output data.
| Parameter | Type | Default | Description |
|---|---|---|---|
--bactopia | string | The path to bactopia results to use as inputs |
NCBI Genome Download Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--species | string | Name of the species to download assemblies | |
--accession | string | An NCBI Assembly accession to be downloaded | |
--accessions | string | An file of NCBI Assembly accessions (one per line) to be downloaded | |
--format | string | fasta | Comma separated list of formats to download |
--section | string | refseq | NCBI section to download |
--assembly_level | string | complete | Comma separated list of assembly levels to download |
--kingdom | string | bacteria | Comma separated list of formats to download |
--limit | string | Limit the number of assemblies to download | |
--keep_downloads | boolean | false | Save downloaded files into the bactopia-runs folder |
Prokka Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--prokka_proteins | string | FASTA file of trusted proteins to first annotate from | |
--prokka_prodigal_tf | string | Training file to use for Prodigal | |
--prokka_compliant | boolean | false | Force Genbank/ENA/DDJB compliance |
--prokka_centre | string | Bactopia | Sequencing centre ID |
--prokka_coverage | integer | 80 | Minimum coverage on query protein |
--prokka_evalue | string | 1e-09 | Similarity e-value cut-off |
--prokka_opts | string | Extra Prokka options in quotes. | |
--prokka_debug | boolean | false | Enable debug mode for Prokka |
PIRATE Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--use_pirate | boolean | false | Use PIRATE instead of panaroo in the 'pangenome' subworkflow |
--pirate_steps | string | 50,60,70,80,90,95,98 | Percent identity thresholds to use for pangenome construction |
--pirate_features | string | CDS | Comma-delimited features to use for pangenome construction |
--pirate_para_off | boolean | false | Switch off paralog identification |
--pirate_z | boolean | false | Retain all PIRATE intermediate files |
--pirate_pan_opt | string | Additional arguments to pass to pangenome construction. |
Roary Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--roary_use_prank | boolean | false | Use PRANK instead of MAFFT for core gene |
--use_roary | boolean | false | Use Roary instead of PIRATE in the 'pangenome' subworkflow |
--roary_i | integer | 95 | Minimum percentage identity for blastp |
--roary_cd | integer | 99 | Percentage of isolates a gene must be in to be core |
--roary_g | integer | 50000 | Maximum number of clusters |
--roary_s | boolean | false | Do not split paralogs |
--roary_ap | boolean | false | Allow paralogs in core alignment |
--roary_iv | number | 1.5 | MCL inflation value |
Panaroo Run Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--panaroo_mode | string | strict | The stringency mode at which to run panaroo (choices: strict, moderate, sensitive) |
--panaroo_alignment | string | core | Output alignments of core genes or all genes (choices: core, pan) |
--panaroo_aligner | string | mafft | Aligner to use for core/pan genome alignment (choices: mafft, prank, clustal) |
--panaroo_core_threshold | number | 0.95 | Core-genome sample threshold |
--panaroo_threshold | number | 0.98 | Sequence identity threshold |
--panaroo_family_threshold | number | 0.7 | Protein family sequence identity threshold |
--panaroo_len_dif_percent | number | 0.98 | Length difference cutoff |
--panaroo_merge_paralogs | boolean | false | Do not split paralogs |
--panaroo_opts | string | Additional options to pass to panaroo |
SNP-Dists Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--snpdists_a | boolean | false | Count all differences not just [AGTC] |
--snpdists_b | boolean | false | Keep top left corner cell |
--snpdists_csv | boolean | false | Output CSV instead of TSV |
--snpdists_k | boolean | false | Keep case, don't uppercase all letters |
ClonalFrameML Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--clonalframeml_emsim | integer | 100 | Number of simulations to estimate uncertainty in the EM results |
--clonalframeml_opts | string | Extra ClonalFrameML options in quotes | |
--skip_recombination | boolean | false | Skip ClonalFrameML execution in subworkflows |
IQ-TREE Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--iqtree_model | string | HKY | Substitution model name |
--iqtree_bb | integer | 1000 | Ultrafast bootstrap replicates |
--iqtree_alrt | integer | 1000 | SH-like approximate likelihood ratio test replicates |
--iqtree_asr | boolean | false | Ancestral state reconstruction by empirical Bayes |
--iqtree_opts | string | Extra IQ-TREE options in quotes. | |
--skip_phylogeny | boolean | false | Skip IQ-TREE execution in subworkflows |
Scoary Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--scoary_traits | string | Input trait table (CSV) to test for associations | |
--scoary_p_value_cutoff | number | 0.05 | For statistical tests, genes with higher p-values will not be reported |
--scoary_correction | string | I | Apply the indicated filtration measure. (choices: I, B, BH, PW, EPW, P) |
--scoary_permute | integer | 0 | Perform N number of permutations of the significant results post-analysis |
--scoary_start_col | integer | 15 | On which column in the gene presence/absence file do individual strain info start |
Filtering Parameters
Use these parameters to specify which samples to include or exclude.
| Parameter | Type | Default | Description |
|---|---|---|---|
--include | string | A text file containing sample names (one per line) to include from the analysis | |
--exclude | string | A text file containing sample names (one per line) to exclude from the analysis |
Optional Parameters
These optional parameters can be useful in certain settings.
| Parameter | Type | Default | Description |
|---|---|---|---|
--outdir | string | bactopia | Base directory to write results to |
--skip_compression | boolean | false | Output files will not be compressed |
--datasets | string | The path to cache datasets to | |
--keep_all_files | boolean | false | Keeps all analysis files created |
Max Job Request Parameters
Set the top limit for requested resources for any single job.
| Parameter | Type | Default | Description |
|---|---|---|---|
--max_retry | integer | 3 | Maximum times to retry a process before allowing it to fail. |
--max_cpus | integer | 4 | Maximum number of CPUs that can be requested for any single job. |
--max_memory | string | 128.GB | Maximum amount of memory that can be requested for any single job. |
--max_time | string | 240.h | Maximum amount of time that can be requested for any single job. |
--max_downloads | integer | 3 | Maximum number of samples to download at a time |
Nextflow Configuration Parameters
Parameters to fine-tune your Nextflow setup.
| Parameter | Type | Default | Description |
|---|---|---|---|
--nfconfig | string | A Nextflow compatible config file for custom profiles, loaded last and will overwrite existing variables if set. | |
--publish_dir_mode | string | copy | Method used to save pipeline results to output directory. (choices: symlink, rellink, link, copy, copyNoFollow, move) |
--infodir | string | ${params.outdir}/pipeline_info | Directory to keep pipeline Nextflow logs and reports. |
--force | boolean | false | Nextflow will overwrite existing output files. |
--cleanup_workdir | boolean | false | After Bactopia is successfully executed, the work directory will be deleted. |
Institutional config options
Parameters used to describe centralized config profiles. These should not be edited.
| Parameter | Type | Default | Description |
|---|---|---|---|
--custom_config_version | string | master | Git commit id for Institutional configs. |
--custom_config_base | string | https://raw.githubusercontent.com/nf-core/configs/master | Base directory for Institutional configs. |
--config_profile_name | string | Institutional config name. | |
--config_profile_description | string | Institutional config description. | |
--config_profile_contact | string | Institutional config contact information. | |
--config_profile_url | string | Institutional config URL link. |
Nextflow Profile Parameters
Parameters to fine-tune your Nextflow setup.
| Parameter | Type | Default | Description |
|---|---|---|---|
--condadir | string | Directory to Nextflow should use for Conda environments | |
--registry | string | quay.io | Registry to pull Docker containers from. |
--datasets_cache | string | <HOME>/.bactopia/datasets | Directory where downloaded datasets should be stored. |
--singularity_cache | string | Directory where remote Singularity images are stored. | |
--singularity_pull_docker_container | boolean | Instead of directly downloading Singularity images for use with Singularity, force the workflow to pull and convert Docker containers instead. | |
--force_rebuild | boolean | false | Force overwrite of existing pre-built environments. |
--queue | string | general,high-memory | Comma-separated name of the queue(s) to be used by a job scheduler (e.g. AWS Batch or SLURM) |
--cluster_opts | string | Additional options to pass to the executor. (e.g. SLURM: '--account=my_acct_name' | |
--container_opts | string | Additional options to pass to Apptainer, Docker, or Singularity. (e.g. Singularity: '-D pwd' | |
--disable_scratch | boolean | false | All intermediate files created on worker nodes of will be transferred to the head node. |
Helpful Parameters
Uncommonly used parameters that might be useful.
| Parameter | Type | Default | Description |
|---|---|---|---|
--monochrome_logs | boolean | Do not use coloured log outputs. | |
--nfdir | boolean | Print directory Nextflow has pulled Bactopia to | |
--sleep_time | integer | 5 | The amount of time (seconds) Nextflow will wait after setting up datasets before execution. |
--validate_params | boolean | true | Boolean whether to validate parameters against the schema at runtime |
--help | boolean | Display help text. | |
--wf | string | bactopia | Specify which workflow or Bactopia Tool to execute |
--list_wfs | boolean | List the available workflows and Bactopia Tools to use with '--wf' | |
--show_hidden_params | boolean | Show all params when using --help | |
--help_all | boolean | An alias for --help --show_hidden_params | |
--version | boolean | Display version text. |
Composition
This workflow uses the following subworkflows:
- clonalframeml - Detect and mask recombination events in bacterial phylogenies.
- iqtree - Construct maximum likelihood phylogenetic trees from alignments.
- ncbigenomedownload - Download bacterial genomes from NCBI's RefSeq database.
- pangenome - Perform pangenome analysis with optional core-genome phylogeny.
- prokka - Annotate bacterial genomes with functional information.
- scoary - Pan-genome wide association studies.
Citations
If you use this in your analysis, please cite the following.
-
Bactopia
Petit III RA, Read TD Bactopia - a flexible pipeline for complete analysis of bacterial genomes. mSystems 5 (2020) -
ClonalFramML
Didelot X, Wilson DJ ClonalFrameML: Efficient Inference of Recombination in Whole Bacterial Genomes. PLoS Comput Biol 11(2) e1004041 (2015) -
IQ-TREE
Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies. Mol. Biol. Evol. 32:268-274 (2015) -
ModelFinder
Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS ModelFinder - Fast model selection for accurate phylogenetic estimates. Nat. Methods 14:587-589 (2017) -
UFBoot2
Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS UFBoot2: Improving the ultrafast bootstrap approximation. Mol. Biol. Evol. 35:518-522 (2018) -
ncbi-genome-download
Blin K ncbi-genome-download: Scripts to download genomes from the NCBI FTP servers (GitHub) -
Panaroo
Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G, Lees JA, Gladstone RA, Lo S, Beaudoin C, Floto RA, Frost SDW, Corander J, Bentley SD, Parkhill J Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biology 21(1), 180. (2020) -
PIRATE
Bayliss SC, Thorpe HA, Coyle NM, Sheppard SK, Feil EJ PIRATE: A fast and scalable pangenomics toolbox for clustering diverged orthologues in bacteria. Gigascience 8 (2019) -
Prokka
Seemann T Prokka: rapid prokaryotic genome annotation Bioinformatics 30, 2068-2069 (2014) -
Roary
Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MTG, Fookes M, Falush D, Keane JA, Parkhill J Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31, 3691-3693 (2015) -
Scoary
Brynildsrud O, Bohlin J, Scheffer L, Eldholm V Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary. Genome Biol. 17:238 (2016)