pangenome

Tags: alignment core-genome pan-genome phylogeny comparative-genomics bactopia-tool

Pangenome analysis with optional core-genome phylogeny.

This Bactopia Tool creates a pangenome from GFF3 annotation files using one of three tools: Panaroo (default), PIRATE, or Roary. It generates core-genome alignments and gene presence/absence matrices, followed by SNP distance calculations. You can supplement your pangenome with completed genomes using the --species or --accessions parameters, which downloads genomes from RefSeq and annotates them with Prokka. A phylogeny based on the core-genome alignment is created by IQ-Tree, with optional recombination masking using ClonalFrameML. Finally, pan-genome wide association studies can be conducted using Scoary.

Usage

Bactopia CLI:

bactopia --wf pangenome \
  --bactopia /path/to/your/bactopia/results

Nextflow:

nextflow run bactopia/bactopia/workflows/bactopia-tools/pangenome/main.nf \
  --bactopia /path/to/your/bactopia/results

Outputs

Expected Output Files

<BACTOPIA_DIR>
└── <SAMPLE_NAME>
    └── pangenome-<TIMESTAMP>
        ├── clonalframeml
        │   ├── core-genome.ML_sequence.fasta.gz
        │   ├── core-genome.em.txt
        │   ├── core-genome.emsim.txt
        │   ├── core-genome.importation_status.txt
        │   ├── core-genome.labelled_tree.newick
        │   ├── core-genome.position_cross_reference.txt.gz
        │   └── logs
        │       ├── nf.command.{begin,err,log,out,run,sh,trace}
        │       └── versions.yml
        ├── core-genome.distance.tsv
        ├── core-genome.masked.aln.gz
        ├── core-genome.masked.distance.tsv
        ├── core-genome.treefile
        ├── iqtree
        │   ├── core-genome.alninfo.gz
        │   ├── core-genome.bionj
        │   ├── core-genome.ckp.gz
        │   ├── core-genome.contree
        │   ├── core-genome.iqtree
        │   ├── core-genome.log
        │   ├── core-genome.mldist
        │   ├── core-genome.splits.nex
        │   ├── core-genome.ufboot
        │   └── logs
        │       ├── nf.command.{begin,err,log,out,run,sh,trace}
        │       └── versions.yml
        ├── iqtree-fast
        │   ├── logs
        │   │   ├── nf.command.{begin,err,log,out,run,sh,trace}
        │   │   └── versions.yml
        │   ├── roary.bionj
        │   ├── roary.ckp.gz
        │   ├── roary.iqtree
        │   ├── roary.log
        │   ├── roary.mldist
        │   ├── roary.model.gz
        │   └── roary.treefile
        ├── nf-reports
        │   ├── pangenome-dag.dot
        │   ├── pangenome-report.html
        │   └── pangenome-timeline.html
        ├── roary
        │   ├── accessory.header.embl
        │   ├── accessory.tab
        │   ├── accessory_binary_genes.fa.gz
        │   ├── accessory_binary_genes.fa.newick
        │   ├── accessory_graph.dot
        │   ├── blast_identity_frequency.Rtab
        │   ├── clustered_proteins
        │   ├── core_accessory.header.embl
        │   ├── core_accessory.tab
        │   ├── core_accessory_graph.dot
        │   ├── core_alignment_header.embl
        │   ├── core_gene_alignment.aln.gz
        │   ├── gene_presence_absence.Rtab
        │   ├── gene_presence_absence.csv
        │   ├── logs
        │   │   ├── nf.command.{begin,err,log,out,run,sh,trace}
        │   │   └── versions.yml
        │   ├── number_of_conserved_genes.Rtab
        │   ├── number_of_genes_in_pan_genome.Rtab
        │   ├── number_of_new_genes.Rtab
        │   ├── number_of_unique_genes.Rtab
        │   ├── pan_genome_reference.fa.gz
        │   └── summary_statistics.txt
        ├── roary.aln.gz
        ├── scoary
        │   ├── Bogus_trait.results.csv
        │   ├── Tetracycline_resistance.results.csv
        │   └── logs
        │       ├── nf.command.{begin,err,log,out,run,sh,trace}
        │       ├── scoary.log
        │       └── versions.yml
        ├── snpdists
        │   └── logs
        │       ├── nf.command.{begin,err,log,out,run,sh,trace}
        │       └── versions.yml
        └── snpdists-masked
            └── logs
                ├── nf.command.{begin,err,log,out,run,sh,trace}
                └── versions.yml

Pangenome Results

File	Description
`*.aln`	Core-genome alignment file containing genes present across all input genomes
`*.csv`	Gene presence/absence matrix showing which genes are present in each genome
`*.tsv`	SNP distance matrix between all samples

Phylogeny Results

note

Only created if --skip_phylogeny is not enabled

File	Description
`*.treefile`	Maximum likelihood phylogenetic tree in Newick format
`*.iqtree`	IQ-Tree analysis report with model selection and support values
`*.log`	IQ-Tree execution log

Recombination Analysis

note

Only created if --skip_recombination is not enabled

File	Description
`*.masked.aln`	Core-genome alignment with recombination regions masked

Association Analysis

note

Only created if --scoary_traits is specified

File	Description
`scoary/*`	Scoary association analysis results and plots

Panaroo Results

note

Only created when Panaroo is selected as the pangenome tool

File	Description
`panaroo/*`	Panaroo-specific output files including graph and statistics

PIRATE Results

note

Only created when PIRATE is selected as the pangenome tool

File	Description
`pirate/*`	PIRATE-specific output files including gene families and clusters

Roary Results

note

Only created when Roary is selected as the pangenome tool

File	Description
`roary/*`	Roary-specific output files including gene presence/absence matrices

Audit Trail

Below are files that can assist you in understanding which parameters and program versions were used.

Logs

Each process that is executed will have a folder named logs. In this folder are helpful files for you to review if the need ever arises.

Extension	Description
.begin	An empty file used to designate the process started
.err	Contains STDERR outputs from the process
.log	Contains both STDERR and STDOUT outputs from the process
.out	Contains STDOUT outputs from the process
.run	The script Nextflow uses to stage/unstage files and queue processes based on given profile
.sh	The script executed by bash for the process
.trace	The Nextflow trace report for the process
versions.yml	A YAML formatted file with program versions

Nextflow Reports

These Nextflow reports provide great a great summary of your run. These can be used to optimize resource usage and estimate expected costs if using cloud platforms.

Filename	Description
pangenome-dag.dot	The Nextflow DAG visualization
pangenome-report.html	The Nextflow Execution Report
pangenome-timeline.html	The Nextflow Timeline Report
pangenome-trace.txt	The Nextflow Trace report

Parameters

Required Parameters

Define where the pipeline should find input data and save output data.

Parameter	Type	Default	Description
`--bactopia`	string		The path to bactopia results to use as inputs

NCBI Genome Download Parameters

Parameter	Type	Default	Description
`--species`	string		Name of the species to download assemblies
`--accession`	string		An NCBI Assembly accession to be downloaded
`--accessions`	string		An file of NCBI Assembly accessions (one per line) to be downloaded
`--format`	string	`fasta`	Comma separated list of formats to download
`--section`	string	`refseq`	NCBI section to download
`--assembly_level`	string	`complete`	Comma separated list of assembly levels to download
`--kingdom`	string	`bacteria`	Comma separated list of formats to download
`--limit`	string		Limit the number of assemblies to download
`--keep_downloads`	boolean	`false`	Save downloaded files into the bactopia-runs folder

Prokka Parameters

Parameter	Type	Default	Description
`--prokka_proteins`	string		FASTA file of trusted proteins to first annotate from
`--prokka_prodigal_tf`	string		Training file to use for Prodigal
`--prokka_compliant`	boolean	`false`	Force Genbank/ENA/DDJB compliance
`--prokka_centre`	string	`Bactopia`	Sequencing centre ID
`--prokka_coverage`	integer	`80`	Minimum coverage on query protein
`--prokka_evalue`	string	`1e-09`	Similarity e-value cut-off
`--prokka_opts`	string		Extra Prokka options in quotes.
`--prokka_debug`	boolean	`false`	Enable debug mode for Prokka

PIRATE Parameters

Parameter	Type	Default	Description
`--use_pirate`	boolean	`false`	Use PIRATE instead of panaroo in the 'pangenome' subworkflow
`--pirate_steps`	string	`50,60,70,80,90,95,98`	Percent identity thresholds to use for pangenome construction
`--pirate_features`	string	`CDS`	Comma-delimited features to use for pangenome construction
`--pirate_para_off`	boolean	`false`	Switch off paralog identification
`--pirate_z`	boolean	`false`	Retain all PIRATE intermediate files
`--pirate_pan_opt`	string		Additional arguments to pass to pangenome construction.

Roary Parameters

Parameter	Type	Default	Description
`--roary_use_prank`	boolean	`false`	Use PRANK instead of MAFFT for core gene
`--use_roary`	boolean	`false`	Use Roary instead of PIRATE in the 'pangenome' subworkflow
`--roary_i`	integer	`95`	Minimum percentage identity for blastp
`--roary_cd`	integer	`99`	Percentage of isolates a gene must be in to be core
`--roary_g`	integer	`50000`	Maximum number of clusters
`--roary_s`	boolean	`false`	Do not split paralogs
`--roary_ap`	boolean	`false`	Allow paralogs in core alignment
`--roary_iv`	number	`1.5`	MCL inflation value

Panaroo Run Parameters

Parameter	Type	Default	Description
`--panaroo_mode`	string	`strict`	The stringency mode at which to run panaroo (choices: `strict`, `moderate`, `sensitive`)
`--panaroo_alignment`	string	`core`	Output alignments of core genes or all genes (choices: `core`, `pan`)
`--panaroo_aligner`	string	`mafft`	Aligner to use for core/pan genome alignment (choices: `mafft`, `prank`, `clustal`)
`--panaroo_core_threshold`	number	`0.95`	Core-genome sample threshold
`--panaroo_threshold`	number	`0.98`	Sequence identity threshold
`--panaroo_family_threshold`	number	`0.7`	Protein family sequence identity threshold
`--panaroo_len_dif_percent`	number	`0.98`	Length difference cutoff
`--panaroo_merge_paralogs`	boolean	`false`	Do not split paralogs
`--panaroo_opts`	string		Additional options to pass to panaroo

SNP-Dists Parameters

Parameter	Type	Default	Description
`--snpdists_a`	boolean	`false`	Count all differences not just [AGTC]
`--snpdists_b`	boolean	`false`	Keep top left corner cell
`--snpdists_csv`	boolean	`false`	Output CSV instead of TSV
`--snpdists_k`	boolean	`false`	Keep case, don't uppercase all letters

ClonalFrameML Parameters

Parameter	Type	Default	Description
`--clonalframeml_emsim`	integer	`100`	Number of simulations to estimate uncertainty in the EM results
`--clonalframeml_opts`	string		Extra ClonalFrameML options in quotes
`--skip_recombination`	boolean	`false`	Skip ClonalFrameML execution in subworkflows

IQ-TREE Parameters

Parameter	Type	Default	Description
`--iqtree_model`	string	`HKY`	Substitution model name
`--iqtree_bb`	integer	`1000`	Ultrafast bootstrap replicates
`--iqtree_alrt`	integer	`1000`	SH-like approximate likelihood ratio test replicates
`--iqtree_asr`	boolean	`false`	Ancestral state reconstruction by empirical Bayes
`--iqtree_opts`	string		Extra IQ-TREE options in quotes.
`--skip_phylogeny`	boolean	`false`	Skip IQ-TREE execution in subworkflows

Scoary Parameters

Parameter	Type	Default	Description
`--scoary_traits`	string		Input trait table (CSV) to test for associations
`--scoary_p_value_cutoff`	number	`0.05`	For statistical tests, genes with higher p-values will not be reported
`--scoary_correction`	string	`I`	Apply the indicated filtration measure. (choices: `I`, `B`, `BH`, `PW`, `EPW`, `P`)
`--scoary_permute`	integer	`0`	Perform N number of permutations of the significant results post-analysis
`--scoary_start_col`	integer	`15`	On which column in the gene presence/absence file do individual strain info start

Filtering Parameters

Use these parameters to specify which samples to include or exclude.

Parameter	Type	Default	Description
`--include`	string		A text file containing sample names (one per line) to include from the analysis
`--exclude`	string		A text file containing sample names (one per line) to exclude from the analysis

Optional Parameters

These optional parameters can be useful in certain settings.

Parameter	Type	Default	Description
`--outdir`	string	`bactopia`	Base directory to write results to
`--skip_compression`	boolean	`false`	Output files will not be compressed
`--datasets`	string		The path to cache datasets to
`--keep_all_files`	boolean	`false`	Keeps all analysis files created

Max Job Request Parameters

Set the top limit for requested resources for any single job.

Parameter	Type	Default	Description
`--max_retry`	integer	`3`	Maximum times to retry a process before allowing it to fail.
`--max_cpus`	integer	`4`	Maximum number of CPUs that can be requested for any single job.
`--max_memory`	string	`128.GB`	Maximum amount of memory that can be requested for any single job.
`--max_time`	string	`240.h`	Maximum amount of time that can be requested for any single job.
`--max_downloads`	integer	`3`	Maximum number of samples to download at a time

Nextflow Configuration Parameters

Parameters to fine-tune your Nextflow setup.

Parameter	Type	Default	Description
`--nfconfig`	string		A Nextflow compatible config file for custom profiles, loaded last and will overwrite existing variables if set.
`--publish_dir_mode`	string	`copy`	Method used to save pipeline results to output directory. (choices: `symlink`, `rellink`, `link`, `copy`, `copyNoFollow`, `move`)
`--infodir`	string	`${params.outdir}/pipeline_info`	Directory to keep pipeline Nextflow logs and reports.
`--force`	boolean	`false`	Nextflow will overwrite existing output files.
`--cleanup_workdir`	boolean	`false`	After Bactopia is successfully executed, the `work` directory will be deleted.

Institutional config options

Parameters used to describe centralized config profiles. These should not be edited.

Parameter	Type	Default	Description
`--custom_config_version`	string	`master`	Git commit id for Institutional configs.
`--custom_config_base`	string	`https://raw.githubusercontent.com/nf-core/configs/master`	Base directory for Institutional configs.
`--config_profile_name`	string		Institutional config name.
`--config_profile_description`	string		Institutional config description.
`--config_profile_contact`	string		Institutional config contact information.
`--config_profile_url`	string		Institutional config URL link.

Nextflow Profile Parameters

Parameters to fine-tune your Nextflow setup.

Parameter	Type	Default	Description
`--condadir`	string		Directory to Nextflow should use for Conda environments
`--registry`	string	`quay.io`	Registry to pull Docker containers from.
`--datasets_cache`	string	`<HOME>/.bactopia/datasets`	Directory where downloaded datasets should be stored.
`--singularity_cache`	string		Directory where remote Singularity images are stored.
`--singularity_pull_docker_container`	boolean		Instead of directly downloading Singularity images for use with Singularity, force the workflow to pull and convert Docker containers instead.
`--force_rebuild`	boolean	`false`	Force overwrite of existing pre-built environments.
`--queue`	string	`general,high-memory`	Comma-separated name of the queue(s) to be used by a job scheduler (e.g. AWS Batch or SLURM)
`--cluster_opts`	string		Additional options to pass to the executor. (e.g. SLURM: '--account=my_acct_name'
`--container_opts`	string		Additional options to pass to Apptainer, Docker, or Singularity. (e.g. Singularity: '-D `pwd`'
`--disable_scratch`	boolean	`false`	All intermediate files created on worker nodes of will be transferred to the head node.

Helpful Parameters

Uncommonly used parameters that might be useful.

Parameter	Type	Default	Description
`--monochrome_logs`	boolean		Do not use coloured log outputs.
`--nfdir`	boolean		Print directory Nextflow has pulled Bactopia to
`--sleep_time`	integer	`5`	The amount of time (seconds) Nextflow will wait after setting up datasets before execution.
`--validate_params`	boolean	`true`	Boolean whether to validate parameters against the schema at runtime
`--help`	boolean		Display help text.
`--wf`	string	`bactopia`	Specify which workflow or Bactopia Tool to execute
`--list_wfs`	boolean		List the available workflows and Bactopia Tools to use with '--wf'
`--show_hidden_params`	boolean		Show all params when using `--help`
`--help_all`	boolean		An alias for --help --show_hidden_params
`--version`	boolean		Display version text.

Composition

This workflow uses the following subworkflows:

clonalframeml - Detect and mask recombination events in bacterial phylogenies.
iqtree - Construct maximum likelihood phylogenetic trees from alignments.
ncbigenomedownload - Download bacterial genomes from NCBI's RefSeq database.
pangenome - Perform pangenome analysis with optional core-genome phylogeny.
prokka - Annotate bacterial genomes with functional information.
scoary - Pan-genome wide association studies.

Citations

If you use this in your analysis, please cite the following.

Bactopia
Petit III RA, Read TD Bactopia - a flexible pipeline for complete analysis of bacterial genomes. mSystems 5 (2020)
ClonalFramML
Didelot X, Wilson DJ ClonalFrameML: Efficient Inference of Recombination in Whole Bacterial Genomes. PLoS Comput Biol 11(2) e1004041 (2015)
IQ-TREE
Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ IQ-TREE: A fast and effective stochastic algorithm for estimating maximum likelihood phylogenies. Mol. Biol. Evol. 32:268-274 (2015)
ModelFinder
Kalyaanamoorthy S, Minh BQ, Wong TKF, von Haeseler A, Jermiin LS ModelFinder - Fast model selection for accurate phylogenetic estimates. Nat. Methods 14:587-589 (2017)
UFBoot2
Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS UFBoot2: Improving the ultrafast bootstrap approximation. Mol. Biol. Evol. 35:518-522 (2018)
ncbi-genome-download
Blin K ncbi-genome-download: Scripts to download genomes from the NCBI FTP servers (GitHub)
Panaroo
Tonkin-Hill G, MacAlasdair N, Ruis C, Weimann A, Horesh G, Lees JA, Gladstone RA, Lo S, Beaudoin C, Floto RA, Frost SDW, Corander J, Bentley SD, Parkhill J Producing polished prokaryotic pangenomes with the Panaroo pipeline. Genome Biology 21(1), 180. (2020)
PIRATE
Bayliss SC, Thorpe HA, Coyle NM, Sheppard SK, Feil EJ PIRATE: A fast and scalable pangenomics toolbox for clustering diverged orthologues in bacteria. Gigascience 8 (2019)
Prokka
Seemann T Prokka: rapid prokaryotic genome annotation Bioinformatics 30, 2068-2069 (2014)
Roary
Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MTG, Fookes M, Falush D, Keane JA, Parkhill J Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics 31, 3691-3693 (2015)
Scoary
Brynildsrud O, Bohlin J, Scheffer L, Eldholm V Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary. Genome Biol. 17:238 (2016)

Source

View source on GitHub

Usage​

Outputs​

Expected Output Files​

Pangenome Results​

Phylogeny Results​

Recombination Analysis​

Association Analysis​

Panaroo Results​

PIRATE Results​

Roary Results​

Audit Trail​

Logs​

Nextflow Reports​

Parameters​

Required Parameters​

NCBI Genome Download Parameters​

Prokka Parameters​

PIRATE Parameters​

Roary Parameters​

Panaroo Run Parameters​

SNP-Dists Parameters​

ClonalFrameML Parameters​

IQ-TREE Parameters​

Scoary Parameters​

Composition​

Citations​

Source​

Usage

Outputs

Expected Output Files

Pangenome Results

Phylogeny Results

Recombination Analysis

Association Analysis

Panaroo Results

PIRATE Results

Roary Results

Audit Trail

Logs

Nextflow Reports

Parameters

Required Parameters

NCBI Genome Download Parameters

Prokka Parameters

PIRATE Parameters

Roary Parameters

Panaroo Run Parameters

SNP-Dists Parameters

ClonalFrameML Parameters

IQ-TREE Parameters

Scoary Parameters

Composition

Citations

Source