bactopia

Tags: bacteria assembly annotation amr mlst genomics pipeline named-workflow

Comprehensive bacterial analysis pipeline for complete genomic characterization.

This workflow performs end-to-end analysis including quality control, assembly, annotation, antimicrobial resistance detection, MLST typing, and optional pathogen-specific analysis through Merlin. It processes raw sequencing reads and produces a complete genomic characterization suitable for downstream analysis.

Pipeline Overview

Bactopia Workflow

Looking at the workflow overview above, it might not look like much is happening, but I can assure you that a lot is going on. The workflow is broken down into 8 steps, which are:

Gather - Collect all the data in one place
QC - Quality control of the data
Assembler - Assemble the reads into contigs
Annotator - Annotate the contigs
Sketcher - Create a sketch of the contigs, and query databases
Sequence Typing - Determine the sequence type of the contigs
Antibiotic Resistance - Determine the antibiotic resistance of the contigs and proteins
Merlin - Automatically run species-specific tools based on distance

If you are looking for a guide to get started quickly, please check out the Beginner's Guide.

Step 1 - Gather

The main purpose of the gather step is to get all the samples into a single place. This includes downloading samples from ENA/SRA or NCBI Assembly. The tools used are:

Tool	Description
art	For simulating error-free reads for an input assembly
fastq-dl	Downloading FASTQ files from ENA/SRA
ncbi-genome-download	Downloading FASTA files from NCBI Assembly

This gather step also does basic QC checks to help prevent downstream failures.

Failed Quality Checks

Filename	Description
-gzip-error.txt	Sample failed Gzip checks and excluded from further analysis
-low-basepair-proportion-error.txt	Sample failed basepair proportion checks and excluded from further analysis
-low-read-count-error.txt	Sample failed read count checks and excluded from further analysis
-low-sequence-depth-error.txt	Sample failed sequenced basepair checks and excluded from further analysis

Poor samples are excluded to prevent downstream failures

Samples that fail any of the QC checks will be excluded from further analysis. Those samples will generate a *-error.txt file with the error message. Excluding these samples prevents downstream failures that cause the whole workflow to fail.

Example Error: Input FASTQ(s) failed Gzip checks

If input FASTQ(s) fail to pass Gzip test, the sample will be excluded from further analysis.

Example Text from <SAMPLE_NAME>-gzip-error.txt <SAMPLE_NAME> FASTQs failed Gzip tests. Please check the input FASTQs. Further analysis is discontinued.

Example Error: Input FASTQs have disproportionate number of reads

If input FASTQ(s) for a sample have disproportionately different number of reads between the two pairs, the sample will be excluded from further analysis. You can adjust this minimum read count using the --min_proportion parameter.

Example Text from <SAMPLE_NAME>-low-basepair-proportion-error.txt <SAMPLE_NAME> FASTQs failed to meet the minimum shared basepairs. They shared Y basepairs, with R1 having A bp and R2 having B bp. Further analysis is discontinued.

Example Error: Input FASTQ(s) has too few reads

If input FASTQ(s) for a sample have less than the minimum required reads, the sample will be excluded from further analysis. You can adjust this minimum read count using the --min_reads parameter.

Example Text from <SAMPLE_NAME>-low-read-count-error.txt <SAMPLE_NAME> FASTQ(s) contain X total reads. This does not exceed the required minimum Y read count. Further analysis is discontinued.

Example Error: Input FASTQ(s) has too little sequenced basepairs

If input FASTQ(s) for a sample fails to meet the minimum number of sequenced basepairs, the sample will be excluded from further analysis. You can adjust this minimum read count using the --min_basepairs parameter.

Example Text from <SAMPLE_NAME>-low-sequence-depth-error.txt <SAMPLE_NAME> FASTQ(s) contain X total basepairs. This does not exceed the required minimum Y bp. Further analysis is discontinued.

Step 2 - QC

The qc module uses a variety of tools to perform quality control on Illumina and Oxford Nanopore reads. The tools used are:

Tool	Technology	Description
bbtools	Illumina	A suite of tools for manipulating reads
fastp	Illumina	A tool designed to provide fast all-in-one preprocessing for FastQ files
fastqc	Illumina	A quality control tool for high throughput sequence data
fastq_scan	Nanopore	A tool for quickly scanning FASTQ files
lighter	Illumina	A tool for correcting sequencing errors in Illumina reads
NanoPlot	Nanopore	A tool for plotting long read sequencing data
nanoq	Nanopore	A tool for calculating quality metrics for Oxford Nanopore reads
porechop	Nanopore	A tool for removing adapters from Oxford Nanopore reads
rasusa	Nanopore	Randomly subsample sequencing reads to a specified coverage

Similar to the gather step, the qc step will also stop samples that fail to meet basic QC checks from continuing downstream.

Failed Quality Checks

Filename	Description
.error-fastq.gz	A gzipped FASTQ file of reads that failed QC
-low-read-count-error.txt	Sample failed read count checks and excluded from further analysis
-low-sequence-coverage-error.txt	Sample failed sequenced coverage checks and excluded from further analysis
-low-sequence-depth-error.txt	Sample failed sequenced basepair checks and excluded from further analysis

Poor samples are excluded to prevent downstream failures

Example Error: After QC, too few reads remain

If after cleaning reads, a sample has less than the minimum required reads, the sample will be excluded from further analysis. You can adjust this minimum read count using the --min_reads parameter.

Example Error: After QC, too little sequence coverage remains

If after cleaning reads, a sample has failed to meet the minimum sequence coverage required, the sample will be excluded from further analysis. You can adjust this minimum read count using the --min_coverage parameter.

Note: This check is only performed when a genome size is available.

Example Text from <SAMPLE_NAME>-low-sequence-coverage-error.txt After QC, <SAMPLE_NAME> FASTQ(s) contain X total basepairs. This does not exceed the required minimum Y bp (Zx coverage). Further analysis is discontinued.

Example Error: After QC, too little sequenced basepairs remain

If after cleaning reads, a sample has failed to meet the minimum number of sequenced basepairs, the sample will be excluded from further analysis. You can adjust this minimum read count using the --min_basepairs parameter.

Step 3 - Assembler

The assembler module uses a variety of assembly tools to create an assembly of Illumina and Oxford Nanopore reads. The tools used are:

Tool	Description
Dragonflye	Assembly of Oxford Nanopore reads, as well as hybrid assembly with short-read polishing
Shovill	Assembly of Illumina paired-end reads
Shovill-SE	Assembly of Illumina single-end reads
Unicycler	Hybrid assembly, using short-reads first then long-reads

Summary statistics for each assembly are generated using assembly-scan.

Prefer --short_polish over --hybrid with recent ONT sequencing

Using Unicycler (--hybrid) to create a hybrid assembly works great when you have low-coverage noisy long-reads. However, if you are using recent ONT sequencing, you likely have high-coverage and using the --short_polish method is going to yield better results (and be faster!) than --hybrid.

Failed Quality Checks

Filename	Description
-assembly-error.txt	Sample failed assembly checks and excluded from further analysis

Poor samples are excluded to prevent downstream failures

Example Error: Assembled Successfully, but 0 Contigs

If a sample assembles successfully, but 0 contigs are formed, the sample will be excluded from further analysis.

Example Text from <SAMPLE_NAME>-assembly-error.txt <SAMPLE_NAME> assembled successfully, but 0 contigs were formed. Please investigate <SAMPLE_NAME> to determine a cause (e.g. metagenomic, contaminants, etc...) for this outcome. Further assembly-based analysis of <SAMPLE_NAME> will be discontinued.

Example Error: Assembled successfully, but poor assembly size

If your sample assembles successfully, but the assembly size is less than the minimum allowed genome size, the sample will be excluded from further analysis. You can adjust this minimum size using the --min_genome_size parameter.

Example Text from <SAMPLE_NAME>-assembly-error.txt <SAMPLE_NAME> assembled size (000 bp) is less than the minimum allowed genome size (000 bp). If this is unexpected, please investigate <SAMPLE_NAME> to determine a cause (e.g. metagenomic, contaminants, etc...) for the poor assembly. Otherwise, adjust the --min_genome_size parameter to fit your need. Further assembly based analysis of <SAMPLE_NAME> will be discontinued.

Step 4 - Annotator

The annotator step uses either Prokka (default) or Bakta (via --use_bakta) to annotate assembled contigs with functional information including genes, proteins, rRNA, tRNA, and other genomic features.

Step 5 - Sketcher

The sketcher module uses Mash and Sourmash to create sketches and query RefSeq and GTDB.

Step 6 - Sequence Typing

The mlst step uses mlst to scan assemblies against PubMLST typing schemes and determine the sequence type.

Step 7 - Antibiotic Resistance

The amrfinderplus step uses AMRFinder+ to identify antimicrobial resistance genes and point mutations from both assembled contigs and annotated protein sequences.

Step 8 - Merlin

The merlin step automatically selects and runs species-specific typing tools based on Mash distance results from the sketcher step. Enable with --ask_merlin. See the Pathogen-Specific Analysis output section below for the full list of supported organisms and tools.

Usage

Bactopia CLI:

bactopia \
  --input samples.csv \
  --outdir results/

Nextflow:

nextflow run bactopia/bactopia \
  --input samples.csv \
  --outdir results/

Outputs

Expected Output Files

<BACTOPIA_DIR>
├── <SAMPLE_NAME>
│   ├── main
│   │   ├── annotator
│   │   │   └── prokka
│   │   │       ├── <SAMPLE_NAME>-blastdb.tar.gz
│   │   │       ├── <SAMPLE_NAME>.faa.gz
│   │   │       ├── <SAMPLE_NAME>.ffn.gz
│   │   │       ├── <SAMPLE_NAME>.fna.gz
│   │   │       ├── <SAMPLE_NAME>.fsa.gz
│   │   │       ├── <SAMPLE_NAME>.gbk.gz
│   │   │       ├── <SAMPLE_NAME>.gff.gz
│   │   │       ├── <SAMPLE_NAME>.sqn.gz
│   │   │       ├── <SAMPLE_NAME>.tbl.gz
│   │   │       ├── <SAMPLE_NAME>.tsv
│   │   │       ├── <SAMPLE_NAME>.txt
│   │   │       └── logs
│   │   │           ├── <SAMPLE_NAME>.err
│   │   │           ├── <SAMPLE_NAME>.log
│   │   │           ├── nf.command.{begin,err,log,out,run,sh,trace}
│   │   │           └── versions.yml
│   │   ├── assembler
│   │   │   ├── <SAMPLE_NAME>.fna.gz
│   │   │   ├── <SAMPLE_NAME>.tsv
│   │   │   ├── logs
│   │   │   │   ├── nf.command.{begin,err,log,out,run,sh,trace}
│   │   │   │   ├── shovill-se.log
│   │   │   │   └── versions.yml
│   │   │   └── supplemental
│   │   │       ├── illumina.txt
│   │   │       └── shovill.corrections
│   │   ├── gather
│   │   │   ├── <SAMPLE_NAME>-meta.tsv
│   │   │   └── logs
│   │   │       ├── nf.command.{begin,err,log,out,run,sh,trace}
│   │   │       └── versions.yml
│   │   ├── qc
│   │   │   ├── <SAMPLE_NAME>_SE.fastq.gz
│   │   │   ├── logs
│   │   │   │   ├── <SAMPLE_NAME>-fastp.log
│   │   │   │   ├── nf.command.{begin,err,log,out,run,sh,trace}
│   │   │   │   └── versions.yml
│   │   │   └── supplemental
│   │   │       ├── <SAMPLE_NAME>.fastp.html
│   │   │       ├── <SAMPLE_NAME>.fastp.json
│   │   │       ├── <SAMPLE_NAME>_SE-final.json
│   │   │       ├── <SAMPLE_NAME>_SE-final_fastqc.html
│   │   │       ├── <SAMPLE_NAME>_SE-final_fastqc.zip
│   │   │       ├── <SAMPLE_NAME>_SE-original.json
│   │   │       ├── <SAMPLE_NAME>_SE-original_fastqc.html
│   │   │       └── <SAMPLE_NAME>_SE-original_fastqc.zip
│   │   └── sketcher
│   │       ├── <SAMPLE_NAME>-k21.msh
│   │       ├── <SAMPLE_NAME>-k31.msh
│   │       ├── <SAMPLE_NAME>-mash-refseq88-k21.txt
│   │       ├── <SAMPLE_NAME>-sourmash-gtdb-rs207-k31.txt
│   │       ├── <SAMPLE_NAME>.sig
│   │       └── logs
│   │           ├── nf.command.{begin,err,log,out,run,sh,trace}
│   │           └── versions.yml
│   └── tools
│       ├── amrfinderplus
│       │   ├── <SAMPLE_NAME>.tsv
│       │   └── logs
│       │       ├── nf.command.{begin,err,log,out,run,sh,trace}
│       │       └── versions.yml
│       └── mlst
│           ├── <SAMPLE_NAME>.tsv
│           └── logs
│               ├── nf.command.{begin,err,log,out,run,sh,trace}
│               └── versions.yml
└── bactopia-runs
    └── bactopia-<TIMESTAMP>
        ├── merged-results
        │   ├── amrfinderplus.tsv
        │   ├── assembly-scan.tsv
        │   ├── logs
        │   │   ├── amrfinderplus-concat
        │   │   │   ├── nf.command.{begin,err,log,out,run,sh,trace}
        │   │   │   └── versions.yml
        │   │   ├── assembly-scan-concat
        │   │   │   ├── nf.command.{begin,err,log,out,run,sh,trace}
        │   │   │   └── versions.yml
        │   │   ├── meta-concat
        │   │   │   ├── nf.command.{begin,err,log,out,run,sh,trace}
        │   │   │   └── versions.yml
        │   │   └── mlst-concat
        │   │       ├── nf.command.{begin,err,log,out,run,sh,trace}
        │   │       └── versions.yml
        │   ├── meta.tsv
        │   └── mlst.tsv
        └── nf-reports
            ├── bactopia-dag.dot
            ├── bactopia-report.html
            └── bactopia-timeline.html

Quality Control

File	Description
`supplemental/_fastqc.`	FastQC quality control reports for raw and cleaned reads
`supplemental/-NanoPlot.`	NanoPlot reports for Nanopore reads
`supplemental/.fastp.`	Fastp quality reports (when applicable)
`supplemental/*_original.json`	Quality metrics for original reads
`supplemental/*_final.json`	Quality metrics for final reads

Assembly

File	Description
`*.fasta`	Assembled genome sequences
`assembly-stats.tsv`	Assembly quality metrics
`merged-assembly-stats.tsv`	Consolidated assembly statistics

Annotation

note

Output format depends on chosen annotation tool (Bakta or Prokka)

File	Description
`*.gff.gz`	Genome annotation in GFF3 format (compressed)
`*.gbk.gz`	Genome annotation in GenBank format (compressed)
`*.faa.gz`	Protein sequences (compressed)
`*.fna.gz`	Nucleotide sequences from annotation (compressed)
`*.ffn.gz`	Feature nucleotide sequences (compressed)
`annotation.tsv`	Annotation summary tables
`blastdb.*`	BLAST database created from annotation

Typing

File	Description
`mlst.tsv`	MLST sequence type results
`merged-mlst.tsv`	Consolidated MLST results

Antimicrobial Resistance

File	Description
`amrfinderplus.tsv`	AMR gene detection results
`amrfinderplus.mutation.tsv`	AMR point mutation results
`merged-amrfinderplus.tsv`	Consolidated AMR results

Comparative Analysis

File	Description
`*-k21.msh`	Mash sketch files (k=21)
`*-k31.msh`	Mash sketch files (k=31)
`-mash-refseq88-.txt`	Mash screening results against RefSeq
`*.sig`	Sourmash signatures
`sourmash-*.txt`	Sourmash classification results

Pathogen-Specific Analysis

note

Only created if --ask_merlin is enabled

File	Description
`merlin/clermontyping/*`	E. coli phylogroup typing
`merlin/ectyper/*`	Enterotoxigenic E. coli typing
`merlin/shigatyper/*`	Shigella serotype prediction
`merlin/shigapass/*`	Shigella passive surveillance
`merlin/shigeifinder/*`	Shigella and EIEC detection
`merlin/stecfinder/*`	STEC detection and typing
`merlin/emmtyper/*`	S. pyogenes emm typing
`merlin/hicap/*`	H. influenzae capsular typing
`merlin/hpsuissero/*`	H. parasuis serotyping
`merlin/kleborate/*`	Klebsiella species typing
`merlin/staphtyper/*`	S. aureus spa typing
`merlin/agrvate/*`	S. aureus agr typing
`merlin/sccmec/*`	S. aureus SCCmec typing

Merged Results

note

Run-level aggregated results from all samples

File	Description
`samplesheet.tsv`	Sample metadata and quality metrics

Audit Trail

Below are files that can assist you in understanding which parameters and program versions were used.

Logs

Each process that is executed will have a folder named logs. In this folder are helpful files for you to review if the need ever arises.

Extension	Description
.begin	An empty file used to designate the process started
.err	Contains STDERR outputs from the process
.log	Contains both STDERR and STDOUT outputs from the process
.out	Contains STDOUT outputs from the process
.run	The script Nextflow uses to stage/unstage files and queue processes based on given profile
.sh	The script executed by bash for the process
.trace	The Nextflow trace report for the process
versions.yml	A YAML formatted file with program versions

Nextflow Reports

These Nextflow reports provide great a great summary of your run. These can be used to optimize resource usage and estimate expected costs if using cloud platforms.

Filename	Description
bactopia-dag.dot	The Nextflow DAG visualization
bactopia-report.html	The Nextflow Execution Report
bactopia-timeline.html	The Nextflow Timeline Report
bactopia-trace.txt	The Nextflow Trace report

Parameters

Required Parameters

The following parameters are how you will provide either local or remote samples to be processed by Bactopia.

Parameter	Type	Default	Description
`--samples`	string		A FOFN (via bactopia prepare) with sample names and paths to FASTQ/FASTAs to process

`--r1`	string		First set of compressed (gzip) Illumina paired-end FASTQ reads (requires --r2 and --sample)
`--r2`	string		Second set of compressed (gzip) Illumina paired-end FASTQ reads (requires --r1 and --sample)
`--se`	string		Compressed (gzip) Illumina single-end FASTQ reads (requires --sample)
`--ont`	string		Compressed (gzip) Oxford Nanopore FASTQ reads (requires --sample)
`--hybrid`	boolean	`false`	Create hybrid assembly using Unicycler. (requires --r1, --r2, --ont and --sample)
`--short_polish`	boolean	`false`	Create hybrid assembly from long-read assembly and short read polishing. (requires --r1, --r2, --ont and --sample)
`--sample`	string		Sample name to use for the input sequences

`--accessions`	string		A file containing ENA/SRA Experiment accessions or NCBI Assembly accessions to processed
`--accession`	string		Sample name to use for the input sequences

`--assembly`	string		A assembled genome in compressed FASTA format. (requires --sample)
`--check_samples`	boolean	`false`	Validate the input FOFN provided by --samples

AMRFinder+ Parameters

Parameter	Type	Default	Description
`--amrfinderplus_ident_min`	number	`-1`	Minimum proportion of identical amino acids in alignment for hit (0..1)
`--amrfinderplus_coverage_min`	number	`0.5`	Minimum coverage of the reference protein (0..1)
`--amrfinderplus_organism`	string		Taxonomy group to run additional screens against
`--amrfinderplus_translation_table`	integer	`11`	NCBI genetic code for translated BLAST
`--amrfinderplus_noplus`	boolean	`false`	Disable running AMRFinder+ with the --plus option
`--amrfinderplus_report_common`	boolean	`false`	Report proteins common to a taxonomy group
`--amrfinderplus_report_all_equal`	boolean	`false`	Report all equally-scoring BLAST and HMM matches
`--amrfinderplus_opts`	string		Extra AMRFinder+ options in quotes.
`--amrfinderplus_db`	string		A custom AMRFinder+ database to use, either a tarball or a folder

csvtk concat Parameters

Parameter	Type	Default	Description
`--csvtk_concat_opts`	string		Extra csvtk concat options in quotes

Assembler Parameters

Parameter	Type	Default	Description
`--shovill_assembler`	string	`skesa`	Assembler to be used by Shovill (choices: `skesa`, `megahit`, `spades`, `velvet`)
`--dragonflye_assembler`	string	`flye`	Assembler to be used by Dragonflye (choices: `flye`, `miniasm`, `raven`)
`--use_unicycler`	boolean		Use unicycler for paired end assembly
`--min_contig_len`	integer	`500`	Minimum contig length <0=AUTO>
`--min_contig_cov`	integer	`2`	Minimum contig coverage <0=AUTO>
`--contig_namefmt`	string		Format of contig FASTA IDs in 'printf' style
`--shovill_opts`	string		Extra assembler options in quotes for Shovill
`--shovill_kmers`	string		K-mers to use <blank=AUTO>
`--dragonflye_opts`	string		Extra assembler options in quotes for Dragonflye
`--trim`	boolean		Enable adaptor trimming
`--no_stitch`	boolean		Disable read stitching for paired-end reads
`--no_corr`	boolean		Disable post-assembly correction
`--unicycler_mode`	string	`normal`	Bridging mode used by Unicycler (choices: `conservative`, `normal`, `bold`)
`--min_component_size`	integer	`1000`	Graph dead ends smaller than this size (bp) will be removed from the final graph
`--min_dead_end_size`	integer	`1000`	Graph dead ends smaller than this size (bp) will be removed from the final graph
`--nanohq`	boolean	`false`	For Flye, use '--nano-hq' instead of --nano-raw
`--medaka_model`	string		The model to use for Medaka polishing
`--medaka_rounds`	integer	`0`	The number of Medaka polishing rounds to conduct
`--racon_rounds`	integer	`1`	The number of Racon polishing rounds to conduct
`--no_polish`	boolean		Skip the assembly polishing step
`--no_miniasm`	boolean		Skip miniasm+Racon bridging
`--no_rotate`	boolean		Do not rotate completed replicons to start at a standard gene
`--reassemble`	boolean	`false`	If reads were simulated, they will be used to create a new assembly.
`--polypolish_rounds`	integer	`1`	Number of polishing rounds to conduct with Polypolish for short read polishing
`--pilon_rounds`	integer	`0`	Number of polishing rounds to conduct with Pilon for short read polishing

Gather Parameters

Parameter	Type	Default	Description
`--skip_fastq_check`	boolean		Skip minimum requirement checks for input FASTQs
`--min_basepairs`	integer	`2241820`	The minimum amount of basepairs required to continue downstream analyses.
`--min_reads`	integer	`7472`	The minimum amount of reads required to continue downstream analyses.
`--min_coverage`	integer	`10`	The minimum amount of coverage required to continue downstream analyses.
`--min_proportion`	number	`0.5`	The minimum proportion of basepairs for paired-end reads to continue downstream analyses.
`--min_genome_size`	integer	`100000`	The minimum estimated genome size allowed for the input sequence to continue downstream analyses.
`--max_genome_size`	integer	`18040666`	The maximum estimated genome size allowed for the input sequence to continue downstream analyses.
`--attempts`	integer	`3`	Maximum times to attempt downloads
`--use_ena`	boolean		Download FASTQs from ENA
`--no_cache`	boolean		Skip caching the assembly summary file from ncbi-genome-download

Sketcher Parameters

Parameter	Type	Default	Description
`--sketch_size`	integer	`10000`	Sketch size. Each sketch will have at most this many non-redundant min-hashes.
`--sourmash_scale`	integer	`10000`	Choose number of hashes as 1 in FRACTION of input k-mers
`--no_winner_take_all`	boolean		Disable winner-takes-all strategy for identity estimates
`--screen_i`	number	`0.8`	Minimum identity to report.

MLST Parameters

Parameter	Type	Default	Description
`--mlst_scheme`	string		Don't autodetect, force this scheme on all inputs
`--mlst_minid`	integer	`95`	Minimum DNA percent identity of full allele to consider 'similar'
`--mlst_mincov`	integer	`10`	Minimum DNA percent coverage to report partial allele at all
`--mlst_minscore`	integer	`50`	Minimum score out of 100 to match a scheme
`--mlst_nopath`	boolean	`false`	Strip filename paths from FILE column
`--mlst_db`	string		A custom MLST database to use, either a tarball or a directory

QC Parameters

Parameter	Type	Default	Description
`--use_bbmap`	boolean		Illumina reads will be QC'd using BBMap
`--use_porechop`	boolean	`false`	Use Porechop to remove adapters from ONT reads
`--skip_qc`	boolean		The QC step will be skipped and it will be assumed the inputs sequences have already been QCed.
`--skip_qc_plots`	boolean		QC Plot creation by FastQC or Nanoplot will be skipped
`--skip_error_correction`	boolean		FLASH error correction of reads will be skipped.
`--adapters`	string		A FASTA file containing adapters to remove
`--adapter_k`	integer	`23`	Kmer length used for finding adapters.
`--phix`	string		phiX174 reference genome to remove
`--phix_k`	integer	`31`	Kmer length used for finding phiX174.
`--ktrim`	string	`r`	Trim reads to remove bases matching reference kmers (choices: `f`, `r`, `l`)
`--mink`	integer	`11`	Look for shorter kmers at read tips down to this length, when k-trimming or masking.
`--hdist`	integer	`1`	Maximum Hamming distance for ref kmers (subs only)
`--tpe`	string	`t`	When kmer right-trimming, trim both reads to the minimum length of either (choices: `f`, `t`)
`--tbo`	string	`t`	Trim adapters based on where paired reads overlap (choices: `f`, `t`)
`--qtrim`	string	`rl`	Trim read ends to remove bases with quality below trimq. (choices: `rl`, `f`, `r`, `l`, `w`)
`--trimq`	integer	`6`	Regions with average quality BELOW this will be trimmed if qtrim is set to something other than f
`--maq`	integer	`10`	Reads with average quality (after trimming) below this will be discarded
`--minlength`	integer	`35`	Reads shorter than this after trimming will be discarded
`--ftm`	integer	`5`	If positive, right-trim length to be equal to zero, modulo this number
`--tossjunk`	string	`t`	Discard reads with invalid characters as bases (choices: `f`, `t`)
`--ain`	string	`f`	When detecting pair names, allow identical names (choices: `f`, `t`)
`--qout`	string	`33`	PHRED offset to use for output FASTQs (choices: `33`, `64`)
`--maxcor`	integer	`1`	Max number of corrections within a 20bp window
`--sampleseed`	integer	`42`	Set to a positive number to use as the random number generator seed for sampling
`--ont_minlength`	integer	`1000`	ONT Reads shorter than this will be discarded
`--ont_minqual`	integer	`0`	Minimum average read quality filter of ONT reads
`--porechop_opts`	string		Extra Porechop options in quotes
`--nanoplot_opts`	string		Extra NanoPlot options in quotes
`--bbduk_opts`	string		Extra BBDuk options in quotes
`--fastp_opts`	string		Extra fastp options in quotes

Bakta Download Parameters

Parameter	Type	Default	Description
`--bakta_db`	string		Tarball or path to the Bakta database
`--bakta_db_type`	string	`full`	Which Bakta DB to download 'full' (~30GB) or 'light' (~2GB) (choices: `full`, `light`)
`--bakta_save_as_tarball`	boolean	`false`	Save the Bakta database as a tarball
`--download_bakta`	boolean	`false`	Download the Bakta database to the path given by --bakta_db

Bakta Parameters

Parameter	Type	Default	Description
`--bakta_proteins`	string		FASTA file of trusted proteins to first annotate from
`--bakta_prodigal_tf`	string		Training file to use for Prodigal
`--bakta_replicons`	string		Replicon information table (tsv/csv)
`--bakta_min_contig_length`	integer	`1`	Minimum contig size to annotate
`--bakta_keep_contig_headers`	boolean	`false`	Keep original contig headers
`--bakta_compliant`	boolean	`false`	Force Genbank/ENA/DDJB compliance
`--bakta_skip_trna`	boolean	`false`	Skip tRNA detection & annotation
`--bakta_skip_tmrna`	boolean	`false`	Skip tmRNA detection & annotation
`--bakta_skip_rrna`	boolean	`false`	Skip rRNA detection & annotation
`--bakta_skip_ncrna`	boolean	`false`	Skip ncRNA detection & annotation
`--bakta_skip_ncrna_region`	boolean	`false`	Skip ncRNA region detection & annotation
`--bakta_skip_crispr`	boolean	`false`	Skip CRISPR array detection & annotation
`--bakta_skip_cds`	boolean	`false`	Skip CDS detection & annotation
`--bakta_skip_sorf`	boolean	`false`	Skip sORF detection & annotation
`--bakta_skip_gap`	boolean	`false`	Skip gap detection & annotation
`--bakta_skip_ori`	boolean	`false`	Skip oriC/oriT detection & annotation
`--bakta_opts`	string		Extra Bakta options in quotes. Example: '--gram +'

Prokka Parameters

Parameter	Type	Default	Description
`--prokka_proteins`	string	`${projectDir}/data/proteins.faa`	FASTA file of trusted proteins to first annotate from
`--prokka_prodigal_tf`	string		Training file to use for Prodigal
`--prokka_compliant`	boolean	`false`	Force Genbank/ENA/DDJB compliance
`--prokka_centre`	string	`Bactopia`	Sequencing centre ID
`--prokka_coverage`	integer	`80`	Minimum coverage on query protein
`--prokka_evalue`	string	`1e-09`	Similarity e-value cut-off
`--prokka_opts`	string		Extra Prokka options in quotes.
`--prokka_debug`	boolean	`false`	Enable debug mode for Prokka

mashdist Parameters

Parameter	Type	Default	Description
`--mash_sketch`	string		The reference sequence as a Mash Sketch (.msh file)
`--mash_seed`	integer	`42`	Seed to provide to the hash function
`--mash_table`	boolean	`false`	Table output (fields will be blank if they do not meet the p-value threshold)
`--mash_m`	integer	`1`	Minimum copies of each k-mer required to pass noise filter for reads
`--mash_w`	number	`0.01`	Probability threshold for warning about low k-mer size.
`--mash_max_p`	number	`1.0`	Maximum p-value to report.
`--mash_max_dist`	number	`1.0`	Maximum distance to report.
`--merlin_dist`	number	`0.1`	Maximum distance to report when using Merlin .
`--full_merlin`	boolean	`false`	Go full Merlin and run all species-specific tools, no matter the Mash distance
`--mash_use_fastqs`	boolean	`false`	Query with FASTQs instead of the assemblies

ClermonTyping Parameters

Parameter	Type	Default	Description
`--clermontyping_threshold`	integer	`0`	Do not use contigs under this size

ECTyper Parameters

Parameter	Type	Default	Description
`--ectyper_opid`	integer	`90`	Percent identity required for an O antigen allele match
`--ectyper_opcov`	integer	`90`	Minimum percent coverage required for an O antigen allele match
`--ectyper_hpid`	integer	`95`	Percent identity required for an H antigen allele match
`--ectyper_hpcov`	integer	`50`	Minimum percent coverage required for an H antigen allele match
`--ectyper_verify`	boolean	`false`	Enable E. coli species verification
`--ectyper_print_alleles`	boolean	`false`	Prints the allele sequences if enabled as the final column

emmtyper Parameters

Parameter	Type	Default	Description
`--emmtyper_wf`	string	`blast`	Workflow for emmtyper to use. (choices: `blast`, `pcr`)
`--emmtyper_blastdb`	string		Path to custom EMM BLAST DB.
`--emmtyper_cluster_distance`	integer	`500`	Distance between cluster of matches to consider as different clusters
`--emmtyper_percid`	integer	`95`	Minimal percent identity of sequence
`--emmtyper_culling_limit`	integer	`5`	Total hits to return in a position
`--emmtyper_mismatch`	integer	`5`	Threshold for number of mismatch to allow in BLAST hit
`--emmtyper_align_diff`	integer	`5`	Threshold for difference between alignment length and subject length in BLAST
`--emmtyper_gap`	integer	`2`	Threshold gap to allow in BLAST hit
`--emmtyper_min_perfect`	integer	`15`	Minimum size of perfect match at 3 primer end
`--emmtyper_min_good`	integer	`15`	Minimum size where there must be 2 matches for each mismatch
`--emmtyper_max_size`	integer	`2000`	Maximum size of PCR product

hicap Parameters

Parameter	Type	Default	Description
`--hicap_database_dir`	string		Directory containing locus database
`--hicap_model_fp`	string		Path to prodigal model
`--hicap_full_sequence`	boolean	`false`	Write the full input sequence out to the genbank file rather than just the region surrounding and including the locus
`--hicap_debug`	boolean	`false`	hicap will print debug messages
`--hicap_gene_coverage`	number	`0.8`	Minimum percentage coverage to consider a single gene complete
`--hicap_gene_identity`	number	`0.7`	Minimum percentage identity to consider a single gene complete
`--hicap_broken_gene_length`	integer	`60`	Minimum length to consider a broken gene
`--hicap_broken_gene_identity`	number	`0.8`	Minimum percentage identity to consider a broken gene

Mykrobe Parameters

Parameter	Type	Default	Description
`--mykrobe_species`	string		Species panel to use (choices: `sonnei`, `staph`, `tb`, `typhi`)
`--mykrobe_kmer`	integer	`21`	K-mer length
`--mykrobe_min_depth`	integer	`1`	Minimum depth
`--mykrobe_model`	string	`kmer_count`	Genotype model used. (choices: `kmer_count`, `median_depth`)
`--mykrobe_report_all_calls`	boolean	`false`	Report all calls
`--mykrobe_opts`	string		Extra Mykrobe options in quotes

GenoTyphi Parameters

Parameter	Type	Default	Description
`--genotyphi_kmer`	integer	`21`	K-mer length
`--genotyphi_min_depth`	integer	`1`	Minimum depth
`--genotyphi_model`	string	`kmer_count`	Genotype model used. (choices: `kmer_count`, `median_depth`)
`--genotyphi_report_all_calls`	boolean	`false`	Report all calls
`--genotyphi_mykrobe_opts`	string		Extra Mykrobe options in quotes

Kleborate Parameters

Parameter	Type	Default	Description
`--kleborate_preset`	string	`kpsc`	Preset module to use for Kleborate (choices: `kpsc`, `kosc`, `escherichia`)
`--kleborate_opts`	string		Extra options in quotes for Kleborate

legsta Parameters

Parameter	Type	Default	Description
`--legsta_noheader`	boolean	`false`	Don't print header row

LisSero Parameters

Parameter	Type	Default	Description
`--lissero_min_id`	number	`95.0`	Minimum percent identity to accept a match
`--lissero_min_cov`	number	`95.0`	Minimum coverage of the gene to accept a match

ngmaster Parameters

Parameter	Type	Default	Description
`--ngmaster_csv`	boolean	`false`	output comma-separated format (CSV) rather than tab-separated

pasty Parameters

Parameter	Type	Default	Description
`--pasty_min_pident`	integer	`95`	Minimum percent identity to count a hit
`--pasty_min_coverage`	integer	`95`	Minimum percent coverage to count a hit

pbptyper Parameters

Parameter	Type	Default	Description
`--pbptyper_min_pident`	integer	`95`	Minimum percent identity to count a hit
`--pbptyper_min_coverage`	integer	`95`	Minimum percent coverage to count a hit

SeqSero2 Parameters

Parameter	Type	Default	Description
`--seqsero2_run_mode`	string	`k`	Workflow to run. 'a' allele mode, or 'k' k-mer mode (choices: `a`, `k`)
`--seqsero2_input_type`	string	`assembly`	Input format to analyze. 'assembly' or 'fastq' (choices: `assembly`, `fastq`)
`--seqsero2_bwa_mode`	string	`mem`	Algorithms for bwa mapping for allele mode (choices: `mem`, `sam`)

SeroBA Parameters

Parameter	Type	Default	Description
`--seroba_noclean`	boolean	`false`	Do not clean up intermediate files
`--seroba_coverage`	integer	`20`	Threshold for k-mer coverage of the reference sequence

SISTR Parameters

Parameter	Type	Default	Description
`--sistr_full_cgmlst`	boolean	`false`	Use the full set of cgMLST alleles which can include highly similar alleles

AgrVATE Parameters

Parameter	Type	Default	Description
`--agrvate_typing_only`	boolean	`false`	agr typing only. Skips agr operon extraction and frameshift detection

spaTyper Parameters

Parameter	Type	Default	Description
`--spatyper_repeats`	string		List of spa repeats
`--spatyper_repeat_order`	string		List spa types and order of repeats
`--spatyper_do_enrich`	boolean	`false`	Do PCR product enrichment

sccmec Parameters

Parameter	Type	Default	Description
`--sccmec_min_targets_pident`	integer	`90`	Minimum percent identity to count a target hit
`--sccmec_min_targets_coverage`	integer	`80`	Minimum percent coverage to count a target hit
`--sccmec_min_regions_pident`	integer	`85`	Minimum percent identity to count a region hit
`--sccmec_min_regions_coverage`	integer	`93`	Minimum percent coverage to count a region hit

STECFinder Parameters

Parameter	Type	Default	Description
`--stecfinder_use_reads`	boolean	`false`	Paired-end Illumina reads will be used instead of assemblies
`--stecfinder_hits`	boolean	`false`	Show detailed gene search results
`--stecfinder_cutoff`	number	`10.0`	Minimum read coverage for gene to be called
`--stecfinder_length`	number	`50.0`	Percentage of gene length needed for positive call
`--stecfinder_ipah_length`	number	`10.0`	Percentage of ipaH gene length needed for positive gene call
`--stecfinder_ipah_depth`	number	`1.0`	Minimum depth for positive ipaH gene call (requires --stecfinder_use_reads)
`--stecfinder_stx_length`	number	`10.0`	Percentage of stx gene length needed for positive gene call
`--stecfinder_stx_depth`	number	`1.0`	Minimum depth for positive stx gene call (requires --stecfinder_use_reads)
`--stecfinder_o_length`	number	`60.0`	Percentage of wz_ gene length needed for positive call
`--stecfinder_o_depth`	number	`1.0`	Minimum depth for positive qz_ gene call (requires --stecfinder_use_reads)
`--stecfinder_h_length`	number	`60.0`	Percentage of fliC gene length needed for positive call
`--stecfinder_h_depth`	number	`1.0`	Minimum depth for positive fliC gene call (requires --stecfinder_use_reads)

TB-Profiler Profile Parameters

Parameter	Type	Default	Description
`--tbprofiler_call_whole_genome`	boolean	`false`	Call whole genome
`--tbprofiler_mapper`	string	`bwa`	Mapping tool to use. If you are using nanopore data it will default to minimap2 (choices: `bwa`, `minimap2`, `bowtie2`, `bwa-mem2`)
`--tbprofiler_caller`	string	`freebayes`	Variant calling tool to use (choices: `bcftools`, `gatk`, `freebayes`)
`--tbprofiler_calling_params`	string		Extra variant caller options in quotes
`--tbprofiler_suspect`	boolean	`false`	Use the suspect suite of tools to add ML predictions
`--tbprofiler_no_flagstat`	boolean	`false`	Don't collect flagstats
`--tbprofiler_no_delly`	boolean	`false`	Don't run delly
`--tbprofiler_opts`	string		Extra options in quotes for TBProfiler

TB-Profiler Collate Parameters

Parameter	Type	Default	Description
`--tbprofiler_itol`	boolean	`false`	Generate itol config files
`--tbprofiler_full`	boolean	`false`	Output mutations in main result file
`--tbprofiler_all_variants`	boolean	`false`	Output all variants in variant matrix
`--tbprofiler_mark_missing`	boolean	`false`	An asterisk will be used to mark predictions which are affected by missing data at a drug resistance position

Dataset Parameters

Define where the pipeline should find input data and save output data.

Parameter	Type	Default	Description
`--species`	string		Name of species for species-specific dataset to use
`--ask_merlin`	boolean		Ask Merlin to execute species specific Bactopia tools based on Mash distances
`--coverage`	integer	`100`	Reduce samples to a given coverage, requires a genome size
`--genome_size`	integer	`0`	Expected genome size (bp) for all samples, required for read error correction and read subsampling
`--use_bakta`	boolean		Use Bakta for annotation, instead of Prokka

Optional Parameters

These optional parameters can be useful in certain settings.

Parameter	Type	Default	Description
`--outdir`	string	`bactopia`	Base directory to write results to
`--skip_compression`	boolean	`false`	Output files will not be compressed
`--datasets`	string		The path to cache datasets to
`--keep_all_files`	boolean	`false`	Keeps all analysis files created

Max Job Request Parameters

Set the top limit for requested resources for any single job.

Parameter	Type	Default	Description
`--max_retry`	integer	`3`	Maximum times to retry a process before allowing it to fail.
`--max_cpus`	integer	`4`	Maximum number of CPUs that can be requested for any single job.
`--max_memory`	string	`128.GB`	Maximum amount of memory that can be requested for any single job.
`--max_time`	string	`240.h`	Maximum amount of time that can be requested for any single job.
`--max_downloads`	integer	`3`	Maximum number of samples to download at a time

Nextflow Configuration Parameters

Parameters to fine-tune your Nextflow setup.

Parameter	Type	Default	Description
`--nfconfig`	string		A Nextflow compatible config file for custom profiles, loaded last and will overwrite existing variables if set.
`--publish_dir_mode`	string	`copy`	Method used to save pipeline results to output directory. (choices: `symlink`, `rellink`, `link`, `copy`, `copyNoFollow`, `move`)
`--infodir`	string	`${params.outdir}/pipeline_info`	Directory to keep pipeline Nextflow logs and reports.
`--force`	boolean	`false`	Nextflow will overwrite existing output files.
`--cleanup_workdir`	boolean	`false`	After Bactopia is successfully executed, the `work` directory will be deleted.

Institutional config options

Parameters used to describe centralized config profiles. These should not be edited.

Parameter	Type	Default	Description
`--custom_config_version`	string	`master`	Git commit id for Institutional configs.
`--custom_config_base`	string	`https://raw.githubusercontent.com/nf-core/configs/master`	Base directory for Institutional configs.
`--config_profile_name`	string		Institutional config name.
`--config_profile_description`	string		Institutional config description.
`--config_profile_contact`	string		Institutional config contact information.
`--config_profile_url`	string		Institutional config URL link.

Nextflow Profile Parameters

Parameters to fine-tune your Nextflow setup.

Parameter	Type	Default	Description
`--condadir`	string		Directory to Nextflow should use for Conda environments
`--registry`	string	`quay.io`	Registry to pull Docker containers from.
`--datasets_cache`	string	`<HOME>/.bactopia/datasets`	Directory where downloaded datasets should be stored.
`--singularity_cache`	string		Directory where remote Singularity images are stored.
`--singularity_pull_docker_container`	boolean		Instead of directly downloading Singularity images for use with Singularity, force the workflow to pull and convert Docker containers instead.
`--force_rebuild`	boolean	`false`	Force overwrite of existing pre-built environments.
`--queue`	string	`general,high-memory`	Comma-separated name of the queue(s) to be used by a job scheduler (e.g. AWS Batch or SLURM)
`--cluster_opts`	string		Additional options to pass to the executor. (e.g. SLURM: '--account=my_acct_name'
`--container_opts`	string		Additional options to pass to Apptainer, Docker, or Singularity. (e.g. Singularity: '-D `pwd`'
`--disable_scratch`	boolean	`false`	All intermediate files created on worker nodes of will be transferred to the head node.

Helpful Parameters

Uncommonly used parameters that might be useful.

Parameter	Type	Default	Description
`--monochrome_logs`	boolean		Do not use coloured log outputs.
`--nfdir`	boolean		Print directory Nextflow has pulled Bactopia to
`--sleep_time`	integer	`5`	The amount of time (seconds) Nextflow will wait after setting up datasets before execution.
`--validate_params`	boolean	`true`	Boolean whether to validate parameters against the schema at runtime
`--help`	boolean		Display help text.
`--wf`	string	`bactopia`	Specify which workflow or Bactopia Tool to execute
`--list_wfs`	boolean		List the available workflows and Bactopia Tools to use with '--wf'
`--show_hidden_params`	boolean		Show all params when using `--help`
`--help_all`	boolean		An alias for --help --show_hidden_params
`--version`	boolean		Display version text.

Composition

This workflow uses the following subworkflows:

amrfinderplus - Find antimicrobial resistance genes and point mutations.
bactopia_assembler - Assemble bacterial genomes using automated assembler selection.
bactopia_datasets - Download and provide pre-compiled datasets required by Bactopia.
bactopia_gather - Search, validate, gather, and standardize input samples.
bactopia_qc - Perform comprehensive quality control on sequencing reads.
bactopia_sketcher - Create genomic sketches and perform rapid taxonomic classification.
bakta - Rapid bacterial genome annotation.
merlin - MinER assisted species-specific bactopia tool seLectIoN.
mlst - Determine multilocus sequence types (MLST) from bacterial assemblies.
prokka - Annotate bacterial genomes with functional information.

Source

View source on GitHub

Pipeline Overview​

Step 1 - Gather​

Failed Quality Checks​

Step 2 - QC​

Failed Quality Checks​

Step 3 - Assembler​

Failed Quality Checks​

Step 4 - Annotator​

Step 5 - Sketcher​

Step 6 - Sequence Typing​

Step 7 - Antibiotic Resistance​

Step 8 - Merlin​

Usage​

Outputs​

Expected Output Files​

Quality Control​

Assembly​

Annotation​

Typing​

Antimicrobial Resistance​

Comparative Analysis​

Pathogen-Specific Analysis​

Merged Results​

Audit Trail​

Logs​

Nextflow Reports​

Parameters​

Required Parameters​

AMRFinder+ Parameters​

csvtk concat Parameters​

Assembler Parameters​

Gather Parameters​

Sketcher Parameters​

MLST Parameters​

QC Parameters​

Bakta Download Parameters​

Bakta Parameters​

Prokka Parameters​

mashdist Parameters​

ClermonTyping Parameters​

ECTyper Parameters​

emmtyper Parameters​

hicap Parameters​

Mykrobe Parameters​

GenoTyphi Parameters​

Kleborate Parameters​

legsta Parameters​

LisSero Parameters​

ngmaster Parameters​

pasty Parameters​

pbptyper Parameters​

SeqSero2 Parameters​

SeroBA Parameters​

SISTR Parameters​

AgrVATE Parameters​

spaTyper Parameters​

sccmec Parameters​

STECFinder Parameters​

TB-Profiler Profile Parameters​

TB-Profiler Collate Parameters​

Composition​

Source​

Pipeline Overview

Step 1 - Gather

Failed Quality Checks

Step 2 - QC

Failed Quality Checks

Step 3 - Assembler

Failed Quality Checks

Step 4 - Annotator

Step 5 - Sketcher

Step 6 - Sequence Typing

Step 7 - Antibiotic Resistance

Step 8 - Merlin

Usage

Outputs

Expected Output Files

Quality Control

Assembly

Annotation

Typing

Antimicrobial Resistance

Comparative Analysis

Pathogen-Specific Analysis

Merged Results

Audit Trail

Logs

Nextflow Reports

Parameters

Required Parameters

AMRFinder+ Parameters

csvtk concat Parameters

Assembler Parameters

Gather Parameters

Sketcher Parameters

MLST Parameters

QC Parameters

Bakta Download Parameters

Bakta Parameters

Prokka Parameters

mashdist Parameters

ClermonTyping Parameters

ECTyper Parameters

emmtyper Parameters

hicap Parameters

Mykrobe Parameters

GenoTyphi Parameters

Kleborate Parameters

legsta Parameters

LisSero Parameters

ngmaster Parameters

pasty Parameters

pbptyper Parameters

SeqSero2 Parameters

SeroBA Parameters

SISTR Parameters

AgrVATE Parameters

spaTyper Parameters

sccmec Parameters

STECFinder Parameters

TB-Profiler Profile Parameters

TB-Profiler Collate Parameters

Composition

Source