bactopia
Tags: bacteria assembly annotation amr mlst genomics pipeline named-workflow
Comprehensive bacterial analysis pipeline for complete genomic characterization.
This workflow performs end-to-end analysis including quality control, assembly, annotation, antimicrobial resistance detection, MLST typing, and optional pathogen-specific analysis through Merlin. It processes raw sequencing reads and produces a complete genomic characterization suitable for downstream analysis.
Pipeline Overview

Looking at the workflow overview above, it might not look like much is happening, but I can assure you that a lot is going on. The workflow is broken down into 8 steps, which are:
- Gather - Collect all the data in one place
- QC - Quality control of the data
- Assembler - Assemble the reads into contigs
- Annotator - Annotate the contigs
- Sketcher - Create a sketch of the contigs, and query databases
- Sequence Typing - Determine the sequence type of the contigs
- Antibiotic Resistance - Determine the antibiotic resistance of the contigs and proteins
- Merlin - Automatically run species-specific tools based on distance
If you are looking for a guide to get started quickly, please check out the Beginner's Guide.
Step 1 - Gather
The main purpose of the gather step is to get all the samples into a single place. This
includes downloading samples from ENA/SRA or NCBI Assembly. The tools used are:
| Tool | Description |
|---|---|
| art | For simulating error-free reads for an input assembly |
| fastq-dl | Downloading FASTQ files from ENA/SRA |
| ncbi-genome-download | Downloading FASTA files from NCBI Assembly |
This gather step also does basic QC checks to help prevent downstream failures.
Failed Quality Checks
| Filename | Description |
|---|---|
| -gzip-error.txt | Sample failed Gzip checks and excluded from further analysis |
| -low-basepair-proportion-error.txt | Sample failed basepair proportion checks and excluded from further analysis |
| -low-read-count-error.txt | Sample failed read count checks and excluded from further analysis |
| -low-sequence-depth-error.txt | Sample failed sequenced basepair checks and excluded from further analysis |
Samples that fail any of the QC checks will be excluded from further analysis.
Those samples will generate a *-error.txt file with the error message. Excluding
these samples prevents downstream failures that cause the whole workflow to fail.
Example Error: Input FASTQ(s) failed Gzip checks
If input FASTQ(s) fail to pass Gzip test, the sample will be excluded from further analysis.
Example Text from <SAMPLE_NAME>-gzip-error.txt <SAMPLE_NAME> FASTQs failed Gzip tests. Please check the input FASTQs. Further analysis is discontinued.
Example Error: Input FASTQs have disproportionate number of reads
If input FASTQ(s) for a sample have disproportionately different number of reads
between the two pairs, the sample will be excluded from further analysis. You can
adjust this minimum read count using the --min_proportion parameter.
Example Text from <SAMPLE_NAME>-low-basepair-proportion-error.txt
<SAMPLE_NAME> FASTQs failed to meet the minimum shared basepairs. They
shared Y basepairs, with R1 having A bp and R2 having B bp. Further
analysis is discontinued.
Example Error: Input FASTQ(s) has too few reads
If input FASTQ(s) for a sample have less than the minimum required reads, the
sample will be excluded from further analysis. You can adjust this minimum read
count using the --min_reads parameter.
Example Text from <SAMPLE_NAME>-low-read-count-error.txt
<SAMPLE_NAME> FASTQ(s) contain X total reads. This does not exceed the required
minimum Y read count. Further analysis is discontinued.
Example Error: Input FASTQ(s) has too little sequenced basepairs
If input FASTQ(s) for a sample fails to meet the minimum number of sequenced
basepairs, the sample will be excluded from further analysis. You can
adjust this minimum read count using the --min_basepairs parameter.
Example Text from <SAMPLE_NAME>-low-sequence-depth-error.txt
<SAMPLE_NAME> FASTQ(s) contain X total basepairs. This does not exceed the
required minimum Y bp. Further analysis is discontinued.
Step 2 - QC
The qc module uses a variety of tools to perform quality control on Illumina and
Oxford Nanopore reads. The tools used are:
| Tool | Technology | Description |
|---|---|---|
| bbtools | Illumina | A suite of tools for manipulating reads |
| fastp | Illumina | A tool designed to provide fast all-in-one preprocessing for FastQ files |
| fastqc | Illumina | A quality control tool for high throughput sequence data |
| fastq_scan | Nanopore | A tool for quickly scanning FASTQ files |
| lighter | Illumina | A tool for correcting sequencing errors in Illumina reads |
| NanoPlot | Nanopore | A tool for plotting long read sequencing data |
| nanoq | Nanopore | A tool for calculating quality metrics for Oxford Nanopore reads |
| porechop | Nanopore | A tool for removing adapters from Oxford Nanopore reads |
| rasusa | Nanopore | Randomly subsample sequencing reads to a specified coverage |
Similar to the gather step, the qc step will also stop samples that fail to meet
basic QC checks from continuing downstream.
Failed Quality Checks
| Filename | Description |
|---|---|
| .error-fastq.gz | A gzipped FASTQ file of reads that failed QC |
| -low-read-count-error.txt | Sample failed read count checks and excluded from further analysis |
| -low-sequence-coverage-error.txt | Sample failed sequenced coverage checks and excluded from further analysis |
| -low-sequence-depth-error.txt | Sample failed sequenced basepair checks and excluded from further analysis |
Samples that fail any of the QC checks will be excluded from further analysis.
Those samples will generate a *-error.txt file with the error message. Excluding
these samples prevents downstream failures that cause the whole workflow to fail.
Example Error: After QC, too few reads remain
If after cleaning reads, a sample has less than the minimum required reads, the
sample will be excluded from further analysis. You can adjust this minimum read
count using the --min_reads parameter.
Example Text from <SAMPLE_NAME>-low-read-count-error.txt
<SAMPLE_NAME> FASTQ(s) contain X total reads. This does not exceed the required
minimum Y read count. Further analysis is discontinued.
Example Error: After QC, too little sequence coverage remains
If after cleaning reads, a sample has failed to meet the minimum sequence
coverage required, the sample will be excluded from further analysis. You can
adjust this minimum read count using the --min_coverage parameter.
Note: This check is only performed when a genome size is available.
Example Text from <SAMPLE_NAME>-low-sequence-coverage-error.txt
After QC, <SAMPLE_NAME> FASTQ(s) contain X total basepairs. This does not
exceed the required minimum Y bp (Zx coverage). Further analysis is
discontinued.
Example Error: After QC, too little sequenced basepairs remain
If after cleaning reads, a sample has failed to meet the minimum number of
sequenced basepairs, the sample will be excluded from further analysis. You can
adjust this minimum read count using the --min_basepairs parameter.
Example Text from <SAMPLE_NAME>-low-sequence-depth-error.txt
<SAMPLE_NAME> FASTQ(s) contain X total basepairs. This does not exceed the
required minimum Y bp. Further analysis is discontinued.
Step 3 - Assembler
The assembler module uses a variety of assembly tools to create an assembly of
Illumina and Oxford Nanopore reads. The tools used are:
| Tool | Description |
|---|---|
| Dragonflye | Assembly of Oxford Nanopore reads, as well as hybrid assembly with short-read polishing |
| Shovill | Assembly of Illumina paired-end reads |
| Shovill-SE | Assembly of Illumina single-end reads |
| Unicycler | Hybrid assembly, using short-reads first then long-reads |
Summary statistics for each assembly are generated using assembly-scan.
--short_polish over --hybrid with recent ONT sequencingUsing Unicycler (--hybrid) to create a hybrid
assembly works great when you have low-coverage noisy long-reads. However, if you are
using recent ONT sequencing, you likely have high-coverage and using the --short_polish
method is going to yield better results (and be faster!) than --hybrid.
Failed Quality Checks
| Filename | Description |
|---|---|
| -assembly-error.txt | Sample failed assembly checks and excluded from further analysis |
Samples that fail any of the QC checks will be excluded from further analysis.
Those samples will generate a *-error.txt file with the error message. Excluding
these samples prevents downstream failures that cause the whole workflow to fail.
Example Error: Assembled Successfully, but 0 Contigs
If a sample assembles successfully, but 0 contigs are formed, the sample will be excluded from further analysis.
Example Text from <SAMPLE_NAME>-assembly-error.txt <SAMPLE_NAME> assembled successfully, but 0 contigs were formed. Please investigate <SAMPLE_NAME> to determine a cause (e.g. metagenomic, contaminants, etc...) for this outcome. Further assembly-based analysis of <SAMPLE_NAME> will be discontinued.
Example Error: Assembled successfully, but poor assembly size
If your sample assembles successfully, but the assembly size is less than the minimum
allowed genome size, the sample will be excluded from further analysis. You can
adjust this minimum size using the --min_genome_size parameter.
Example Text from <SAMPLE_NAME>-assembly-error.txt
<SAMPLE_NAME> assembled size (000 bp) is less than the minimum allowed genome
size (000 bp). If this is unexpected, please investigate <SAMPLE_NAME> to
determine a cause (e.g. metagenomic, contaminants, etc...) for the poor assembly.
Otherwise, adjust the --min_genome_size parameter to fit your need. Further
assembly based analysis of <SAMPLE_NAME> will be discontinued.
Step 4 - Annotator
The annotator step uses either Prokka (default)
or Bakta (via --use_bakta) to annotate
assembled contigs with functional information including genes, proteins, rRNA, tRNA,
and other genomic features.
Step 5 - Sketcher
The sketcher module uses Mash and
Sourmash to create sketches and query
RefSeq and GTDB.
Step 6 - Sequence Typing
The mlst step uses mlst to scan assemblies against
PubMLST typing schemes and determine the sequence type.
Step 7 - Antibiotic Resistance
The amrfinderplus step uses AMRFinder+ to identify
antimicrobial resistance genes and point mutations from both assembled contigs and
annotated protein sequences.
Step 8 - Merlin
The merlin step automatically selects and runs species-specific typing tools based on
Mash distance results from the sketcher step. Enable with --ask_merlin. See the
Pathogen-Specific Analysis output section below for the
full list of supported organisms and tools.
Usage
Bactopia CLI:
bactopia \
--input samples.csv \
--outdir results/
Nextflow:
nextflow run bactopia/bactopia \
--input samples.csv \
--outdir results/
Outputs
Expected Output Files
<BACTOPIA_DIR>
├── <SAMPLE_NAME>
│ ├── main
│ │ ├── annotator
│ │ │ └── prokka
│ │ │ ├── <SAMPLE_NAME>-blastdb.tar.gz
│ │ │ ├── <SAMPLE_NAME>.faa.gz
│ │ │ ├── <SAMPLE_NAME>.ffn.gz
│ │ │ ├── <SAMPLE_NAME>.fna.gz
│ │ │ ├── <SAMPLE_NAME>.fsa.gz
│ │ │ ├── <SAMPLE_NAME>.gbk.gz
│ │ │ ├── <SAMPLE_NAME>.gff.gz
│ │ │ ├── <SAMPLE_NAME>.sqn.gz
│ │ │ ├── <SAMPLE_NAME>.tbl.gz
│ │ │ ├── <SAMPLE_NAME>.tsv
│ │ │ ├── <SAMPLE_NAME>.txt
│ │ │ └── logs
│ │ │ ├── <SAMPLE_NAME>.err
│ │ │ ├── <SAMPLE_NAME>.log
│ │ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ │ └── versions.yml
│ │ ├── assembler
│ │ │ ├── <SAMPLE_NAME>.fna.gz
│ │ │ ├── <SAMPLE_NAME>.tsv
│ │ │ ├── logs
│ │ │ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ │ │ ├── shovill-se.log
│ │ │ │ └── versions.yml
│ │ │ └── supplemental
│ │ │ ├── illumina.txt
│ │ │ └── shovill.corrections
│ │ ├── gather
│ │ │ ├── <SAMPLE_NAME>-meta.tsv
│ │ │ └── logs
│ │ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ │ └── versions.yml
│ │ ├── qc
│ │ │ ├── <SAMPLE_NAME>_SE.fastq.gz
│ │ │ ├── logs
│ │ │ │ ├── <SAMPLE_NAME>-fastp.log
│ │ │ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ │ │ └── versions.yml
│ │ │ └── supplemental
│ │ │ ├── <SAMPLE_NAME>.fastp.html
│ │ │ ├── <SAMPLE_NAME>.fastp.json
│ │ │ ├── <SAMPLE_NAME>_SE-final.json
│ │ │ ├── <SAMPLE_NAME>_SE-final_fastqc.html
│ │ │ ├── <SAMPLE_NAME>_SE-final_fastqc.zip
│ │ │ ├── <SAMPLE_NAME>_SE-original.json
│ │ │ ├── <SAMPLE_NAME>_SE-original_fastqc.html
│ │ │ └── <SAMPLE_NAME>_SE-original_fastqc.zip
│ │ └── sketcher
│ │ ├── <SAMPLE_NAME>-k21.msh
│ │ ├── <SAMPLE_NAME>-k31.msh
│ │ ├── <SAMPLE_NAME>-mash-refseq88-k21.txt
│ │ ├── <SAMPLE_NAME>-sourmash-gtdb-rs207-k31.txt
│ │ ├── <SAMPLE_NAME>.sig
│ │ └── logs
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ └── tools
│ ├── amrfinderplus
│ │ ├── <SAMPLE_NAME>.tsv
│ │ └── logs
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ └── mlst
│ ├── <SAMPLE_NAME>.tsv
│ └── logs
│ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ └── versions.yml
└── bactopia-runs
└── bactopia-<TIMESTAMP>
├── merged-results
│ ├── amrfinderplus.tsv
│ ├── assembly-scan.tsv
│ ├── logs
│ │ ├── amrfinderplus-concat
│ │ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ │ └── versions.yml
│ │ ├── assembly-scan-concat
│ │ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ │ └── versions.yml
│ │ ├── meta-concat
│ │ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ │ └── versions.yml
│ │ └── mlst-concat
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ ├── meta.tsv
│ └── mlst.tsv
└── nf-reports
├── bactopia-dag.dot
├── bactopia-report.html
└── bactopia-timeline.html
Quality Control
| File | Description |
|---|---|
supplemental/*_fastqc.* | FastQC quality control reports for raw and cleaned reads |
supplemental/*-NanoPlot.* | NanoPlot reports for Nanopore reads |
supplemental/*.fastp.* | Fastp quality reports (when applicable) |
supplemental/*_original.json | Quality metrics for original reads |
supplemental/*_final.json | Quality metrics for final reads |
Assembly
| File | Description |
|---|---|
*.fasta | Assembled genome sequences |
assembly-stats.tsv | Assembly quality metrics |
merged-assembly-stats.tsv | Consolidated assembly statistics |
Annotation
Output format depends on chosen annotation tool (Bakta or Prokka)
| File | Description |
|---|---|
*.gff.gz | Genome annotation in GFF3 format (compressed) |
*.gbk.gz | Genome annotation in GenBank format (compressed) |
*.faa.gz | Protein sequences (compressed) |
*.fna.gz | Nucleotide sequences from annotation (compressed) |
*.ffn.gz | Feature nucleotide sequences (compressed) |
annotation.tsv | Annotation summary tables |
blastdb.* | BLAST database created from annotation |
Typing
| File | Description |
|---|---|
mlst.tsv | MLST sequence type results |
merged-mlst.tsv | Consolidated MLST results |
Antimicrobial Resistance
| File | Description |
|---|---|
amrfinderplus.tsv | AMR gene detection results |
amrfinderplus.mutation.tsv | AMR point mutation results |
merged-amrfinderplus.tsv | Consolidated AMR results |
Comparative Analysis
| File | Description |
|---|---|
*-k21.msh | Mash sketch files (k=21) |
*-k31.msh | Mash sketch files (k=31) |
*-mash-refseq88-*.txt | Mash screening results against RefSeq |
*.sig | Sourmash signatures |
sourmash-*.txt | Sourmash classification results |
Pathogen-Specific Analysis
Only created if --ask_merlin is enabled
| File | Description |
|---|---|
merlin/clermontyping/* | E. coli phylogroup typing |
merlin/ectyper/* | Enterotoxigenic E. coli typing |
merlin/shigatyper/* | Shigella serotype prediction |
merlin/shigapass/* | Shigella passive surveillance |
merlin/shigeifinder/* | Shigella and EIEC detection |
merlin/stecfinder/* | STEC detection and typing |
merlin/emmtyper/* | S. pyogenes emm typing |
merlin/hicap/* | H. influenzae capsular typing |
merlin/hpsuissero/* | H. parasuis serotyping |
merlin/kleborate/* | Klebsiella species typing |
merlin/staphtyper/* | S. aureus spa typing |
merlin/agrvate/* | S. aureus agr typing |
merlin/sccmec/* | S. aureus SCCmec typing |
Merged Results
Run-level aggregated results from all samples
| File | Description |
|---|---|
samplesheet.tsv | Sample metadata and quality metrics |
Audit Trail
Below are files that can assist you in understanding which parameters and program versions were used.
Logs
Each process that is executed will have a folder named logs. In this folder are helpful
files for you to review if the need ever arises.
| Extension | Description |
|---|---|
| .begin | An empty file used to designate the process started |
| .err | Contains STDERR outputs from the process |
| .log | Contains both STDERR and STDOUT outputs from the process |
| .out | Contains STDOUT outputs from the process |
| .run | The script Nextflow uses to stage/unstage files and queue processes based on given profile |
| .sh | The script executed by bash for the process |
| .trace | The Nextflow trace report for the process |
| versions.yml | A YAML formatted file with program versions |
Nextflow Reports
These Nextflow reports provide great a great summary of your run. These can be used to optimize resource usage and estimate expected costs if using cloud platforms.
| Filename | Description |
|---|---|
| bactopia-dag.dot | The Nextflow DAG visualization |
| bactopia-report.html | The Nextflow Execution Report |
| bactopia-timeline.html | The Nextflow Timeline Report |
| bactopia-trace.txt | The Nextflow Trace report |
Parameters
Required Parameters
The following parameters are how you will provide either local or remote samples to be processed by Bactopia.
| Parameter | Type | Default | Description |
|---|---|---|---|
--samples | string | A FOFN (via bactopia prepare) with sample names and paths to FASTQ/FASTAs to process | |
--r1 | string | First set of compressed (gzip) Illumina paired-end FASTQ reads (requires --r2 and --sample) | |
--r2 | string | Second set of compressed (gzip) Illumina paired-end FASTQ reads (requires --r1 and --sample) | |
--se | string | Compressed (gzip) Illumina single-end FASTQ reads (requires --sample) | |
--ont | string | Compressed (gzip) Oxford Nanopore FASTQ reads (requires --sample) | |
--hybrid | boolean | false | Create hybrid assembly using Unicycler. (requires --r1, --r2, --ont and --sample) |
--short_polish | boolean | false | Create hybrid assembly from long-read assembly and short read polishing. (requires --r1, --r2, --ont and --sample) |
--sample | string | Sample name to use for the input sequences | |
--accessions | string | A file containing ENA/SRA Experiment accessions or NCBI Assembly accessions to processed | |
--accession | string | Sample name to use for the input sequences | |
--assembly | string | A assembled genome in compressed FASTA format. (requires --sample) | |
--check_samples | boolean | false | Validate the input FOFN provided by --samples |
AMRFinder+ Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--amrfinderplus_ident_min | number | -1 | Minimum proportion of identical amino acids in alignment for hit (0..1) |
--amrfinderplus_coverage_min | number | 0.5 | Minimum coverage of the reference protein (0..1) |
--amrfinderplus_organism | string | Taxonomy group to run additional screens against | |
--amrfinderplus_translation_table | integer | 11 | NCBI genetic code for translated BLAST |
--amrfinderplus_noplus | boolean | false | Disable running AMRFinder+ with the --plus option |
--amrfinderplus_report_common | boolean | false | Report proteins common to a taxonomy group |
--amrfinderplus_report_all_equal | boolean | false | Report all equally-scoring BLAST and HMM matches |
--amrfinderplus_opts | string | Extra AMRFinder+ options in quotes. | |
--amrfinderplus_db | string | A custom AMRFinder+ database to use, either a tarball or a folder |
csvtk concat Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--csvtk_concat_opts | string | Extra csvtk concat options in quotes |
Assembler Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--shovill_assembler | string | skesa | Assembler to be used by Shovill (choices: skesa, megahit, spades, velvet) |
--dragonflye_assembler | string | flye | Assembler to be used by Dragonflye (choices: flye, miniasm, raven) |
--use_unicycler | boolean | Use unicycler for paired end assembly | |
--min_contig_len | integer | 500 | Minimum contig length <0=AUTO> |
--min_contig_cov | integer | 2 | Minimum contig coverage <0=AUTO> |
--contig_namefmt | string | Format of contig FASTA IDs in 'printf' style | |
--shovill_opts | string | Extra assembler options in quotes for Shovill | |
--shovill_kmers | string | K-mers to use <blank=AUTO> | |
--dragonflye_opts | string | Extra assembler options in quotes for Dragonflye | |
--trim | boolean | Enable adaptor trimming | |
--no_stitch | boolean | Disable read stitching for paired-end reads | |
--no_corr | boolean | Disable post-assembly correction | |
--unicycler_mode | string | normal | Bridging mode used by Unicycler (choices: conservative, normal, bold) |
--min_component_size | integer | 1000 | Graph dead ends smaller than this size (bp) will be removed from the final graph |
--min_dead_end_size | integer | 1000 | Graph dead ends smaller than this size (bp) will be removed from the final graph |
--nanohq | boolean | false | For Flye, use '--nano-hq' instead of --nano-raw |
--medaka_model | string | The model to use for Medaka polishing | |
--medaka_rounds | integer | 0 | The number of Medaka polishing rounds to conduct |
--racon_rounds | integer | 1 | The number of Racon polishing rounds to conduct |
--no_polish | boolean | Skip the assembly polishing step | |
--no_miniasm | boolean | Skip miniasm+Racon bridging | |
--no_rotate | boolean | Do not rotate completed replicons to start at a standard gene | |
--reassemble | boolean | false | If reads were simulated, they will be used to create a new assembly. |
--polypolish_rounds | integer | 1 | Number of polishing rounds to conduct with Polypolish for short read polishing |
--pilon_rounds | integer | 0 | Number of polishing rounds to conduct with Pilon for short read polishing |
Gather Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--skip_fastq_check | boolean | Skip minimum requirement checks for input FASTQs | |
--min_basepairs | integer | 2241820 | The minimum amount of basepairs required to continue downstream analyses. |
--min_reads | integer | 7472 | The minimum amount of reads required to continue downstream analyses. |
--min_coverage | integer | 10 | The minimum amount of coverage required to continue downstream analyses. |
--min_proportion | number | 0.5 | The minimum proportion of basepairs for paired-end reads to continue downstream analyses. |
--min_genome_size | integer | 100000 | The minimum estimated genome size allowed for the input sequence to continue downstream analyses. |
--max_genome_size | integer | 18040666 | The maximum estimated genome size allowed for the input sequence to continue downstream analyses. |
--attempts | integer | 3 | Maximum times to attempt downloads |
--use_ena | boolean | Download FASTQs from ENA | |
--no_cache | boolean | Skip caching the assembly summary file from ncbi-genome-download |
Sketcher Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--sketch_size | integer | 10000 | Sketch size. Each sketch will have at most this many non-redundant min-hashes. |
--sourmash_scale | integer | 10000 | Choose number of hashes as 1 in FRACTION of input k-mers |
--no_winner_take_all | boolean | Disable winner-takes-all strategy for identity estimates | |
--screen_i | number | 0.8 | Minimum identity to report. |
MLST Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--mlst_scheme | string | Don't autodetect, force this scheme on all inputs | |
--mlst_minid | integer | 95 | Minimum DNA percent identity of full allele to consider 'similar' |
--mlst_mincov | integer | 10 | Minimum DNA percent coverage to report partial allele at all |
--mlst_minscore | integer | 50 | Minimum score out of 100 to match a scheme |
--mlst_nopath | boolean | false | Strip filename paths from FILE column |
--mlst_db | string | A custom MLST database to use, either a tarball or a directory |
QC Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--use_bbmap | boolean | Illumina reads will be QC'd using BBMap | |
--use_porechop | boolean | false | Use Porechop to remove adapters from ONT reads |
--skip_qc | boolean | The QC step will be skipped and it will be assumed the inputs sequences have already been QCed. | |
--skip_qc_plots | boolean | QC Plot creation by FastQC or Nanoplot will be skipped | |
--skip_error_correction | boolean | FLASH error correction of reads will be skipped. | |
--adapters | string | A FASTA file containing adapters to remove | |
--adapter_k | integer | 23 | Kmer length used for finding adapters. |
--phix | string | phiX174 reference genome to remove | |
--phix_k | integer | 31 | Kmer length used for finding phiX174. |
--ktrim | string | r | Trim reads to remove bases matching reference kmers (choices: f, r, l) |
--mink | integer | 11 | Look for shorter kmers at read tips down to this length, when k-trimming or masking. |
--hdist | integer | 1 | Maximum Hamming distance for ref kmers (subs only) |
--tpe | string | t | When kmer right-trimming, trim both reads to the minimum length of either (choices: f, t) |
--tbo | string | t | Trim adapters based on where paired reads overlap (choices: f, t) |
--qtrim | string | rl | Trim read ends to remove bases with quality below trimq. (choices: rl, f, r, l, w) |
--trimq | integer | 6 | Regions with average quality BELOW this will be trimmed if qtrim is set to something other than f |
--maq | integer | 10 | Reads with average quality (after trimming) below this will be discarded |
--minlength | integer | 35 | Reads shorter than this after trimming will be discarded |
--ftm | integer | 5 | If positive, right-trim length to be equal to zero, modulo this number |
--tossjunk | string | t | Discard reads with invalid characters as bases (choices: f, t) |
--ain | string | f | When detecting pair names, allow identical names (choices: f, t) |
--qout | string | 33 | PHRED offset to use for output FASTQs (choices: 33, 64) |
--maxcor | integer | 1 | Max number of corrections within a 20bp window |
--sampleseed | integer | 42 | Set to a positive number to use as the random number generator seed for sampling |
--ont_minlength | integer | 1000 | ONT Reads shorter than this will be discarded |
--ont_minqual | integer | 0 | Minimum average read quality filter of ONT reads |
--porechop_opts | string | Extra Porechop options in quotes | |
--nanoplot_opts | string | Extra NanoPlot options in quotes | |
--bbduk_opts | string | Extra BBDuk options in quotes | |
--fastp_opts | string | Extra fastp options in quotes |
Bakta Download Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--bakta_db | string | Tarball or path to the Bakta database | |
--bakta_db_type | string | full | Which Bakta DB to download 'full' (~30GB) or 'light' (~2GB) (choices: full, light) |
--bakta_save_as_tarball | boolean | false | Save the Bakta database as a tarball |
--download_bakta | boolean | false | Download the Bakta database to the path given by --bakta_db |
Bakta Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--bakta_proteins | string | FASTA file of trusted proteins to first annotate from | |
--bakta_prodigal_tf | string | Training file to use for Prodigal | |
--bakta_replicons | string | Replicon information table (tsv/csv) | |
--bakta_min_contig_length | integer | 1 | Minimum contig size to annotate |
--bakta_keep_contig_headers | boolean | false | Keep original contig headers |
--bakta_compliant | boolean | false | Force Genbank/ENA/DDJB compliance |
--bakta_skip_trna | boolean | false | Skip tRNA detection & annotation |
--bakta_skip_tmrna | boolean | false | Skip tmRNA detection & annotation |
--bakta_skip_rrna | boolean | false | Skip rRNA detection & annotation |
--bakta_skip_ncrna | boolean | false | Skip ncRNA detection & annotation |
--bakta_skip_ncrna_region | boolean | false | Skip ncRNA region detection & annotation |
--bakta_skip_crispr | boolean | false | Skip CRISPR array detection & annotation |
--bakta_skip_cds | boolean | false | Skip CDS detection & annotation |
--bakta_skip_sorf | boolean | false | Skip sORF detection & annotation |
--bakta_skip_gap | boolean | false | Skip gap detection & annotation |
--bakta_skip_ori | boolean | false | Skip oriC/oriT detection & annotation |
--bakta_opts | string | Extra Bakta options in quotes. Example: '--gram +' |
Prokka Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--prokka_proteins | string | ${projectDir}/data/proteins.faa | FASTA file of trusted proteins to first annotate from |
--prokka_prodigal_tf | string | Training file to use for Prodigal | |
--prokka_compliant | boolean | false | Force Genbank/ENA/DDJB compliance |
--prokka_centre | string | Bactopia | Sequencing centre ID |
--prokka_coverage | integer | 80 | Minimum coverage on query protein |
--prokka_evalue | string | 1e-09 | Similarity e-value cut-off |
--prokka_opts | string | Extra Prokka options in quotes. | |
--prokka_debug | boolean | false | Enable debug mode for Prokka |
mashdist Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--mash_sketch | string | The reference sequence as a Mash Sketch (.msh file) | |
--mash_seed | integer | 42 | Seed to provide to the hash function |
--mash_table | boolean | false | Table output (fields will be blank if they do not meet the p-value threshold) |
--mash_m | integer | 1 | Minimum copies of each k-mer required to pass noise filter for reads |
--mash_w | number | 0.01 | Probability threshold for warning about low k-mer size. |
--mash_max_p | number | 1.0 | Maximum p-value to report. |
--mash_max_dist | number | 1.0 | Maximum distance to report. |
--merlin_dist | number | 0.1 | Maximum distance to report when using Merlin . |
--full_merlin | boolean | false | Go full Merlin and run all species-specific tools, no matter the Mash distance |
--mash_use_fastqs | boolean | false | Query with FASTQs instead of the assemblies |
ClermonTyping Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--clermontyping_threshold | integer | 0 | Do not use contigs under this size |
ECTyper Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--ectyper_opid | integer | 90 | Percent identity required for an O antigen allele match |
--ectyper_opcov | integer | 90 | Minimum percent coverage required for an O antigen allele match |
--ectyper_hpid | integer | 95 | Percent identity required for an H antigen allele match |
--ectyper_hpcov | integer | 50 | Minimum percent coverage required for an H antigen allele match |
--ectyper_verify | boolean | false | Enable E. coli species verification |
--ectyper_print_alleles | boolean | false | Prints the allele sequences if enabled as the final column |
emmtyper Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--emmtyper_wf | string | blast | Workflow for emmtyper to use. (choices: blast, pcr) |
--emmtyper_blastdb | string | Path to custom EMM BLAST DB. | |
--emmtyper_cluster_distance | integer | 500 | Distance between cluster of matches to consider as different clusters |
--emmtyper_percid | integer | 95 | Minimal percent identity of sequence |
--emmtyper_culling_limit | integer | 5 | Total hits to return in a position |
--emmtyper_mismatch | integer | 5 | Threshold for number of mismatch to allow in BLAST hit |
--emmtyper_align_diff | integer | 5 | Threshold for difference between alignment length and subject length in BLAST |
--emmtyper_gap | integer | 2 | Threshold gap to allow in BLAST hit |
--emmtyper_min_perfect | integer | 15 | Minimum size of perfect match at 3 primer end |
--emmtyper_min_good | integer | 15 | Minimum size where there must be 2 matches for each mismatch |
--emmtyper_max_size | integer | 2000 | Maximum size of PCR product |
hicap Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--hicap_database_dir | string | Directory containing locus database | |
--hicap_model_fp | string | Path to prodigal model | |
--hicap_full_sequence | boolean | false | Write the full input sequence out to the genbank file rather than just the region surrounding and including the locus |
--hicap_debug | boolean | false | hicap will print debug messages |
--hicap_gene_coverage | number | 0.8 | Minimum percentage coverage to consider a single gene complete |
--hicap_gene_identity | number | 0.7 | Minimum percentage identity to consider a single gene complete |
--hicap_broken_gene_length | integer | 60 | Minimum length to consider a broken gene |
--hicap_broken_gene_identity | number | 0.8 | Minimum percentage identity to consider a broken gene |
Mykrobe Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--mykrobe_species | string | Species panel to use (choices: sonnei, staph, tb, typhi) | |
--mykrobe_kmer | integer | 21 | K-mer length |
--mykrobe_min_depth | integer | 1 | Minimum depth |
--mykrobe_model | string | kmer_count | Genotype model used. (choices: kmer_count, median_depth) |
--mykrobe_report_all_calls | boolean | false | Report all calls |
--mykrobe_opts | string | Extra Mykrobe options in quotes |
GenoTyphi Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--genotyphi_kmer | integer | 21 | K-mer length |
--genotyphi_min_depth | integer | 1 | Minimum depth |
--genotyphi_model | string | kmer_count | Genotype model used. (choices: kmer_count, median_depth) |
--genotyphi_report_all_calls | boolean | false | Report all calls |
--genotyphi_mykrobe_opts | string | Extra Mykrobe options in quotes |
Kleborate Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--kleborate_preset | string | kpsc | Preset module to use for Kleborate (choices: kpsc, kosc, escherichia) |
--kleborate_opts | string | Extra options in quotes for Kleborate |
legsta Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--legsta_noheader | boolean | false | Don't print header row |
LisSero Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--lissero_min_id | number | 95.0 | Minimum percent identity to accept a match |
--lissero_min_cov | number | 95.0 | Minimum coverage of the gene to accept a match |
ngmaster Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--ngmaster_csv | boolean | false | output comma-separated format (CSV) rather than tab-separated |
pasty Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--pasty_min_pident | integer | 95 | Minimum percent identity to count a hit |
--pasty_min_coverage | integer | 95 | Minimum percent coverage to count a hit |
pbptyper Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--pbptyper_min_pident | integer | 95 | Minimum percent identity to count a hit |
--pbptyper_min_coverage | integer | 95 | Minimum percent coverage to count a hit |
SeqSero2 Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--seqsero2_run_mode | string | k | Workflow to run. 'a' allele mode, or 'k' k-mer mode (choices: a, k) |
--seqsero2_input_type | string | assembly | Input format to analyze. 'assembly' or 'fastq' (choices: assembly, fastq) |
--seqsero2_bwa_mode | string | mem | Algorithms for bwa mapping for allele mode (choices: mem, sam) |
SeroBA Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--seroba_noclean | boolean | false | Do not clean up intermediate files |
--seroba_coverage | integer | 20 | Threshold for k-mer coverage of the reference sequence |
SISTR Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--sistr_full_cgmlst | boolean | false | Use the full set of cgMLST alleles which can include highly similar alleles |
AgrVATE Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--agrvate_typing_only | boolean | false | agr typing only. Skips agr operon extraction and frameshift detection |
spaTyper Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--spatyper_repeats | string | List of spa repeats | |
--spatyper_repeat_order | string | List spa types and order of repeats | |
--spatyper_do_enrich | boolean | false | Do PCR product enrichment |
sccmec Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--sccmec_min_targets_pident | integer | 90 | Minimum percent identity to count a target hit |
--sccmec_min_targets_coverage | integer | 80 | Minimum percent coverage to count a target hit |
--sccmec_min_regions_pident | integer | 85 | Minimum percent identity to count a region hit |
--sccmec_min_regions_coverage | integer | 93 | Minimum percent coverage to count a region hit |
STECFinder Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--stecfinder_use_reads | boolean | false | Paired-end Illumina reads will be used instead of assemblies |
--stecfinder_hits | boolean | false | Show detailed gene search results |
--stecfinder_cutoff | number | 10.0 | Minimum read coverage for gene to be called |
--stecfinder_length | number | 50.0 | Percentage of gene length needed for positive call |
--stecfinder_ipah_length | number | 10.0 | Percentage of ipaH gene length needed for positive gene call |
--stecfinder_ipah_depth | number | 1.0 | Minimum depth for positive ipaH gene call (requires --stecfinder_use_reads) |
--stecfinder_stx_length | number | 10.0 | Percentage of stx gene length needed for positive gene call |
--stecfinder_stx_depth | number | 1.0 | Minimum depth for positive stx gene call (requires --stecfinder_use_reads) |
--stecfinder_o_length | number | 60.0 | Percentage of wz_ gene length needed for positive call |
--stecfinder_o_depth | number | 1.0 | Minimum depth for positive qz_ gene call (requires --stecfinder_use_reads) |
--stecfinder_h_length | number | 60.0 | Percentage of fliC gene length needed for positive call |
--stecfinder_h_depth | number | 1.0 | Minimum depth for positive fliC gene call (requires --stecfinder_use_reads) |
TB-Profiler Profile Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--tbprofiler_call_whole_genome | boolean | false | Call whole genome |
--tbprofiler_mapper | string | bwa | Mapping tool to use. If you are using nanopore data it will default to minimap2 (choices: bwa, minimap2, bowtie2, bwa-mem2) |
--tbprofiler_caller | string | freebayes | Variant calling tool to use (choices: bcftools, gatk, freebayes) |
--tbprofiler_calling_params | string | Extra variant caller options in quotes | |
--tbprofiler_suspect | boolean | false | Use the suspect suite of tools to add ML predictions |
--tbprofiler_no_flagstat | boolean | false | Don't collect flagstats |
--tbprofiler_no_delly | boolean | false | Don't run delly |
--tbprofiler_opts | string | Extra options in quotes for TBProfiler |
TB-Profiler Collate Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--tbprofiler_itol | boolean | false | Generate itol config files |
--tbprofiler_full | boolean | false | Output mutations in main result file |
--tbprofiler_all_variants | boolean | false | Output all variants in variant matrix |
--tbprofiler_mark_missing | boolean | false | An asterisk will be used to mark predictions which are affected by missing data at a drug resistance position |
Dataset Parameters
Define where the pipeline should find input data and save output data.
| Parameter | Type | Default | Description |
|---|---|---|---|
--species | string | Name of species for species-specific dataset to use | |
--ask_merlin | boolean | Ask Merlin to execute species specific Bactopia tools based on Mash distances | |
--coverage | integer | 100 | Reduce samples to a given coverage, requires a genome size |
--genome_size | integer | 0 | Expected genome size (bp) for all samples, required for read error correction and read subsampling |
--use_bakta | boolean | Use Bakta for annotation, instead of Prokka |
Optional Parameters
These optional parameters can be useful in certain settings.
| Parameter | Type | Default | Description |
|---|---|---|---|
--outdir | string | bactopia | Base directory to write results to |
--skip_compression | boolean | false | Output files will not be compressed |
--datasets | string | The path to cache datasets to | |
--keep_all_files | boolean | false | Keeps all analysis files created |
Max Job Request Parameters
Set the top limit for requested resources for any single job.
| Parameter | Type | Default | Description |
|---|---|---|---|
--max_retry | integer | 3 | Maximum times to retry a process before allowing it to fail. |
--max_cpus | integer | 4 | Maximum number of CPUs that can be requested for any single job. |
--max_memory | string | 128.GB | Maximum amount of memory that can be requested for any single job. |
--max_time | string | 240.h | Maximum amount of time that can be requested for any single job. |
--max_downloads | integer | 3 | Maximum number of samples to download at a time |
Nextflow Configuration Parameters
Parameters to fine-tune your Nextflow setup.
| Parameter | Type | Default | Description |
|---|---|---|---|
--nfconfig | string | A Nextflow compatible config file for custom profiles, loaded last and will overwrite existing variables if set. | |
--publish_dir_mode | string | copy | Method used to save pipeline results to output directory. (choices: symlink, rellink, link, copy, copyNoFollow, move) |
--infodir | string | ${params.outdir}/pipeline_info | Directory to keep pipeline Nextflow logs and reports. |
--force | boolean | false | Nextflow will overwrite existing output files. |
--cleanup_workdir | boolean | false | After Bactopia is successfully executed, the work directory will be deleted. |
Institutional config options
Parameters used to describe centralized config profiles. These should not be edited.
| Parameter | Type | Default | Description |
|---|---|---|---|
--custom_config_version | string | master | Git commit id for Institutional configs. |
--custom_config_base | string | https://raw.githubusercontent.com/nf-core/configs/master | Base directory for Institutional configs. |
--config_profile_name | string | Institutional config name. | |
--config_profile_description | string | Institutional config description. | |
--config_profile_contact | string | Institutional config contact information. | |
--config_profile_url | string | Institutional config URL link. |
Nextflow Profile Parameters
Parameters to fine-tune your Nextflow setup.
| Parameter | Type | Default | Description |
|---|---|---|---|
--condadir | string | Directory to Nextflow should use for Conda environments | |
--registry | string | quay.io | Registry to pull Docker containers from. |
--datasets_cache | string | <HOME>/.bactopia/datasets | Directory where downloaded datasets should be stored. |
--singularity_cache | string | Directory where remote Singularity images are stored. | |
--singularity_pull_docker_container | boolean | Instead of directly downloading Singularity images for use with Singularity, force the workflow to pull and convert Docker containers instead. | |
--force_rebuild | boolean | false | Force overwrite of existing pre-built environments. |
--queue | string | general,high-memory | Comma-separated name of the queue(s) to be used by a job scheduler (e.g. AWS Batch or SLURM) |
--cluster_opts | string | Additional options to pass to the executor. (e.g. SLURM: '--account=my_acct_name' | |
--container_opts | string | Additional options to pass to Apptainer, Docker, or Singularity. (e.g. Singularity: '-D pwd' | |
--disable_scratch | boolean | false | All intermediate files created on worker nodes of will be transferred to the head node. |
Helpful Parameters
Uncommonly used parameters that might be useful.
| Parameter | Type | Default | Description |
|---|---|---|---|
--monochrome_logs | boolean | Do not use coloured log outputs. | |
--nfdir | boolean | Print directory Nextflow has pulled Bactopia to | |
--sleep_time | integer | 5 | The amount of time (seconds) Nextflow will wait after setting up datasets before execution. |
--validate_params | boolean | true | Boolean whether to validate parameters against the schema at runtime |
--help | boolean | Display help text. | |
--wf | string | bactopia | Specify which workflow or Bactopia Tool to execute |
--list_wfs | boolean | List the available workflows and Bactopia Tools to use with '--wf' | |
--show_hidden_params | boolean | Show all params when using --help | |
--help_all | boolean | An alias for --help --show_hidden_params | |
--version | boolean | Display version text. |
Composition
This workflow uses the following subworkflows:
- amrfinderplus - Find antimicrobial resistance genes and point mutations.
- bactopia_assembler - Assemble bacterial genomes using automated assembler selection.
- bactopia_datasets - Download and provide pre-compiled datasets required by Bactopia.
- bactopia_gather - Search, validate, gather, and standardize input samples.
- bactopia_qc - Perform comprehensive quality control on sequencing reads.
- bactopia_sketcher - Create genomic sketches and perform rapid taxonomic classification.
- bakta - Rapid bacterial genome annotation.
- merlin - MinER assisted species-specific bactopia tool seLectIoN.
- mlst - Determine multilocus sequence types (MLST) from bacterial assemblies.
- prokka - Annotate bacterial genomes with functional information.