bactopia_qc
Tags: fastq qc adapter-removal error-correction subsampling fastp bbduk lighter porechop nanoq fastqc nanoplot sample-scope
Automated quality control, error correction, and read subsampling.
A comprehensive QC pipeline that adapts to the input read type:
- Illumina: Adapter/PhiX removal (Fastp or BBDuk), Error Correction (Lighter), and Subsampling (Rasusa)
- Nanopore: Adapter removal (Porechop), Quality filtering (Nanoq), and Subsampling (Rasusa)
- Hybrid: Processes both short and long reads through their respective pipelines
- Assembly: Passes through simulated reads from assemblies
Generates quality metrics using fastq-scan and optional quality reports using FastQC (Illumina) and NanoPlot (ONT).
Inputs
record (
meta: Record,
r1: Path?,
r2: Path?,
se: Path?,
lr: Path?,
fna: Path?
)
| Field | Type | Description |
|---|---|---|
meta | Record | Groovy Record containing sample information (must include runtype, genome_size, species) |
r1 | Path? | Illumina R1 reads (paired-end forward) |
r2 | Path? | Illumina R2 reads (paired-end reverse) |
se | Path? | Single-end Illumina reads |
lr | Path? | Long reads (ONT) |
fna | Path? | Assembly file (FASTA) for assembly-based simulations |
adapters: Path?
phix: Path?
| Name | Type | Description |
|---|---|---|
adapters | Path? | Filepath for custom adapter sequences (FASTA) |
phix | Path? | Filepath for custom PhiX sequences (FASTA) |
Outputs
record (
meta: Record,
r1: Path?,
r2: Path?,
se: Path?,
lr: Path?,
fna: Path?,
reads_grouped: Set<Path?>,
error: Set<Path?>,
skipped: Path?,
results: Set<Path>,
logs: Set<Path?>,
nf_logs: Set<Path>,
versions: Set<Path>
)
| Field | Type | Description |
|---|---|---|
meta | Record | Sample information record |
r1 | Path? | QC'd Illumina R1 reads (paired-end forward) |
r2 | Path? | QC'd Illumina R2 reads (paired-end reverse) |
se | Path? | QC'd single-end Illumina reads |
lr | Path? | QC'd long reads (ONT) |
fna | Path? | Assembly file (FASTA) |
reads_grouped | Set<Path?> | All output FASTQs for publishing |
error | Set<Path?> | Captured error messages if QC failed (e.g., reads empty after trimming) |
skipped | Path? | Marker file indicating QC was skipped for this sample |
results | Set<Path> | All output files to be published |
logs | Set<Path?> | Optional program specific log files |
nf_logs | Set<Path> | Nextflow-specific log files (e.g. .command.{begin |
versions | Set<Path> | A YAML formatted file with program versions |
Parameters
QC Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
--use_bbmap | boolean | Illumina reads will be QC'd using BBMap | |
--use_porechop | boolean | false | Use Porechop to remove adapters from ONT reads |
--skip_qc | boolean | The QC step will be skipped and it will be assumed the inputs sequences have already been QCed. | |
--skip_qc_plots | boolean | QC Plot creation by FastQC or Nanoplot will be skipped | |
--skip_error_correction | boolean | FLASH error correction of reads will be skipped. | |
--adapters | string | A FASTA file containing adapters to remove | |
--adapter_k | integer | 23 | Kmer length used for finding adapters. |
--phix | string | phiX174 reference genome to remove | |
--phix_k | integer | 31 | Kmer length used for finding phiX174. |
--ktrim | string | r | Trim reads to remove bases matching reference kmers (choices: f, r, l) |
--mink | integer | 11 | Look for shorter kmers at read tips down to this length, when k-trimming or masking. |
--hdist | integer | 1 | Maximum Hamming distance for ref kmers (subs only) |
--tpe | string | t | When kmer right-trimming, trim both reads to the minimum length of either (choices: f, t) |
--tbo | string | t | Trim adapters based on where paired reads overlap (choices: f, t) |
--qtrim | string | rl | Trim read ends to remove bases with quality below trimq. (choices: rl, f, r, l, w) |
--trimq | integer | 6 | Regions with average quality BELOW this will be trimmed if qtrim is set to something other than f |
--maq | integer | 10 | Reads with average quality (after trimming) below this will be discarded |
--minlength | integer | 35 | Reads shorter than this after trimming will be discarded |
--ftm | integer | 5 | If positive, right-trim length to be equal to zero, modulo this number |
--tossjunk | string | t | Discard reads with invalid characters as bases (choices: f, t) |
--ain | string | f | When detecting pair names, allow identical names (choices: f, t) |
--qout | string | 33 | PHRED offset to use for output FASTQs (choices: 33, 64) |
--maxcor | integer | 1 | Max number of corrections within a 20bp window |
--sampleseed | integer | 42 | Set to a positive number to use as the random number generator seed for sampling |
--ont_minlength | integer | 1000 | ONT Reads shorter than this will be discarded |
--ont_minqual | integer | 0 | Minimum average read quality filter of ONT reads |
--porechop_opts | string | Extra Porechop options in quotes | |
--nanoplot_opts | string | Extra NanoPlot options in quotes | |
--bbduk_opts | string | Extra BBDuk options in quotes | |
--fastp_opts | string | Extra fastp options in quotes |
Used By
Subworkflows
- bactopia_qc - Perform comprehensive quality control on sequencing reads.
Workflows
- bactopia - Comprehensive bacterial analysis pipeline for complete genomic characterization.
- cleanyerreads - Quality control and optional host read removal from raw sequencing reads.
- staphopia - Comprehensive analysis pipeline for Staphylococcus aureus isolates.
Citations
If you use this in your analysis, please cite the following.
-
Bactopia
Petit III RA, Read TD Bactopia - a flexible pipeline for complete analysis of bacterial genomes. mSystems 5 (2020) -
BBTools
Bushnell B BBMap short read aligner, and other bioinformatic tools. (Link) -
fastp
Chen S, Zhou Y, Chen Y, and Gu J fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics, 34(17), i884-i890. (2018) -
FastQC
Andrews S FastQC: a quality control tool for high throughput sequence data. (WebLink) -
fastq-scan
Petit III RA fastq-scan: generate summary statistics of input FASTQ sequences. (GitHub) -
Lighter
Song L, Florea L, Langmead B Lighter: Fast and Memory-efficient Sequencing Error Correction without Counting. Genome Biol. 15(11):509 (2014) -
NanoPlot
De Coster W, D'Hert S, Schultz DT, Cruts M, Van Broeckhoven C NanoPack: visualizing and processing long-read sequencing data Bioinformatics Volume 34, Issue 15 (2018) -
Nanoq
Steinig E Nanoq: Minimal but speedy quality control for nanopore reads in Rust (GitHub) -
Porechop
Wick RR, Judd LM, Gorrie CL, Holt KE. Completing bacterial genome assemblies with multiplex MinION sequencing. Microb Genom. 3(10):e000132 (2017) -
Rasusa
Hall MB Rasusa: Randomly subsample sequencing reads to a specified coverage. (2019).
Source
Version
BACTOPIA_QC:
- bactopia-qc: 1.0.4