Skip to main content

merlin

Tags: species-specific automated mash minmer typing bactopia-tool

MinMER-assisted species-specific tool selection and execution.

This Bactopia Tool, Merlin, uses MinMER distances based on the RefSeq sketch to automatically run species-specific analysis tools. Merlin identifies the closest reference genomes and executes appropriate typing and analysis tools for each detected species.

Usage

Bactopia CLI:

bactopia --wf merlin \
--bactopia /path/to/your/bactopia/results

Nextflow:

nextflow run bactopia/bactopia/workflows/bactopia-tools/merlin/main.nf \
--bactopia /path/to/your/bactopia/results

Outputs

Expected Output Files

<BACTOPIA_DIR>
├── <SAMPLE_NAME>
│ └── tools
│ ├── clermontyping
│ │ ├── <SAMPLE_NAME>.tsv
│ │ ├── logs
│ │ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ │ └── versions.yml
│ │ └── supplemental
│ │ ├── <SAMPLE_NAME>.blast.xml
│ │ ├── <SAMPLE_NAME>.html
│ │ └── <SAMPLE_NAME>.mash.tsv
│ ├── ectyper
│ │ ├── <SAMPLE_NAME>.blast_alleles.txt
│ │ ├── <SAMPLE_NAME>.tsv
│ │ └── logs
│ │ ├── ectyper.log
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ ├── kleborate
│ │ ├── <SAMPLE_NAME>.tsv
│ │ └── logs
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ ├── merlindist
│ │ └── merlin-<TIMESTAMP>
│ │ ├── <SAMPLE_NAME>-dist.txt
│ │ └── logs
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ ├── shigapass
│ │ ├── <SAMPLE_NAME>.tsv
│ │ ├── logs
│ │ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ │ └── versions.yml
│ │ └── supplemental
│ │ └── ShigaPass_summary.csv
│ ├── shigatyper
│ │ ├── <SAMPLE_NAME>-hits.tsv
│ │ ├── <SAMPLE_NAME>.tsv
│ │ └── logs
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ ├── shigeifinder
│ │ ├── <SAMPLE_NAME>.tsv
│ │ └── logs
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ └── stecfinder
│ ├── <SAMPLE_NAME>.tsv
│ └── logs
│ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ └── versions.yml
├── <SAMPLE_NAME>SE
│ └── tools
│ ├── clermontyping
│ │ ├── <SAMPLE_NAME>SE.tsv
│ │ ├── logs
│ │ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ │ └── versions.yml
│ │ └── supplemental
│ │ ├── <SAMPLE_NAME>SE.blast.xml
│ │ ├── <SAMPLE_NAME>SE.html
│ │ └── <SAMPLE_NAME>SE.mash.tsv
│ ├── ectyper
│ │ ├── <SAMPLE_NAME>SE.blast_alleles.txt
│ │ ├── <SAMPLE_NAME>SE.tsv
│ │ └── logs
│ │ ├── ectyper.log
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ ├── kleborate
│ │ ├── <SAMPLE_NAME>SE.tsv
│ │ └── logs
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ ├── merlindist
│ │ └── merlin-<TIMESTAMP>
│ │ ├── <SAMPLE_NAME>SE-dist.txt
│ │ └── logs
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ ├── shigapass
│ │ ├── <SAMPLE_NAME>SE.tsv
│ │ ├── logs
│ │ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ │ └── versions.yml
│ │ └── supplemental
│ │ └── ShigaPass_summary.csv
│ ├── shigatyper
│ │ ├── <SAMPLE_NAME>SE-hits.tsv
│ │ ├── <SAMPLE_NAME>SE.tsv
│ │ └── logs
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ ├── shigeifinder
│ │ ├── <SAMPLE_NAME>SE.tsv
│ │ └── logs
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ └── stecfinder
│ ├── <SAMPLE_NAME>SE.tsv
│ └── logs
│ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ └── versions.yml
├── SRR13039589
│ └── tools
│ ├── clermontyping
│ │ ├── SRR13039589.tsv
│ │ ├── logs
│ │ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ │ └── versions.yml
│ │ └── supplemental
│ │ ├── SRR13039589.blast.xml
│ │ ├── SRR13039589.html
│ │ └── SRR13039589.mash.tsv
│ ├── ectyper
│ │ ├── SRR13039589.blast_alleles.txt
│ │ ├── SRR13039589.tsv
│ │ └── logs
│ │ ├── ectyper.log
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ ├── kleborate
│ │ ├── SRR13039589.tsv
│ │ └── logs
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ ├── merlindist
│ │ └── merlin-<TIMESTAMP>
│ │ ├── SRR13039589-dist.txt
│ │ └── logs
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ ├── shigapass
│ │ ├── SRR13039589.tsv
│ │ ├── logs
│ │ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ │ └── versions.yml
│ │ └── supplemental
│ │ └── ShigaPass_summary.csv
│ ├── shigatyper
│ │ ├── SRR13039589-hits.tsv
│ │ ├── SRR13039589.tsv
│ │ └── logs
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ ├── shigeifinder
│ │ ├── SRR13039589.tsv
│ │ └── logs
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ └── stecfinder
│ ├── SRR13039589.tsv
│ └── logs
│ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ └── versions.yml
└── bactopia-runs
└── merlin-<TIMESTAMP>
├── merged-results
│ ├── clermontyping.tsv
│ ├── ectyper.tsv
│ ├── kleborate.tsv
│ ├── logs
│ │ ├── clermontyping-concat
│ │ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ │ └── versions.yml
│ │ ├── ectyper-concat
│ │ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ │ └── versions.yml
│ │ ├── kleborate-concat
│ │ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ │ └── versions.yml
│ │ ├── shigapass-concat
│ │ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ │ └── versions.yml
│ │ ├── shigatyper-concat
│ │ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ │ └── versions.yml
│ │ ├── shigeifinder-concat
│ │ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ │ └── versions.yml
│ │ └── stecfinder-concat
│ │ ├── nf.command.{begin,err,log,out,run,sh,trace}
│ │ └── versions.yml
│ ├── shigapass.tsv
│ ├── shigatyper.tsv
│ ├── shigeifinder.tsv
│ └── stecfinder.tsv
└── nf-reports
├── merlin-dag.dot
├── merlin-report.html
└── merlin-timeline.html

Species-Specific Analysis

note

Tools executed depend on detected species

FileDescription
Analysisresults from all executed species-specific tools

Merged Results

FileDescription
merlin.tsvMerged summary of all species-specific analyses

Audit Trail

Below are files that can assist you in understanding which parameters and program versions were used.

Logs

Each process that is executed will have a folder named logs. In this folder are helpful files for you to review if the need ever arises.

ExtensionDescription
.beginAn empty file used to designate the process started
.errContains STDERR outputs from the process
.logContains both STDERR and STDOUT outputs from the process
.outContains STDOUT outputs from the process
.runThe script Nextflow uses to stage/unstage files and queue processes based on given profile
.shThe script executed by bash for the process
.traceThe Nextflow trace report for the process
versions.ymlA YAML formatted file with program versions

Nextflow Reports

These Nextflow reports provide great a great summary of your run. These can be used to optimize resource usage and estimate expected costs if using cloud platforms.

FilenameDescription
merlin-dag.dotThe Nextflow DAG visualization
merlin-report.htmlThe Nextflow Execution Report
merlin-timeline.htmlThe Nextflow Timeline Report
merlin-trace.txtThe Nextflow Trace report

Parameters

Required Parameters

Define where the pipeline should find input data and save output data.

ParameterTypeDefaultDescription
--bactopiastringThe path to bactopia results to use as inputs

mashdist Parameters

ParameterTypeDefaultDescription
--mash_sketchstringThe reference sequence as a Mash Sketch (.msh file)
--mash_seedinteger42Seed to provide to the hash function
--mash_tablebooleanfalseTable output (fields will be blank if they do not meet the p-value threshold)
--mash_minteger1Minimum copies of each k-mer required to pass noise filter for reads
--mash_wnumber0.01Probability threshold for warning about low k-mer size.
--mash_max_pnumber1.0Maximum p-value to report.
--mash_max_distnumber1.0Maximum distance to report.
--merlin_distnumber0.1Maximum distance to report when using Merlin .
--full_merlinbooleanfalseGo full Merlin and run all species-specific tools, no matter the Mash distance
--mash_use_fastqsbooleanfalseQuery with FASTQs instead of the assemblies

ClermonTyping Parameters

ParameterTypeDefaultDescription
--clermontyping_thresholdinteger0Do not use contigs under this size

csvtk concat Parameters

ParameterTypeDefaultDescription
--csvtk_concat_optsstringExtra csvtk concat options in quotes

ECTyper Parameters

ParameterTypeDefaultDescription
--ectyper_opidinteger90Percent identity required for an O antigen allele match
--ectyper_opcovinteger90Minimum percent coverage required for an O antigen allele match
--ectyper_hpidinteger95Percent identity required for an H antigen allele match
--ectyper_hpcovinteger50Minimum percent coverage required for an H antigen allele match
--ectyper_verifybooleanfalseEnable E. coli species verification
--ectyper_print_allelesbooleanfalsePrints the allele sequences if enabled as the final column

emmtyper Parameters

ParameterTypeDefaultDescription
--emmtyper_wfstringblastWorkflow for emmtyper to use. (choices: blast, pcr)
--emmtyper_blastdbstringPath to custom EMM BLAST DB.
--emmtyper_cluster_distanceinteger500Distance between cluster of matches to consider as different clusters
--emmtyper_percidinteger95Minimal percent identity of sequence
--emmtyper_culling_limitinteger5Total hits to return in a position
--emmtyper_mismatchinteger5Threshold for number of mismatch to allow in BLAST hit
--emmtyper_align_diffinteger5Threshold for difference between alignment length and subject length in BLAST
--emmtyper_gapinteger2Threshold gap to allow in BLAST hit
--emmtyper_min_perfectinteger15Minimum size of perfect match at 3 primer end
--emmtyper_min_goodinteger15Minimum size where there must be 2 matches for each mismatch
--emmtyper_max_sizeinteger2000Maximum size of PCR product

hicap Parameters

ParameterTypeDefaultDescription
--hicap_database_dirstringDirectory containing locus database
--hicap_model_fpstringPath to prodigal model
--hicap_full_sequencebooleanfalseWrite the full input sequence out to the genbank file rather than just the region surrounding and including the locus
--hicap_debugbooleanfalsehicap will print debug messages
--hicap_gene_coveragenumber0.8Minimum percentage coverage to consider a single gene complete
--hicap_gene_identitynumber0.7Minimum percentage identity to consider a single gene complete
--hicap_broken_gene_lengthinteger60Minimum length to consider a broken gene
--hicap_broken_gene_identitynumber0.8Minimum percentage identity to consider a broken gene

Mykrobe Parameters

ParameterTypeDefaultDescription
--mykrobe_speciesstringSpecies panel to use (choices: sonnei, staph, tb, typhi)
--mykrobe_kmerinteger21K-mer length
--mykrobe_min_depthinteger1Minimum depth
--mykrobe_modelstringkmer_countGenotype model used. (choices: kmer_count, median_depth)
--mykrobe_report_all_callsbooleanfalseReport all calls
--mykrobe_optsstringExtra Mykrobe options in quotes

GenoTyphi Parameters

ParameterTypeDefaultDescription
--genotyphi_kmerinteger21K-mer length
--genotyphi_min_depthinteger1Minimum depth
--genotyphi_modelstringkmer_countGenotype model used. (choices: kmer_count, median_depth)
--genotyphi_report_all_callsbooleanfalseReport all calls
--genotyphi_mykrobe_optsstringExtra Mykrobe options in quotes

Kleborate Parameters

ParameterTypeDefaultDescription
--kleborate_presetstringkpscPreset module to use for Kleborate (choices: kpsc, kosc, escherichia)
--kleborate_optsstringExtra options in quotes for Kleborate

legsta Parameters

ParameterTypeDefaultDescription
--legsta_noheaderbooleanfalseDon't print header row

LisSero Parameters

ParameterTypeDefaultDescription
--lissero_min_idnumber95.0Minimum percent identity to accept a match
--lissero_min_covnumber95.0Minimum coverage of the gene to accept a match

ngmaster Parameters

ParameterTypeDefaultDescription
--ngmaster_csvbooleanfalseoutput comma-separated format (CSV) rather than tab-separated

pasty Parameters

ParameterTypeDefaultDescription
--pasty_min_pidentinteger95Minimum percent identity to count a hit
--pasty_min_coverageinteger95Minimum percent coverage to count a hit

pbptyper Parameters

ParameterTypeDefaultDescription
--pbptyper_min_pidentinteger95Minimum percent identity to count a hit
--pbptyper_min_coverageinteger95Minimum percent coverage to count a hit

SeqSero2 Parameters

ParameterTypeDefaultDescription
--seqsero2_run_modestringkWorkflow to run. 'a' allele mode, or 'k' k-mer mode (choices: a, k)
--seqsero2_input_typestringassemblyInput format to analyze. 'assembly' or 'fastq' (choices: assembly, fastq)
--seqsero2_bwa_modestringmemAlgorithms for bwa mapping for allele mode (choices: mem, sam)

SeroBA Parameters

ParameterTypeDefaultDescription
--seroba_nocleanbooleanfalseDo not clean up intermediate files
--seroba_coverageinteger20Threshold for k-mer coverage of the reference sequence

SISTR Parameters

ParameterTypeDefaultDescription
--sistr_full_cgmlstbooleanfalseUse the full set of cgMLST alleles which can include highly similar alleles

AgrVATE Parameters

ParameterTypeDefaultDescription
--agrvate_typing_onlybooleanfalseagr typing only. Skips agr operon extraction and frameshift detection

spaTyper Parameters

ParameterTypeDefaultDescription
--spatyper_repeatsstringList of spa repeats
--spatyper_repeat_orderstringList spa types and order of repeats
--spatyper_do_enrichbooleanfalseDo PCR product enrichment

sccmec Parameters

ParameterTypeDefaultDescription
--sccmec_min_targets_pidentinteger90Minimum percent identity to count a target hit
--sccmec_min_targets_coverageinteger80Minimum percent coverage to count a target hit
--sccmec_min_regions_pidentinteger85Minimum percent identity to count a region hit
--sccmec_min_regions_coverageinteger93Minimum percent coverage to count a region hit

StaphSCAN Parameters

ParameterTypeDefaultDescription
--staphscan_modulesstringComma-separated list of modules to run
--staphscan_db_mlststringPath or tarball to custom MLST database

STECFinder Parameters

ParameterTypeDefaultDescription
--stecfinder_use_readsbooleanfalsePaired-end Illumina reads will be used instead of assemblies
--stecfinder_hitsbooleanfalseShow detailed gene search results
--stecfinder_cutoffnumber10.0Minimum read coverage for gene to be called
--stecfinder_lengthnumber50.0Percentage of gene length needed for positive call
--stecfinder_ipah_lengthnumber10.0Percentage of ipaH gene length needed for positive gene call
--stecfinder_ipah_depthnumber1.0Minimum depth for positive ipaH gene call (requires --stecfinder_use_reads)
--stecfinder_stx_lengthnumber10.0Percentage of stx gene length needed for positive gene call
--stecfinder_stx_depthnumber1.0Minimum depth for positive stx gene call (requires --stecfinder_use_reads)
--stecfinder_o_lengthnumber60.0Percentage of wz_ gene length needed for positive call
--stecfinder_o_depthnumber1.0Minimum depth for positive qz_ gene call (requires --stecfinder_use_reads)
--stecfinder_h_lengthnumber60.0Percentage of fliC gene length needed for positive call
--stecfinder_h_depthnumber1.0Minimum depth for positive fliC gene call (requires --stecfinder_use_reads)

TB-Profiler Profile Parameters

ParameterTypeDefaultDescription
--tbprofiler_call_whole_genomebooleanfalseCall whole genome
--tbprofiler_mapperstringbwaMapping tool to use. If you are using nanopore data it will default to minimap2 (choices: bwa, minimap2, bowtie2, bwa-mem2)
--tbprofiler_callerstringfreebayesVariant calling tool to use (choices: bcftools, gatk, freebayes)
--tbprofiler_calling_paramsstringExtra variant caller options in quotes
--tbprofiler_suspectbooleanfalseUse the suspect suite of tools to add ML predictions
--tbprofiler_no_flagstatbooleanfalseDon't collect flagstats
--tbprofiler_no_dellybooleanfalseDon't run delly
--tbprofiler_optsstringExtra options in quotes for TBProfiler

TB-Profiler Collate Parameters

ParameterTypeDefaultDescription
--tbprofiler_itolbooleanfalseGenerate itol config files
--tbprofiler_fullbooleanfalseOutput mutations in main result file
--tbprofiler_all_variantsbooleanfalseOutput all variants in variant matrix
--tbprofiler_mark_missingbooleanfalseAn asterisk will be used to mark predictions which are affected by missing data at a drug resistance position
Filtering Parameters

Use these parameters to specify which samples to include or exclude.

ParameterTypeDefaultDescription
--includestringA text file containing sample names (one per line) to include from the analysis
--excludestringA text file containing sample names (one per line) to exclude from the analysis
Optional Parameters

These optional parameters can be useful in certain settings.

ParameterTypeDefaultDescription
--outdirstringbactopiaBase directory to write results to
--skip_compressionbooleanfalseOutput files will not be compressed
--datasetsstringThe path to cache datasets to
--keep_all_filesbooleanfalseKeeps all analysis files created
Max Job Request Parameters

Set the top limit for requested resources for any single job.

ParameterTypeDefaultDescription
--max_retryinteger3Maximum times to retry a process before allowing it to fail.
--max_cpusinteger4Maximum number of CPUs that can be requested for any single job.
--max_memorystring128.GBMaximum amount of memory that can be requested for any single job.
--max_timestring240.hMaximum amount of time that can be requested for any single job.
--max_downloadsinteger3Maximum number of samples to download at a time
Nextflow Configuration Parameters

Parameters to fine-tune your Nextflow setup.

ParameterTypeDefaultDescription
--nfconfigstringA Nextflow compatible config file for custom profiles, loaded last and will overwrite existing variables if set.
--publish_dir_modestringcopyMethod used to save pipeline results to output directory. (choices: symlink, rellink, link, copy, copyNoFollow, move)
--infodirstring${params.outdir}/pipeline_infoDirectory to keep pipeline Nextflow logs and reports.
--forcebooleanfalseNextflow will overwrite existing output files.
--cleanup_workdirbooleanfalseAfter Bactopia is successfully executed, the work directory will be deleted.
Institutional config options

Parameters used to describe centralized config profiles. These should not be edited.

ParameterTypeDefaultDescription
--custom_config_versionstringmasterGit commit id for Institutional configs.
--custom_config_basestringhttps://raw.githubusercontent.com/nf-core/configs/masterBase directory for Institutional configs.
--config_profile_namestringInstitutional config name.
--config_profile_descriptionstringInstitutional config description.
--config_profile_contactstringInstitutional config contact information.
--config_profile_urlstringInstitutional config URL link.
Nextflow Profile Parameters

Parameters to fine-tune your Nextflow setup.

ParameterTypeDefaultDescription
--condadirstringDirectory to Nextflow should use for Conda environments
--registrystringquay.ioRegistry to pull Docker containers from.
--datasets_cachestring<HOME>/.bactopia/datasetsDirectory where downloaded datasets should be stored.
--singularity_cachestringDirectory where remote Singularity images are stored.
--singularity_pull_docker_containerbooleanInstead of directly downloading Singularity images for use with Singularity, force the workflow to pull and convert Docker containers instead.
--force_rebuildbooleanfalseForce overwrite of existing pre-built environments.
--queuestringgeneral,high-memoryComma-separated name of the queue(s) to be used by a job scheduler (e.g. AWS Batch or SLURM)
--cluster_optsstringAdditional options to pass to the executor. (e.g. SLURM: '--account=my_acct_name'
--container_optsstringAdditional options to pass to Apptainer, Docker, or Singularity. (e.g. Singularity: '-D pwd'
--disable_scratchbooleanfalseAll intermediate files created on worker nodes of will be transferred to the head node.
Helpful Parameters

Uncommonly used parameters that might be useful.

ParameterTypeDefaultDescription
--monochrome_logsbooleanDo not use coloured log outputs.
--nfdirbooleanPrint directory Nextflow has pulled Bactopia to
--sleep_timeinteger5The amount of time (seconds) Nextflow will wait after setting up datasets before execution.
--validate_paramsbooleantrueBoolean whether to validate parameters against the schema at runtime
--helpbooleanDisplay help text.
--wfstringbactopiaSpecify which workflow or Bactopia Tool to execute
--list_wfsbooleanList the available workflows and Bactopia Tools to use with '--wf'
--show_hidden_paramsbooleanShow all params when using --help
--help_allbooleanAn alias for --help --show_hidden_params
--versionbooleanDisplay version text.

Composition

This workflow uses the following subworkflows:

  • bactopia_datasets - Download and provide pre-compiled datasets required by Bactopia.
  • merlin - MinER assisted species-specific bactopia tool seLectIoN.

Citations

If you use this in your analysis, please cite the following.

Source

View source on GitHub