KAT Script Read Below
When using LMP data, in many cases the protocol used to prepare the sequencing librarywill impose considerable biases. It is good practice to check LMP reads against the PEreads for coherence. They have been prepared from the same genomic DNA so should havesimilar content. Over-representation and absence of motifs are important factors tocheck. The presence of motifs originating from adaptors (in fact mostly generated fromtheir junction with genomic DNA) can also be spotted.
KAT Script Read Below
Breaking WGS data into k-mers provides a nice way of identifying contamination, organelles orotherwise unexpected content, in your reads or assemblies. This section will walkyou through how you might be able to identify and extract contamination in yourdata.
Running this tool will produce a matrix containing distinct k-mer counts at varyingfrequency and GC value. It will also produce a density plot, such as the one belowthat highlights error k-mers shown at very low coverage with a wide GC spread andgenuine content between 10-100X with GC spread from 5-25. In this case we also havesome unexpected content shown at approx 200X with GC from 15-25:
You can also use this tool for subsampling the extracted data. This can be usefulfor reducing expression of highly expressed reads. To do this add the --frequencyoption and set a threshold indicating how many of the reads to keep: 1.0 implies keepall, 0.0 means discard all, 0.5 would imply to keep half of the sequences.
A second use case assumes you already know the contaminant genome and haveaccess to the reference assembly of that contaminant. In this case you candirectly inspect your assembly for signs of the contaminant using the following command:
CDC has deposited Illumina reads from 18 outbreak strains into SRA under project SRP072035 so I pulled the data and had a look. I managed to download the readsets in a few minutes (using bionode-ncbi) but it took a really long time to unpack these into fastq files using sra-toolkit.
Note 2: the assemblies (SPAdes fasta and fastg; plus Prokka annotated in GenBank format), and various analyses including trees created using Parsnp (from assemblies) and our RedDog pipeline (mapping of reads to reference genome strain NUHP1 =CP007547) are here in github:
This was detected by our mapping pipeline RedDog, which I used to map the reads to reference genome NUHP1 CP007547 (this may not be the best reference, I just picked one randomly). The assemblies confirm it: genes BD94_0888 to BD94_0962, and the end of BD94_0963, are missing in these 4 strains (although reads do map to BD94_0948, because this is present in a second copy elsewhere in the genome).
The 10 clades highlighted in the tree are those containing >5 aEPEC in our collection, which represent the most common aEPEC lineages. The figures below show that these 10 aEPEC groups are present across the Asian and African GEMS sites; most also appear in non-GEMS collections from Europe as well as North and South America, indicating they are globally distributed.
Common approaches include: (i) sequencing large numbers of isolates using high-throughput Illumina platforms; (ii) the identification of SNPs (single nucleotide polymorphisms) using read mapping approaches (with BWA, SMALT, SAMtools and GATK being popular tools); and (iii) uniform use of RAxML for generating maximum likelihood phylogenies.
Although there is still no real consensus on exact methods for SNP calling, I think most of the tools people are using (ie a good, stable read mapper followed by SNP calling with an established tool like SAMtools or GATK, with some basic filtering to remove low-evidence or ambiguous calls) end up with very similar answers (as we saw with the NGS outbreak analysis challenge session held at the ASM NGS meeting in September 2015). All in all it seems to me that the use of genomics for public health & diagnostic microbiology is in far better shape in this respect than clinical human genomics, which is going through something of a crisis involving wide discrepancies in variant calling as well as uncertainty around data interpretation.
We have two work-horse scripts for plotting trees with data, one based on R (using ape) and one based on Python (using the ete2 package). Both are available in GitHub at and require an input tree (newick format), and take strain information or data (for heatmaps) in CSV format.
Each can do slightly different things. The biggest difference is that the R script is restricted to rectangular trees and works best for plotting associated data as text columns and heatmaps, like this example (taken from Holt et al, PNAS 2013) of a tree of Vietnamese Shigella sonnei, with tips coloured by city of isolation, and heatmap indicating the presence (black) or absence (white) of accessory genes. Example data files (newick tree + accessory gene content matrix) is available in the github repository).
A version of the R script is now included in the SRST2 repository, which can accept MLST and gene content information output by SRST2, calculate a tree from the MLST data, and plot gene content against this tree, either on an individual isolate basis or summarising gene frequencies by ST. Instructions and example data are here: -output-in-r, including how to recreate this figure from the SRST2 paper:
The Python script is best for plotting trees in circular format with simpler, discrete categorical data, e.g. with coloured rings, or colouring branches or clades backgrounds according to tip values; it can also show branch support.
Note: We highly recommend that you first consult the latest KAT documentation on our GitHub repository to get acquainted with many of the concepts mentioned in the Sample Solution below.
In addition to the source code provided, the KAT Sample Solution contains infrastructure-as-code to set up Kubernetes resources that support the application. The Kubernetes cluster is set up using Microk8s as a single-node Kubernetes installation inside the VM. A brief description of the Kubernetes-based supporting subsystems is provided below:
-NGINXHelm Package ManagerWe will use standard Helm Charts for PostgreSQL, Prometheus, Grafana, and Python as well as customized Helm Charts to deploy and manage our application stack. Infrastructure-as-Code (IaC) RepositoryPublicly accessible Git repository hosting the application and Kubernetes IaC.Grafana + PrometheusA modern Kubernetes-friendly monitoring and alerting system.Technology DemonstrationThe following graphic shows the application components deployed into the Kubernetes cluster. A detailed description of each of the components, along with links to documentation describing each in detail, are provided in the table below. Again, we encourage you to read through the up-to-date documentation and code within the KAT GitHub repository to best understand the entire solution.
We discuss some of the salient features of the KAT Sample Solution below. We highly recommend that you check out the latest detailed documentation on the BigBitBus Kubernetes Automation Toolkit open-source repository:
All components of the KAT Sample Solution are subject to open-source licenses; please refer to their respective source repositories to learn and read about the licensing terms. The KAT BoosterPack code and documentation license is available here:
Rapid molecular typing of bacterial pathogens is critical for public health epidemiology, surveillance and infection control, yet routine use of whole genome sequencing (WGS) for these purposes poses significant challenges. Here we present SRST2, a read mapping-based tool for fast and accurate detection of genes, alleles and multi-locus sequence types (MLST) from WGS data. Using >900 genomes from common pathogens, we show SRST2 is highly accurate and outperforms assembly-based methods in terms of both gene detection and allele assignment. We include validation of SRST2 within a public health laboratory, and demonstrate its use for microbial genome surveillance in the hospital setting. In the face of rising threats of antimicrobial resistance and emerging virulence among bacterial pathogens, SRST2 represents a powerful tool for rapidly extracting clinically useful information from raw WGS data.
Rapid molecular typing of bacterial pathogens is critical for public health epidemiology, surveillance and infection control [1],[2]. Two key goals of such activities are: (1) to detect the presence of genes linked to clinically relevant phenotypes - including virulence genes, antimicrobial resistance genes or serotype determinants; and (2) to classify isolates into clonal groups, via multi-locus sequence typing (MLST [3]) or detection of clone-specific or other epidemiological markers. Whole genome sequencing (WGS) or `genomic epidemiology is increasingly being adopted for these tasks and has the potential to replace current techniques which are mainly based on PCR and/or restriction enzyme digestion coupled with sequencing or size separation via electrophoresis [1],[4]. WGS is particularly attractive as: (1) it can be applied simultaneously to large numbers of bacterial isolates of any species with no need for organism- or target-specific reagents; and (2) the resulting data are readily shareable, can be compared easily with past and future data sets, and are informative for both routine surveillance (monitoring genes and clones) and detailed outbreak investigation (genome-wide phylogenies for transmission analysis) [2],[4].
Currently available methods rely on assembling short reads into longer contiguous sequences (contigs), which can be interrogated using BLAST or other search algorithms to identify genes or alleles of interest (for example, ARG-Annot [18]; ResFinder, PlasmidFinder and MLST typer [19]-[21]; BIGSdb [22],[23]). The reliance on assembly introduces efficiency and sensitivity problems due to the data, time and computational requirements for generating high quality assemblies of bacterial genomes from short reads. There are several assemblers (for example, Velvet[24], SPAdes[25]) that can produce a bacterial genome assembly in minutes to hours with a few gigabytes of memory. However, the production of high quality assemblies with these tools requires quality filtering and other preprocessing of reads as well as optimisation of kmer length and other parameters which in practice requires several alternative assemblies to be generated and compared [26],[27], thus multiplying by an order of magnitude the amount of computational time and memory required to produce each genome prior to typing analysis. Further, the quality of even highly optimised assemblies remains highly variable, even for closely related genomes sequenced together in multiplex. Hence assembly-based analyses of genomes sequenced with short-read technology are very difficult to standardise and quality control, which is important to ensure robust, reliable and reproducible assays for use in public health and infection control. 041b061a72