Project report : Hybrid genome assemblies of 26
strains : Escherichia sp and Pseudomonas sp .
To the attention of : First name LAST NAME , Company - CITY, Country
Analysis/Writing : First name LAST NAME, Bioinformatics Engineer
Corrections/Validation : First name LAST NAME, Ph.D, Operations managerScores out of 5 of the results according to (Data) data quality and contamination; (Contiguity) number and size of contigs; (Completion) assembly completeness; (Correctness) assembly errors; (Annotation) annotation.
The goal of the project was to assemble the genomes of 26
strains of bacteria using high-throughput sequencing data from Oxford
Nanopore and Illumina technologies.
Overall, the assembly metrics are good for your samples. We observe a good accuracy and contiguity. For the samples D46, we obtain a more fragmented assembly.
Due to the completion metrics, we can estimate that the assemblies are complete but 2 of your samples, D22 and D24, have lower results.
Moreover, the total sizes of the assemblies are close to the size of the reference genomes.
This report describes all bioinformatics analyses that were performed following high-throughput sequencing using Oxford Nanopore and Illumina technologies. The objective was to perform de novo hybrid assemblies of 26 strains : Escherichia sp and Pseudomonas sp and annotation of these assembled genomes.
This process was carried out in the following stages:
Below is a representation of the key steps from the bioinformatics pipeline use to obtain de novo assemblies.
Summary tables of results :
For each of the samples :
Below is an example of the structure of each sub-folder corresponding to the samples :
/
Sample
├── Sample_assembly.fasta
├── Sample_coverage_plots
│ ├── 500kb.png
│ └── contig_1.pdf
│ └── …
├── Sample_k2_report.txt
└── Sample_prokka
├── mygenome.faa
├── mygenome.ffn
├── mygenome.gbk
├── mygenome.gff └── mygenome.tsv
Illumina data cleaning ensures that we have excellent quality reads (>Q30), thus, ensuring the best possible quality for downstream analysis. Read adapters are removed and then the reads are filtered based on their quality.
Details of the data are available in Appendix 1.
Distribution of raw and filtered illumina data for all samples by (a) number of reads ; (b) number of bases ; (c) estimated depth ; (d) mean reads size ; (e) median reads size.
Data cleaning recovers on average 69.43% of sequenced bases (minimum 65.81%, maximum 72.49%) (Figure 3b and Table 1). Thus, the average depth estimated on the filtered data (based on a genome size of 5Mb) is 130.82 X (minimum 46.86 X, maximum 302.2 X) (Figure 3d and Table 1).
ONT data cleaning ensure the good quality of the reads (>Q10) as well as a minimal size (>1000 bp) thus allowing to maximize the overlap size of the reads to make the assembly more robust.
Details of the data are available in Appendix 2.
Distribution of raw and filtered Oxford Nanopore Technologies data from all samples by (a) number of reads (b) number of bases (c) depth (d) mean reads size (e) median reads size.
For your dataset, data filtering significantly increases the median and average read size, without major impact on the number of bases and theoretical depth (Figure 4 and Table 2). The median depth of the long reads is 85.47 X (Figure 4c and Table 2) with a global median size around 3941.5 bp (minimum 1769 bp, maximum 6783 bp) (Figure 4d and Table 2) and a global mean size around 7443 bp (minimum 3258 bp, maximum 11618 bp) (Figure 4e and Table 2).
Distribution of the obtain assemblies according to (a) the total size of the assembly; (b) the N50; (c) the L50; (d) the number of contigs ; (e) the GC percentage ; (f) the percentage of complete orthologous genes found; (g) the percentage of Illumina reads mapped back on the assembly; (h) the percentage of reads assigned to the species. The color of the dots corresponds to the species mostly found in the sample.
The percentage of reads assigned to the species does not necessarily reflect the percentage presence of the species in the sample. This may be due to the use of unique k-mers that regularly result in assignments to a higher taxonomic rank than the species or to the low presence of the species in the databases.
Box 1 : Definitions
N50 : Given a set of contigs, the N50 is defined as
the sequence length of the shortest contig at 50% of the total assembly
length.
L50 : Given a set of contigs, each with its
own length, the L50 is defined as count of smallest number of contigs
whose length sum makes up half of genome size.
RMBC
Reads Mapped Back to Contigs : Percentage of reads realigned on
the assembly.
Depth of coverage : Number of times a
nucleotide position is represented (in a dataset of reads).
Figure 6 is a graphical representation of the distribution of Blast assignments based on the species names associated with each contig (Table 4, Appendix 3). In a perfect assembly, for a given species without plasmids it is expected that the percentage of contigs associated with given species is close to 100%.
For readability, species with assignment percentages below 1% of the assembled genome size have been masked in this representation.
The results of BLAST may be biased by the presence or absence of the species of interest in the databases.
Table 4 repeats part of Appendix 3 and contains a list of assembled contigs that are larger than 50 kb. A blast is performed for each contig in the assembly to obtain the name of the species or plasmid associated with the sequence of the contig.
The complete list of contigs is available in table form in Appendix 3.
Evaluating an assembly de novo is critical but by definition very complex. It is necessary to interrogate several criteria to define the quality of an assembly. In order to get the most complete view of the important characteristics of assemblies, we focus on the “4C” (Contiguity, Completion, Correctness, Contamination) highlighting different informative metrics (Figure 5 and Table 3 ).
Details of these data are available in Appendix 3 and 4.
Contiguity is related to the size and number of contigs. The goal of an assembly is to reflect the contiguity of the genome in vivo, this assembly process seeks to maximize the size of contigs and minimize the number of contigs to reflect the actual number and size of chromosomes in the organism. Contiguity errors can occur, for example, due to assembler settings that allow unlinked contigs to be joined or that prevent linked contigs from being joined.
The contiguity is most often measured by the N50 and L50, here the assemblies have an average N50 equal to 5557403.08 and an average L50 equal to 1 (Figure 5b, c and Table 3).
Overall, the contiguity metrics of the assemblies are good: few contigs, low L50 and high N50. In addition, the size of the assemblies are close to those expected for your species, about 5 Mb for Escherichia and 7 Mb for Pseudomonas. One of your samples, D46, has a more fragmented assembly with 19 contigs.
Contiguity alone does not guarantee the quality of an assembly. It is important to verify that all the information available in the reads is used and that the assembly contains all the orthologous genes conserved in a given clade.
For your assemblies, the most complete dataset for evaluating orthologues containing 440 orthologues (Figure 5f and Table 3). The assemblies contain a minimum of 83% and a maximum of99.8% of the orthologues.
Furthermore, the alignment of Illumina reads on the Nanopore assembly shows an average of99.05% of the reads (minimum 94.12% and maximum 99.94%) used to correct the assembly (Figure 5 and Table 3). These two information allow us to confirm that the assemblies are complete.
Almost all of your assemblies have a good proportion of orthologous gene detected and a correct read alignment rate (Figure 5f, g and Table 3). However, sample D24 has lower detection of orthologous gene and sample D22 has lower illumina read alignment rate.
The accuracy of the assembly has to be verified by checking for assembly errors such as unresolved repeat regions. To do this, the depth of coverage along each contig is analyzed to determine if any regions are more covered than others, indicating a problem with the assembly.
Representation of the coverage of the largest contigs in a sample. Each point represents the depth from the Illumina reads.
Overall, the depth along the largest contigs is homogeneous.
High-throughput sequencing theoretically provides high-quality de novo assembled genomic sequences, but in practice, DNA extracts are often contaminated with sequences from other organisms. To do this, contamination of raw sequences and final assemblies are checked using two complementary approaches. The first one perform a taxonomic classification analysis of the raw reads. The second one perform a BLAST on the assembly and keep the best hit for each contig (Figure 6, Table 3 and Appendix 3).
Your samples do not show any major contamination.
Distribution of features from Prokka for ech sample annotation
Following the evaluations and quality controls of the assemblies, the contigs obtained are annotated.
The annotations that were performed allow us to obtain on average 5446 CDS (minimum of 4634 and maximum of 6715) (Figure 8a, b, Table 5 and Appendix 4). The average rRNA number is 19 (Figure 8c and Table 5). The average number of tRNAs is 85 (Figure 8e and Table 5).
The annotation result files are available in the subfolder annotation by sample. This folder contains the following files:
The table corresponding to Figure 8 is available in Appendix 5.
The goal of the project was to assemble the genomes of 26 strains of bacteria using high-throughput sequencing data from Oxford Nanopore and Illumina technologies.
The reads were first cleaned following the protocol detailed in the Methodology section. Then, the reads were assembled and quality metrics were calculated to evaluate these assemblies.
Overall, the assembly metrics are good for your samples. We observe a good accuracy and contiguity. For the samples D46, we obtain a more fragmented assembly.
Due to the completion metrics, we can estimate that the assemblies are complete but 2 of your samples, D22 and D24, have lower results.
Moreover, the total sizes of the assemblies are close to the size of the reference genomes.
This part is written according to the project.
This part is written according to the project.
This part is written according to the project.
This part is written according to the project.
This part is written according to the project.
1 rue du Pr. Calmette,
59000 LILLE, France
+33 (0) 3 62 26 37 77
bioinfo@genoscreen.fr
© GenoScreen 2022