Bioinformatics analysis report

logo

Project report : Hybrid genome assemblies of 26 strains : Escherichia sp and Pseudomonas sp .

To the attention of : First name LAST NAME , Company - CITY, Country

Analysis/Writing : First name LAST NAME, Bioinformatics Engineer

Corrections/Validation : First name LAST NAME, Ph.D, Operations manager

Main results

Figure 1 - Summary of the quality of the results

Scores out of 5 of the results according to (Data) data quality and contamination; (Contiguity) number and size of contigs; (Completion) assembly completeness; (Correctness) assembly errors; (Annotation) annotation.

The goal of the project was to assemble the genomes of 26 strains of bacteria using high-throughput sequencing data from Oxford Nanopore and Illumina technologies.

Overall, the assembly metrics are good for your samples. We observe a good accuracy and contiguity. For the samples D46, we obtain a more fragmented assembly.

Due to the completion metrics, we can estimate that the assemblies are complete but 2 of your samples, D22 and D24, have lower results.

Moreover, the total sizes of the assemblies are close to the size of the reference genomes.

Project description

This report describes all bioinformatics analyses that were performed following high-throughput sequencing using Oxford Nanopore and Illumina technologies. The objective was to perform de novo hybrid assemblies of 26 strains : Escherichia sp and Pseudomonas sp and annotation of these assembled genomes.

This process was carried out in the following stages:

First, a quality check of the raw data was performed before and after the cleaning steps, including both detection and removal of adapters as well as trimming of low quality bases.
Following the cleaning, the genomes of the 26 strains, Escherichia sp and Pseudomonas sp were assembled.
The advantage of using a hybrid approach with short and long reads is that it allows to correct assemblies obtained with polishing methods and also to correct sequencing errors.
Then, the quality of these assemblies was evaluated with four categories of metrics, which are: contiguity, correctness, completion and contamination.
Finally, an annotation was performed.

Below is a representation of the key steps from the bioinformatics pipeline use to obtain de novo assemblies.

Figure 2 - Key steps from the bioinformatics pipeline leading to the hybrid assembly and annotation of the strains

Overview of the data sent

Summary tables of results :

Appendix 1: Statistics on Illumina data (short reads)
Appendix 2: Statistics on Oxford Nanopore data (long reads)
Appendix 3: Information and statistics on final assemblies
Appendix 4: Taxonomic assignment of contigs
Appendix 5: Annotation Information

For each of the samples :

Fasta sequences after assembly and corrections from long and short reads.
The result of the annotations for each assembly and each of their contigs. Each folder contains annotation files in different standard formats: generic feature format (.gff), genbank (.gbk), protein sequences (.faa), nucleic sequences (.ffn) and a simplified tabulated version (.tbl)
A depth of coverage graph along each contig.
Reads assignment report.

Below is an example of the structure of each sub-folder corresponding to the samples :

Sample/
├── Sample_assembly.fasta
├── Sample_coverage_plots
│   ├── 500kb.png
│   └── contig_1.pdf
│   └── …
├── Sample_k2_report.txt
└── Sample_prokka
    ├── mygenome.faa
    ├── mygenome.ffn
    ├── mygenome.gbk
    ├── mygenome.gff
    └── mygenome.tsv

Results

Quality control and data cleaning

Illumina Data

Figure 3

Illumina data cleaning ensures that we have excellent quality reads (>Q30), thus, ensuring the best possible quality for downstream analysis. Read adapters are removed and then the reads are filtered based on their quality.

Details of the data are available in Appendix 1.

Figure 3 - Illumina statistics

Distribution of raw and filtered illumina data for all samples by (a) number of reads ; (b) number of bases ; (c) estimated depth ; (d) mean reads size ; (e) median reads size.

Table 1

Data cleaning recovers on average 69.43% of sequenced bases (minimum 65.81%, maximum 72.49%) (Figure 3b and Table 1). Thus, the average depth estimated on the filtered data (based on a genome size of 5Mb) is 130.82 X (minimum 46.86 X, maximum 302.2 X) (Figure 3d and Table 1).

Nanopore Data

Figure 4

ONT data cleaning ensure the good quality of the reads (>Q10) as well as a minimal size (>1000 bp) thus allowing to maximize the overlap size of the reads to make the assembly more robust.

Details of the data are available in Appendix 2.

Figure 4 - Oxford Nanopore Technologies Statistics

Distribution of raw and filtered Oxford Nanopore Technologies data from all samples by (a) number of reads (b) number of bases (c) depth (d) mean reads size (e) median reads size.

Table 2

For your dataset, data filtering significantly increases the median and average read size, without major impact on the number of bases and theoretical depth (Figure 4 and Table 2). The median depth of the long reads is 85.47 X (Figure 4c and Table 2) with a global median size around 3941.5 bp (minimum 1769 bp, maximum 6783 bp) (Figure 4d and Table 2) and a global mean size around 7443 bp (minimum 3258 bp, maximum 11618 bp) (Figure 4e and Table 2).

Assembly

Figure 5

Figure 5 - Statistics on the obtained assemblies

Distribution of the obtain assemblies according to (a) the total size of the assembly; (b) the N50; (c) the L50; (d) the number of contigs ; (e) the GC percentage ; (f) the percentage of complete orthologous genes found; (g) the percentage of Illumina reads mapped back on the assembly; (h) the percentage of reads assigned to the species. The color of the dots corresponds to the species mostly found in the sample.

Table 3

The percentage of reads assigned to the species does not necessarily reflect the percentage presence of the species in the sample. This may be due to the use of unique k-mers that regularly result in assignments to a higher taxonomic rank than the species or to the low presence of the species in the databases.

Box 1 : Definitions

N50 : Given a set of contigs, the N50 is defined as the sequence length of the shortest contig at 50% of the total assembly length.
L50 : Given a set of contigs, each with its own length, the L50 is defined as count of smallest number of contigs whose length sum makes up half of genome size.
RMBC Reads Mapped Back to Contigs : Percentage of reads realigned on the assembly.
Depth of coverage : Number of times a nucleotide position is represented (in a dataset of reads).

Assignment of contigs

Figure 6

Figure 6 is a graphical representation of the distribution of Blast assignments based on the species names associated with each contig (Table 4, Appendix 3). In a perfect assembly, for a given species without plasmids it is expected that the percentage of contigs associated with given species is close to 100%.

For readability, species with assignment percentages below 1% of the assembled genome size have been masked in this representation.

Figure 6 - Species assignment distribution of contigs with Blast

The results of BLAST may be biased by the presence or absence of the species of interest in the databases.

Table 4

Table 4 repeats part of Appendix 3 and contains a list of assembled contigs that are larger than 50 kb. A blast is performed for each contig in the assembly to obtain the name of the species or plasmid associated with the sequence of the contig.

The complete list of contigs is available in table form in Appendix 3.

General evaluation of assemblies

Evaluating an assembly de novo is critical but by definition very complex. It is necessary to interrogate several criteria to define the quality of an assembly. In order to get the most complete view of the important characteristics of assemblies, we focus on the “4C” (Contiguity, Completion, Correctness, Contamination) highlighting different informative metrics (Figure 5 and Table 3 ).

Details of these data are available in Appendix 3 and 4.

Contiguity : How fragmented is the assembly?

Contiguity is related to the size and number of contigs. The goal of an assembly is to reflect the contiguity of the genome in vivo, this assembly process seeks to maximize the size of contigs and minimize the number of contigs to reflect the actual number and size of chromosomes in the organism. Contiguity errors can occur, for example, due to assembler settings that allow unlinked contigs to be joined or that prevent linked contigs from being joined.

The contiguity is most often measured by the N50 and L50, here the assemblies have an average N50 equal to 5557403.08 and an average L50 equal to 1 (Figure 5b, c and Table 3).

Overall, the contiguity metrics of the assemblies are good: few contigs, low L50 and high N50. In addition, the size of the assemblies are close to those expected for your species, about 5 Mb for Escherichia and 7 Mb for Pseudomonas. One of your samples, D46, has a more fragmented assembly with 19 contigs.

Completion : Is there any information missing?

Contiguity alone does not guarantee the quality of an assembly. It is important to verify that all the information available in the reads is used and that the assembly contains all the orthologous genes conserved in a given clade.

For your assemblies, the most complete dataset for evaluating orthologues containing 440 orthologues (Figure 5f and Table 3). The assemblies contain a minimum of 83% and a maximum of99.8% of the orthologues.

Furthermore, the alignment of Illumina reads on the Nanopore assembly shows an average of99.05% of the reads (minimum 94.12% and maximum 99.94%) used to correct the assembly (Figure 5 and Table 3). These two information allow us to confirm that the assemblies are complete.

Almost all of your assemblies have a good proportion of orthologous gene detected and a correct read alignment rate (Figure 5f, g and Table 3). However, sample D24 has lower detection of orthologous gene and sample D22 has lower illumina read alignment rate.

Correctness : Are there any errors in the assembly?

The accuracy of the assembly has to be verified by checking for assembly errors such as unresolved repeat regions. To do this, the depth of coverage along each contig is analyzed to determine if any regions are more covered than others, indicating a problem with the assembly.

Figure 7 - Example of a coverage profile along a contig

Representation of the coverage of the largest contigs in a sample. Each point represents the depth from the Illumina reads.

Overall, the depth along the largest contigs is homogeneous.

Contamination : Have any contaminations generated spurious data?

High-throughput sequencing theoretically provides high-quality de novo assembled genomic sequences, but in practice, DNA extracts are often contaminated with sequences from other organisms. To do this, contamination of raw sequences and final assemblies are checked using two complementary approaches. The first one perform a taxonomic classification analysis of the raw reads. The second one perform a BLAST on the assembly and keep the best hit for each contig (Figure 6, Table 3 and Appendix 3).

Your samples do not show any major contamination.

Annotation

Figure 8

Figure 8 - Structural and functional annotation of genomes

Distribution of features from Prokka for ech sample annotation

Table 5

Following the evaluations and quality controls of the assemblies, the contigs obtained are annotated.

The annotations that were performed allow us to obtain on average 5446 CDS (minimum of 4634 and maximum of 6715) (Figure 8a, b, Table 5 and Appendix 4). The average rRNA number is 19 (Figure 8c and Table 5). The average number of tRNAs is 85 (Figure 8e and Table 5).

The annotation result files are available in the subfolder annotation by sample. This folder contains the following files:

Sample_ID.gff (GFF3 format) : annotation and sequence
Sample_ID.gbk (Standard Genbank. multi-FASTA) : annotation in GenBank format
Sample_ID.fna : sequence of contigs used for annotation (FASTA)
Sample_ID.faa : protein sequence (FASTA)

The table corresponding to Figure 8 is available in Appendix 5.

Conclusion

The goal of the project was to assemble the genomes of 26 strains of bacteria using high-throughput sequencing data from Oxford Nanopore and Illumina technologies.

The reads were first cleaned following the protocol detailed in the Methodology section. Then, the reads were assembled and quality metrics were calculated to evaluate these assemblies.

Overall, the assembly metrics are good for your samples. We observe a good accuracy and contiguity. For the samples D46, we obtain a more fragmented assembly.

Due to the completion metrics, we can estimate that the assemblies are complete but 2 of your samples, D22 and D24, have lower results.

Moreover, the total sizes of the assemblies are close to the size of the reference genomes.