Location: National Cold Water Marine Aquaculture Center
Project Number: 8030-31000-005-021-S
Project Type: Non-Assistance Cooperative Agreement
Start Date: Sep 1, 2020
End Date: Aug 31, 2025
Objective:
To assemble existing long-read sequence data for the Eastern oyster genome, correct base error with short-read data, scaffold assembled contigs with Hi-C maps, and build a chromosome level assembly using the genetic linkage map data for guidance with advanced bioinformatics pipelines. Substantial improvements in assembly accuracy are expected.
Approach:
Recent re-sequenced read coverage analysis of Eastern oyster reference genome revealed that the current assembly represents a mosaic of diplotigs and haplotigs. Assembly pipeline-driven, artifactual redundancy in certain regions of the genome can have severe consequences for downstream genomic analyses. In order to correct these errors and produce an improved Eastern oyster genome assembly, all available reference oyster sequence data will be used to create a de novo assembly using updated, state of the art, bioinformatic methods Field et. al. [1]. The data for this project already exists in our local institutions disc directory and includes single molecular real time (SMRT) sequences (Pacific Biosciences), Illumina 150bp length reads from the SMRT sequencing DNA source and HiC library (Phase Genomics) Illumina sequences (150bp length) from a closely related sibling of the reference oyster. All quality filtered SMRT sequences will be assembled using CANU [2] and Flye [3]. Assembled contigs will then be error corrected with two rounds of Arrow [4] using the raw SMRT sequences. A final polishing of residual insertion and deletion base errors will be accomplished with Illumina short-reads (same DNA source) using Pilon [5]. At this point, we will compare the two assemblies for contiguity and accuracy metrics to choose the highest quality assembly to move forward. Accuracy will be assessed with the Mercury tool for reference-free assembly evaluation based on comparing k-mers in a de novo assembly to those found in unassembled high-accuracy Illumina reads [6]. After contig error correction duplicated contigs will be removed using Purge Haplotigs [7]. We expect to carry out at least two successive haplotig purge steps as this has been shown to remove haplotigs that escape removal after one round of detection. The HiC library sequences will be aligned to all purged contigs to properly order and orient each into scaffold structures and will be accomplished by using the Salsa2 [8] and Juicebox [9] programs. We will build a final chromosomal index file from the HiC-derived scaffolds using Chromonomer [10] and the Eastern oyster genetic linkage map sequence markers.