Skip to main content
ARS Home » Southeast Area » Athens, Georgia » U.S. National Poultry Research Center » Exotic & Emerging Avian Viral Diseases Research » Research » Publications at this Location » Publication #298516

Title: Sequencing artifacts in the type A influenza databases and attempts to correct them

Author
item Suarez, David
item CHESTER, NIKKI - Athens Academy
item HATFIELD, JASON - Athens Academy

Submitted to: Influenza and Other Respiratory Viruses
Publication Type: Peer Reviewed Journal
Publication Acceptance Date: 1/15/2014
Publication Date: 7/1/2014
Publication URL: http://handle.nal.usda.gov/10113/60131
Citation: Suarez, D.L., Chester, N., Hatfield, J. 2014. Sequencing artifacts in the type A influenza databases and attempts to correct them. Influenza and Other Respiratory Viruses. 8(4):499-505. DOI: 10.1111/irv.12239

Interpretive Summary: With modern technology scientists have been able to determine the genetic sequence of many thousands of influenza viruses. These sequences are placed in public databases to allow anyone in the world access to the information to contribute to their own research projects. There are standards on how to submit the data to the public databases, but the quality of the submitted data is dependent on the scientist who prepared the data. A research project was developed in collaboration with a local high school to look at influenza sequences in the public database and identify sequences that likely had errors in the sequence. If an error was identified then the student would contact the scientist and ask him to review the sequence and correct it if necessary. A total of 1081 sequences were identified and the students successfully were able to correct over 200 of these sequences. This paper describes the types of errors commonly found and provides information on how to prevent these types of errors in the future.

Technical Abstract: Type A influenza virus causes a wide range of disease in both man and animals, and considerable research effort goes to the study and sequence of this virus. Currently, there are over 276,000 gene sequences representing over 65,000 strains in publicly available databases. However, the quality of the sequences submitted to these public databases is determined by the contributor. Currently, there are many sequence errors present in the databases which can affect both sequence analysis and also require significant curation of data by individual researchers to further research efforts. As part of a high school class project, bioinformatics analysis was performed on all six internal gene segments of influenza A virus. Sequences were selected that were longer than the accepted length of those segments, with the hypothesis that these sequences would have an error in the sequence. A total of 1081 sequences met this criterium, which represents 0.82% rate of potential errors. Specific attention was placed on sequences with additional nucleotides upstream or downstream of the highly conserved non-coding ends of the viral segments. Three types of errors were commonly observed: non-influenza primer sequence was not removed from the sequence; PCR product was cloned and plasmid sequence was included in the sequence; and Taq polymerase added an adenine at the end of the PCR product. Internal insertions of nucleotide sequence were also commonly observed, but in many cases it was unclear if the sequence was correct or actually contained an error. Students contacted some of the sequence submitters alerting them of the issue(s) and requesting a review of their suspect sequence(s). Students also requested the labs resubmit corrected sequences when appropriate to update the public databases. A total of 215 sequences, or 22.8% of the suspect sequences, were corrected in the public databases in the first year of the student project. Unfortunately 138 additional sequences with possible errors were added to the databases in the last year. Additional awareness of the need for data integrity of sequences submitted to public databases is needed to fully reap the benefits of the huge amount of sequence data available for analysis.