Skip to main content
ARS Home » Northeast Area » Beltsville, Maryland (BARC) » Beltsville Agricultural Research Center » Animal Genomics and Improvement Laboratory » AIP » Software » FINDMAP

findmap.f90 Align sequence reads to reference map, call previous variants, and identify new variants

Downloads Version 2.2 programs, example and test outputs, and executables
(released December 10, 2018)
  • Findmap alignment and Findvar variant calling series
  • Programs mapsim.f90, storemap.f90, map2seq.f90, findmap.f90, findvar.f90, leftmost.f90, and depth2vcf.c included
    • Skip program mapsim when using actual map and variant files
    • Skip simulation program map2seq when using actual fastq reads
    • Programs leftmost and depth2vcf optional for converting to other formats
  • To save space, zip download does not include actual reference genome, variant list, or fastq sequence read files
    • Download findmapV2.2.zip onto a computer with the Unix operating system
    • Type unzip findmapV2.2.zip, and hit enter
  • After unzipping package, type runsim.script to run program series script and generate an example simulated reference map, a simulated variant file, simulated fastq files, and other program input and output files; type runmap.script to run alignment and variant calling programs
  • Programs will execute in the Test_Output directory using options files there.
    • Options files give more detail on available options and recommended values
    • Output files in Test_Output should match those in Example_Output fairly closely
    • Each new project should be run in a new directory because standard file names are used
  • For quick testing, mapsim.options and map2seq.options are set to just 1 chromosome, and you must then reset them to 30 chromosome pairs for cattle (or other numbers to match your reference.fa file); similarly, maxhash is set to 87654321 and maxdup to 10 million in storemap.options to reduce memory in initial testing, but values of 287654321 and 300 million are recommended for a 3-Gbase genome
  • Findmap output formats .found and .lost now include an extra column for map quality score
    • findvar is not backward compatible to input data aligned by earlier findmap versions
    • Map quality scores are accurate if detection option is set to 2 but quadruples run time
  • Do not include all of the unmapped contigs in the fasta map file as separate chromosomes because those would increase memory for little gain; from the human reference hg38.fa, we used 25 chromosomes (including 22 autosomes plus X, Y, and mitochondrial DNA) and reformatted the file using fasta.sas (included)
  • Variant list from the the 1000 Bull Genomes Project (Daetwyler et al., 2014) should be concatenated across chromosomes and then reformatted to variants.prior using program variant.sas.
    • Main reason to reformat is that indel notation and locations used in findmap differ from vcf notation
    • variant.sas program can also reformat the 00-common_all.vcf human variant file
  • Simulation options
    • For markersim.f90, there is an optional input file, genetic.cor, in the subdirectory Example_Output.
    • For storemap.f90, there is an optional input file, flank.location, in the subdirectory Example_Output; its only purpose is to output the flanking sequence for each location in the input file
  • Mapsim option newvar should be 0 when processing actual data so that all known variants will be used
    • In mapsim.options, newvar can be set to 5 (or other odd number) to exclude every 5th variant and demonstrate detection of those "new" variants
    • Program testvar.sas then tests accuracy of variant calls, both for previously known and newly discovered variants
  • Program map2seq simulates genotypes for variant list in groups of 4, with every 4th variant homozygous alternate allele, every 2nd variant homozygous reference, and 1st and 3rd variants heterozygous or optionally can read file genotypes.true instead to declare variant genotypes for each DNA source
Version 2.1 programs, example outputs, and executables
(released October 1, 2018)


Version 2 programs, example outputs, and executables
(released May 31, 2018)


Version 1 programs, example outputs, and executables
(released July 19, 2016; last updated July 28, 2016)


Version 0 programs, example outputs, and executables
(beta version; released January 8, 2016)

Inputs reference.fa Standard fasta format for reference genome:
> as 1st character in line for each new chromosome
50-byte lines of ACGT (or N for unknown bases), or acgt for repeated sections
All programs in this series treat lower and uppercase as the same because storemap identifies, counts, stores, and links repeated k-mers to each other while hashing reference map
variants.prior Lists all previously known SNPs and indels
Insertions reported 1 base to left of 1st base where they differ from reference genome, reading left to right
Deletions reported at their detected location, not 1 base to left
Use variants.sas to reformat the 1000 Bull Genomes variant file
Format: chr# location vartype (SNP, INS, or DEL)  variant#  length  alternate_allele
 
fastq.filelist List of DNA source names such as source1, source2, etc., along with numeric IDs
source1.1.fq,
source1.2.fq,
source2.1.fq,
source2.2.fq, etc.
Standard fastq format for paired end reads, with reads 1 and 2 of each pair at same position in 2 separate files for each DNA source
*.options Program control file with user-defined options

Outputs storemap.unf Hash table, etc., for reference map
reference.unf Unformatted map for faster input
variant.readdepth Number of ref and alt alleles, 1 row/variant
Format: variant# chr# var_location ref# alt#
individual.readdepth Format: ID#  chip#  #SNPs
Read counts for A and B alleles stored in 1-byte hexadecimal format (input format for imputation program findhap4; VanRaden et al., 2015)
segments.found Alignments, errors, and known variant locations for segments where paired end locations differ by <fraglen
Format: segment# pair# direction chr# segment_location num_alts num_errs (var_locations var_type) (err_locations err_base)
segments.lost Same format, but for segments where paired end locations do not match
segments.newindels Locations and properties of new indels detected (those not already in variants.txt)
Format: segment# pair# direction chr# seg_location indel_size indel_location bases (inserted or deleted)
SNPs.new Summary of new SNPs including read depth and number of alternate alleles found
Locations can have >1 row if differing alternate alleles are observed
Format: chr# SNP_location read_depth num_alt ref_allele alt_allele
indels.new Summary of new indels including read depth and number of alternate alleles found
Locations can have >1 row if differing alternate alleles are observed
Format: chr# indel_location read_depth num_alt ref_allele alt_allele
variants.all Combined list of prior and new variants in same format as variants.prior

References 2019 VanRaden, P.M., Bickhart, D.M., and O'Connell, J.R. Calling known variants and identifying new variants while rapidly aligning sequence data. J. Dairy Sci. 102:3216–3229.
2016 VanRaden, P.M., and D.M. Bickhart. Fast single-pass alignment and variant calling using sequencing data. Plant Anim. Genome XXIV Conf., San Diego, CA, Jan. 9–13, W161.  |  Presentation slides

VanRaden, P.M., D.M. Bickhart, and J.R. O'Connell. Identifying and calling insertions, deletions, and single-base mutations efficiently from sequence data. J. Dairy Sci. 99(E-Suppl. 1):140(abstr. 0302).  |  Presentation slides
2015 VanRaden, P.M., C. Sun, and J.R. O'Connell. Fast imputation using medium- or low-coverage sequence data. BMC Genet. 16:82.
2014 Daetwyler, H.D., A. Capitan, H. Pausch, P. Stothard, R. van Binsbergen, R.F. Brøndum, X. Liao, A. Djari, S.C. Rodriguez, C. Grohs, D. Esquerré, O. Bouchez, M.N. Rossignol, C. Klopp, D. Rocha, S. Fritz, A. Eggen, P.J. Bowman, D. Coote, A.J. Chamberlain, C. Anderson, C.P. Van Tassell, I. Hulsegge, M.E. Goddard, B. Guldbrandtsen, M.S. Lund, R.F. Veerkamp, D.A. Boichard, R. Fries, and B.J. Hayes. Whole-genome sequencing of 234 bulls facilitates mapping of monogenic and complex traits in cattle. Nature Genet. 46:858–865.

VanRaden, P.M., and C. Sun. Fast imputation using medium- or low-coverage sequence data. Proc. 10th World Congr. Genet. Appl. Livest. Prod., 179.  |  Presentation slides

License Fortran package findmap.f90 is public domain and was developed with U.S. taxpayer funding. Accurate results are not guaranteed. Please report any bugs to paul.vanraden@usda.gov. You may modify, improve, use, and redistribute the code to anyone for any purpose. Or, you can ask Paul to make changes that could benefit U.S. evaluations and other users.

 Paul VanRaden
 Animal Genomics and Improvement Laboratory
 Agricultural Research Service, USDA