Downloading and interpreting search results:
When constructing a custom reference library, always download the Genetic Data (FASTA format) and Metadata (csv) for selections from the search results. These files work together to ensure the genetic data is properly linked with it’s description.
Genetic Data (FASTA format): In FASTA format the line before the nucleotide sequence, called the FASTA definition line, must begin with a carat (“>”), followed by a unique sequence identifier. For sequences downloaded from CaMDL, the sequence identifier is always the occurrenceID, which is a globally unique identifier for each sequence in CaMDL. The occurrenceID matches each record in the associated metadata csv file. This is the only relational element between the Genetic Data and Metadata files, as sequences alone aren’t guaranteed to be unique.
The FASTA formatted Genetic Data file is ready to be re-formatted to suit the needs of taxonomic assignment. Follow instructions here to format FASTA files into a BLAST database for use with BLAST+ software.
Metadata (csv): Each gene record that shows as a search result has associated metadata. All of the metadata terms are aligned with Darwin Core (DwC) data standard terminology used by the Ocean Biodiversity Information System (OBIS) and the “Minimum Information about any (X) Sequence” (MIxS) specification generated by the Genomic Standards Consortium (Yilmaz et al., 2011).
More information about the Darwin Core Archive biodiversity informatics data standard can be found on the GBIF website (Darwin Core Archives – How-to Guide :: GBIF IPT User Manual).
Metadata terms can be accessed in Appendix 1 of this document and as an excel spreadsheet in the ‘Resources’ section of the website. Terms are classified into required, recommended, or optional. The spreadsheet includes the term name (which is the name also found on any downloaded metadata files), the definition, and an example of the text.