Pull Reference Sequences

Command

ncbi:pull-reference-sequences

Source

This command uses the following file as the source for a list of Reference Sequence instances:

ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz

Threading & Synchronization

This command assumes that ReferenceSequence instances do not exist already exist. Multithreading is used to more quickly persist the data.

Filtering

There are a number of filtering strategies used:

G2AParser gene2AccessionParser = G2AParser.getInstance(8);
List<G2AFilter> filters = Arrays.asList(new G2AFilter[] { new G2ATaxonIdFilter(9606),
         new G2AAssemblyFilter("Reference.*Primary Assembly"),
         new G2AProteinAccessionVersionPrefixFilter(Arrays.asList(new String[] { "NP_" })),
         new G2AGenomicNucleotideAccessionVersionPrefixFilter(Arrays.asList(new String[] { "NC_" })),
         new G2ARNANucleotideAccessionVersionPrefixFilter(Arrays.asList(new String[] { "NM_", "NR_" })) });
G2AAndFilter andFilter = new G2AAndFilter(filters);
List<Record> recordList = gene2AccessionParser.parse(andFilter, genes2RefSeqFile);
	

Here, the command is filtering out ReferenceSequence instances that have:

  • a human taxonomy identifier
  • a primary assembly
  • the protein accession prefix of "NP_"
  • the genomic nucleotide accession prefix of "NC_"
  • the rna nucleotide accession prefix of "NM_" or "NR_"