This command uses the following file as the source for a list of Reference Sequence instances:
ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz
This command assumes that ReferenceSequence instances do not exist already exist. Multithreading is used to more quickly persist the data.
There are a number of filtering strategies used:
G2AParser gene2AccessionParser = G2AParser.getInstance(8);
List<G2AFilter> filters = Arrays.asList(new G2AFilter[] { new G2ATaxonIdFilter(9606),
new G2AAssemblyFilter("Reference.*Primary Assembly"),
new G2AProteinAccessionVersionPrefixFilter(Arrays.asList(new String[] { "NP_" })),
new G2AGenomicNucleotideAccessionVersionPrefixFilter(Arrays.asList(new String[] { "NC_" })),
new G2ARNANucleotideAccessionVersionPrefixFilter(Arrays.asList(new String[] { "NM_", "NR_" })) });
G2AAndFilter andFilter = new G2AAndFilter(filters);
List<Record> recordList = gene2AccessionParser.parse(andFilter, genes2RefSeqFile);
Here, the command is filtering out ReferenceSequence instances that have: