This command uses the following file as the source for a list of Alignment instances:
ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/alignments/GCF_000001405.28_knownrefseq_alignments.gff3
The following files are also downloaded for Alignment Region instances:
ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/*.rna.gbff.gz
Downloading all the "*.rna.gbff.gz" files will take about 7GB of disk space. Processing through all this data will take some time as the data is compressed and the parser is memory intensive.
This command does not try to synchronize Alignment instances with previously created instances. Multithreading is heavily used.
We currently filter on various fields:
List<GBFFFilter> filters = Arrays.asList(new GBFFFilter[] {
new GBFFSequenceAccessionPrefixFilter(Arrays.asList(new String[] { "NM_", "NR_" })),
new GBFFSourceOrganismNameFilter("Homo sapiens"),
new GBFFFeatureSourceOrganismNameFilter("Homo sapiens"),
new GBFFFeatureTypeNameFilter("CDS"),
new GBFFFeatureTypeNameFilter("source") });
GBFFAndFilter gbffFilter = new GBFFAndFilter(filters);
List<Sequence> sequenceList = gbffMgr.deserialize(gbffFilter, f);
We include Alignment instances that have: