Pull Alignments

Command

ncbi:pull-alignments

Source

This command uses the following file as the source for a list of Alignment instances:

ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/alignments/GCF_000001405.28_knownrefseq_alignments.gff3

The following files are also downloaded for Alignment Region instances:

ftp://ftp.ncbi.nlm.nih.gov/refseq/H_sapiens/mRNA_Prot/*.rna.gbff.gz

Downloading all the "*.rna.gbff.gz" files will take about 7GB of disk space. Processing through all this data will take some time as the data is compressed and the parser is memory intensive.

Threading & Synchronization

This command does not try to synchronize Alignment instances with previously created instances. Multithreading is heavily used.

Filtering

We currently filter on various fields:

List<GBFFFilter> filters = Arrays.asList(new GBFFFilter[] {
	  new GBFFSequenceAccessionPrefixFilter(Arrays.asList(new String[] { "NM_", "NR_" })),
          new GBFFSourceOrganismNameFilter("Homo sapiens"),
	  new GBFFFeatureSourceOrganismNameFilter("Homo sapiens"),
          new GBFFFeatureTypeNameFilter("CDS"),
	  new GBFFFeatureTypeNameFilter("source") });
GBFFAndFilter gbffFilter = new GBFFAndFilter(filters);
List<Sequence> sequenceList = gbffMgr.deserialize(gbffFilter, f);
	

We include Alignment instances that have:

  • a sequence accession prefix of NM_ or NR_
  • a source->organism name of 'Homo sapiens'
  • a feature->source->organism name of 'Homo sapiens'
  • a feature type name of 'CDS'
  • a feature type name of 'source'