Starting with a new release of EMBL (in this case 62), we download the flat files from the EBI's FTP server, using a simple mirror script.
From the flat files we generate a fasta file of all the EMBL sequences minus the ESTs, also produce a fasta file of the ESTs, just in case.
% mkdir /database/embl/r62/est % mv /database/embl/r62/est*.dat.gz /database/embl/r62/est/est*.dat.gz % zcat /database/embl/r62/est/est*.dat.gz | sp2fasta -g - > embl_r62_est.fasta % zcat /database/embl/r62/*.dat.gz | sp2fasta -g - > embl_r62.fasta
Unfortunately sp2fasta does not produce the database prefix correctly so we have to fix it retrospectively (fixup.pl). This problem was fixed in more recent versions of sp2fasta .
% fixup.pl embl_r62_est.fasta % fixup.pl embl_r62.fasta
From the fasta file of the EMBL sequences, generate a BLAST database. This BLAST database will be used for all the BrassicaDB BLASTN analysis until the next release.
% cd /database/blast/release % /usr/local/blast2/xdformat -n -o embl_base -t EMBL_r62 embl_r62.fasta
Note that we have to use xdformat here instead of pressdb. This is because pressdb can only handle up to 4 gigabases of sequence and EMBL r62 is approx. 4.5 gigabases of sequence (without the ESTs).
While we have the flat files it is a good time to verify that we have all the Brassica sequences in BrassicaDB. So we run the brass_embl.pl script over the flat files to extract the Brassica EMBL records.
% zcat /database/embl/r62/est/est*.dat.gz | brass_embl.pl % mv brass_embl.dat Brassica_ESTs_EMBL_r62.dat % zcat /database/embl/r62/*.dat.gz | brass_embl.pl % mv brass_embl.dat Brassica_EMBL_r62.dat
A quick check to see how many sequences we have found:
% grep -c "ID " Brassica_ESTs_EMBL_r62.dat 3608 % grep -c "ID " Brassica_EMBL_r62.dat 1061
These records are then parsed to produce an ace file (GenEmbl2ace), and a list of the accessions is generated by loading the ace file into an empty database and exporting the Sequence keyset. This accession list is compared with the list of accessions in the database (obtained in a similar way), and the differences noted for action.
For example:
% GenEmbl2ace.pl Brassica_ESTs_EMBL_r62.dat Brassica_ESTs_EMBL_r62.seq.ace Brassica_ESTs_EMBL_r62.papers.ace % GenEmbl2ace.pl Brassica_EMBL_r62.dat Brassica_EMBL_r62.seq.ace Brassica_EMBL_r62.papers.ace % gzip *.dat
At this point we now have an pair of ace files which contain all the Brassica sequence from EMBL, and a BLAST database of nucleotide sequence against which we can perform further analysis.
Now we do the same thing for the peptide sequence databases SwissPROT + TrEMBL: BrassicaDB Peptide Process.
Then the various BLAST analyses required are performed.
Then the updated version of the database is prepared and released to the public.
Some issues arose during the processing of EMBL release 62 which generated some changes in the contents of the database which were unexpected (see details).
We also produce an updated copy of the BLAST database, which includes all the sequence updates, for internal use. This is done by using SynCron to generate a cumulative file of the updates, generating a fasta file from the cumulative file and appending these sequences to the BLAST database (the scripts used in this process). During this process we also grab the new Brassica sequences and load them into the database(s).
|
Last modified: Wed Jun 19 15:18:30 MET DST |