Release 3.2b Update - January 2005
The Release 3.2b Heterochromatin annotation update addresses known data bugs that were present in the August 2004 Release 3.2 annotation release. This release updates addresses the following issues-
- Many FlyBase identifieres (FBgn's) were missing or incorrectly assigned. These have been syncronized with FlyBase and are now corrected.
- 63 'Repeat_Region' annotations had either 'CG' or 'CR' IDs in Release 3.2, since these models were once considered valid gene models by annotation criteria. These annotations now have 'TE' IDs and all prior identifiers have been added to the synonym table to improve tracking.
- In Release 3 the 'Linked_1 through Linked_7' sequence scaffolds were constructed by linking small WGS scaffolds together using cDNA data. However, the composite linked sequence made by these scaffolds was not available through GenBank. These 7 scaffolds now have their own GenBank accessions IDs (AABU01002777-AABU01002783) and can accessed through NCBI.
- A number of various 'miscellaneous RNAs' and 'curator comment' annotations have been reclassified as 'ncRNAs' and will be further evaluated in future updates.
Release 3.2 General Information - August 2004
The Release 3.2 Heterochromatin annotation represents the latest effort to describe the protein-coding genes, non-coding genes, and other features located in the heterochromatin sequence. In this update, the underlying sequence is the 20.7Mb of Release 3 whole-genome-shotgun (WGS) scaffolds from Celera that could not be assembled into the euchromatin arms as well as a few BDGP-sequenced scaffolds.
The sequence finishing and annotation of the heterochromatic region of the genome is being performed by the Drosophila Heterochromatin Genome Project (DHGP; see Hoskins et al. 2002). As sequence gaps are filled, and the heterochromatic scaffolds are finished to high quality and re-annotated, they will be contributed to GenBank and FlyBase and integrated into future releases of the Drosophila genomic sequence. Release 3.2 annotation of the heterochromatic regions was released to GenBank in June 2004 and should be available from FlyBase and GenBank by August 2004.
The confidence we have in the annotated gene models varies considerably; improvements to the gene models will be ongoing, and will require the continued input of the community. If you notice a mistake in annotation, please submit an error report form (also accessed from the gene annotation reports) or write to 'help AT dhgp.org'. Updates may also be submitted as sequence records or as Apollo-generated XML files.
Heterochromatin-specific Details
The WGS3 heterochromatin consists of ~2600 scaffolds that still contain gaps and collapsed repeats, but are otherwise considered relatively high-quality sequence. Some of these have been mapped to particular chromosome arms (i.e. 2h, 3h, 4h, Xh, or Yh), while the remaining have been placed on chromsome U. It is important to note that scaffolds that have been mapped to a particular chromosome arm are provisionally ordered, but not oriented: they are ordered by their experimentally determined cytological locations, but their orientation and exact order remain unclear. Chromosome U consists of unordered, unoriented scaffolds. While the underlying sequence of the scaffolds annotated in Release 3.2 has not changed, the mapping and ordering of these scaffolds on chromosome arms (e.g. 2h, 3h...) may differ from previous releases.
The transition between the euchromatic and heterochromatic regions of the genome is thought to be a gradual one, and there are no objective rules to categorize the sequence in this transitional area as definitively euchromatic or heterochromatic. Currently the boundaries between the euchromatic and heterochromatic portions of the genome are based on cytological data, as described in Hoskins et al. 2002.
Annotation guidelines consistent with FlyBase and the overall Drosophila genome annotation were adhered to whenever possible. However, since these annotations are based on high-quality draft sequence, certain gene models may contain missing or premature stop codons, missing start codons, or gaps within their ORFs. Open reading frames corresponding to fragments of transposable elements are common in heterochromatin; every attempt was made to identify these and exclude them from the gene annotations.
As the DHGP adds new data and improves the quality of the underlying sequence and assembly in future releases, the quality of the annotations will also improve. The DHGP welcomes any feedback and data from the community that will assist in this effort.