The HSP70 Sequence Database  

[ Home ] [ Prokaryotes ] [ Eukaryotes ] [ Mitochondria ]
[ Chloroplast ] [ Viruses ] [ Human ]

Human HSP70 Gene HSPA5

Burcin Uygungil and Laurence A. Moran

The endoplasmic reticulum version of HSP70 is involved in protein folding and assembly in the ER. It also plays a role in the transport of newly synthesized proteins into the ER as well as possibly assisting in the targeting of proteins to the ubiquitin degradation pathway. The protein is known as GRP78 (glucose regulated protein), BiP (binding protein), MIF2, and ER Lumenal Ca2+ binding protein. Since this protein functions by binding transiently to other proteins and is involved in different processes within the ER, the name BiP (Binding Protein) seems to be the most suitable.

BiP is only found in eukarotes. All of the known homologues contain an N-terminal signal sequence that directs the protein to the lumen of the ER. The signal sequence consists of a basic lysine residue followed by a 14 residue hydrophobic region and a polar region for recognition by the Signal Recognition Particle (SRP). BiP proteins also have a (H/K)DEL ER retention signal at the C-terminus.

BiP genes diverged from all other eukaryotic HSP70's at a very early stage in the evolution of eukaryotic cells. Thus, the internal amino acid sequences of these proteins from different species share certain characteristics that distinguish them from all other members of the HSP70 gene family. These signature sequences permit easy identification and alignment of the sequence records for this gene product. [Alignments]

HSPA5 Database Sequences

The human BiP gene is named HSPA5 [HUGO HSPA5]. The Entrez Gene reference [GeneID 3309] currently (Aug. 2005) points to seven GenBank entries (see table below). The Ensembl Gene Report record [ENSG00000044574] lists the same six sequence records. The HSPA5 gene is located near the tip of the long arm of chromosome 9 with the 5'-end of the gene nearest the telomere [MapView 9q33 - 9q34.1]. The image on the left is taken from the [ensembl Chromosome 9 website].

Errors, Conflicts, and Polymorphisms

The amino acid sequences of the GenBank records are very similar with only four conflicting amino acid sites. M19645 contains the full-length sequence of the gene cloned from human fetal liver DNA (Ting and Lee 1994). X87949 is a full-length cDNA sequence derived from a cervical carcinoma cDNA library (Chao 1995 unpublished). These two sequences were independently submitted by different labs. They differ from all other human BiP sequences at four sites: a single amino acid gap at position 269 (Δ269(R)); a histidine residue at position 392 (H392(R)); a serine residue at postion 413 (S413(R)); and an asparagine residue at position 421 (N421(K)). The predicted deletion removes a highly conserved arginine residue that is present in all other eukaryotic versions of HSP70 with the exception of the prokaryotic derived mitochondrial version. We conclude that Δ269(R) is a sequencing error. The remaining three differences in M19645 and X87949 also occur in highly conserved regions. [see Alignments]

Considering that all four of these variant positions are located in otherwise highly conserved regions in members of the HSP70 gene family, it is surprising that both of these independent entries (M19645 and X87949) have the exact same variant sequence. The Asp substitution for Lys at position 421 seems peculiar but we note that a large number of the other hsp70 family members have a Gln residue at position 421 and Gln is similar to Asn. There is a slight possibility that the old M19645 and X87949 records might indicate two different alleles in the human population. However, it is much more likely that these sequencs are incorrect.

There are four cDNA sequences: AF188611(Fife, 1999 unpublished), AF216292 (Bermudez-Fajardo 2000, unpublished), AJ271729 (Hansen, 2000 direct submission), and BC020235 (NIH MGC Project, 2001). All four contain the expected arginine, aspartate, arginine, and lysine residues at the respective positions and they do not show the deletion at position 269. These sequences agree with the coding region of the human genome sequence as represented in the latest build of chromosome 9 [NC_000009].

AF188611 is a partial cDNA sequence that begins at position -10 so it is missing most of the N-terminal signal sequence (Fife 1999 unpublished). A significant and obvious error occurs at the C-terminus where instead of a KDEL sequence, a series of histidine residues is reported. Oddly, the Evidence Viewer does not include these histidine residues and truncates the sequence. It is curious that this record was submitted since it is clearly an error. This brings up the question of how much research an individual lab should do before submitting a sequence. This particular record was submitted well after the first record for this gene, yet there is no indication in the GenBank record that the C-terminus is incorrect.

The record AF216292 (Bermudez-Fajardo 2000 unpublished) is also a complete coding sequence of 1965 bp. In this case the hypothetical translation is the same as that of the 2007 bp sequence AJ271729 (Hansen 2000 direct submission). As seen in the Evidence Viewer, these two nucleotide sequence appear truncated at the 5′ end but the amino acid sequences are correct. The provisional NCBI RefSeq standard is derived from a full-length cDNA (BC020235) that has been submitted by the NIH cDNA sequencing project. The translation does not contain any discrepancies and seems to be a good choice as it agrees with the majority of the other records.

From the contig NT_029366, three different model mRNAs were predicted as supported by alignment with ESTs (XM_044200, XM_044201, and XM_044202), which correspond to the translations found in AF216292 and AJ271729 (NM_005347). Recently, all three of these records were removed and no new predicted mRNA has been suggested in their place. Instead, NCBI has placed the record NM_005347 under the Genome Annotation category. This is inconsistent with their method of organizing records and does not seem to follow any sort of reasoning. Usually as we have observed that mRNAs that are predicted computationally fall under the Genome Annotation category, whereas the records of the form NM_xxxxxx are usually derived from an independent submission, in this case AJ271729 providing essentially two differently derived standard records. However, a BLAST2 sequence alignment between NM_005347 and NT_029366 revealed that of the 2007 bp in the former record, 100% of them aligned with that portion of the contig. This indicates that NCBI likely chose NM_005347 to represent the Genome Annotation reference because it was already the correct sequence thus minimizing redundancy.

Accession Source Errors Comment

genome   N-terminal
fragment (1993)
genome Δ269(R)
GRP78 (1994)
cDNA Δ269(R)
BiP (1995)
cDNA 674...
begins at aa-10
cDNA   Grp78
ER luminal Ca2+
binding protein
cDNA   old v.1 RefSeq
cDNA   NIH cDNA Seq. Proj.
new v.2 RefSeq
NC_000009 human
  Chromosome 9
NT_29366 human
  Chromosome 9
revision history
  v.2 June 10, 2002
"model sequence"
Chromosome 9
source unclear
PIR Δ269(R)

Under Locus Link, the record XM_067790 is indicated as being similar to X87949 (*Locus Link LOC132335). However, the HSPA5-like gene is found in the contig NT_022757 and maps to the long arm of chromosome 4 (4q32.3) with the 5' end closer to the centromere (*MapView LOC132335). The N-terminal region contains many non-conservative substitutions. In addition, residues between 106-144, 191-294, and 416-469 are missing. On top of the intermittent non-conserved residues, the C-terminal ER localization signal is absent. Considering that the sequence is highly error prone, it is likely a pseudogene derived from the HSPA5 RNA since it generally resembles this gene. The method of finding genes for the Human Genome Project is lead by automated searches for ORFs thus it is impossible for computers to determine whether or not something is a pseudogene or a real gene if there are no stop codons. In this case, if any protein product was derived from this gene it would probably be non-functional since many of the conserved regions are altered by a series of point mutations.

The Swiss-Prot/UniProt record [P11021] is accurate. The primary amino acid sequence corresponds to the more recent sequence records and the M19645 + X87949 variants are noted as conflicts.

Introns and Exons

There are eight coding region exons spanning 4576 bp. The latest Entrez Gene entry shows a single predicted transcript. This prediction does not follow directly from the UniGene Cluster (Hs. 310769) since there are a number of EST's that do not correspond to the predicted transcript. A comparison of the exons and introns predicted by UniGene and that in the final record can be seen in this version of Map Viewer [UniGene vs. RNA Transcipt]. The decision to ignore the artifacts in the EST database and go with the actual transcript was wise, although it took a few years. It is one of the advantages of manual curation of the human genome.

There are several ESTs that are clearly derived from unprocessed mRNA precursors but the human genome annotators have correctly chosen to ignore them (see evidence viewer in LocusLink).

Introns and Exons in the 
HSPA5 Gene

Alternative Splicing

The Entrez Gene entry shows EST's Entrez. AceView list 1308 EST sequences for this gene [HSPA5 EST's].

The ECgene database entry for HSPA5 is H9C10987. The database includes 1 RefSeq, 6 mRNA, and 1035 EST's. According to their analysis the gene produces 18 transcript variants encoding 10 distinct proteins. (At high confidence there are "only" 10 transcripts and 6 different proteins.) The 10 proteins are shown in the Gene Structure window. It is obvious that only one of these is correct (#4). Eight of the others are ridiculous fragments or combinations of exons and introns that have no hope of forming an active protein. It's not clear why #8 is a distinct protein since it looks like it could encode a full-length version of BiP.

The asg entry for HSPA5 is NM_005347. This entry also shows several different alternatively spliced transcripts with two examples of intron retention. It's not clear how many different variants there are but it seems similar to the ECgene entry. The idea that the highly conserved BiP would tolerate an insertion of several dozen amino acids in the middle of the hydrophobic core of the protein is so silly that one wonders whether a human has ever looked at this data. The EST's are clearly artifacts.

The PALS database entry for HSPA5 is HS.522394. The figure shows 20 different putative transcripts. The most unusual feature of this database is the fact that the correct transcript isn't even shown. PALS is even more naive than ECgene and asg because it includes EST's from regions that are far outside the genes.

The SpliceInfo Database entry for HSPA5 is ENSG00000044574. This database seems to accept the curated hyman gene data from NCBI since it shows only a single correct transcript for the human gene. However, there are many different alternative splice forms for the mouse gene based on the available EST's. It's strange that the authors of this database would agree to ignore all of the artifacts in the human EST collection but trust the mouse EST's.


pseudogenes from UCSC
chr 11 (122433748 ...) = HSPA8
chr 4 (165467129 ...) = retro
chr 1 (38843470 ...) = retro

Laurence (Larry) A. Moran []
[Dept. of Biochemistry][University of Toronto], Toronto ON, Canada M5S 1A8