Collection
zero Useful+1
zero

Protein database

Database including protein information
Protein database refers to the database including protein information. There are many common protein databases, including Uniprot It is considered as the most comprehensive protein database with the most extensive and annotated information. Uniprot includes Swiss-Prot TrEMBL and PIR-PSD, see Uniprot Baidu Encyclopedia for details. Other protein databases include PDB (Protein Data Bank, PDB for short, established in 1971), etc. There are also some domestic problems Shanghai Bioinformatics Research Center Subordinate Bioinformation science data sharing SDSPB for platform establishment and maintenance.
Chinese name
Protein database
Foreign name
HPDB
Built
May 2005
Meaning
Display the three-dimensional structure of biological macromolecules
Interpretation
Database including protein information
Representative
Uniprot

Performance and history

Announce
edit
The Protein Database (HPDB), built in May 2005, dynamically displays the three-dimensional structure of biological macromolecules. Click the mouse to enlarge the molecular structure, locate atoms, and measure the distance between atoms, which can be used for teaching or scientific research. The service target is college and technical secondary school students, teachers and scientific and technological workers who can use Chinese fluently in life science, medicine, pharmacy, agriculture, forestry and other fields. The molecular structure features are described in Chinese, and the original English text is provided for textual research. For readers who are good at using English, we advocate direct access to RCSB PDB to reduce network congestion and inconvenience caused by improper translation of HPDB.
The protein database (HPDB) has translated the molecular structure description of each protein into Chinese (except for the latest molecules added to the database), including qualitative description of molecular structure, source of samples expression vector host Chemical analysis method, molecular structure and composition, etc. This information is stored in the database together with the protein molecular structure data, so HPDB supports Chinese query.
Although the protein database (HPDB) has translated the "molecular structure description" part, in order to ensure the reliability and accuracy of the data, HPDB macromolecule No changes have been made to the structure coordinate data, the database keeps the original experimental data file verified by RCSB PDB, and keeps the PDB file format and protein molecular number [1]
Brookhaven Protein Database The BrookHaven Protein Data Bank (PDB) is a data archive on the three-dimensional structure of biological macromolecules maintained by Brookhaven National Laboratory in the United States. Its contents include atomic coordinates, references, first-order and second-order structure information of biological macromolecules, as well as crystal structure factors and NMR experimental data. PDB is funded by the National Science Foundation of the United States and other organizations, providing free services to researchers, educators and students around the world.
PDB was founded in 1973. In the 1990s, the data in PDB began to develop and enrich gradually. According to statistics, the number of biological macromolecular structures collected from the database from 1992 to 1996 were 1007, 1727, 2921, 3821 and 4707, with an average annual increase of 50%. By April 8, 1998, the library had collected 7429 atomic coordinate entry files, 1739 structure factor files and 429 NMR suppression files. PDB mainly collects the structural information of proteins, and also includes a small amount of three-dimensional structures of nucleic acids and sugars. The experimental technology for obtaining information mainly includes X-ray diffraction technology and NMR experimental technology [2]

file structure

Announce
edit
In protein Crystal structure database In PDB, each macromolecular structure is recorded in the form of discrete files, which are called PDB entry files. One file only reflects the information of a certain macromolecular structure. Each macromolecular structure is identified by a unique ID code (4-digit code). The file name suffix of the early entry file is ". pdb", and one macromolecule corresponds to one file. For example, the ID code of the seed protein of Abyssinian Cabbage SEED is 1CRN, and the entry file name is 1CRN.pdb. After 1997, each kind of biomacromolecule has one group (3) of related files corresponding to it. They are: full text files, bibliographic files and graphic files. For example, the ID code of the antibiotic MINORCOATPROTEIN is 1G3P, and its 3 related files are 1G3P. full (full text files), 1G3P. biblio (bibliographic files), 1G3P. gif (graphic files); immunoglobulin (IMMUN-O GLOBULIN) ID code is 1AP2, and its three related files are 1AP2.full (equivalent to the original. pdb file), 1AP2.biblio, 1AP2.gif, and so on.
Each PDB entry file contains 12 parts, including the title part, annotation part, level 1 structure, heterogeneity, level 2 structure, connectivity annotation, various characteristics, crystallography, coordinate transformation, atomic coordinates, chemical connection, and bookkeeping. Each line in the file is called a record, also called an entry, which can be understood as a record entry. Each line includes 80 columns, and the last character of each record entry is a line terminator. PDB files can also be viewed as a collection of record types. It is different from the general concept of relational database. In the database file of a relational database, each record is composed of several fields with different data types and data formats. The field structure of all records is the same. In PDB files, there are many record types, and each type of record has a different format.
A group of records can be divided into one of the following six categories based on the number of times the record type appears in a PDB entry file:
  • Single: Single record type. Such as HEADER END、CRYST1……, It only appears once in one file, and there is no connected part.
  • Single continued: single connection type. Such as AUTHOR CAVEAT、COMPND……, It exists conceptually once in a file, and its content exceeds 1 line, which can indicate that in subsequent lines, these subsequent lines include 1 continuation indication field.
  • Multiple: multiple record type. Such as ATOM CONECT、HELIX……, It appears multiple times in one file, and the information appears in the form of a list in this type of record.
  • Multiple continued: multiple connected. For example, FORMUL, HETATM, and HETNAM exist conceptually multiple times in one entry file. The part of each record content that exceeds one line can be represented in subsequent lines, which include one continuation indication field.
  • Grouping: used as a grouping flag for other record categories. For example: ENDMDL, MODEL, TER.
  • Other: Other record types, such as: JRNL defines the reference of coordinate series, and REMARK represents a general note. Each record type is divided into several fields by a fixed number of columns. The field should include data type, field name and field definition. Columns that are not defined should be left blank.

Macromolecular structure

Announce
edit

Primary structure

Biochemically, the primary structure is defined as Amino acid residue The order in which they are listed. The amino acid in the primary structure is the most basic structural unit of protein. There are more than 20 standard amino acids. One amino acid contains amino NH2 and carboxyl COOH. If one H atom is lost in the amino acid and the OH group is lost in the carboxyl group, the residue is formed. The two amino acids can be dehydrated and condensed into peptides to form peptide bonds and stable peptide planes. The two adjacent amino acid residues are connected by peptide bonds, and then successively connected to form a first-order peptide chain.
In the PDB entry file, the protein primary structure mainly describes the arrangement order of amino acids in each chain of biological macromolecules. This part includes four kinds of records, namely DBREF, SEQADV, SEQRES and MODES Amino acid residue Ordered arrangement is described. For example, in the 1ROG (histocompatibility antigen HLA-B * 2705) entry file, there are 16 SEQRES records, which are divided into A and B2 chains. Each row has one record. The amino acid residues are listed in the order of order. If the line goes on, the peptide chain formed by the sequence connection of GLY, SER, HIS,... amino acid residues is formed. The following example is part of the content extracted from the 1ROG.pdb file:
SEQRES 1 A 183 G LY SER HIS SER M ET A RG T Y R PHE HIS TH R SER VA L SER 1ROG 73 SEQRES 2 A 183 A RG P RO G LY A RG G LY G L U PRO A RG PHE IL E T HR VA L G LY 1ROG 74 SEQRES 3 A 183 T Y R VA L ASP ASP T HR L EU PHE VA L A RG PHE ASP SER ASP 1ROG 75 … … … … … SEQRES 14 A 183 A RG T YR LEU G L U A SN G L Y L YS G L U T HR L EU G LN A RG A LA 1ROG 86 SEQRES 15 A 183 N M E 1ROG 87 SEQRES 1 B 9 A RG A RG I LE LYS A LA IL E T HR L EU L YS 1ROG 88
As mentioned earlier, each line is a record, and the first field in each line is the record name "SEQRES". The second field is an integer, representing the serial number recorded in the current chain. The third field is the identifier of the chain. In this example, there are A and B2 chains. If there is only one chain, the field is empty. The fourth field is an integer, representing the residue number of amino acids in the chain. The 5th to 17th fields are Amino acid residue Sequence, each field is an amino acid name.

heterogeneity

The heterogen section of the PDB file contains Standard amino acid Description of residues. This part includes four kinds of records: HET, HETNAM, HETSYN and FORMUL. HET records describe non-standard amino acids with given coordinates, such as soluble molecules, cofactors, iron, etc., and also describe the heterogeneity of unknown chemical names. In the 1G3P entry document, the first record of HET is:
HET TRO 21 15
Here "HET" is the record name, TRO is the HET identifier, 21 is the sequence number, and 15 is the number of occurrences of this group in the HETATM record, that is, this record describes the non Standard amino acid TRO appears 15 times in the coordinate entry HETATM record. The HETNAM record describes the chemical name of a compound with a given non-standard amino acid identifier. For example, one of the HETNAM records in the 1G3P file is:
HETNAM SO4 SULFATE ION
The chemical name of the compound with HET identifier SO4 is SULFATEION. The FORMUL record describes the chemical expression of the non-standard group and the number of charges it carries.

Secondary structure

The secondary structure refers to the configuration formed by the winding and folding of the main chain skeleton of the polypeptide chain with the peptide plane as the unit. The Level 2 structure includes three cases:
  • Alpha helix
  • β - lamella
  • β - corner.
In the β - lamella, the peptide bond plane is folded into a zigzag shape, and the included angle between the two adjacent peptide bond planes is 110 °. In the PDB entry file, HELIX, SHEET and TURN records describe the secondary structure of the protein. HELIX record is used to describe the position of α - helix structure in molecule. Give the name and number of the helix, mark the residues at the beginning and end of the helix, and the total length. SHEET record is used to describe the position of β - lamellar structure in the molecule. The recording format is similar to HELIX. TURN records describe folds and corners.
From PDB's web home page( http://www.rcsb.org/pdb/index.html )It can be found by the 4-digit code of the macromolecule. After opening the webpage of a protein macromolecule, you can view the overall information, 3D structure, sequence details, etc., and download the PDB entry file. For example, the amino acid sequence and secondary structure of 1ROG are as follows:
1 GSHSMRYFHT SVS RPGRGEP RFITVGYVDD TLFVRFDSDA ASPREEPRAPEEEEEEEE EE BTTTB EEEEEETT EE EEEETTT TT EESST 51 WIEQEGPEYW DRETQICKAK AQTDREDLRT LLRYYNQSEA GSHTLQNMYGTTTSS HHHH HHTHHHHHHH HHHHHHHHHH HHHH TT SS S EEEEEEE 101 CDVGPDGRLL RGYHQDAYDG KDYIALNEDL SSW TAADTAA QITQRKWEAAEEE SS B EEEEEEEETT EE EEE TTS EE SHHH HHHHHHHHTT 151 RVAEQLRAYL EGECVEWLRR YLENGKETLQ RAXTTHHHHHHHH HTTTHHHHHH HHHH SSSSS
Here, each amino acid residue is represented by a single letter, the secondary structure (lines 2, 4, 6 and 8) is represented by H as helix, B as residue on isolated β bridge, E as extended β chain, G as 310 helix, I as pi helix, T as hydrogen bond angle, S as bending, etc.

Connection part

This section describes the disulfide bond and other chemical connections. The records describing chemical connections include SSBOND, CONECT, LINK, HYDBND, CISPEP, etc. SSBOND records describe disulfide bonds in protein and peptide structures. The CONECT record represents the inter atomic association state that other records cannot represent. For example, in the 1G3P file, the first record of CONECT is: CONECT4948299, where "CONECT" is the record name, and the subsequent content indicates that the 48th and 299th atoms in the ATOM or HETATM records are bonded to the 49th atoms respectively. LINK record describes in detail the relationship between residues that cannot be defined in the Level 1 structure. It is essentially a supplement to the CONECT record described above. HYDBND records describe the hydrogen bonds formed between atoms.

Coordinate chapter

The chapter of coordinates mainly records the coordinates of atoms, and related records include ATOM, HETATM, MODEL and ENDMDL. ATOM records are given in the order from amino group to carboxyl group Standard amino acid From the perspective of biochemistry, we can describe the connection between atoms in the spatial structure of standard amino acid residues. Take the standard amino acid residue ALA at the first position of the peptide chain in the ATOM record as an example: the elements contained in the ALA residue and their sequence number are 1N2CA3C4O5CB. Including:
A in 2CA=alpha (α);
B in 5CB=beta (β).
Through the above description, the spatial position of a single residue and the relationship between atoms are determined. According to the property that two adjacent residues in the peptide chain dehydrate and condense to form peptide bonds and stable peptide planes, we can determine two adjacent residues in the primary structure Amino acid residue The relationship between carbon atoms (CA) in the first residue structure and nitrogen atoms (N) in the last residue structure forms a peptide plane.
In the 1G3P file, the ATOM record of the first residue ALA is:
ATOM 1 N ALA 1 -10.684 7.361 121. 696 1. 00 17.19 N ATOM 2 CA ALA 1 -10.459 8.273 120. 534 1. 00 16.43 C ATOM 2 CA ALA 1 -10.459 8.273 120. 534 1. 00 16.43 C ATOM 3 C ALA 1 -10.360 9.687 121. 079 1. 00 16.06 C ATOM 4 O ALA 1 -10.826 9.967 122. 195 1. 00 16.83 O ATOM 5 CB ALA 1 -11.607 8.170 119. 558 1. 00 16.89 C
"ATOM" is the record name. The first record in the above example describes that the x, y and z coordinate values of nitrogen element (N) in residue ALA are -10.684, 7.361 and 121.696, respectively, occupying 1.00 space, 17.19 temperature coefficient and N element symbol. Other ATOM records describe other elements contained in residue ALA. In the 1G3P file, use the same method to describe the remaining 217 Amino acid residue The atomic coordinates of. The HETATM record describes the composition of non Standard amino acid Spatial position coordinates of elements of residues (non-standard amino acid residue names have been defined in the HET record). Its recording mode is consistent with that of ATOM. TER record marks the termination bit of ATOM record. The MASTER record is a summary of the above records. The numbers listed in the following examples represent the records REMARK, "0" HET、HELIX、SHEET、TURN、SITE、 Total number of records of coordinate transformation, atomic record, TER, CONECT and SEQRES. For example: MASTER25802213006188912817END record indicates the end of the article, and the record format is END.

visualization

Announce
edit
3D structure visualization of biomacromolecules
Illustration
According to the above introduction, in the PDB database, the structure of biomacromolecules is represented by the specific record format, the atomic space coordinate value and the description of their connection form, connection order, etc. Through a specific browser, such as Rasmol, the visualization of the three-dimensional structure of macromolecules can be realized based on the PDB file. Rasmol is a molecular phenomenon program. It can be embedded in a web browser, used to open pdb files on the Internet through hyperlinks, and can also be operated in Windows environment, or in Mac and Unix environment. The figure shows the main menu window after Rasmenu.exe is run. Behind the main menu window, a molecular graphic image display window also appears. Open a pdb file in the main menu window, and its 3D structure graph will be displayed in the second window. You can select the display form in the main menu, such as linear, stick, baseball, ribbon, etc., and select the color differentiation method, etc. Using the mouse, you can observe the three-dimensional structure of molecules from different angles, just like you are on the scene and observe a delicate ivory carving from different angles, which is very beautiful and spectacular [2]