Before Build phylogenetic tree "The article has roughly introduced the use of MEGA, but some parts are not clear. Here we will discuss the existing resources explain ( sh ǔ i ) Next. Please correct any mistakes.

Collect homologous sequence

To collect homologous sequences using BLAST, refer to " here ", but the sequence obtained by this collection method contains multiple species. If you need to perform homology comparison with specific species, such as the commonly used model plants Arabidopsis and rice, there will be some problems in the search process. Here are my solutions for these two species.

1. Arabidopsis

Two websites are mainly used to obtain Arabidopsis gene sequence:

  1. TAIR: https://www.arabidopsis.org/
  2. PlnTFDB: http://plntfdb.bio.uni-potsdam.de/v3.0/

TAIR (Arabidopsis Information Resources) provides a large amount of data on Arabidopsis, including complete genome sequence, gene structure, gene product information, gene expression, DNA and seed bank, genome map, genetic and physical markers, publications and information about the Arabidopsis research community.

PLANT TRANSCRIPTION FACTOR DATABASE (PlnTFDB for short) currently contains 2657 protein models, of which the protein sequence of Arabidopsis thaliana is sorted from TAIR.

Common sequences can be obtained in PlnTFDB. As shown in the figure below, click "Eudicot" and "Arabidopsis thaliana" to enter the Arabidopsis database.

Click the transcription factor family listed in the table, such as "zf HD".

Click "Check all" to select all the sequences, and then click "Retrieve" to download them directly .fasta Format.

If there is no desired gene family in the table, you can use TAIR for BLAST search.

Enter the TAIR homepage, fill in the gene family name in the search box, select the protein database, and click Search.

Taking my HSP60 as an example, the following results are obtained after searching. Select the gene closest to the desired gene, such as the last one.

Click "Send to BLAST" and click "Run BLAST" on the next page. Because I don't know what these parameters do, I directly use the default parameters for BLAST search.

Then the gene sequence list with TAIR landing number was obtained.

Exclude the part with E value greater than 0.01, and save the rest. Because this is the login number of the gene, it is necessary to further search for the corresponding protein.

Sort out the above login numbers, and use TAIR to download the fasta files in batches.

Open the batch retrieval page of TAIR: https://www.arabidopsis.org/tools/bulk/index.jsp , click Sequences , start retrieving. Fill in the login number according to the figure below, set the parameters, and obtain the fasta file.

2. Rice

Two websites are mainly used to obtain rice gene sequences:

  1. Rice Genome Annotation Project: http://rice.plantbiology.msu.edu/
  2. National Rice Data Center: http://www.ricedata.cn/gene/

Because the Pfam code has been obtained in HMMER previously, it is much easier to search the rice sequence.

Find it on the front page of Rice Genome Annotation Project Protein Domain Search , fill in the Pfam code in the Pfam profile search box, and click Search.

No more maps here. Sort out the login numbers in the "Model" column of the search results.

Open the batch download page of rice genome annotation project: http://rice.plantbiology.msu.edu/downloads_gad.shtml , select the data type, output format, fill in the login number, and submit.

You can get the search results, copy and paste them for future use.

Sorting homologous sequence

We need to rename these sequences in order to keep the beauty of the later evolution tree as much as possible. Follow the sequence Protein length Sort from small to large, and then remove the comments after the login number, and rename the login number. Be careful to keep the original file for a rainy day.

The Arabidopsis thaliana downloaded in batch has been included in the file LENGTH=1234 The format of renaming after sequence sorting can be ATFBA1 Rice is more troublesome. I only know National Rice Data Center Use the login number to search, click the gene ID to get detailed gene data, including the protein length, and the renaming format after sequence sequencing can be OsFBA1

After sorting them all out, you can follow the example below, leave one line blank between each sequence, and then put the sequences of all races in the same file, and then .txt The suffix is modified to .fasta

 >ATFBA1 sequence >JcFBA2 sequence >OsFBA3 sequence

Click "View" on the top of File Explorer, check "File Extension", and then you can modify the file suffix. Be sure to keep the original file for a rainy day.

Build evolutionary tree

1. Sequence alignment

stay MEGA Home Page Download the corresponding version of the program according to your system. According to the principle of using the new instead of the old, the latest version is recommended MEGA X(64bit) Available here Backup Download

With MEGA X installed by default, .fasta The file will be opened using MEGA X by default. So double click the sorted .fasta Sequence file, open it, and the following interface will pop up.

If the fasta file cannot be opened with MEGA X by default, you can also click "File" and "Open a file" to open the fasta file.

Then we click the "W" above and click“ Align Protein ”To use the built-in ClusterW for sequence alignment.

Select "OK" in the pop-up window and select all sequences. Then select "OK" in "ClusterW options", and perform sequence alignment in the default configuration.

Be careful not to close the window and wait for the end of the comparison.

 Sequence alignment results
Sequence alignment results

Save the comparison results. Click Data and save as shown in the figure .meg Format.

2. Build an evolutionary tree

Select PHYLOGENY , select the first column Construct/Test Maximum …… , import the .meg file

After that, all default, wait for program analysis, and the analysis duration depends on the number of sequences to get the evolution tree.

3. Beautify the evolution tree

Not written yet

4. Export evolution tree

Click "Image" to output pictures in various formats. It is recommended to use BMP format here. If you can't open it, you can try it Honeyview To browse these pictures.