Skip to content

Pytorch implements Bert to do the seq2seq task. Using the unilm scheme, it can now also do tasks such as automatic summarization, text classification, sentiment analysis, NER, part of speech tagging, etc. It supports the t5 model and GPT2 to continue writing articles.

License

Notifications You must be signed in to change notification settings

920232796/bert_seq2seq

Folders and files

Name Name
Last commit message
Last commit date

Latest commit

 

History

230 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

bert_seq2seq

A lightweight small framework. If you like it, welcome star~thank you. If you encounter a problem, you can also ask the issue and promise to reply.

At present, a version of distributed training has been reconstructed, Multi GPU training can be carried out directly by changing parameters , and No additional commands and code are required Welcome to https://github.com/920232796/bert_seq2seq_DDP Learn more

Welcome to join the communication group~You can ask questions, make suggestions, and communicate with each other QQ group: 975907202

At present, this framework can do various NLP tasks. The supported models are:

  1. bert
  2. roberta
  3. roberta-large
  4. gpt2
  5. t5
  6. Huawei nezha model
  7. Bart Chinese

The supported tasks are:

  1. Seq2seq For example, writing poems, couplets, automatic titles, automatic abstracts, etc.
  2. Cls_classifier classifies by extracting the cls vector at the beginning of a sentence, such as sentiment analysis, text classification, semantic matching, etc.
  3. Sequence_labeling sequence tagging tasks, such as named entity recognition, part of speech tagging, Chinese word segmentation, etc.
  4. Sequence_labeling_crf is added to the sequence tagging task of CRF Loss for better results.
  5. Relation_extract relation extraction, such as triple extraction task. (It's not exactly the same as the example of Mr. Su Jianlin.)
  6. Simbert SimBert model, generate similar sentences, and judge the similarity of similar sentences.
  7. Multi_label_cls multi label classification.

Loading different models is achieved by setting the "model_name" parameter, and different tasks are achieved by setting the "model_class" parameter. See various examples for details.

Pre training model download address summary:

  1. The roberta model, model and dictionary files need to go to https://drive.google.com/file/d/1iNeYFhCBJWeUsIlnW_2K6SMwXkM4gLb_/view Download here. For details, please refer to the github warehouse https://github.com/ymcui/Chinese-BERT-wwm The roberta large model can also be downloaded from it.
  2. The bert model (large is not supported at present). Download the bert Chinese pre training weight "bert base chinese":“ https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-pytorch_model.bin ", Download bert Chinese dictionary" bert base chinese ":" https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-chinese-vocab.txt ".
  3. NEZHA model, dictionary weight location (currently only supports base): NEZHA base model download: link: https://pan.baidu.com/s/1Z0SJbISsKzAgs0lT9hFyZQ Extraction code: 4awe
  4. For the gpt2 model, you can view the gpt_test file in the test file for text continuation test. The download address of the gpt2 Chinese general model and dictionary is: https://pan.baidu.com/s/1vTYc8fJUmlQrre5p0JRelw Password: f5un
  5. For the English model of gpt2, refer to https://huggingface.co/pranavpsv/gpt2-genre-story-generator For the pre training model, see gpt2_english_story_train.py in example for the specific training code
  6. The t5 model is supported in both English and Chinese. It can be directly loaded using the transformers package. See the relevant examples in the examples folder for details. Pre training parameters download: https://github.com/renmada/t5-pegasus-pytorch
  7. SimBert model supports the generation of similar sentences. Pre training models can use bert, roberta and nezha.
  8. Download address of bart Chinese model: https://huggingface.co/fnlp/bart-base-chinese

Some codes refer to https://github.com/huggingface/transformers/ and https://github.com/bojone/bert4keras Thank you very much.

Screenshots of some small examples

Gpt2 generation

Input:

It's a lovely day.

Output:

I went there to see a movie with my babies. It was really good! There's nothing to say about the environment. The movie is very exquisite and the sound effect is also very good. I don't know if this store is still open. I hope I can often go to see it if I have time

Write poetry

 image.png

bert+crf ner

Input: image.png Output: iamge.png

News summary text classification (14 classifications)

 image.png Output: image.png

Medical ner

Input:

If it is used together with other drugs, drug interaction may occur. For details, please consult a doctor or pharmacist. Take it with boiling water, 14g once, three times a day. It can nourish blood, regulate menstruation and relieve pain. It is used for menstruation with small amount, stagger after menstruation, abdominal pain during menstruation, Jianmin Group Yekai Thai Medicine (Suizhou) Co., Ltd. 1, and avoid eating raw and cold food. 2. People with other diseases should take it under the guidance of doctors. 3. If your menstruation is normal, you should go to the hospital if you suddenly have oligomenorrhea or wrong menstruation. 4. For the treatment of dysmenorrhea, it is advisable to take medicine 3 to 5 days before menstruation for one week. If there is a reproductive requirement, it should be taken under the guidance of a doctor. 5. Those who have no relief of dysmenorrhea after taking medicine, or severe dysmenorrhea, should go to the hospital for treatment. 6. If the symptoms are not relieved after taking medicine for 2 weeks, you should go to the hospital. 7. It is forbidden for people with allergies to this product. People with allergies should use it with caution. 8. It is forbidden to use this product when its properties change. 9. Please keep this product out of the reach of children. 10. If you are using other drugs, please consult a doctor or pharmacist before using this product. This product is an over-the-counter drug for gynecology with irregular menstruation. It can nourish blood, regulate menstruation and relieve pain. For less menstruation, posterior dislocation, and abdominal pain during menstruation. It can nourish blood, regulate menstruation and relieve pain. It is used for over-the-counter drugs (Class B) with less menstruation, delayed menstruation, abdominal pain during menstruation and 14g * 5 bags. It is forbidden for pregnant women in the National Medical Insurance Catalogue (Class B). Do not take it for diabetics.

Output:

 image.png

Antithetical couplet

 image.png

Semantic matching

 image.png

participle

 image.png

install

  1. Install this frame pip install bert-seq2seq
  2. Install pytorch
  3. Installing tqdm can be used to display the progress bar pip install tqdm
  4. Prepare your own data, just modify the read_data function in the example code, construct the input and output, and then start the training.
  5. Run the corresponding * _train.py file under the example folder. Run different train.py files for different tasks. You need to modify the structure of input and output data, and then conduct training. See various examples in examples

Some function explanations

def load_bert(word2ix, model_name="roberta", model_class="seq2seq")

Load the bert model. The model_name parameter specifies which type of bert to use. Currently, it supports bert, roberta, and nezha. The model_class specifies which type of task to use. Seq2seq indicates the generation task, and cls indicates the text classification task

model.load_pretrain_params(pretrain_model_path)

Load the parameters of the bert model. Note that only the parameters of the encoder are loaded, that is, the parameters of the pre training model downloaded from the Internet; For example, the seq2seq model includes the parameters of the bert model+the full connection layer. This function only loads the first part of the parameters.

def model.load_all_params(recent_model_path)

Load all model parameters. After you have trained for some time and saved the model, you can load the last model training results through this function to continue training or testing.

If you want to read various articles, you can go to my website~ http://www.blog.zhxing.online/#/ Search for poetry or couplets or NER or news summary text classification to find the corresponding article. Thank you for your support.

Update records

2021.11.12: Optimize code to support the roberta target model.

2021.10.12: The decoding method of ner is optimized. The previous coarse granularity decoding method has bugs.

2021.08.18: A lot of code has been optimized. At present, the framework code looks clearer, and a lot of redundant code has been deleted.

2021.08.17: Huawei's NEZHA model is supported. It is very simple. Just change the model_name parameter. Welcome to test the effect.

2021.08.15: An example of word segmentation is added, and rematch code is added to the tokenizer.

2021.07.29: Optimize some codes and make them more concise.

2021.07.20: SimBert model is reproduced, and similar sentences can be output. However, due to the small amount of data, it needs to be tested.

2021.03.19: Support model extension. You can not only use the model provided by the framework, but also directly load the model on the hugging face for training prediction.

2021.03.12: Added an example of gpt2 Chinese training, Duke Zhou interprets dreams.

2021.03.11: The gpt2 example is added to continue the article.

2021.03.11: A randomly generated decoding method is added to make the generation more diverse.

2021.03.08: Beam search returns n results, and randomly selects one as the output.

2021.02.25: An example of semantic matching is added.

2021.02.06: The device setting mode has been adjusted, which is now more convenient.

2021.1.27: The code structure of the framework has been adjusted. There are many changes. If there are bugs, please mention the issue.

2021.1.21: Added a new example, character relationship extraction and classification.

2020.12.02: Some codes have been adjusted and several test files have been added, which can easily load the trained model and test the corresponding tasks.

2020.11.20: An example has been added. At present, triplet extraction f1 can reach 0.7. Added test code for news summary text classification.

2020.11.04: ran the example of running bert crf for ordinary ner tasks, and the effect was good.

2020.10.24: A large amount of code has been adjusted and automatic summary examples of THUCNews dataset have been added. Now, the training should be very effective. Previously, the pre training parameters could not be loaded, and sometimes the effect would be very poor.

2020.10.23: Some code structures have been adjusted, some variables in each example have been written as global variables, and the code of beam search has been changed and streamlined. However, the rhyme in poetry writing is not supported for the time being. It will be supplemented later.

2020.09.29: A training example of Tianchi medical ner competition (medical ner_train. py) has been added. See the competition interface for details: https://tianchi.aliyun.com/competition/entrance/531824/information

2020.08.16: A new example of joint training of poetry couplets (poetry couplet _train. py) has been added, and poetry couplets can be written at the same time; In addition, the test code of poetry is added, and the model can be tested after training.

2020.08.08: This update has many contents, 1 Added an example of automatic summary (auto_title. py) 2 Added the code to simplify the vocabulary. The original 3W words are reduced to more than 1W (because some words will never appear) 3 Some beam search codes have been modified to achieve better results. 4. The fine-grained ner cannot be used for the time being. There is a problem with the data, so it is temporarily placed in the test folder. If appropriate data is found, you can use 5 Add a test folder, where the trained model can be tested to see the effect.

2020.06.22: An article by Conditional Layer Norm is supplemented. Explained some codes. http://www.blog.zhxing.online/#/readBlog?blogId=347

2020.06.21: Many codes have been updated, and an example of triple extraction (triple extraction _train. py)~

2020.06.02: I have been busy with graduation recently. There is still a competition, which will not be updated for the time being. It will be updated in the future.

2020.04.18: After training the bert+crf model, it seems that the learning rate of the crf layer is not high enough and needs to be improved (now it is possible to set the learning rate of the crf layer separately, generally 0.01).

2020.04.13: The NER task+CRF layer Loss has been added to run through the training examples, but the Viterbi algorithm has not been added.

2020.04.11: It is planned to add a CRF layer to the NER task.

2020.04.07: An example of ner is added.

2020.04.07: Pypi has been updated, and models of sequence annotation tasks such as ner have been added.

2020.04.04: Updated the code on pypi. The latest version is 0.0.6. Please use the latest version. There will be fewer bugs.

2020.04.04: Fixed some bugs and added examples of news title text classification

2020.04.02: The punishment degree of repetition and rhyme for writing poetry in beam search has been modified, which may be better.

2020.04.02: Added Duke Zhou's task of interpreting dreams

2020.04.02: Added the task of couplet

2020.04.01: poetry writing task added

2020.04.01: It takes less time to start training a new task after refactoring the code.

python setup.py sdist twine upload dist/bert_seq2seq-2.3.5.tar.gz

About

Pytorch implements Bert to do the seq2seq task. Using the unilm scheme, it can now also do tasks such as automatic summarization, text classification, sentiment analysis, NER, part of speech tagging, etc. It supports the t5 model and GPT2 to continue writing articles.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages