如何用自己的数据集进行bert预训练自然语言处理

本实验在colab环境下进行
这里我们用bert在文本匹配任务上做训练，训练数据集为蚂蚁金服文本匹配的数据
下载代码，配置环境

!git clone https://github.com/BonnieHuangxin/Bert_sentence_similarity.git !mv Bert_sentence_similarity/* ./ !pip install -r sentence_similarity_Bert/requirements.txt !mv sentence_similarity_Bert/examples/* sentence_similarity_Bert/

将tensorflow的bert模型转为pytorch模型

!git clone https://github.com/xieyufei1993/Bert-Pytorch-Chinese-TextClassification.git !mv Bert-Pytorch-Chinese-TextClassification/* ./

【如何用自己的数据集进行bert预训练】下载chinese的bert的中文模型集 tensorlfow格式

!wget https://storage.googleapis.com/bert_models/2018_11_03/chinese_L-12_H-768_A-12.zip !unzip chinese_L-12_H-768_A-12.zip

# tensorflow模型转pytorch模型 !python convert_tf_to_pytorch/convert_tf_checkpoint_to_pytorch.py \ --tf_checkpoint_path chinese_L-12_H-768_A-12/bert_model.ckpt \ --bert_config_file chinese_L-12_H-768_A-12/bert_config.json \ --pytorch_dump_path chinese_L-12_H-768_A-12/pytorch_model.bin

训练

!python sentence_similarity_Bert/run_classifier_modify2.py --data_dir=sentence_similarity_Bert/chinese_data --bert_model=chinese_L-12_H-768_A-12 --task_name=mrpc --output_dir=/home/tmp/sim_model#模型保存在这里 --do_train --train_batch_size=32

预测