NLP|hugging face transformer文本分类运行

【NLP|hugging face transformer文本分类运行】hugging face 团队的transformer又更新了,现在里面有distilroberta和distilbert和albert模型,这三个模型值得我们对比其他模型的差异。那么如何运行呢?
首先进入GitHub,搜索transformer
https://github.com/huggingface/transformers
进入这个repo

git clone 或者下载下来

接着用pycharm或其他编辑器打开这个repo
https://github.com/huggingface/transformers/tree/master/examples
选择examples里的run_gule.py

找到最下面的__main__,把所有代码剪切出来单独封装一个函数为main(),参数有两个model和dataset。
dataset是数据集的名字也是数据所在文件夹名称,model是model type。在这里,最重要的是命令行的argument,由于我们不想用命令行输入参数,这里可以在parser.add_argument中加入参数default,并设置required为False,这样就有了一个默认值。
接着我们设置data dir和训练batch大小和epoch次数。

def main(model,task):parser = argparse.ArgumentParser() model_dir = model_to_dir[model] ## Required parameters data_dir = '/home/socialbird/Downloads/transformers-master/examples/glue_data/{}'.format(task) #task = 'RTE' train_bs = 8 eps = 3.0 parser.add_argument("--data_dir", default=data_dir, type=str, required=False, help="The input data dir. Should contain the .tsv files (or other data files) for the task.") parser.add_argument("--model_type", default=model, type=str, required=False, help="Model type selected in the list: " + ", ".join(MODEL_CLASSES.keys())) parser.add_argument("--model_name_or_path", default=model_dir, type=str, required=False, help="Path to pre-trained model or shortcut name selected in the list: " + ", ".join(ALL_MODELS)) parser.add_argument("--task_name", default=task, type=str, required=False, help="The name of the task to train selected in the list: " + ", ".join(processors.keys())) parser.add_argument("--output_dir", default='output', type=str, required=False, help="The output directory where the model predictions and checkpoints will be written.")## Other parameters parser.add_argument("--config_name", default="", type=str, help="Pretrained config name or path if not the same as model_name") parser.add_argument("--tokenizer_name", default="", type=str, help="Pretrained tokenizer name or path if not the same as model_name") parser.add_argument("--cache_dir", default="", type=str, help="Where do you want to store the pre-trained models downloaded from s3") parser.add_argument("--max_seq_length", default=128, type=int, help="The maximum total input sequence length after tokenization. Sequences longer " "than this will be truncated, sequences shorter will be padded.") parser.add_argument("--do_train", action='store_true', default=True, help="Whether to run training.") parser.add_argument("--do_eval", action='store_true',default=True, help="Whether to run eval on the dev set.") parser.add_argument("--evaluate_during_training", action='store_true',default=True, help="Rul evaluation during training at each logging step.") parser.add_argument("--do_lower_case", action='store_true', help="Set this flag if you are using an uncased model.")parser.add_argument("--per_gpu_train_batch_size", default=train_bs, type=int, help="Batch size per GPU/CPU for training.") parser.add_argument("--per_gpu_eval_batch_size", default=8, type=int, help="Batch size per GPU/CPU for evaluation.") parser.add_argument('--gradient_accumulation_steps', type=int, default=1, help="Number of updates steps to accumulate before performing a backward/update pass.") parser.add_argument("--learning_rate", default=5e-5, type=float, help="The initial learning rate for Adam.") parser.add_argument("--weight_decay", default=0.0, type=float, help="Weight deay if we apply some.") parser.add_argument("--adam_epsilon", default=1e-8, type=float, help="Epsilon for Adam optimizer.") parser.add_argument("--max_grad_norm", default=1.0, type=float, help="Max gradient norm.") parser.add_argument("--num_train_epochs", default=eps, type=float, help="Total number of training epochs to perform.") parser.add_argument("--max_steps", default=-1, type=int, help="If > 0: set total number of training steps to perform. Override num_train_epochs.") parser.add_argument("--warmup_steps", default=0, type=int, help="Linear warmup over warmup_steps.")parser.add_argument('--logging_steps', type=int, default=200, help="Log every X updates steps.") parser.add_argument('--save_steps', type=int, default=500, help="Save checkpoint every X updates steps.") parser.add_argument("--eval_all_checkpoints", action='store_true', help="Evaluate all checkpoints starting with the same prefix as model_name ending and ending with step number") parser.add_argument("--no_cuda", default=False,required=False, help="Avoid using CUDA when available") parser.add_argument('--overwrite_output_dir', action='store_true',default=True, help="Overwrite the content of the output directory") parser.add_argument('--overwrite_cache', action='store_true', help="Overwrite the cached training and evaluation sets") parser.add_argument('--seed', type=int, default=42, help="random seed for initialization")parser.add_argument('--fp16', action='store_true', help="Whether to use 16-bit (mixed) precision (through NVIDIA apex) instead of 32-bit") parser.add_argument('--fp16_opt_level', type=str, default='O1', help="For fp16: Apex AMP optimization level selected in ['O0', 'O1', 'O2', and 'O3']." "See details at https://nvidia.github.io/apex/amp.html") parser.add_argument("--local_rank", type=int, default=-1, help="For distributed training: local_rank") parser.add_argument('--server_ip', type=str, default='', help="For distant debugging.") parser.add_argument('--server_port', type=str, default='', help="For distant debugging.") args = parser.parse_args()


然后定义model to dirs,这个是我用来设置model path的字典,你完全可以不添加这个。运行这个脚本会载入一个bert或其他模型,所以要指定model的名字类型,和model的路径位置(model_name_or_path),如果没有预先下载模型可以填模型的具体名字如roberta-base。
model_to_dir = { 'distilbert':'distilbert-base-uncased', 'distilroberta': MODEL_DIRS['distilroberta'], 'albert': 'albert-base-v2', 'bert': MODEL_DIRS['bert-base'], 'roberta': 'roberta-base', 'camembert': 'camembert-base', 'xlm':'xlm-mlm-ende-1024', 'xlnet':'xlnet-base-cased' }

最后我们还需要设置processor。在data/processors/glue.py脚本中已经有了一些processor,我们可以直接使用。如RTE processors。
使用rte需要知道rte的标注task name是什么,也就是dataset的标准名字是什么。在脚本的最下面能看到。
glue_processors = { "cola": ColaProcessor, "mnli": MnliProcessor, "mnli-mm": MnliMismatchedProcessor, "mrpc": MrpcProcessor, "sst-2": Sst2Processor, "sts-b": StsbProcessor, "qqp": QqpProcessor, "qnli": QnliProcessor, "rte": RteProcessor, "wnli": WnliProcessor,}


最后我们还需要这些数据,需要一个download glue data这个脚本,在github 的W4ngatang这个repo可以找到。
最后我们运行run glue.py就可以了

    推荐阅读