transformers 框架使用详解,bert-base-chinese
以 bert-base-chinese 模型为例,模型目录 model_name = "C:/Users/Administrator.DESKTOP-TPJL4TC/.cache/modelscope/hub/tiansz/bert-base-chinese"
bert-base-chinese 模型大小只有400多兆,参数的量级在百万级别,与现在动辄几十个G,几十亿几百亿的参数量级不在一个层次,所以 bert 的主要功能是理解语义,它的双向编码其实就是transformer论文中的自注意力的实现。既然能够理解语义,它就能实现一些延伸的能力。
1、两个句子相似度的比较。
2、实现简单的QA,即给它一段话作为context,然后根据这段话提问,它能定位到你这个问题的答案在context中的位置,然后将答案揪出来,当然,它不是generate模型,它的参数量也做不到generate,它只是简单的截取一句话作为最符合的答案。
3、命名实体识别NER
4、在NLP领域,你可以定义很多下游任务,当然要自己实现输出层的逻辑。
transformers的三大组件configuration, tokenizer和model都可以通过一致的from_pertrained()方法来实例化。
Transformers提供了三个主要的组件。
- Configuration配置类。存储模型和分词器的参数,诸如词表大小,隐层维数,dropout rate等。配置类对深度学习框架是透明的。
- Tokenizer分词器类。每个模型都有对应的分词器,存储token到index的映射,负责每个模型特定的序列编码解码流程,比如BPE(Byte Pair Encoding),SentencePiece等等。也可以方便地添加特殊token或者调整词表大小,如CLS、SEP等等。
- Model模型类。提供一个基类,实现模型的计算图和编码过程,实现前向传播过程,通过一系列self-attention层直到最后一个隐藏状态层。在最后一层基础上,根据不同的应用会再做些封装,比如XXXForSequenceClassification,XXXForMaskedLM这些派生类。
Transformers的作者们还为以上组件提供了一系列Auto Classes,能够从一个短的别名(如bert-base-cased)里自动推测出来应该实例化哪种配置类、分词器类和模型类。
Transformers提供两大类的模型架构,一类用于语言生成NLG任务
,比如GPT、GPT-2、Transformer-XL、XLNet和XLM,另一类主要用于语言理解任务
,如Bert、DistilBert、RoBERTa、XLM.
tokenizer.encode() 方法
经过层层继承,最终的实现是在文件transformers\tokenization_utils_base.py
中的 class PreTrainedTokenizerBase(SpecialTokensMixin, PushToHubMixin):
def encode(self,text: Union[TextInput, PreTokenizedInput, EncodedInput],text_pair: Optional[Union[TextInput, PreTokenizedInput, EncodedInput]] = None,add_special_tokens: bool = True,padding: Union[bool, str, PaddingStrategy] = False,truncation: Union[bool, str, TruncationStrategy] = None,max_length: Optional[int] = None,stride: int = 0,padding_side: Optional[bool] = None,return_tensors: Optional[Union[str, TensorType]] = None,**kwargs,) -> List[int]:"""Converts a string to a sequence of ids (integer), using the tokenizer and vocabulary.Same as doing `self.convert_tokens_to_ids(self.tokenize(text))`.Args:text (`str`, `List[str]` or `List[int]`):The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the`tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`method).text_pair (`str`, `List[str]` or `List[int]`, *optional*):Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string usingthe `tokenize` method) or a list of integers (tokenized string ids using the `convert_tokens_to_ids`method)."""encoded_inputs = self.encode_plus(text,text_pair=text_pair,add_special_tokens=add_special_tokens,padding=padding,truncation=truncation,max_length=max_length,stride=stride,padding_side=padding_side,return_tensors=return_tensors,**kwargs,)return encoded_inputs["input_ids"]
model.eval() 的作用:模型在默认状态下是激活了 Dropout 模块,你此时给他输入数据会导致模型参数发生变化,所以需要调用eval()方法将模型设置为评估(evaluation)模式,deactivate DropOut modules。
python中的 __call__
方法
它的作用为:当你把对象当做函数来调用时,例如 objectA(xxx),就会被重定向到__call__
方法。
在类PreTrainedTokenizerBase
中
def __call__(self,text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,text_pair: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,text_target: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,text_pair_target: Optional[Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]]] = None,add_special_tokens: bool = True,padding: Union[bool, str, PaddingStrategy] = False,truncation: Union[bool, str, TruncationStrategy] = None,max_length: Optional[int] = None,stride: int = 0,is_split_into_words: bool = False,pad_to_multiple_of: Optional[int] = None,padding_side: Optional[bool] = None,return_tensors: Optional[Union[str, TensorType]] = None,return_token_type_ids: Optional[bool] = None,return_attention_mask: Optional[bool] = None,return_overflowing_tokens: bool = False,return_special_tokens_mask: bool = False,return_offsets_mapping: bool = False,return_length: bool = False,verbose: bool = True,**kwargs,) -> BatchEncoding:"""Main method to tokenize and prepare for the model one or several sequence(s) or one or several pair(s) of sequences.Args:text (`str`, `List[str]`, `List[List[str]]`, *optional*):The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).text_pair (`str`, `List[str]`, `List[List[str]]`, *optional*):The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).text_target (`str`, `List[str]`, `List[List[str]]`, *optional*):The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or alist of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences).text_pair_target (`str`, `List[str]`, `List[List[str]]`, *optional*):The sequence or batch of sequences to be encoded as target texts. Each sequence can be a string or alist of strings (pretokenized string). If the sequences are provided as list of strings (pretokenized),you must set `is_split_into_words=True` (to lift the ambiguity with a batch of sequences)."""......
此时,你就看到有这样的调用
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)inputs = tokenizer(question, context, return_tensors="pt")
outputs = model(**inputs)
model() 实际上调用的就是 model.forward() 。
而 tokenizer() 并不是 tokenizer.decode() ,从返回值类型就能看出来。从代码上看,tokenizer() 做了一些判断,返回值为 BatchEncoding;而 tokenizer.decode() 返回值为 BatchEncoding[‘input_ids’],所以也可以
input_ids = tokenizer.encode(question, context, return_tensors="pt")
outputs = model(input_ids)
model = xxx.from_pretrained(model_name) 的问题
同一个模型,可以有不同的下游任务,网络模型包括输入层,中间隐藏层,输出层三部分。我们所说的下游任务就是指输出层,我们拿到隐藏层的最后一层的计算结果之后,就可以在输出层上做些文章以实现不同的功能,所以在实例化模型的时候会有多种方式,AutoModelForxxxx,或者 BertForxxxx,所以 model() 的输出结果就不一样,参数个数也可能不一样,这个要去看它的 forward() 方法。
多去看看代码,基本上都有说明。我们以BertForQuestionAnswering
为例
from transformers import BertTokenizer, BertForQuestionAnswering
model = BertForQuestionAnswering.from_pretrained(model_name)
......
outputs = model(**inputs)
类的定义如下
@add_start_docstrings("""Bert Model with a span classification head on top for extractive question-answering tasks like SQuAD (a linearlayers on top of the hidden-states output to compute `span start logits` and `span end logits`).""",BERT_START_DOCSTRING,
)
class BertForQuestionAnswering(BertPreTrainedModel):def __init__(self, config):......def forward(self,input_ids: Optional[torch.Tensor] = None,attention_mask: Optional[torch.Tensor] = None,token_type_ids: Optional[torch.Tensor] = None,position_ids: Optional[torch.Tensor] = None,head_mask: Optional[torch.Tensor] = None,inputs_embeds: Optional[torch.Tensor] = None,start_positions: Optional[torch.Tensor] = None,end_positions: Optional[torch.Tensor] = None,output_attentions: Optional[bool] = None,output_hidden_states: Optional[bool] = None,return_dict: Optional[bool] = None,) -> Union[Tuple[torch.Tensor], QuestionAnsweringModelOutput]:
从说明就知道此类提供question-answering任务,它的返回值是 Tuple[torch.Tensor] 或者 QuestionAnsweringModelOutput,通过传入的参数 return_dict 来决定返回值类型,默认就是返回 QuestionAnsweringModelOutput,它是一个dataclass,可以访问它的属性。
再比如
from transformers import BertTokenizer, AutoModel
model = AutoModel.from_pretrained(model_name)
print(type(model)) # <class 'transformers.models.bert.modeling_bert.BertModel'>
类的定义如下
@add_start_docstrings("The bare Bert Model transformer outputting raw hidden-states without any specific head on top.",BERT_START_DOCSTRING,
)
class BertModel(BertPreTrainedModel):"""The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer ofcross-attention is added between the self-attention layers, following the architecture described in [Attention isall you need](https://arxiv.org/abs/1706.03762) by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.To behave as an decoder the model needs to be initialized with the `is_decoder` argument of the configuration setto `True`. To be used in a Seq2Seq model, the model needs to initialized with both `is_decoder` argument and`add_cross_attention` set to `True`; an `encoder_hidden_states` is then expected as an input to the forward pass."""_no_split_modules = ["BertEmbeddings", "BertLayer"]def __init__(self, config, add_pooling_layer=True):......def forward(self,input_ids: Optional[torch.Tensor] = None,attention_mask: Optional[torch.Tensor] = None,token_type_ids: Optional[torch.Tensor] = None,position_ids: Optional[torch.Tensor] = None,head_mask: Optional[torch.Tensor] = None,inputs_embeds: Optional[torch.Tensor] = None,encoder_hidden_states: Optional[torch.Tensor] = None,encoder_attention_mask: Optional[torch.Tensor] = None,past_key_values: Optional[List[torch.FloatTensor]] = None,use_cache: Optional[bool] = None,output_attentions: Optional[bool] = None,output_hidden_states: Optional[bool] = None,return_dict: Optional[bool] = None,) -> Union[Tuple[torch.Tensor], BaseModelOutputWithPoolingAndCrossAttentions]:
从说明就知道此类只能作为encoder和decoder用,其返回值为 Tuple[torch.Tensor] 或者 BaseModelOutputWithPoolingAndCrossAttentions
@dataclass 的说明
@dataclass
class QuestionAnsweringModelOutput(ModelOutput):"""Base class for outputs of question answering models.Args:loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):Total span extraction loss is the sum of a Cross-Entropy for the start and end positions.start_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length)`):Span-start scores (before SoftMax).end_logits (`torch.FloatTensor` of shape `(batch_size, sequence_length)`):Span-end scores (before SoftMax).hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,sequence_length)`.Attentions weights after the attention softmax, used to compute the weighted average in the self-attentionheads."""loss: Optional[torch.FloatTensor] = Nonestart_logits: torch.FloatTensor = Noneend_logits: torch.FloatTensor = Nonehidden_states: Optional[Tuple[torch.FloatTensor, ...]] = Noneattentions: Optional[Tuple[torch.FloatTensor, ...]] = None
此装饰器的作用相当于定义了一系列的类的属性
def __init__(self, loss: Optional[torch.FloatTensor] = Nonestart_logits: torch.FloatTensor = Noneend_logits: torch.FloatTensor = Nonehidden_states: Optional[Tuple[torch.FloatTensor, ...]] = Noneattentions: Optional[Tuple[torch.FloatTensor, ...]] = None
):
@classmethod 的说明
@classmethod
def from_pretrained(cls,pretrained_model_name_or_path: Union[str, os.PathLike],*init_inputs,cache_dir: Optional[Union[str, os.PathLike]] = None,force_download: bool = False,local_files_only: bool = False,token: Optional[Union[str, bool]] = None,revision: str = "main",trust_remote_code=False,**kwargs,
):............
此方法是类的方法,不需要实例化就能访问,且第一个参数是类(cls),而不是对象(self)。此方法可以访问类属性和cls的方法,而不能访问self的方法。
with torch.no_grad() 的作用
torch.no_grad()是PyTorch中的一个上下文管理器(context manager),用于指定在其内部的代码块中不进行梯度计算。当你不需要计算梯度时,可以使用该上下文管理器来提高代码的执行效率,尤其是在推断(inference)阶段和梯度裁剪(grad clip)阶段的时候。不需要进行梯度计算和反向传播,只需要进行前向传播计算。,从而提高计算效率并节省内存。with torch.no_grad()
常见于eval()验证集和测试集中。另外,This context manager is thread local; it will not affect computation in other threads.
logits
在神经网络中,logits通常是指模型在最后一层(全连接层)产生的原始输出,该层有多少个神经元就会有多少个值,这些输出还没有经过任何激活函数(如softmax或sigmoid)处理,根据不同的目的,将这些值输入到不同的激活函数中,就能归纳出不同的结果。