加载onnx模型进行推理 - 老杨说话的地方

onnx的模型加载，需要用到onnx runtime。

推理代码：

import torch
import onnxruntime as rt
from transformers import LlamaTokenizer

def generate_prompt(text):
    return f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{text}

### Response:"""

tokenizer = LlamaTokenizer.from_pretrained("minlik/chinese-llama-7b-merged")
text = '介绍一下北京'
prompt = generate_prompt(text)
input = tokenizer.encode(prompt, return_tensors='pt')
x = input.numpy()
sess = rt.InferenceSession('your_model_path_and_name.onnx')
max_words=2048
for i in range(0,max_words-1):
    with torch.no_grad():
        outputs = sess.run(None, {"input": x})
        predictions = outputs[0] #将output中的第一个元素取出来，就是预测的字的概率，第二个元素是past，这里我们只需要第一个元素。
        predictions=torch.from_numpy(predictions)
    predicted_index = torch.argmax(predictions[0, -1, :]).item()#这里最后一个tensor，就是下一个字。
    predicted_text = tokenizer.decode([predicted_index])
    if predicted_text == '</s>':
        break;
    else:
        print(predicted_text)
    prompt=prompt+predicted_text
    input = tokenizer.encode(prompt, return_tensors='pt')
    x = input.numpy()

需要注意：

tokenizer后的文本，返回的是torch张量，所以，需要先转换成numpy。
sess.run(None, {"input": x})中的input，对应于转换时自定义的名字。
predictions = outputs[0]，这里是把输出的结果第一个元素取出来，第一个元素就是所有预测的下一个字的概率，最后一个张量就是要预测的字。predicted_index = torch.argmax(predictions[0, -1, :]).item()就是对最后一个元素进行softmax，得到这个字在词典中的id。使用tokenizer的decode方法后就得到了真正的预测的字。
预测完一个字后，把这个字和前面的内容，都合并在一起，再继续预测下一个字，直到遇到结束符号</s>。