In the rapidly evolving landscape of large language models (LLMs), enhancing their capabilities and performance is pivotal. Three prominent techniques that stand out in achieving this are:
Prompting is a versatile way to guide the behavior of a language model without the need for additional training. By structuring the input, or "prompt", in a specific way, we can steer the model's responses in specific directions or styles. Let's delve into how this can be leveraged for Llama 2.
```python from torch import cuda, bfloat16 import transformers
model_id = "meta-llama/Llama-2-13b-chat-hf"
bnb_config = transformers.BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=bfloat16 )
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id) model = transformers.AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, quantization_config=bnb_config, device_map='auto' ) model.eval()
generator = transformers.pipeline( model=model, tokenizer=tokenizer, task="text-generation", temperature=0.1, max_new_tokens=1024, repetition_penalty=1.1, top_k=50, top_p=0.9 )
system_message = "You are a helpful assistant"
prompt = "What is 1 + 1?"
basic_prompt = f"""
[INST] <
prompt = """
[INST] <
[INST]
Classify the text into neutral, negative or positive.
Text: I think the food was okay.
Sentiment:
[/INST]
"""
print(generator(prompt)[0]["generated_text"])
prompt = """
[INST] <
```bash
[INST] <
[INST] <
[INST]
Classify the text into neutral, negative or positive.
Text: I think the food was okay.
Sentiment:
[/INST]
Neutral
[INST] <
Hello! I'd be happy to help you with that. Let's break down the problem into steps:
Step 1: Identify the odd numbers in the group.
The odd numbers in the group are:
Step 2: Add the odd numbers together.
3 + 5 + 15 = 23
Step 3: Determine if the result is odd or even.
23 is an odd number, so the sum of the odd numbers in the group is odd.
Therefore, the odd numbers in the group add up to an odd number. ```
Retrieval Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches for question answering. By using an external knowledge source, RAG can retrieve relevant information and then generate a response based on that information. The result is more specific and up-to-date answers. Let's explore how RAG can be implemented with Llama 2.
```python
from torch import cuda, bfloat16 import transformers import warnings warnings.filterwarnings("ignore")
model_id = "meta-llama/Llama-2-13b-chat-hf"
bnb_config = transformers.BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=bfloat16 )
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
model = transformers.AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code = True, quantization_config = bnb_config, device_map = 'auto' ) model.eval()
generator = transformers.pipeline( model=model, tokenizer = tokenizer, task = "text-generation", temperature = 0.1, max_new_tokens = 1024, repetition_penalty = 1.1, top_k = 50, top_p = 0.9)
knowledge_base = [ "Zephyr is a series of language models that are trained to act as helpful assistants.", "Zephyr-7B-β is the second model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO).", "We found that removing the in-built alignment of these datasets boosted performance on MT Bench and made the model more helpful.", "However, this means that model is likely to generate problematic text when prompted to do so and should only be used for educational and research purposes." ] with open(r'knowledge_base.txt', 'w') as fp: fp.write('\n'.join(knowledge_base))
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
embedding_model = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2') from langchain.text_splitter import CharacterTextSplitter from langchain.vectorstores import FAISS from langchain.document_loaders import TextLoader
documents = TextLoader("knowledge_base.txt").load() text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) docs = text_splitter.split_documents(documents)
db = FAISS.from_documents(docs, embedding_model) from langchain.chains import RetrievalQA from langchain.llms import HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=generator)
rag = RetrievalQA.from_chain_type( llm=llm, chain_type='stuff', retriever=db.as_retriever() ) question= "What is Zephyr?" print("LLM ONLY", "*" * 10) print("Q:", question) print("A:", llm(question))
print("RAG + LLM", ''10) print("A:", rag(question))
question = 'What model is Zephyr fine-tuned from?' print("LLM ONLY", "*" * 10) print("Q:", question) print("A:", llm(question))
print("RAG + LLM", ''10) print("A:", rag(question)) ```
RAG is a powerful technique that bridges the gap between large-scale information retrieval and detailed text generation, providing more relevant and accurate answers to user queries
```bash LLM ONLY **** Q: What is Zephyr? A: Zephyr is a lightweight, open-source API gateway built using Node.js and Express. It provides a simple, flexible, and highly performant platform for building and deploying APIs.
Here are some key features of Zephyr:
Overall, Zephyr is a powerful and flexible tool for building and deploying APIs, and it is well-suited for a wide range of applications and use cases. RAG + LLM * A: {'query': 'What is Zephyr?', 'result': ' Zephyr is a series of language models that are trained to act as helpful assistants.'} LLM ONLY * Q: What model is Zephyr fine-tuned from? A:
Zephyr is a pre-trained language model that has been fine-tuned on a specific dataset for a particular task. The exact model and dataset used for Zephyr's fine-tuning are not specified in the provided information. However, based on the description of the model as a "state-of-the-art" language model, it is likely that Zephyr was fine-tuned on a large and diverse dataset such as the Web Text Corpus or the Common Crawl dataset.
It's worth noting that the choice of dataset and fine-tuning strategy can have a significant impact on the performance of a language model, and different models may be optimized for different tasks or domains. If you need more information about Zephyr's fine-tuning, you may want to consult the original research paper or contact the developers of the model. RAG + LLM **** A: {'query': 'What model is Zephyr fine-tuned from?', 'result': ' Zephyr is fine-tuned from mistralai/Mistral-7B-v0.1.'} ```
Fine-tuning is a pivotal technique in the world of machine learning, especially for large language models like Llama 2. By training a pre-trained model on a smaller, domain-specific dataset, we can improve its performance and make it more specialized for particular tasks.
```python import pandas as pd from datasets import load_dataset
dataset = load_dataset("timdettmers/openassistant-guanaco") df = pd.DataFrame(dataset["train"][:1000]).dropna() df.to_csv("train.csv")
print(df['text'].iloc[0]) ```
With the dataset prepared, we can proceed with the actual fine-tuning process, where the Llama 2 model will be trained further using this dataset to specialize it for the tasks or topics contained within the dataset.
After preparing the dataset, we can delve into the fine-tuning process. Parameter-Efficient Fine-Tuning (PEFT) is a method that allows us to fine-tune large language models with fewer parameters, making the process more efficient in terms of memory and computational requirements. Let's walk through the steps to fine-tune Llama 2 using PEFT.
bash
autotrain llm --train \
--project_name Llama-Chat \
--model abhishek/llama-2-7b-hf-small-shards \
--data_path . \
--use_peft \
--use_int4 \
--learning_rate 2e-4 \
--num_train_epochs 1 \
--trainer sft \
--merge_adapter
After fine-tuning Llama 2, we can leverage its specialized knowledge to make predictions or generate responses. The following code demonstrates how to set up the fine-tuned model for predictions and generate a response to a given prompt.
```python from torch import cuda, bfloat16 import transformers
model_id = 'Llama-Chat'
bnb_config = transformers.BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=bfloat16 )
tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)
model = transformers.AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, quantization_config=bnb_config, device_map='auto', ) model.eval()
generator = transformers.pipeline( model=model, tokenizer=tokenizer, task='text-generation', temperature=0.1, max_new_tokens=500, repetition_penalty=1.1 )
prompt = "### Human: Write me top five things to do in San Francisco.### Assistant:" print(generator(prompt)[0]["generated_text"]) ```
```bash
Created 2023-11-04T15:15:58-07:00, updated 2023-11-04T20:02:03-07:00 · History · Edit