Optimizing Llama 2: Harnessing the Power of Prompt, RAG, and Fine-Tuning

In the rapidly evolving landscape of large language models (LLMs), enhancing their capabilities and performance is pivotal. Three prominent techniques that stand out in achieving this are:

Prompt: A method of structuring input to the model to get desired outputs without additional training.
RAG (Retrieval Augmented Generation): Combines retrieval-based and generation-based approaches to pull in relevant external information for more dynamic responses.
PEFT (Fine-Tuning): Refines a pre-trained model on a domain-specific dataset, making it more specialized and accurate for particular tasks.

Prompt with Llama 2

Prompting is a versatile way to guide the behavior of a language model without the need for additional training. By structuring the input, or "prompt", in a specific way, we can steer the model's responses in specific directions or styles. Let's delve into how this can be leveraged for Llama 2.

```python from torch import cuda, bfloat16 import transformers

model_id = "meta-llama/Llama-2-13b-chat-hf"

4-bit Quantization to load Llama-2-13b with less GPU memory

bnb_config = transformers.BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=bfloat16 )

tokenizer = transformers.AutoTokenizer.from_pretrained(model_id) model = transformers.AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, quantization_config=bnb_config, device_map='auto' ) model.eval()

generator = transformers.pipeline( model=model, tokenizer=tokenizer, task="text-generation", temperature=0.1, max_new_tokens=1024, repetition_penalty=1.1, top_k=50, top_p=0.9 )

system_message = "You are a helpful assistant" prompt = "What is 1 + 1?" basic_prompt = f""" ~~[INST] <> {system_message} {prompt} <> [/INST] """ print(generator(basic_prompt)[0]['generated_text'])~~

prompt = """ ~~[INST] <> You are a helpful assistant. <> Classify the text into neutral, negative or positive. Text: I think the food was alright. Sentiment: [/INST] Neutral~~ ~~[INST] Classify the text into neutral, negative or positive. Text: I think the food was okay. Sentiment: [/INST] """ print(generator(prompt)[0]["generated_text"])~~

prompt = """ [INST] <> You are a helpful assistant. <> Do the odd numbers in this group add up to an even number? 3, 5, 15, 32. Solve by breaking the problem into steps. Identify the odd numbers, add them, and indicate whether the result is odd or even.[/INST] """ print(generator(prompt)[0]["generated_text"]) ```

Breakdown of the Code

Importing Necessary Libraries: We start by importing key libraries like torch and transformers which provide the necessary tools to work with the model and leverage GPU capabilities.

Model Configuration: This section sets the model ID for Llama-2-13b, which is hosted on the Hugging Face model hub.

Setting up Quantization: Here, a configuration for 4-bit quantization is set up. This allows Llama-2-13b to be loaded with reduced GPU memory requirements.

Loading Tokenizer and Model: The tokenizer and model are loaded using the provided model ID. The model is also set to evaluation mode to ensure it's ready for inference.

Setting up the Generator Pipeline: A text-generation pipeline is established using the loaded model and tokenizer. Various parameters are set to control the generation process, tailoring the output to desired specifications.

Generating Responses Using Prompts: Different prompts are used to demonstrate the versatility of prompting. These range from simple arithmetic questions to sentiment analysis and mathematical problem-solving.

Output

```bash ~~[INST] <> You are a helpful assistant <> What is 1 + 1? [/INST] Oh my, that's a simple one! The answer to 1 + 1 is... (drumroll please)... 2! 😊~~

~~[INST] <> You are a helpful assistant. <> Classify the text into neutral, negative or positive. Text: I think the food was alright. Sentiment: [/INST] Neutral~~ ~~[INST] Classify the text into neutral, negative or positive. Text: I think the food was okay. Sentiment: [/INST] Neutral~~

[INST] <> You are a helpful assistant. <> Do the odd numbers in this group add up to an even number? 3, 5, 15, 32. Solve by breaking the problem into steps. Identify the odd numbers, add them, and indicate whether the result is odd or even.[/INST]

Hello! I'd be happy to help you with that. Let's break down the problem into steps:

Step 1: Identify the odd numbers in the group.

The odd numbers in the group are:

3

5

15

Step 2: Add the odd numbers together.

3 + 5 + 15 = 23

Step 3: Determine if the result is odd or even.

23 is an odd number, so the sum of the odd numbers in the group is odd.

Therefore, the odd numbers in the group add up to an odd number. ```

RAG (Retrieval Augmented Generation) with Llama 2

Retrieval Augmented Generation (RAG) is a technique that combines retrieval-based and generation-based approaches for question answering. By using an external knowledge source, RAG can retrieve relevant information and then generate a response based on that information. The result is more specific and up-to-date answers. Let's explore how RAG can be implemented with Llama 2.

```python

3 Easy Methods For Improving Your Large Language Model

https://www.youtube.com/watch?v=Rqu5Hjsbq6A

from torch import cuda, bfloat16 import transformers import warnings warnings.filterwarnings("ignore")

model_id = "meta-llama/Llama-2-13b-chat-hf"

4-bit Quantization to load Llama-2-13b wih less GPU memory

bnb_config = transformers.BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=bfloat16 )

tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

model = transformers.AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code = True, quantization_config = bnb_config, device_map = 'auto' ) model.eval()

generator = transformers.pipeline( model=model, tokenizer = tokenizer, task = "text-generation", temperature = 0.1, max_new_tokens = 1024, repetition_penalty = 1.1, top_k = 50, top_p = 0.9)

Our tiny knowledge base

knowledge_base = [ "Zephyr is a series of language models that are trained to act as helpful assistants.", "Zephyr-7B-β is the second model in the series, and is a fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO).", "We found that removing the in-built alignment of these datasets boosted performance on MT Bench and made the model more helpful.", "However, this means that model is likely to generate problematic text when prompted to do so and should only be used for educational and research purposes." ] with open(r'knowledge_base.txt', 'w') as fp: fp.write('\n'.join(knowledge_base))

from langchain.embeddings.huggingface import HuggingFaceEmbeddings

Embedding Model for converting text to numerical representations

embedding_model = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2') from langchain.text_splitter import CharacterTextSplitter from langchain.vectorstores import FAISS from langchain.document_loaders import TextLoader

Load documents and split them

documents = TextLoader("knowledge_base.txt").load() text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) docs = text_splitter.split_documents(documents)

Create local vector database

db = FAISS.from_documents(docs, embedding_model) from langchain.chains import RetrievalQA from langchain.llms import HuggingFacePipeline

Load LLM into LangChain

llm = HuggingFacePipeline(pipeline=generator)

RAG Pipeline

rag = RetrievalQA.from_chain_type( llm=llm, chain_type='stuff', retriever=db.as_retriever() ) question= "What is Zephyr?" print("LLM ONLY", "*" * 10) print("Q:", question) print("A:", llm(question))

print("RAG + LLM", ''10) print("A:", rag(question))

question = 'What model is Zephyr fine-tuned from?' print("LLM ONLY", "*" * 10) print("Q:", question) print("A:", llm(question))

print("RAG + LLM", ''10) print("A:", rag(question)) ```

Breakdown of the RAG Code

Setting up the Environment: In addition to importing necessary libraries, we suppress warnings to keep the output clean.

Loading Tokenizer and Model: This part remains similar to the previous section, where we load the Llama 2 model with the necessary configurations.

Creating a Knowledge Base: A simple knowledge base is defined with key information about Llama 2. This knowledge base is written to a file, which will later be used by RAG for retrieval.

Embedding Model Setup: The HuggingFaceEmbeddings model is used to convert text into numerical representations. This is crucial for RAG as it uses these embeddings to find relevant information from the knowledge base.

Loading and Splitting Documents: The knowledge base is loaded and split into manageable chunks. This ensures that the retrieval process is efficient.

Local Vector Database Creation: A FAISS database is created from the document chunks using the embeddings. This database will be used to retrieve relevant information during the RAG process.

Setting up the RAG Pipeline: The Llama 2 model is loaded into a LangChain pipeline, and the RAG pipeline is set up using this model and the FAISS database.

Question-Answering with RAG: Finally, questions are posed to the model. The answers generated using only Llama 2 and those using RAG + Llama 2 are printed, showcasing the difference in output quality and relevance.

RAG is a powerful technique that bridges the gap between large-scale information retrieval and detailed text generation, providing more relevant and accurate answers to user queries

Output

```bash LLM ONLY **** Q: What is Zephyr? A: Zephyr is a lightweight, open-source API gateway built using Node.js and Express. It provides a simple, flexible, and highly performant platform for building and deploying APIs.

Here are some key features of Zephyr:

Lightweight: Zephyr is designed to be lightweight and fast, making it ideal for building high-performance APIs.

Flexible: Zephyr supports a wide range of protocols and data formats, including JSON, XML, and Protocol Buffers.

Simple: Zephyr has a simple and intuitive API that makes it easy to build and deploy APIs.

Extensible: Zephyr is highly extensible, allowing developers to add custom functionality and plugins as needed.

Scalable: Zephyr is designed to scale horizontally, making it easy to handle large volumes of traffic and data.

Secure: Zephyr provides built-in support for SSL/TLS encryption and OAuth 2.0 authentication.

Multi-language: Zephyr supports multiple programming languages, including JavaScript, Python, and Java.

Open-source: Zephyr is an open-source project, which means that it is free to use and that the community can contribute to its development.

Overall, Zephyr is a powerful and flexible tool for building and deploying APIs, and it is well-suited for a wide range of applications and use cases. RAG + LLM * A: {'query': 'What is Zephyr?', 'result': ' Zephyr is a series of language models that are trained to act as helpful assistants.'} LLM ONLY * Q: What model is Zephyr fine-tuned from? A:

Zephyr is a pre-trained language model that has been fine-tuned on a specific dataset for a particular task. The exact model and dataset used for Zephyr's fine-tuning are not specified in the provided information. However, based on the description of the model as a "state-of-the-art" language model, it is likely that Zephyr was fine-tuned on a large and diverse dataset such as the Web Text Corpus or the Common Crawl dataset.

It's worth noting that the choice of dataset and fine-tuning strategy can have a significant impact on the performance of a language model, and different models may be optimized for different tasks or domains. If you need more information about Zephyr's fine-tuning, you may want to consult the original research paper or contact the developers of the model. RAG + LLM **** A: {'query': 'What model is Zephyr fine-tuned from?', 'result': ' Zephyr is fine-tuned from mistralai/Mistral-7B-v0.1.'} ```

Fine-Tuning Llama 2

Fine-tuning is a pivotal technique in the world of machine learning, especially for large language models like Llama 2. By training a pre-trained model on a smaller, domain-specific dataset, we can improve its performance and make it more specialized for particular tasks.

Prepare fine-tuning dataset

```python import pandas as pd from datasets import load_dataset

Datasets

dataset = load_dataset("timdettmers/openassistant-guanaco") df = pd.DataFrame(dataset["train"][:1000]).dropna() df.to_csv("train.csv")

print(df['text'].iloc[0]) ```

With the dataset prepared, we can proceed with the actual fine-tuning process, where the Llama 2 model will be trained further using this dataset to specialize it for the tasks or topics contained within the dataset.

Fine-Tuning Llama 2 with Parameter-Efficient Fine-Tuning (PEFT)

After preparing the dataset, we can delve into the fine-tuning process. Parameter-Efficient Fine-Tuning (PEFT) is a method that allows us to fine-tune large language models with fewer parameters, making the process more efficient in terms of memory and computational requirements. Let's walk through the steps to fine-tune Llama 2 using PEFT.

bash autotrain llm --train \ --project_name Llama-Chat \ --model abhishek/llama-2-7b-hf-small-shards \ --data_path . \ --use_peft \ --use_int4 \ --learning_rate 2e-4 \ --num_train_epochs 1 \ --trainer sft \ --merge_adapter

Breakdown of the Fine-Tuning Command

Command: autotrain llm --train: This command initiates the auto-training process for a large language model (llm) in train mode. Project Name: --project_name Llama-Chat: Specifies the name of the project. This can be useful for tracking and managing different training runs.

Model: --model abhishek/llama-2-7b-hf-small-shards: Defines the pre-trained model that will be fine-tuned. In this case, we are using a 7-billion parameter shard of Llama 2.

Data Path: --data_path .: Points to the location of the training dataset. Here, the current directory is specified.

PEFT: --use_peft: Activates Parameter-Efficient Fine-Tuning.

Quantization: --use_int4: Enables 4-bit integer quantization, which reduces the memory requirements of the model.

Learning Rate: --learning_rate 2e-4: Specifies the learning rate for the training process.

Training Epochs: --num_train_epochs 1: Indicates the number of training epochs. In this case, the model will be trained for one epoch.

Trainer: --trainer sft: Chooses the trainer type. Here, "sft" is selected.

Merge Adapter: --merge_adapter: Use this flag to merge PEFT adapter with the model.

Predicting with the Fine-Tuned Llama 2 Model

After fine-tuning Llama 2, we can leverage its specialized knowledge to make predictions or generate responses. The following code demonstrates how to set up the fine-tuned model for predictions and generate a response to a given prompt.

```python from torch import cuda, bfloat16 import transformers

model_id = 'Llama-Chat'

4-bit Quantization to load Llama 2 with less GPU memory

bnb_config = transformers.BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type='nf4', bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=bfloat16 )

Llama 2 Tokenizer

tokenizer = transformers.AutoTokenizer.from_pretrained(model_id)

Llama 2 Model

model = transformers.AutoModelForCausalLM.from_pretrained( model_id, trust_remote_code=True, quantization_config=bnb_config, device_map='auto', ) model.eval()

Our text generator

generator = transformers.pipeline( model=model, tokenizer=tokenizer, task='text-generation', temperature=0.1, max_new_tokens=500, repetition_penalty=1.1 )

prompt = "### Human: Write me top five things to do in San Francisco.### Assistant:" print(generator(prompt)[0]["generated_text"]) ```

Output

```bash

Human: Write me top five things to do in San Francisco.### Assistant: Here are the top 5 things you should definitely do when visiting San Francisco, California:

Visit Alcatraz Island - Take a ferry ride out to Alcatraz Island and explore the infamous prison that once housed some of America's most notorious criminals. You can take a guided tour or just walk around on your own to learn about the history of this iconic landmark.

Explore Golden Gate Park - This massive park is home to many attractions including the Japanese Tea Garden, the Conservatory of Flowers, and the de Young Museum. Spend an afternoon strolling through the gardens or taking a bike ride along the trails.

Ride the Cable Car - A must-do for any visitor to San Francisco, riding one of these historic cars is like stepping back in time. Hop aboard at Fisherman's Wharf and enjoy the views as you make your way up and down the hills of downtown.

Eat at a Dive Bar - San Francisco has no shortage of dive bars where locals gather to drink beer and eat cheap food. Some favorites include The Saloon, The Knockout, and The Lexington Club.

Walk Across the Golden Gate Bridge - One of the most recognizable symbols of San Francisco, walking across this iconic bridge offers breathtaking views of the city skyline and the Pacific Ocean beyond. ```

Related

llm Building the Future of Instruction-Based Code Generation: An Exploration of Code Alpaca's LLaMA Models with Ludwig's Fine-Tuning QLORA Technique - 2023-09-01

llm From Big Servers to Your Laptop: Running Llama2, Dolly2, and More in Your Local Environment - 2023-08-30

llm Fine-tuning Zephyr 7B GPTQ with 4-Bit Quantization for Custom Data and Inference - 2023-11-08

paper Retrieval-Augmented Generation for Large Language Models: A Survey - 2024-03-31

paper LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models - 2024-03-26

paper FrugalGPT: Making Large Language Models Affordable and Efficient - 2024-04-04

paper Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution - 2024-02-24

paper In-Context Learning for Extreme Multi-Label Classification - 2024-03-13

paper PiFi: Bridging the Gap Between Small and Large Language Models - A Comprehensive Review - 2025-06-11

paper Direct Preference Optimization: Your Language Model is Secretly a Reward Model - 2023-11-05

Created 2023-11-04T15:15:58-07:00, updated 2023-11-04T20:02:03-07:00