Fine-tuning Zephyr 7B GPTQ with 4-Bit Quantization for Custom Data and Inference

Introduction

Model fine-tuning and quantization play pivotal roles in creating efficient and robust machine learning solutions. This blog post explores the fine-tuning process of the Zephyr 7B GPT-Q model using 4-bit quantization to boost its performance for custom data inference tasks.

Fine-tuning with zephyr_trainer.py

The zephyr_trainer.py script is our key to unlocking the potential of the Zephyr model. It encompasses the complete fine-tuning process, from data preparation to model training. Here's the workflow:

Initializing the Trainer

Setting up the trainer with specific configurations is the first step toward model optimization.

```python from llm.zephyr.finetune_gptq.config import Config

Configurations are loaded into the trainer for initialization

config = Config() zephyr_trainer = ZephyrTrainer(config) ```

Data Processing

Processing the data correctly is vital for training the model to respond in a professional chatbot manner.

```python from lm.zephyr.finetune_gptq.prompt_utils import to_chat_text

def process_data_sample(self, example): # Convert the example into a format suitable for the chatbot processed_example = to_chat_text(example, self.config.INSTRUCTION_FIELD, self.config.TARGET_FIELD) return processed_example ```

Model Preparation and Quantization

The model is equipped with 4-bit quantization and Low-Rank Adaptation (LoRA) modules, which are instrumental in fine-tuning.

```python from peft import LoraConfig, prepare_model_for_kbit_training, get_peft_model from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig, TrainingArguments

def prepare_model(self):

    '''
    Prepares model for finetuning by quantizing it and attaching lora modules to the model

    Returns:
    model - Model ready for finetuning
    peft_config - LoRA Adapter config
    '''

    bnb_config = GPTQConfig(
                                bits=self.config.BITS,
                                disable_exllama=self.config.DISABLE_EXLLAMA,
                                tokenizer=self.tokenizer
                            )

    model = AutoModelForCausalLM.from_pretrained(
                                                    self.config.MODEL_ID,
                                                    quantization_config=bnb_config,
                                                    device_map=self.config.DEVICE_MAP
                                                )

    print("DOWNLOADED MODEL")
    print(model)

    model.config.use_cache=self.config.USE_CACHE
    model.config.pretraining_tp=1
    model.gradient_checkpointing_enable()
    model = prepare_model_for_kbit_training(model)

    print("MODEL CONFIG UPDATED")

    peft_config = LoraConfig(
                                r=self.config.LORA_R,
                                lora_alpha=self.config.LORA_ALPHA,
                                lora_dropout=self.config.LORA_DROPOUT,
                                bias=self.config.BIAS,
                                task_type=self.config.TASK_TYPE,
                                target_modules=self.config.TARGET_MODULES
                            )

    model = get_peft_model(model, peft_config)

    print("PREPARED MODEL FOR FINETUNING")
    print(model)

    return model, peft_config

```

Training the Model

The actual training is handled by setting up training arguments and invoking a specialized trainer.

```python def train(self): # The method that orchestrates the model training data = self.create_dataset() model, peft_config = self.prepare_model() training_args = self.set_training_arguments()

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=data,
    tokenizer=self.tokenizer,
    # Additional parameters here
)
trainer.train()

```

Utility with prompt_utils.py

The prompt_utils.py file offers the to_chat_text function, crucial for preparing the dataset in a way that is conducive to the chatbot's style.

python def to_chat_text(example, instruction_field:str, target_field:str): processed_example = "<|system|>\n You are a support chatbot who helps with user queries chatbot who always responds in the style of a professional.</s>\n<|user|>\n" + example[instruction_field] + "</s>\n<|assistant|>\n" + example[target_field] return processed_example

Inference with finetuned_inference.py

With the model fine-tuned, finetuned_inference.py showcases the model's ability to infer and interact in real-time.

Prompt Generation for Inference Generating a prompt that the model can respond to is vital for real-world applications.

```python def generate_prompt(example):

processed_example = "<|system|>\n You are a support chatbot who helps with user queries chatbot who always responds in the style of a professional.\n<|user|>\n" + example["instruction"] + "\n<|assistant|>\n"

return processed_example

```

Executing Inference

The model, now fine-tuned and optimized, is ready to generate responses to given prompts.

```python if name == 'main': tokenizer = AutoTokenizer.from_pretrained(Config.OUTPUT_DIR)

inp_str = generate_prompt(
    {
        "instruction": "I have a question about placing an order",
    },
)

inputs = tokenizer(inp_str, return_tensors="pt").to("cuda")

model = AutoPeftModelForCausalLM.from_pretrained(
    Config.OUTPUT_DIR,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.float16,
    device_map="cuda")

generation_config = GenerationConfig(
    do_sample=True,
    top_k=1,
    temperature=0.1,
    max_new_tokens=256,
    pad_token_id=tokenizer.eos_token_id
)

st_time = time.time()
outputs = model.generate(**inputs, generation_config=generation_config)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print(time.time()-st_time)

```

Reference

Github - https://github.com/bayjarvis/llm/tree/main/zephyr/finetune_gptq

Related

Created 2023-11-08T10:52:10-08:00, updated 2023-11-11T18:24:15-08:00 · History · Edit