From Big Servers to Your Laptop: Running Llama2, Dolly2, and More in Your Local Environment

Machine learning enthusiasts and researchers are constantly advancing the frontiers of technology, crafting larger and more sophisticated models, especially in the domain of Natural Language Processing (NLP). However, not all of us have the resources to run these behemoths. If you've ever been curious about running the more manageable, smaller counterparts of some of the most prominent language models on your own computer, then this blog post offers the perfect insight!

What will we cover?

  1. Setting up the environment.
  2. Loading different language models.
  3. Calculating the memory footprint and FLOPs (Floating Point Operations) for each model.
  4. Prompting the models and measuring response time.

1. Setting Up The Environment

To start, ensure you have the transformers library installed. This library by Hugging Face provides the necessary tools to interact with large pre-trained models.

python from transformers import pipeline import torch import time

2. Loading Different Language Models

Here are some models we'll be working with:

3. Calculating Memory and FLOPs

For any deep learning model, it's crucial to know its memory footprint (how much RAM it consumes) and its FLOPs (how computationally intensive it is).

```python def get_model_size(model): model_size = model.get_memory_footprint() return model_size

def get_model_flops(model): model_flops = (model.floating_point_ops({ "input_ids": torch.zeros( (1, model.config.max_length) ) })) return model_flops ```

4. Prompting The Models and Measuring Response Time

We'll now load the models, measure their properties, and prompt them with a simple question.

```python model_list = ['EleutherAI/pythia-70m', 'databricks/dolly-v2-2-8b', 'meta-llama/Llama-2-7b-chat-hf']

def create_pipeline(model_name): instruct_pipeline = pipeline( model=model_name, torch_dtype=torch.bfloat16, trust_remote_code=True, device_map="auto") base_model = instruct_pipeline.model memory_footprint = get_model_size(base_model) print("Memory footprint:", memory_footprint / 1e9, "GB") model_flops = get_model_flops(base_model) print("Flops:", model_flops / 1e9, "GFLOPs") return instruct_pipeline

def prompt(pipeline, s): start = time.time() response = pipeline(s)
end = time.time() return response, round(end - start, 2) ```

Prompting the models:

python for model in model_list: prompt_txt = "Explain self-attention and transformer to me like I'm 5 years old (ELI5)." print('model:', model, '*'*10) print('prompt:', prompt_txt) response, elapsed_time = prompt(create_pipeline(model), prompt_txt) print('response:') print(response) print('took time:', elapsed_time)

Results

EleutherAI's pythia-70m - Memory: ~0.166 GB - FLOPs: ~5.36 GFLOPs - Response: "Explain self-attention and transformer to me like I'm 5 years old (ELI5)." - Time Taken: 0.53 seconds

Databricks' dolly-v2-2-8b - Memory: ~5.68 GB - FLOPs: ~317.56 GFLOPs - Response: "Self-attention is a technique for neural networks to pay attention to individual data points, instead of paying attention to groups of data points. For example, self-attention can look at the word in the middle of a paragraph and the surrounding words, paying attention to individual letters in the word. Transformer is a type of neural network that is especially good at understanding natural language. Transformer uses multi-layer neural networks to analyze the natural language, which can get quite complex. You train a transformer by saying what you want the model to 'understand' and the model will generate the code to do the same thing. Transformer can do word-level natural language processing, sentence level natural language processing, and paragraph level natural language processing. It can also generate summaries, take inputs and generate outputs." - Time Taken: 7.77 seconds

Meta's Llama-2-7b-chat-hf - Memory: ~13.54 GB - FLOPs: ~792.88 GFLOPs - Response: "Self-attention is like a magic wand that helps a model focus on the most important parts of a message. It's like when you're playing hide and seek with your friends, and you want to find the person who is hiding the best. Self-attention helps the model find the most important words or phrases in a message, just like how you find your friend who is hiding the best. Transformer is like a superhero that helps the model understand the message better. It's like when you have a big pile of toys, and you want to find the toy that you really want to play with. Transformer helps the model understand the message by looking at all the words and phrases together, just like how you look at all the toys in the pile to find the one you want to play with. So, self-attention is like a magic wand that helps the model find the most important parts of a message, and transformer is like a superhero that helps the model understand the message better by looking at all the words and phrases together." - Time Taken: 388.36 seconds

Conclusion

Running large language models locally can be insightful. However, as models grow in size, so do their computational requirements. Always ensure your machine has the necessary resources before attempting to run these behemoths!

Remember: While larger models might provide more nuanced responses, they also consume significantly more resources. Choose wisely based on your task and available resources.

Happy coding!

Related

Created 2023-08-30T14:44:36-07:00, updated 2023-11-01T11:52:41-07:00 · History · Edit