Finetune Gemma Models with Transformers

Gemma, a open source language model of Google Deepmind’s, has been made available to the online community via Hugging Face. It is available in two sizes, 2 billion and 7 billion parameters, and comes with both pre-trained and instruction-tracked versions. You can use it on Github, with TGI support, or deploy and fine-tune it with Vertex Model Garden and Google Kubernetes Engine for easy deployment and refinement.

Table of Content

What is Gemma?Prompt format Using gemma-7b-it with Transformers How to Finetune Gemma Models with HuggingFace and PEFT 1. Download the Model and Tokenizer 2. Test model 3. start the finetuning process 4. Use LoRA config 5. Test Finetuned Model Why PEFT?PyTorch on GPU and TPU Next Steps Let’s Wrap

To ensure optimal integration with the Hugging Face ecosystem, Google and Hugging Face collaborated. The Hub offers access to the 4 open-access models, including 2 basic models and 2 fine-tuned ones.

What is Gemma?

Gemma is a family of 4 new LLM models by Google based on Gemini. Two sizes, 2 billion and 7 billion parameters, were used to categorize Gemma, a set of large language models. Writing, translating, and answering questions are among the many functions that these models can perform. These models can be used on a variety of computers without special techniques to reduce their size and can retain information from up to 8,000 words of text.

gemma-2b: Base 2B model.
gemma-2b-it: Instruction fine-tuned version of the base 2B model.
gemma-7b: Base 7B model.
gemma-7b-it: Instruction fine-tuned version of the base 7B model.

In what sense are the Gemma models good? Below is a summary of the base models’ performance on the LLM Leaderboard in relation to other open models (higher scores indicate better performance):

Model	License	Commercial use?	Pretraining size [tokens]	Leaderboard score ⬇️
LLama 2 70B Chat (reference)	Llama 2 license	✅	2T	67.87
Gemma-7B	Gemma license	✅	6T	63.75
DeciLM-7B	Apache 2.0	✅	unknown	61.55
PHI-2 (2.7B)	MIT	✅	1.4T	61.33
Mistral-7B-v0.1	Apache 2.0	✅	unknown	60.97
Llama 2 7B	Llama 2 license	✅	2T	54.32
Gemma 2B	Gemma license	✅	2T	46.51

Gemma COMPARISON with other LLMs

Prompt format

The base models have no prompt format. They can be used to maintain an input sequence with a plausible continuation or for zero-shot/few-shot inference, just like other base models. They provide a solid foundation for fine-tuning your own use cases. The Instruct versions feature a basic conversation structure:

<start_of_turn>user
knock knock<end_of_turn>
<start_of_turn>model
who is there<end_of_turn>
<start_of_turn>user
LaMDA<end_of_turn>
<start_of_turn>model
LaMDA who?<end_of_turn>

This format must be reproduced exactly for effective use.

Using gemma-7b-it with Transformers

Here’s a quick rundown on how to use gemma-7b-it with transformers. The system mandates about 18 GB of RAM, which comprises consumer GPUs like 3090 or 4090.

Python

from transformers import AutoTokenizer, pipeline
import torch

model = "google/gemma-7b-it"

tokenizer = AutoTokenizer.from_pretrained(model)
pipeline = pipeline(
    "text-generation",
    model=model,
    model_kwargs={"torch_dtype": torch.bfloat16},
    device="cuda",
)

messages = [
    {"role": "user", "content": "What is Fine Tuning?  Write in one line."},
]
prompt = pipeline.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipeline(
    prompt,
    max_new_tokens=256,
    do_sample=True,
    temperature=0.7,
    top_k=50,
    top_p=0.95
)
print(outputs[0]["generated_text"][len(prompt):])

Output

Fine tuning is a machine learning technique that involves adjusting the parameters of a pre-trained model to fit a specific downstream task.

How to Finetune Gemma Models with HuggingFace and PEFT

Users are required to sign a consent form before they can access Gemma model artifacts.

In addition, Gemma models are compatible with torch.compile() with CUDA graphs, resulting in a ~4x inference time speedup!

To use Gemma models with transformers, make sure to use the latest transformers release:

pip install -U "transformers==4.38.1" --upgrade

1. Download the Model and Tokenizer

The Hugging Face Hub enables users to view the model artifacts after filling in the consent form.
Users who have submitted a consent form can access model samples from the Hugging Face Hub.
This example only shows the download model and tokenizer with BitsAndBytesConfig for the weight quantization.

Python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "google/gemma-2b"
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id, token=os.environ['HF_TOKEN'])
model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device_map={"":0}, token=os.environ['HF_TOKEN'])

2. Test model

Now, lets test model before starting the finetuning:

Python

text = "Qoute: The greatest glory in living"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

The model makes a reasonable complement with some additional tokens:

Output

Qoute: The greatest glory in living lies not in never falling, but in rising every time we fall.

-Nelson Mandela

However, this is not the format we would prefer for the answer to follow. Let’s finetune this model to generate the output in following format:

Required Output Format

Qoute: The greatest glory in living lies not in never falling, but in rising every time we fall.

Author: Nelson Mandela

3. start the finetuning process

Ok. Let’s begin the process of fine-tuning. We will use the English quotes dataset for this purpose.

Python

from datasets import load_dataset

data = load_dataset("Abirate/english_quotes")
data = data.map(lambda samples: tokenizer(samples["quote"]), batched=True)

4. Use LoRA config

And now we use LoRA config:

Python

import transformers
from trl import SFTTrainer

def formatting_func(example):
    text = f"Quote: {example['quote'][0]}\nAuthor: {example['author'][0]}"
    return [text]

trainer = SFTTrainer(
    model=model,
    train_dataset=data["train"],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=1,
        gradient_accumulation_steps=4,
        warmup_steps=2,
        max_steps=10,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="paged_adamw_8bit"
    ),
    peft_config=lora_config,
    formatting_func=formatting_func,
)
trainer.train()

5. Test Finetuned Model

After finishing this task, we can start examining the fine-tuned Gemma using the similar prompt we used previously:

Python

text = "Qoute: The best and most beautiful things in"
device = "cuda:0"
inputs = tokenizer(text, return_tensors="pt").to(device)

outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Output

The best and most beautiful things in the world cannot be seen or even touched - they must be felt with the heart.

Author: Helen Keller

Why PEFT?

Language models are commonly trained using memory and compute-intensive methods, even for small sizes. This can be problematic for users who rely on openly available compute platforms like Colab or Kaggle for learning and experimentation. Additionally, the cost of adapting these models to different domains is a crucial consideration for enterprise users, making PEFT scalable and cost-effective.

PyTorch on GPU and TPU

Hugging Face transformers’ for gemma models are optimized for PyTorch and PyTorch/XLA. This makes Gemma models available for TPU and GPU users to access and experiment with as needed. TPU acceleration via PyTorch/XLA is also made possible for other Hugging Face models by this FSDP via SPMD integration.

Next Steps

We proceeded to examine a basic example that was taken from the source notebook and demonstrated the LoRA fine tuning technique for Gemma models. The complete colab for GPU is available here, and the complete script for TPU can be found here.

Let’s Wrap

In this work, a new family of open language models called Gemma is introduced. Gemma performs well on academic benchmarks for language understanding, safety, and reasoning. Two model sizes (2 billion and 7 billion parameters) are made available, along with pretrained and fine-tuned checkpoints. On 11 out of 18 text-based tasks, Gemma outperforms similarly sized open models. Hugging Face also provide thorough assessments of the models’ safety and responsibility features, along with a thorough explanation of our model’s development. We think that enhancing the safety of frontier models and facilitating the subsequent wave of LLM innovations depend on the responsible release of LLMs.

Top Stories

How to Fine-tune Meta Llama-3 8B

C4AI Command R+ Everything You Need to Know

Fine-tune Jamba-v0.1 on English quotes using QLoRA

Stay Connected

Finetune Gemma Models with Transformers

What is Gemma?

Prompt format

Using gemma-7b-it with Transformers

How to Finetune Gemma Models with HuggingFace and PEFT

1. Download the Model and Tokenizer

2. Test model

3. start the finetuning process

4. Use LoRA config

5. Test Finetuned Model

Why PEFT?

PyTorch on GPU and TPU

Next Steps

Let’s Wrap

Other Posts

New site-level approval process for AdSense for Search (AFS) sites

Intersection of Page Speed, SEO, and User Experience

Text Particles in Motion with Cursor Interaction Using JS

What is a Search Engine and how does it work?

Latest Posts

C4AI Command R+ Everything You Need to Know

Hugging Face Transformers Pipeline, what can they do?

DeepLearning.AI just announced a course on Hugging Face

Quick Links

About US

Top Stories

Stay Connected

What is Gemma?

Prompt format

Using gemma-7b-it with Transformers

How to Finetune Gemma Models with HuggingFace and PEFT

1. Download the Model and Tokenizer

2. Test model

3. start the finetuning process

4. Use LoRA config

5. Test Finetuned Model

Why PEFT?

PyTorch on GPU and TPU

Next Steps

Let’s Wrap

You Might Also Like

Other Posts

Latest Posts

Quick Links

About US