Imports and Setup
Libraries: The code imports necessary libraries like datasets
, transformers
, peft
, trl
, and torch
.
Logging: Sets up logging to track the training process.
import sys
import logging
import datasets
from datasets import load_dataset
from peft import LoraConfig
import torch
import transformers
from trl import SFTTrainer
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments, BitsAndBytesConfig
# Logging setup (you can customize this as needed)
logging.basicConfig(
format="%(asctime)s - %(levelname)s - %(name)s - %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
handlers=[logging.StreamHandler(sys.stdout)],
)
logger = logging.getLogger(__name__)
Hyperparameters
Hyperparameters: Defines two dictionaries, training_config
, and peft_config
, to store hyperparameters for training and PEFT (Parameter-Efficient Fine-Tuning) respectively.
Training Arguments: Creates a TrainingArguments
object from the training_config
dictionary.
PEFT Configuration: Creates a LoraConfig
object from the peft_config
dictionary, specifying the LoRA (Low-Rank Adaptation) settings for efficient fine-tuning.
# Training hyperparameters
training_config = {
"bf16": True, # Use mixed precision
"do_eval": False,
"learning_rate": 5.0e-06,
"log_level": "info",
"logging_steps": 20,
"logging_strategy": "steps",
"lr_scheduler_type": "cosine",
"num_train_epochs": 1,
"max_steps": -1,
"output_dir": "./checkpoint_dir",
"overwrite_output_dir": True,
"per_device_eval_batch_size": 4,
"per_device_train_batch_size": 4,
"remove_unused_columns": True,
"save_steps": 100,
"save_total_limit": 1,
"seed": 0,
"gradient_checkpointing": True,
"gradient_checkpointing_kwargs": {"use_reentrant": False},
"gradient_accumulation_steps": 1,
"warmup_ratio": 0.2,
}
# PEFT (LoRA) configuration
peft_config = {
"r": 16, # LoRA rank
"lora_alpha": 32,
"lora_dropout": 0.05,
"bias": "none",
"task_type": "CAUSAL_LM",
"target_modules": "all-linear",
"modules_to_save": None,
}
# Create TrainingArguments and LoraConfig objects
train_conf = TrainingArguments(**training_config)
peft_conf = LoraConfig(**peft_config)
Model and Tokenizer Loading
Checkpoint Path: Specifies the path to the pre-trained model, here “microsoft/Phi-3-mini-4k-instruct
“.
Model Arguments: Defines model_kwargs
with settings like use_cache
, torch_dtype
(bfloat16 for mixed precision), and attention implementation (flash_attention_2
).
Model and Tokenizer Loading: Loads the pre-trained model using AutoModelForCausalLM.from_pretrained
and the tokenizer using AutoTokenizer.from_pretrained
.
Tokenizer Configuration: Sets the maximum sequence length, padding token, token ID, and padding side for the tokenizer.
# Model checkpoint to fine-tune
checkpoint_path = "microsoft/Phi-3-mini-4k-instruct" # Or other Phi-3 model
# Model loading arguments
model_kwargs = dict(
use_cache=False,
trust_remote_code=True,
attn_implementation="flash_attention_2", # Flash Attention support
torch_dtype=torch.bfloat16,
device_map=None
)
# Load the model and tokenizer
model = AutoModelForCausalLM.from_pretrained(checkpoint_path, **model_kwargs)
tokenizer = AutoTokenizer.from_pretrained(checkpoint_path)
# Tokenizer configuration
tokenizer.model_max_length = 2048 # Set maximum sequence length
tokenizer.pad_token = tokenizer.unk_token # Use unk as padding token
tokenizer.pad_token_id = tokenizer.convert_tokens_to_ids(tokenizer.pad_token)
tokenizer.padding_side = 'right'
Data Processing
apply_chat_template Function
: This function preprocesses the data by:
Adding an empty system message if none exists in the beginning.
Applying the chat template using the tokenizer to format the conversation.
def apply_chat_template(example, tokenizer):
messages = example["messages"]
# Add an empty system message if there is none
if messages[0]["role"] != "system":
messages.insert(0, {"role": "system", "content": ""})
example["text"] = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=False
)
return example
Data Loading and Processing
Dataset Loading: Loads the “HuggingFaceH4/ultrachat_200k
” dataset using the datasets library.
Dataset Splitting: Extracts the train_sft
and test_sft
splits for training and evaluation.
Data Processing: Applies the apply_chat_template
function to both training and test datasets using the map
function, preparing the data for the chat-based fine-tuning task.
# Load the dataset
raw_dataset = load_dataset("HuggingFaceH4/ultrachat_200k")
# Extract train and test splits
train_dataset = raw_dataset["train_sft"]
test_dataset = raw_dataset["test_sft"]
column_names = list(train_dataset.features) # Get column names
# Process the datasets using the apply_chat_template function
processed_train_dataset = train_dataset.map(
apply_chat_template,
fn_kwargs={"tokenizer": tokenizer},
num_proc=10,
remove_columns=column_names,
desc="Applying chat template to train_sft",
)
processed_test_dataset = test_dataset.map(
apply_chat_template,
fn_kwargs={"tokenizer": tokenizer},
num_proc=10,
remove_columns=column_names,
desc="Applying chat template to test_sft",
)
Training
Trainer Initialization: Creates an SFTTrainer object with the following arguments:
model: The loaded pre-trained model.
args: The TrainingArguments
object.
peft_config
: The LoraConfig
object for LoRA settings.
train_dataset
and eval_dataset
: The processed training and evaluation datasets.
Other arguments like max_seq_length
, dataset_text_field
, tokenizer
, and packing
.
Training Execution: Starts the training process using trainer.train()
.
Metrics Logging and Saving: Logs and saves the training metrics.
Saving Trainer State: Saves the trainer state for potential resuming or further analysis.
# Initialize the SFTTrainer
trainer = SFTTrainer(
model=model,
args=train_conf,
peft_config=peft_conf,
train_dataset=processed_train_dataset,
eval_dataset=processed_test_dataset,
max_seq_length=2048,
dataset_text_field="text",
tokenizer=tokenizer,
packing=True
)
# Train the model
train_result = trainer.train()
# Log and save training metrics
metrics = train_result.metrics
trainer.log_metrics("train", metrics)
trainer.save_metrics("train", metrics)
trainer.save_state()
Evaluation
Tokenizer Adjustment: Changes the tokenizer padding side to ‘left
‘ for evaluation.
Evaluation: Runs the evaluation using trainer.evaluate()
and obtains evaluation metrics.
Metrics Logging and Saving: Logs and saves the evaluation metrics.
# Adjust tokenizer padding side for evaluation
tokenizer.padding_side = 'left'
# Evaluate the model
metrics = trainer.evaluate()
# Log and save evaluation metrics
metrics["eval_samples"] = len(processed_test_dataset)
trainer.log_metrics("eval", metrics)
trainer.save_metrics("eval", metrics)
Save the Fine-Tuned Model
Saving Fine-Tuned Model: Saves the fine-tuned model to the specified output directory using trainer.save_model()
.
# Save the fine-tuned model
trainer.save_model(train_conf.output_dir)