Questions concerning configuring train_lora.py for custom corpus

Hi! I followed the instructions for fine-tuning my corpus and (I think) managed to do so successfully after days of debugging. I have A LOT of implementation questions and the following is half-guide, half questions about the process. I want to make it very clear I am VERY thankful for the existence of this library but I feel obligated to point out the issues below.

---
 

I first created a new conda environment (with `conda create -n sat-finetune python=3.9` and `conda install pip` just to be safe) and ran the following:

```
git clone https://github.com/segment-any-text/wtpsplit
cd wtpsplit
pip install -r requirements.txt
pip install adapters==0.2.1 --no-dependencies
cd ..
```
 

Then I created the `.pth` dataset as per this format:

```
import torch

torch.save(
 {
 "language_code": {
 "sentence": {
 "sat-dataset": {
 "meta": {
 "train_data": ["train sentence 1", "train sentence 2"],
 },
 "data": [
 "test sentence 1",
 "test sentence 2",
 ]
 }
 }
 }
 },
 "<path>/sat-dataset.pth"
)
```
 

My config is below:

```
{
 "model_name_or_path": "segment-any-text/sat-3l",
 "output_dir": "sat-3l-LL_lora",
 "block_size": 256,
 "eval_stride": 128,
 "do_train": true,
 "do_eval": true,
 "per_device_train_batch_size": 64,
 "per_device_eval_batch_size": 32,
 "gradient_accumulation_steps": 1,
 "eval_accumulation_steps": 8,
 "dataloader_num_workers": 1,
 "preprocessing_num_workers": 1,
 "learning_rate": 3e-4,
 "fp16": false,
 "num_train_epochs": 30,
 "logging_steps": 50,
 "report_to": "wandb",
 "wandb_project": "sentence",
 "save_steps": 100000000,
 "remove_unused_columns": false,
 "do_sentence_training": true,
 "do_auxiliary_training": false,
 "warmup_ratio": 0.1,
 "non_punctuation_sample_ratio": null,
 "prediction_loss_only": true,
 "use_auxiliary": true,
 "ddp_timeout": 3600,
 "use_subwords": true,
 "custom_punctuation_file": "punctuation_xlmr_unk.txt",
 "log_level": "warning",
 "adapter_config": "lora[r=16,alpha=32,intermediate_lora=True]",
 "text_path": "<path>/sat-dataset.pth",
 "weight_decay": 0.0,
 "auxiliary_remove_prob": 0.0,
 "train_adapter": true
}
```

---
 

The first issue that popped up was that `wtpsplit` wasn't installed. To fix this, I added the path of the `wtsplit` dir to `train_lora.py`:

```
...
import wandb

# Added this line below
sys.path.insert(0, os.path.abspath('<path>/wtpsplit'))

from wtpsplit.models import SubwordXLMConfig, SubwordXLMForTokenClassification
...
```
 

On hindsight, I think this can be fixed with a `pip install .` in the correct directory but I wasn't sure.


---

After this, I received an outdated CUDA version error as the torch version in the requirements.txt file by default installs `1.7.1`. I tried upgrading to the version on my kernel with the recommended torch command (`conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia`) but this updates numpy to 2.0 among other things and causes version errors. Downgrading to `torch=1.13.1+cu117` did not help (from a brief Google search the version itself is buggy) and I progressively downgraded to `torch=1.10.1+cu111` to make it work.

 

This made CUDA work but then I got an index error along these lines:
 

```
...
(Thousands of the line below)
/pytorch/aten/src/ATen/native/cuda/Indexing.cu:699: indexSelectLargeIndex: block: [864,0,0], thread: [126,0,0] Assertion `srcIndex < srcSelectDimSize` failed.

...
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: CUDA error: device-side assert triggered
```
 

I believe this is because in `train_lora.py` we add the newline token to the tokenizer:

```
...
tokenizer = AutoTokenizer.from_pretrained(args.base_model)
# needed since we create labels in collate_fn based on tokens
tokenizer.add_special_tokens({"additional_special_tokens": [AddedToken("\n")]})
custom_token_id = tokenizer.convert_tokens_to_ids("\n")
# used later to filter out special tokens
special_tokens_ids = set(tokenizer.all_special_ids)
special_tokens_ids.discard(custom_token_id)
...
```
 

but never update the size of the embedding in the backbone model. This leads to the tokenizer generating the input id token corresponding to the newline token (250002) but the embedding model not being big enough to accommodate it. I thought this was because I had newlines in my sentences but even after removing them I still received this error (I also later realized we added newlines anyway in `prepare_dataset`). To fix this, I added this line:

```
...
special_tokens_ids = set(tokenizer.all_special_ids)
 special_tokens_ids.discard(custom_token_id)

# Added this line below
backbone.resize_token_embeddings(len(tokenizer))

if "short" in dataset_name:
 one_sample_per_line = True
...
```
 
This led to another error I will explain further below but at this point I had a few questions:

 

**1. Should we have newlines in our train / valid sentences?
 
2. Adding the extra embedding for newline feels like a hack and I am pretty sure the original SaT-3l I used as the base model should have had this in the embedding. Was the wrong model used as the base? And is this new embedding for newline changing enough given we freeze the weights of the original model? Furthermore, is this leaking into the TokenClassifier part of the model?**

 

---
 

After this, I tried running `train_lora.py` but the code was stuck. I did some debugging and it was stuck on this line:

```
...
 **kwargs,
)

# Stuck on this line below
trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)

logger.warning(f"Finished training for {lang} {dataset_name}.")
...
```
 
I did some digging and I think it was because I was running on 4 GPUS and the inter-GPU communication was not working. I tried several things (setting `ddp_backend=gloo` in the `config.json` file, setting `os.environ["NCCL_P2P_DISABLE"]="1"` on top of `train_lora.py`) but the only thing that fixed it was restricting CUDA to one device by setting `os.environ["CUDA_VISIBLE_DEVICES"]="0"` on top of `train_lora.py`. From my understanding of the paper and the Github Issues I have read, the paper's experiments were run on TPUs (on Amazon Sagemaker) and 1 GPU for fine-tuning so this seems like an oversight. I feel like I have to ask the question here:

 

**1. Is there a way to run fine-tuning on multiple GPUs?**

 

---
 

After I fixed this, the code ran but I received another error involving infinite values:

 

```
ValueError: Input contains infinity or a value too large for dtype('float64')
```
 

When I went through the traceback, I found the error to be in `evaluate.py`, specifically when `compute_metrics` in `train_lora.py` calls `evaluate_sentence`which in turn calls `get_metrics` here:

```
newline_probs = char_probs[:, positive_index]

# This line below
metrics, info = get_metrics(newline_labels, newline_probs, threshold=threshold)

info["newline_labels"] = newline_labels
```
 

This is because of this line in `get_metrics`:

```
def get_metrics(labels, preds, threshold: float = 0.01):
 # Compute precision-recall curve and AUC

 # This line below
 precision, recall, thresholds = sklearn.metrics.precision_recall_curve(labels, preds)

 pr_auc = sklearn.metrics.auc(recall, precision)
 ...
```
 

because `preds` contains `-np.inf`. I did some more digging and found it was because we call `token_to_char_probs` here in `evaluate_sentence`:

```
...
if "xlm" in model.config.model_type:
 tokens = tokenizer.tokenize(text, verbose=False)

 # This line below
 char_probs = token_to_char_probs(text, tokens, logits, tokenizer, offsets_mapping)

else:
 ....
```
 

This is because `token_to_char_probs` in `utils.__init__.py` initiializes the return tensor `char_probs` as `-np.inf` here:

```
def token_to_char_probs(text, tokens, token_logits, tokenizer, offsets_mapping):
 """Map from token probabalities to character probabilities"""
 
 # This line below
 char_probs = np.full((len(text), token_logits.shape[1]), -np.inf) # Initialize with very low numbers
 ...
```
 
Which because we only replace rows whose corresponding character is the last character of a non-special token:

```
...
 # Assign the token's probability to the last character of the token
for i in range(valid_offsets.shape[0]):
 start, end = valid_offsets[i]

 # This line below
 char_probs[end - 1] = token_logits[valid_indices[i]]

...
```
 

We are left with a lot of `-np.inf` in the first column when we call `get_metrics`:

```
...
# This line below
newline_probs = char_probs[:, positive_index]

metrics, info = get_metrics(newline_labels, newline_probs, threshold=threshold)
...
```
 

and `sklearn.metrics.auc` really does not like that. To fix this, I set the offending line in `get_metrics` to be `sigmoid(pred)` as per the call to `f1score` in the same function:

```
def get_metrics(labels, preds, threshold: float = 0.01):
 # Compute precision-recall curve and AUC

 # I changed this line below
 precision, recall, thresholds = sklearn.metrics.precision_recall_curve(labels, sigmoid(preds))
 
 ....
 # Compute F1 score for a specific threshold (e.g., 0.01 after applying sigmoid)
 f1_at_specific_threshold = sklearn.metrics.f1_score(labels, sigmoid(preds) > threshold)
 ...
```
 

With all of these changes, running the initial command with `train_lora.py` and `config.json` worked. The results look OK but I have another few questions at this point:

 


**1. Is the auxiliary objection function necessary for fine-tuning? I tried setting `"use_auxiliary": false` in the `config.json` file but ran into warnings while training and it errored while loading with `SaT`.
 
2. What is the difference between `one_sample_per_line=True` and `one_sample_per_line=False`? Does it actually make a difference on training results?
 
3. Is the optimal threshold returned by `compute_metrics` really the best threshold in your empirical experience?
 
4. Is corruption mandatory / desirable for validation? From my understanding the default LORA configs do not corrupt training samples but `evaluate_sentence` has this line:**

```
separator = Constants.SEPARATORS[lang_code]

# This line below
sentences = [corrupt(sentence, do_lowercase, do_remove_punct) for sentence in sentences]
text = separator.join(sentences)
```

 

Once again, thank you for reading this and I hope you can answer my questions / concerns.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions concerning configuring train_lora.py for custom corpus #130

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Questions concerning configuring train_lora.py for custom corpus #130

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions