I'm training a model with a huggingface trainer and I specified the checkpoint folder for the resume_from_checkpoint parameter.However, when it continues to train, it still saves the checkpoints with the names corresponding to the first save steps (e.g. checkpoint-4 even though resume_from_checkpoint should start from checkpoint-4096). The progress bar shows all the max_steps as well, even though I don't want it to start from the beginning.
Is this a common problem? How do I fix this?
I save my training arguments in a yaml file:
training_args: learning_rate: !!float 1e-4 do_train: true per_device_train_batch_size: 8 per_device_eval_batch_size: 8 logging_steps: 1024 output_dir: /path/to/training_output/ overwrite_output_dir: False remove_unused_columns: False save_strategy: steps evaluation_strategy: steps save_steps: 1024 load_best_model_at_end: True warmup_steps: 100 max_steps: 65536 seed: 22 resume_from_checkpoint: /path/to/checkpoint-4096and then train the model by initialising a TrainingArguments object with these as **kwargs.
But the terminal displays:
Saving model checkpoint to /path/to/checkpoint-4And the progress bar shows all the steps, even though I need it to start from step 4096.