Need help with pytorch....cannot resume training

Hi

I'm working with nanogpt, a repo that allows you to make a small version of ChatGPT, and i have run into a problem.


I have to train it using large amount of data, but having insufficient Ram i have to figure out a way where i train the model with a subset of data first, then move on to the next file.

What i have so far is multiple input files and a script that trains the model. Then i have a bash script that runs the training file by feeding it multiple inputs in a loop. But when i resume training, it just doesn't happen. No errors.

I'm new to ML and do not fully understand how the script works. The max iterations are 5000. Training resumes at 5000 steps. Is it because 5000 steps have already been completed that training doesn't resume?

Any input will be helpful

Thanks
 
Have you read the main README of the github repo??? I just quickly skimmed through it and there is a section marked "I only have a macbook" or other cheap computer in bold lol.

It says to set some flag cpu= true and use nightly build of pytorch or whatever.

Also has the full set of options to specify like max iterations but also normal steps and iterations options are low numbers.

Good luck.
...

BTW what are your hardware specs?
 
I have an Athlon 3000g cpu, 8 gigs ram and 2x3080ti ( 64gb ram arriving tomorrow)

Yes i can run the script with a single file without and with the cpu flag. But what i want is for it to run with multiple inputs without my intervention.

Is it too much to ask for you to look at the train.py file? Maybe you'll be able to figure out how can I run it with multiple inputs

Thanks
 
This is just a curiosity to me, I'm not versed in ml either lol but I'll try looking at the repo and script if there's anything glaringly obvious...

@draglord so in train.py I see an "init_from" param default to "scratch"... have you tried setting it to "resume" and see if that works? just a shot in the dark for me.

you could post what you are already doing (what is your data set, what command line you are running to try to resume your training? etc
...

@draglord - So maybe "init_from" is default (scratch or gptXXX whatever) for the FIRST run and it will default "always_save_checkpoints" then in your next resumptions set "init_from=resume" ?
 
Last edited:
Things i tried:

1) pulling the list of input files in python, and then looping the portion of the code after data has been read (train = np.memmap part), to do the traning for each file, while setting "init_from" to scratch in the first run and then resume in the subsequent runs. What happens is, training goes fine in the first loop, then it doesn't train in the second (and third...) loop. No errors, just that the training starts from iteration 5000 and there already is a max iteration argument that has a value of 5000.

2) Wrote a bash script that reads all the input files, then execute train.py in a loop while passing those files as arguments. One thing is i haven't changed "init_from" parameter from a variable. So it's either "scratch" or "resume", always. If i keep it as scratch, training happens for every file, but, it's pointless as previous results are discarded. If i keep it as 'resume', same output as above

Checkpoints are saved at every 250 iterations within the 500 iteration run.
 
If I'm following your comment correctly ("So it's either "scratch" or "resume" always) ... have you tried your 2nd approach , just modify your bash script which reads all the input files and the FIRST one sets (or leaves init_from=scratch default) then from 2nd input file onwards it sets init_from=resume ? Did you try that way? (or am I missing something?)
 
I didn't try this yet, but i have tried something else

I ran the train.py for the first time manually, so that i would have a checkpoint file, and then i automated it with the subsequent files. After that if i have the option "resume" it won't be affected by lack of checkpoint file

I haven't integrated this with the bash script yet because i wanted to see if it works, and it doesn't.
 
Can you paste your command line for a single run of train.py here? Also can you attach the train.py file you are using too? Are you using latest master branch from the repo? There is a param "always_save_checkpoint = True" do you have that really set to true?
 
Command line run

python train.py config/train_shakespeare_char.py

Yes using the latest master branch. And no "always_save_checkpoint" wasn't enabled, trying with it set to True right now.
Just tried it out with "always_save_chekpoint", no avail
 

Attachments

  • train_single_run.txt
    15.8 KB · Views: 75
Last edited:
Actually, I don't know how you split your dataset into "multiple files" - have you tried leaving the dataset as it is and running the command as follows in the script? Also, do not modify the train.py and config/train.......py files themselves, rather override the options in the command lines...

python train.py config/train_shakespeare_char.py --device=cpu --compile=False --eval_iters=20 --log_interval=1 --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=2000 --lr_decay_iters=2000 --dropout=0.0 --always_save_checkpoint=True

do:

python train.py config/train_shakespeare_char.py --device=cpu --compile=False --eval_iters=20 --log_interval=1 --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=2000 --lr_decay_iters=2000 --dropout=0.0 --always_save_checkpoint=True --init_from=resume

while (not done);
 
I split the dataset from prepare.py. there's an option to work with batch data but i used a streaming dataset and created 3 files from there.

And yes it's a good idea to pass the arguments from config file
 
OK let me take a step back (for my understanding)... Can you NOT split the prepare.py dataset generated? So basically run the 3 commands one by one and not in a loop/script?

Try TWO approaches - first simple to get some output, second to try init_from=resume to get better output...

So for first approach just run these 3 commands:

$ python data/shakespeare_char/prepare.py

then

$ python train.py config/train_shakespeare_char.py --device=cpu --compile=False --eval_iters=20 --log_interval=1 --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=2000 --lr_decay_iters=2000 --dropout=0.0

then

$ python sample.py --out_dir=out-shakespeare-char --device=cpu

then see what is the output???

NEXT:

Second approach - to get better training and sample output - do the following:

(You have already generated dataset unsplit in first approach so skip the prepare.py step)

$ python train.py config/train_shakespeare_char.py --device=cpu --compile=False --eval_iters=20 --log_interval=1 --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=2000 --lr_decay_iters=2000 --dropout=0.0 --always_save_checkpoint=True --init_from=scratch

Then in a script, run following command line maybe repeated 3 times (later you can run more) in a loop:

$ python train.py config/train_shakespeare_char.py --device=cpu --compile=False --eval_iters=20 --log_interval=1 --block_size=64 --batch_size=12 --n_layer=4 --n_head=4 --n_embd=128 --max_iters=2000 --lr_decay_iters=2000 --dropout=0.0 --always_save_checkpoint=True --init_from=resume

Finally, again run the sample.py to see if it gives better output? ->

$ python sample.py --out_dir=out-shakespeare-char --device=cpu

Beyond this I have run out of ideas LOL maybe someone actually experienced with pytorch and ML etc can chime in here in this thread...
...

@draglord ^^^
 
Hi

I can already go through with the first approach. That works fine and i can get some output

I have also gone through the second approach. The training process just never starts.

I think it has something to do with the steps. Since max iterations are 5000 the training never proceeds after that

Thank you for your help. Your effort means a lot. I'll try posting this on pytorch forums
 
I'm trying the nanoGPT quickstart now... will let you know if the second approach works for me or not (if I am able to get nanoGPT set up and running)...
 
So I was able to run nanoGPT... I noticed I dont need to set "max_iters" to lower 2000 or 5000 I left it at 600000 default...

When I run the train.py command (second approach with init_from=resume) the "iter_num" keeps increasing and checkpoint gets saved every 250 iterations... if I CTRL+C to cancel the script then run sample.py I get some ouput.

Then when I run the train.py again with init_from=resume the iter_num resumes from last count and keeps running with regular checkpoints, then I can keep running sample.py again and again...

Question: How do I know the training/model has improved when I resume the train.py runs?
...

@draglord ^^^
 
I guess you'll have to see if the output is better than what it was before.

Also, are you running this on cpu or GPU? I think the parameters that you send have a default iterations of 5000
 
...

So I noticed the more and more I run the train.py script (more number of iterations count) the chk.pt checkpoint file keeps getting saved every 250 iterations and the size remains 9.7 MB

Also , if I cancel (CTRL+C) the train.py and run sample.py the output is always around 5.3 KB but the contents change depending on how far along I have run the train.py (iterations count increases slowly toward the default max_iters=600000)...
...

@draglord LOL i dont understand the shakesperean lingo in the sample.py output :) all I can see it is changing after each resumption of train.py - dont know if the quality is better...

I am running on CPU only.
 
@draglord how long does 5000 iterations take to run on your sytstem? Mine completes in under 3 minutes.

Anyway, I guess the approach is to run the train.py command explicitly setting max_iter=600000 (some high number) and lr_decay_iters=600000 (the comments say these 2 vals should be same recommended) and let the script run first time init_from=scratch then CTRL+C whenever you need and if want to resume then set --init_from=resume
 
Back
Top