I haven’t tried it yet. But will give it a shot.@Emrebel have you tried nanoGPT???
Whoa @draglord I just ran the openwebtext (so called clone of OpenAI's chatGPT) dataset (smallest init_from=gpt2 and also gpt2-xl) for few minutes training run (max_iters=100 only haha) and asked "What is OpenAI", "What is your name", and the sample "What is the answer to life, everything"...
It output rambling but kind of understandable for the latter 2 prompts, but "what is openAI" it actually spit out some relevant response!
Will try running for a bit longer and ask it some stuff LOL
Note, this is on CPU (I have threadripper 32 core and 128gb RAM) but just a RX580 8gb GPU, so will try running on GPU too and see if it speeds up even though memory is much less Also, my 1gbit ftth connection helps with fast downloads of the bin files (6.5gb for gpt2-xl it downloads some pytorch bin related stuff).
Will try larger models (gpt2-xl and max_iters little higher values) and see what I get.
This is really intriguing stuff.
@draglord can I ask what is your dataset (300gb) and what is the config py file you are using?
@Emrebel have you tried nanoGPT???
Whoa @draglord I just ran the openwebtext (so called clone of OpenAI's chatGPT) dataset (smallest init_from=gpt2 and also gpt2-xl) for few minutes training run (max_iters=100 only haha) and asked "What is OpenAI", "What is your name", and the sample "What is the answer to life, everything"...
It output rambling but kind of understandable for the latter 2 prompts, but "what is openAI" it actually spit out some relevant response!
Will try running for a bit longer and ask it some stuff LOL
Note, this is on CPU (I have threadripper 32 core and 128gb RAM) but just a RX580 8gb GPU, so will try running on GPU too and see if it speeds up even though memory is much less Also, my 1gbit ftth connection helps with fast downloads of the bin files (6.5gb for gpt2-xl it downloads some pytorch bin related stuff).
Will try larger models (gpt2-xl and max_iters little higher values) and see what I get.
This is really intriguing stuff.
@draglord can I ask what is your dataset (300gb) and what is the config py file you are using?
I didn't try openwebtext yet as i don't have enough Ram.
Tokenizing the data is a ram intensive process. Training is gpu intensive.Have you tried those params (reducing batch_size, block_size and that 3rd gradient_accum one) yet? I think you might be able to run on your existing dual GPUs the openwebtext dataset and try running queries against it?
iter 50: loss 3.2720, time 42.04ms, mfu 0.06%
Half precision float training is supposed to deliver performance close to full precision models while having lower training and inference times.@vishalrao i changed the dtype from bfloat16 to float32 and the traning time dropped from 60 ms to 20 ms. But i think changing the dtype takes a hit on the quality of the model. But i should let you know i received an Out of memory error while changing the dtype, so i reduced the batch and block size as well
I am not really sure how you get lower training time on float32.
Yes could be a possible reason. I will check if the consumer grade gpus can support fp16.I think the lower training time on FP32 is relative to BF16 (and not FP16) where the latter (bfloat16) is not available in consumer gaming GPU hardware (even though they have tensor cores) but only in the datacenter GPUs maybe?