How to launch a big cloud VM with GPUs to run AI workloads?

vishalrao

Global Moral Police
Keymaster
Does anyone here use cloud VMs with GPUs to run AI workloads?

I'm trying to run the new Facebook/Meta LLM called Llama 3.1 and its 405b sized model using ollama.com which in GPU mode requires at least 230 GB of total GPU memory.

I have been trying to launch VMs in various cloud providers, like Linode, TensorDock and the big 3 - AWS, GCP and Azure but in my personal accounts (not corporate) and these mofos are throwing nakhras by making me request for quota allocation to allow launching of such VMs (with many vCPUs like 48 or 64 and lots of RAM and multiple GPUs to meet the VRAM requirements) ...

Any suggestions? I just want to launch such a VM for a couple of hours to run the OLLAMA docker image with the LLAMA 3.1 model then delete it...
 
Some resources which might be helpful--




 
  • Like
Reactions: vishalrao
Does anyone here use cloud VMs with GPUs to run AI workloads?

I'm trying to run the new Facebook/Meta LLM called Llama 3.1 and its 405b sized model using ollama.com which in GPU mode requires at least 230 GB of total GPU memory.

I have been trying to launch VMs in various cloud providers, like Linode, TensorDock and the big 3 - AWS, GCP and Azure but in my personal accounts (not corporate) and these mofos are throwing nakhras by making me request for quota allocation to allow launching of such VMs (with many vCPUs like 48 or 64 and lots of RAM and multiple GPUs to meet the VRAM requirements) ...

Any suggestions? I just want to launch such a VM for a couple of hours to run the OLLAMA docker image with the LLAMA 3.1 model then delete it...
Any particular reason you're not using the smaller variants? They are surprisingly good. Alternatively, you might look into quantizing the models to Q5_K_M in llama.cpp for minimal loss in quality.
 
  • Like
Reactions: vishalrao
No reason, just wanted to experience it for kicks aka shits and giggles lol.

Also to compare the speed difference on various hardware configs for the different model sizes and quality of responses for my application use-case (text-to-SQL), see whether it's worth running the full-fat 405b model over the more reasonable 70b one.
 
  • Like
Reactions: m1tr
Also to compare the speed difference on various hardware configs for the different model sizes and quality of responses for my application use-case (text-to-SQL), see whether it's worth running the full-fat 405b model over the more reasonable 70b one.
Check out sqlcoder by defog. It's available on huggingface. They fine-tune llama for text to SQL. We benchmarked the codellama finetune and found the performance on par with GPT4 (with extensive pre and post processing). They have a newer finetune based on llama3 which is supposed to be even better.
 
  • Like
Reactions: vishalrao
btw i just tried the 8b model (5 GB memory requirement) on my zen4 laptop and was able to run both CPU and GPU modes... the laptop iGPU (780M on the Ryzen 7840U CPU) doubled the performance (tokens per minute) it seems.. if an iGPU makes this much difference , now im even more eager to run it on proper GPU :D
 
btw i just tried the 8b model (5 GB memory requirement) on my zen4 laptop and was able to run both CPU and GPU modes... the laptop iGPU (780M on the Ryzen 7840U CPU) doubled the performance (tokens per minute) it seems.. if an iGPU makes this much difference , now im even more eager to run it on proper GPU :D
8B model with 5GB memory sounds like a pretty heavy quantisation. Do check the perplexity numbers to see how much you're losing in model quality.
 
  • Like
Reactions: vishalrao
Wheeeee !!! Linode and TensorDock FTW !!!

ollama.png
 
  • Like
Reactions: m1tr
I think I got what I needed. Scratched my itch with this topic for now.

Just wanted to get a feel for the differences between running a model fully on CPU vs GPU vs mixed CPU+GPU - not in like "laboratory conditions" or anything.

So what I observed is the following - I know I could have just searched online for other people to tell me but wanted to try it for myself:

Running fully on GPU (either single or multiple) is about at least an order of magnitude (10x) faster than fully on only CPU and running mixed CPU+GPU gets you about a 2x speed up.

Of course, like I said this was not an apples-to-apples or lab-conditions comparison because I couldn't get exactly matching VM specs for the various runs.

I think I mentioned I ran the docker image of ollama.com with the facebook/meta llama 3.1 8b, 70b and 405b models.
 
.

Just a follow up to mention I tried https://fly.io and was able to run the llama 3.1 405b parameter (230 GB) workload with ollama.com on beefy GPU instances after some initial hurdles.

They're apparently quite low cost too. Impressed with them especially as they seem to be a small operation.

See my thread on their forum: https://community.fly.io/t/unable-t...-due-to-an-insufficient-resources-error/21398

PS: One of their claims to fame is the fast (serverless-style) launches - I saw it's primarily due to them hosting a local dockerhub mirror.
 
Did you explore the AWS p series VMs? I will soon need to setup service architecture using llama3 model, and provision for RAG. Will AWS p series be good enough? Or any better way to do this on-prem?

@vishalrao
 
  • Like
Reactions: vishalrao
Hi @asingh no I don't think I looked at p series at AWS it was the g6 series iirc so I can't really advise, sorry.

Probably best to avoid on prem unless you have your reasons to prefer it.

AWS is king (I guess) but as a casual explorer I liked the Azure UI seems a bit easier to navigate and setup the config you want.

AWS and GCP website is confusing to me but I'm not a full time devops guy.
 
  • Like
Reactions: asingh
Has anyone else tried Jarvis Labs? It's a Coimbatore based company and the instances are quite cheap IMO (although I haven't really compared the rates to other CSPs).
 
  • Like
Reactions: asingh