How to launch a big cloud VM with GPUs to run AI workloads?

vishalrao · Aug 2, 2024

Does anyone here use cloud VMs with GPUs to run AI workloads?

I'm trying to run the new Facebook/Meta LLM called Llama 3.1 and its 405b sized model using ollama.com which in GPU mode requires at least 230 GB of total GPU memory.

I have been trying to launch VMs in various cloud providers, like Linode, TensorDock and the big 3 - AWS, GCP and Azure but in my personal accounts (not corporate) and these mofos are throwing nakhras by making me request for quota allocation to allow launching of such VMs (with many vCPUs like 48 or 64 and lots of RAM and multiple GPUs to meet the VRAM requirements) ...

Any suggestions? I just want to launch such a VM for a couple of hours to run the OLLAMA docker image with the LLAMA 3.1 model then delete it...

nRiTeCh · Aug 2, 2024

Some resources which might be helpful--

Erdal Toprak | AI Homelab - A guide into hardware to software considerations

AI Homelab - A guide into hardware to software considerations

erdaltoprak.com

https://www.reddit.com/r/homelab/comments/145clvr

Deploy the solution for running AI Workloads as VMs | Implementation Guide—Virtualizing GPUs for AI with VMware and NVIDIA Based on Dell Infrastructure | Dell Technologies Info Hub

This guide describes the deployment and implementation of the Dell Validated Design for Virtualizing GPUs for AI with VMware and NVIDIA. This reference architecture was designed in collaboration with VMware and NVIDIA and is based on Dell infrastructure.

infohub.delltechnologies.com

RunPod - The Cloud Built for AI

Develop, train, and scale AI models in one cloud. Spin up on-demand GPUs with GPU Cloud, scale ML inference with Serverless.

www.runpod.io

m1tr · Aug 2, 2024

vishalrao said:
Does anyone here use cloud VMs with GPUs to run AI workloads?

I'm trying to run the new Facebook/Meta LLM called Llama 3.1 and its 405b sized model using ollama.com which in GPU mode requires at least 230 GB of total GPU memory.

I have been trying to launch VMs in various cloud providers, like Linode, TensorDock and the big 3 - AWS, GCP and Azure but in my personal accounts (not corporate) and these mofos are throwing nakhras by making me request for quota allocation to allow launching of such VMs (with many vCPUs like 48 or 64 and lots of RAM and multiple GPUs to meet the VRAM requirements) ...

Any suggestions? I just want to launch such a VM for a couple of hours to run the OLLAMA docker image with the LLAMA 3.1 model then delete it...

Any particular reason you're not using the smaller variants? They are surprisingly good. Alternatively, you might look into quantizing the models to Q5_K_M in llama.cpp for minimal loss in quality.

vishalrao · Aug 2, 2024

No reason, just wanted to experience it for kicks aka shits and giggles lol.

Also to compare the speed difference on various hardware configs for the different model sizes and quality of responses for my application use-case (text-to-SQL), see whether it's worth running the full-fat 405b model over the more reasonable 70b one.

m1tr · Aug 2, 2024

vishalrao said:
Also to compare the speed difference on various hardware configs for the different model sizes and quality of responses for my application use-case (text-to-SQL), see whether it's worth running the full-fat 405b model over the more reasonable 70b one.

Check out sqlcoder by defog. It's available on huggingface. They fine-tune llama for text to SQL. We benchmarked the codellama finetune and found the performance on par with GPT4 (with extensive pre and post processing). They have a newer finetune based on llama3 which is supposed to be even better.

vishalrao · Aug 3, 2024

btw i just tried the 8b model (5 GB memory requirement) on my zen4 laptop and was able to run both CPU and GPU modes... the laptop iGPU (780M on the Ryzen 7840U CPU) doubled the performance (tokens per minute) it seems.. if an iGPU makes this much difference , now im even more eager to run it on proper GPU

m1tr · Aug 3, 2024

vishalrao said:
btw i just tried the 8b model (5 GB memory requirement) on my zen4 laptop and was able to run both CPU and GPU modes... the laptop iGPU (780M on the Ryzen 7840U CPU) doubled the performance (tokens per minute) it seems.. if an iGPU makes this much difference , now im even more eager to run it on proper GPU

8B model with 5GB memory sounds like a pretty heavy quantisation. Do check the perplexity numbers to see how much you're losing in model quality.

vishalrao · Aug 3, 2024

Wheeeee !!! Linode and TensorDock FTW !!!

vishalrao · Aug 4, 2024

I think I got what I needed. Scratched my itch with this topic for now.

Just wanted to get a feel for the differences between running a model fully on CPU vs GPU vs mixed CPU+GPU - not in like "laboratory conditions" or anything.

So what I observed is the following - I know I could have just searched online for other people to tell me but wanted to try it for myself:

Running fully on GPU (either single or multiple) is about at least an order of magnitude (10x) faster than fully on only CPU and running mixed CPU+GPU gets you about a 2x speed up.

Of course, like I said this was not an apples-to-apples or lab-conditions comparison because I couldn't get exactly matching VM specs for the various runs.

I think I mentioned I ran the docker image of ollama.com with the facebook/meta llama 3.1 8b, 70b and 405b models.

vishalrao · Aug 19, 2024

.

Just a follow up to mention I tried https://fly.io and was able to run the llama 3.1 405b parameter (230 GB) workload with ollama.com on beefy GPU instances after some initial hurdles.

They're apparently quite low cost too. Impressed with them especially as they seem to be a small operation.

See my thread on their forum: https://community.fly.io/t/unable-t...-due-to-an-insufficient-resources-error/21398

PS: One of their claims to fame is the fast (serverless-style) launches - I saw it's primarily due to them hosting a local dockerhub mirror.

asingh · Aug 19, 2024

Did you explore the AWS p series VMs? I will soon need to setup service architecture using llama3 model, and provision for RAG. Will AWS p series be good enough? Or any better way to do this on-prem?

@vishalrao

vishalrao · Aug 20, 2024

Hi @asingh no I don't think I looked at p series at AWS it was the g6 series iirc so I can't really advise, sorry.

Probably best to avoid on prem unless you have your reasons to prefer it.

AWS is king (I guess) but as a casual explorer I liked the Azure UI seems a bit easier to navigate and setup the config you want.

AWS and GCP website is confusing to me but I'm not a full time devops guy.

m1tr · Aug 22, 2024

Has anyone else tried Jarvis Labs? It's a Coimbatore based company and the instances are quite cheap IMO (although I haven't really compared the rates to other CSPs).

Search

Search

How to launch a big cloud VM with GPUs to run AI workloads?

vishalrao

Global Moral Police

nRiTeCh

Erdal Toprak | AI Homelab - A guide into hardware to software considerations

Deploy the solution for running AI Workloads as VMs | Implementation Guide—Virtualizing GPUs for AI with VMware and NVIDIA Based on Dell Infrastructure | Dell Technologies Info Hub

RunPod - The Cloud Built for AI

m1tr

vishalrao

Global Moral Police

m1tr

vishalrao

Global Moral Police

m1tr

vishalrao

Global Moral Police

vishalrao

Global Moral Police

vishalrao

Global Moral Police

asingh

vishalrao

Global Moral Police

m1tr