How to run large LLMs like Llama 3.1 70B or Mixtral 8x22B with limited GPU VRAM?

draglord · Sep 25, 2024

I am trying to run large LLMs like LLama 3.1 70B and Mixtral 8x22B locally on my system. I have 3x3090 and 4080 Super 16GB.

I read on reddit that people were able to run those models on a single 3090 with 4 bit quantization. I am trying to run the same models using 4 bit quantizations but have been unsuccessful so far.

I tried to run this model for LLama and this model for mixtral. I also tried to run the base model by llama by passing 4 bit quantization as a parameter but no dice.

I was successfully able to run LLama 3.1 7B instruct and Mixtral 7B instruct locally (without 4 bit quantization).

Anyone has any idea how can i run those models?

Blue^_^ · Sep 25, 2024

how many tk/s are you getting?

tr27 · Sep 25, 2024

draglord said:
I am trying to run large LLMs like LLama 3.1 70B and Mixtral 8x22B locally on my system. I have 3x3090 and 4080 Super 16GB.

I read on reddit that people were able to run those models on a single 3090 with 4 bit quantization. I am trying to run the same models using 4 bit quantizations but have been unsuccessful so far.

I tried to run this model for LLama and this model for mixtral. I also tried to run the base model by llama by passing 4 bit quantization as a parameter but no dice.

I was successfully able to run LLama 3.1 7B instruct and Mixtral 7B instruct locally (without 4 bit quantization).

Anyone has any idea how can i run those models?

vllm.entrypoints.openai.api_server \
--model casperhansen/llama-3-70b-instruct-awq \
--dtype auto \
--tp 3 \
--engine-use-ray \
--gpu-memory-utilization 0.93

Try with the 3 3090 only

draglord · Sep 25, 2024

tr27 said:
vllm.entrypoints.openai.api_server \
--model casperhansen/llama-3-70b-instruct-awq \
--dtype auto \
--tp 3 \
--engine-use-ray \
--gpu-memory-utilization 0.93

Try with the 3 3090 only

Alright I'll give vllm a try

Blue^_^ said:
how many tk/s are you getting?

sorry i did not see that

tr27 said:
vllm.entrypoints.openai.api_server \
--model casperhansen/llama-3-70b-instruct-awq \
--dtype auto \
--tp 3 \
--engine-use-ray \
--gpu-memory-utilization 0.93

Try with the 3 3090 only

Hi @tr27

I ran the command and vllm serve but nothing happens. I tried to hit the localhost:8000 endpoint but it could never establish a connection. Can you tell me what could i be doing wrong?

thetechguy · Feb 20, 2025

@draglord - how are you running all 4 cards ? single machine or eGPU ?

vishalrao · Feb 21, 2025

Did you try ollama.com ? There's an option --all-gpus IINM? and if it overflows GPU VRAM it goes some into regular RAM...

draglord · Feb 22, 2025

thetechguy said:
@draglord - how are you running all 4 cards ? single machine or eGPU ?

Hi

Yes a single machine. Risers and pcie cable

vishalrao said:
Did you try ollama.com ? There's an option --all-gpus IINM? and if it overflows GPU VRAM it goes some into regular RAM...

Yeah I tried ollama. Easy to use interface but I had 3x3090 and a 4080 so it uses 16 gb on each card.

Plus you cannot download models from hf

ibose · Feb 23, 2025

draglord said:
Yes a single machine.

May I ask which case is being used ?

draglord · Feb 23, 2025

ibose said:
May I ask which case is being used ?

I think it's Lian li. I'm not sure such y model.

vishalrao · Feb 23, 2025

draglord said:
Hi

Yes a single machine. Risers and pcie cable

Yeah I tried ollama. Easy to use interface but I had 3x3090 and a 4080 so it uses 16 gb on each card.

Plus you cannot download models from hf

What other tools you tried apart from ollama.com ? Tried the https://lmstudio.ai app ? It has ability to run as a server/service too and is cross platform. Support HF model downloads too.
Or even just the bare llama.cpp backend but I don't know the feasibility of that...

draglord · Feb 23, 2025

No I didn't hear about lmstudio.

I just tried vllm, I figured out how to use it, it supports hf downloads but can only use an even number of gpus.

Party Monger · Feb 27, 2025

What's the best way to run LLMs on Mac? Have 64gb ram M2 max

draglord · Mar 1, 2025

Party Monger said:
What's the best way to run LLMs on Mac? Have 64gb ram M2 max

Easiest way is to use ollama.

Party Monger · Mar 1, 2025

draglord said:
Easiest way is to use ollama.

I tried it before. Is there anythhing better with good interface

sporkydork · Mar 1, 2025

How about lmstudio? I tried it out on a windows 10 machine with 16gb ram and ran the lowest model of deepseek AI for a test. Worked out fine.

thetechguy · Mar 1, 2025

@Party Monger

ollama is the simplest.

ollama pull mistral

For ui,
docker run -p 8080:8080 open-webui/open-webui

ishashank · Mar 1, 2025

Use Jan ai, one of the best and open source.

Party Monger · Mar 2, 2025

thetechguy said:
@Party Monger

ollama is the simplest.

ollama pull mistral

For ui,
docker run -p 8080:8080 open-webui/open-webui

Thanks bro will try this.

Search

Search

How to run large LLMs like Llama 3.1 70B or Mixtral 8x22B with limited GPU VRAM?

draglord

Blue^_^

tr27

draglord

thetechguy

vishalrao

Global Moral Police

draglord

ibose

draglord

vishalrao

Global Moral Police

draglord

Party Monger

draglord

Party Monger

sporkydork

thetechguy

ishashank

Party Monger