How to run large LLMs like Llama 3.1 70B or Mixtral 8x22B with limited GPU VRAM?

draglord

Forerunner
I am trying to run large LLMs like LLama 3.1 70B and Mixtral 8x22B locally on my system. I have 3x3090 and 4080 Super 16GB.

I read on reddit that people were able to run those models on a single 3090 with 4 bit quantization. I am trying to run the same models using 4 bit quantizations but have been unsuccessful so far.

I tried to run this model for LLama and this model for mixtral. I also tried to run the base model by llama by passing 4 bit quantization as a parameter but no dice.

I was successfully able to run LLama 3.1 7B instruct and Mixtral 7B instruct locally (without 4 bit quantization).

Anyone has any idea how can i run those models?
 
I am trying to run large LLMs like LLama 3.1 70B and Mixtral 8x22B locally on my system. I have 3x3090 and 4080 Super 16GB.

I read on reddit that people were able to run those models on a single 3090 with 4 bit quantization. I am trying to run the same models using 4 bit quantizations but have been unsuccessful so far.

I tried to run this model for LLama and this model for mixtral. I also tried to run the base model by llama by passing 4 bit quantization as a parameter but no dice.

I was successfully able to run LLama 3.1 7B instruct and Mixtral 7B instruct locally (without 4 bit quantization).

Anyone has any idea how can i run those models?
vllm.entrypoints.openai.api_server \
--model casperhansen/llama-3-70b-instruct-awq \
--dtype auto \
--tp 3 \
--engine-use-ray \
--gpu-memory-utilization 0.93

Try with the 3 3090 only
 
vllm.entrypoints.openai.api_server \
--model casperhansen/llama-3-70b-instruct-awq \
--dtype auto \
--tp 3 \
--engine-use-ray \
--gpu-memory-utilization 0.93

Try with the 3 3090 only

Alright I'll give vllm a try
how many tk/s are you getting?
sorry i did not see that
vllm.entrypoints.openai.api_server \
--model casperhansen/llama-3-70b-instruct-awq \
--dtype auto \
--tp 3 \
--engine-use-ray \
--gpu-memory-utilization 0.93

Try with the 3 3090 only
Hi @tr27

I ran the command and vllm serve but nothing happens. I tried to hit the localhost:8000 endpoint but it could never establish a connection. Can you tell me what could i be doing wrong?
 
Last edited:
@draglord - how are you running all 4 cards ? single machine or eGPU ?
Hi

Yes a single machine. Risers and pcie cable
Did you try ollama.com ? There's an option --all-gpus IINM? and if it overflows GPU VRAM it goes some into regular RAM...
Yeah I tried ollama. Easy to use interface but I had 3x3090 and a 4080 so it uses 16 gb on each card.

Plus you cannot download models from hf
 
Hi

Yes a single machine. Risers and pcie cable

Yeah I tried ollama. Easy to use interface but I had 3x3090 and a 4080 so it uses 16 gb on each card.

Plus you cannot download models from hf

What other tools you tried apart from ollama.com ? Tried the https://lmstudio.ai app ? It has ability to run as a server/service too and is cross platform. Support HF model downloads too.
Or even just the bare llama.cpp backend but I don't know the feasibility of that...