How to run large LLMs like Llama 3.1 70B or Mixtral 8x22B with limited GPU VRAM?

draglord · Wednesday at 9:15 AM

I am trying to run large LLMs like LLama 3.1 70B and Mixtral 8x22B locally on my system. I have 3x3090 and 4080 Super 16GB.

I read on reddit that people were able to run those models on a single 3090 with 4 bit quantization. I am trying to run the same models using 4 bit quantizations but have been unsuccessful so far.

I tried to run this model for LLama and this model for mixtral. I also tried to run the base model by llama by passing 4 bit quantization as a parameter but no dice.

I was successfully able to run LLama 3.1 7B instruct and Mixtral 7B instruct locally (without 4 bit quantization).

Anyone has any idea how can i run those models?

Blue^_^ · Wednesday at 9:37 AM

how many tk/s are you getting?

tr27 · Wednesday at 10:18 AM

draglord said:
I am trying to run large LLMs like LLama 3.1 70B and Mixtral 8x22B locally on my system. I have 3x3090 and 4080 Super 16GB.

I read on reddit that people were able to run those models on a single 3090 with 4 bit quantization. I am trying to run the same models using 4 bit quantizations but have been unsuccessful so far.

I tried to run this model for LLama and this model for mixtral. I also tried to run the base model by llama by passing 4 bit quantization as a parameter but no dice.

I was successfully able to run LLama 3.1 7B instruct and Mixtral 7B instruct locally (without 4 bit quantization).

Anyone has any idea how can i run those models?

vllm.entrypoints.openai.api_server \
--model casperhansen/llama-3-70b-instruct-awq \
--dtype auto \
--tp 3 \
--engine-use-ray \
--gpu-memory-utilization 0.93

Try with the 3 3090 only

draglord · Wednesday at 10:31 AM

tr27 said:
vllm.entrypoints.openai.api_server \
--model casperhansen/llama-3-70b-instruct-awq \
--dtype auto \
--tp 3 \
--engine-use-ray \
--gpu-memory-utilization 0.93

Try with the 3 3090 only

Alright I'll give vllm a try

Blue^_^ said:
how many tk/s are you getting?

sorry i did not see that

tr27 said:
vllm.entrypoints.openai.api_server \
--model casperhansen/llama-3-70b-instruct-awq \
--dtype auto \
--tp 3 \
--engine-use-ray \
--gpu-memory-utilization 0.93

Try with the 3 3090 only

Hi @tr27

I ran the command and vllm serve but nothing happens. I tried to hit the localhost:8000 endpoint but it could never establish a connection. Can you tell me what could i be doing wrong?