Oh.. i thought about for LLM but, 32GB is less for that. Thanks for the offer.
32 GB is less? 32 GB is max which you will get on a consumer grade GPU(RTX 5090) else you will have to go for a workstation card which will cost even more than a 5090.
bro, in mac the memory is unified that’s why people look for higher capacity ones as the RAM acts like VRAM, but LLM are not feasible to run under even 80GB VRAM for any thing meaning ful, and production ready.
The process needs RAM too (for non computation stuff) that’s why 32GB is less for AI stuff.
I don’t know what you plan on doing with LLM’s but people have been using LLM’s on smaller capacity GPU’s too.
@desi_gamer put together a beautiful article on the same with all the current flashy stuff being used in LLM space which can run on a 24 GB GPU. Have a look at this: https://techenclave.com/t/guide-post-training-llms-reinforcement-learning-using-qwen-2-5-14b/411415
You want to run a production ready AI server for 1.1L? What stupid expectation is this?
What consumer gpu are you getting with 32gb ram?
Your replies make no sense at all. I’m comfortably running pretty decent size models on this.
I shared my view on the part mentioned LLMs, so I responded specifically to that. Not sure why you’re reading this as some kind of challenge.
You’re conflating running a model with running it well for production workloads.
On 16-24GB VRAM with RAM offloading, you can run quantized 70B models at around 4-8 tokens/sec or comfortably run 30B models at 30-40 tokens/sec. That works fine for prototyping or light inference. I do exactly that on 12GB VRAM for testing.
But the moment you exceed VRAM and start offloading to system RAM, you hit PCIe bandwidth limits. PCIe 4.0 x16 tops out at 64 GB/s, while GDDR6X VRAM runs at 900+ GB/s. That’s a 14x bandwidth gap. For batch inference, fine-tuning, or long context windows where you’re constantly shuttling data, that bottleneck compounds. Token generation can drop 50-80% when you’re pulling layers from system RAM.
On 32GB unified, the situation is different but worse for my use case. That 32GB is shared between macOS, applications, CPU, and GPU. The OS alone uses 6-8GB, leaving maybe 24GB effective for GPU compute. You can’t offload to separate system RAM because there is no separate pool. And at 400 GB/s max bandwidth on M3 Max, you’re already slower than dedicated VRAM.
So yes, people run models on smaller setups. I need 64GB+ unified specifically because I’m targeting larger models with longer contexts at production speeds, without constantly fighting memory pressure or offload penalties. If 32GB fits your workflow, that’s fine. It doesn’t fit mine.
I didn’t knew sharing something like that was crime.
This is a sale thread not a discussion thread. What part of this is hard to understand?
Ignoring the fact that you’re wrong on multiple levels, you’re spamming a post for no reason. You want 64gb ram, this has 32g, all you had to do was ignore and move on. ![]()
Your requirements for a production AI server needs enterprise cards from Nvida going into 3-5 lakhs at minimum. Now are you going to spam every sale thread on this forum that doesn’t meet that criteria?
This has more vram than the second best consumer nvidia gpu available, if model size is the main issue for you. If running just ollama you can easily cap the OS usage to 4-5gb. That leaves 25GB ram for LLMs using gpu. That’s 1 gb more than rtx 5080. That itself costs 50k more than this entire device ![]()
so this thread is detached from the sale now, cool. let me break down why i said “32GB unified is less for LLMs” based on what i’ve actually seen, not theory.
on a mac, 32GB unified means everything shares the same pool: macOS, browser, vscode, docker, and the model. and by default the gpu can only allocate around ~2/3 of total RAM, so on a 32GB system you’re effectively working with ~21GB gpu-allocatable headroom. that’s the first “unified tax” people keep skipping.
and again, i’m not talking about running 70B here. i’m being practical. 8B/13B models can be usable on 32GB unified at normal contexts, sure. but the moment you aim for 30B-ish models or push longer contexts, you’re already near the edge because the weights alone sit around ~20GB in common 4-bit builds, and then KV cache grows with context and stays resident. once memory pressure hits and swap starts, throughput drops hard and you stop doing “LLM work” and start babysitting memory.
this is the same vibe as video editing. yes, premiere can open on 8GB or 16GB. you can even export something small. but try doing real work with a long timeline, lots of clips, effects, and chrome open, and you’ll spend more time watching it stutter than editing. it’s not about “can it launch”, it’s about whether it stays smooth when you actually use it the way it’s meant to be used.
same reason diffusion comparisons don’t map. diffusion is bursty: big spike during generation, then memory clears. LLMs are stateful: weights stay loaded and KV cache keeps growing while you work. different pattern, different bottleneck.
so yeah, 32GB unified can run local LLMs. but for serious LLM dev, it’s “less” because you don’t get a clean 32GB model budget, you get a shared pool with a ~21GB gpu allocation ceiling by default, and long context + normal multitasking pushes you into memory pressure fast. and honestly even 64GB starts feeling “less” once you’re doing bigger models and long context. for my use, 128GB is where it starts to feel practical. this isn’t about budget or expectations, it’s just how I work with these stuff.
and I am ending my comment now that even 64GB Unified is less for LLMs.
Simple command let’s you overcome that limit. But you probably know that and its called a bad faith arguments.
https://www.reddit.com/r/LocalLLaMA/comments/186phti/m1m2m3_increase_vram_allocation_with_sudo_sysctl/
Again i have no problem with what you’re saying. If you said that about 16gb I’d be fine. But you’re saying that about 32gb which is absolutely false. It runs plenty fine. Now if you ask for extreme professional loads thats different. As you keep increasing model size more and more hardware won’t be able to run them. You have to temper your expectations with what you pay for the hardware. If you want more performance simply pay more and get more ram.
It’s good that you back pedaled from Production to just serious LLM dev. Question is why yould a llm Dev work on a 3yr old machine and spend just 1L and then complain it’s not enough. Requirements change, hardware changes.
this is not a sales thread any more, thank you!
I get where you’re coming from, and I’m not denying that 32GB unified runs plenty of models. My comment was a quick reply in a sale thread and it came off more absolute than I intended, so I’m clarifying here on technical grounds for anyone reading later.
I’m aware of the sysctl VRAM allocation tweak. It can help in some cases, but it doesn’t create more memory. It just lets the GPU claim a bigger slice of the same 32GB unified pool, which under heavier LLM use often shifts pressure ,
onto macOS and other processes and you still end up with compression or swap. That’s what I’m trying to avoid in my workflow.
What I mean by 32GB unified is less for LLM is specifically heavy usage: larger models, long context, sustained sessions, and multitasking while generating, where KV cache growth makes memory pressure show up fast.
For diffusion and smaller LLMs, 32GB unified is usually fine. For my use case, I personally aim for 64GB+ unified for predictable headroom (just for LLMs I mean still).
THank for your kind attention on this matter,.
Yet your calculations were based on that limit. Removing which adds 5-6gb more ram.
Bottom line is, you were complaining that 32gb isn’t enough based on professional requirements, which is what I replied to before. Basic consumer requirements of 99% people run just fine. Now that you’re saying basically what I said, I guess we can let this go.