Rendered at 11:29:57 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
iagooar 16 hours ago [-]
I love my MacBook Pro M5 128GB RAM and I love qwen3.6.
BUT DO NOT buy this MacBook if you plan on doing serious coding using local LLMs with it. The reason is simple: your fingers will burn and your head will explode from the noise.
Running any kind of sophisticated job on the very laptop you are using is just not viable. Sure you can use it in clamshell mode, but forget touching it while working with AI coding or agents.
If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk. Connect to it over LAN or Tailscale. The MacMini will also cost you almost 1/3 of the MacBook Pro.
Thank me later.
jasonjmcghee 7 hours ago [-]
I'm surprised no one has else has mentioned - low power mode.
With no speculative decoding, using high power mode, I get 80 t/s on 35B A3B - and it gets hot and spins up. On low power mode I get 38 t/s - no fans, cool to warm laptop.
If you currently don't use speculative decoding and you start using it, it can nearly offset the difference between high and low power, and it's night and day experience.
I almost always keep my laptop on low power mode.
html5cat 4 hours ago [-]
Awesome idea! Will try it out. Wish there was a way to enable low power on a per-app basis. Scrolling and reading on low power mode is really annoying.
c16 3 hours ago [-]
Will give this a try later. Enjoy working with A3B Coder, but the heat coming out my 32gb M5 is a lot. This might be the trick - Thanks!
anon373839 6 hours ago [-]
Can you mention what inference stack you're using? I've tried MTP several times with that model and it always seems to significantly cut my token generation speed from ~60 tokens/sec to ~40 (M3 Max).
mycall 5 hours ago [-]
It is less efficient use of the GPU and uses more electricity overall, no?
astrostl 13 hours ago [-]
> MacBook Pro M5 128GB RAM
614 GB/s of memory bandwidth
> MacMini M4 with 64GB of RAM
273 GB/s of memory bandwidth (also only currently available with 48GB)
When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible. In this case a hypothetical Mini with 1TB of memory would still be over 2x slower with 27-35B models.
And FWIW I have an M4 Max MBP 128GB that I keep on a Roost laptop stand, with a separate keyboard/mouse/video. It does fire up the cooling jets when running local LLMs, but stays within tolerance for me on noise. I haven't heat-tested it on longer runs, but I imagine the risen airflow helps a ton.
iagooar 4 hours ago [-]
On paper the M4 should be roughly 1/3 of the M5, in practice it is only 1/2. With the right, optimized model like qwen3.6 35B MoE MLX you can get over 40 tok / sec on it. I run dozens of background jobs that are not time-critical on it.
bfjvibybd6cuvu6 2 hours ago [-]
What kind of jobs?
bigyabai 11 hours ago [-]
> When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible.
This is only true when your GPU isn't bottlenecked building a KV cache, which it usually will be on Apple Silicon. The Achilles heel of the M-series chips are their weak, SOC-grade GPU that holds back the Max and Ultra models from having interactive TTFTs on larger models and contexts.
SwellJoe 15 hours ago [-]
I opted to buy a normal 32GB laptop for this very reason. I know how loud and hot the GPUs in my desktop run when running even smallish models like Qwen 27B or Gemma 4 31B (which is a better model for most than Qwen 3.6, despite the benchmarks). I also have a Strix Halo which doesn't get loud, because it has a single huge fan, but it does get hot. So, there's no way a laptop could work as hard as models make them work, and not be unbearable. Tiny fans trying to remove all that heat? They gotta be screaming. No reason to spend all that money on a laptop that I couldn't realistically make use of. I do run a lot of VMs on my desktop, but I can get to those on a VPN.
It's a nice idea to run a model on a laptop so you can work anywhere...but, that's a job for models in the cloud. Not much data has to traverse the network, so it's not a big deal. Or one could also setup a VPN so you can reach a self-hosted model on a big box at home for things that require data privacy.
All that said, there are models that work great on very small devices for some tasks and won't work it to death. Gemma 4 12B QAT 4-bit runs on a 16GB device, maybe even smaller, including a tablet. It's the best self-hostable vision model I've tested for my purposes (categorization, identification, labeling, type stuff), beating much larger models. It's also a decent conversationalist with good prose but it doesn't know much of anything (not a lot of the world fits in 7GB), so it needs search if you want to use it for research. It's a pretty good tool user. I definitely wouldn't want to use it for code, though, beyond very simple stuff.
girvo 14 hours ago [-]
Gemma is better than Qwen at everything except coding, in all my evaluations. Which is a shame because that is what I use them for!
UncleOxidant 13 hours ago [-]
It would be great if the Gemma folks would release a code-focused model. Probably won't happen, but it's fun to dream.
SwellJoe 12 hours ago [-]
The Ornith folks say they're doing that, but haven't released the Gemma-based 31b yet (https://github.com/deepreinforce-ai/Ornith-1). But, also, the Qwen-based 35b MoE Ornith version performs worse than Qwen 3.6 and Qwen AgentWorld on my benchmarks (which are focused on finding security bugs, so not exactly the same as agentic coding, but closely related skills).
That said, the reason they're able to release Ornith branded post-trains of both Gemma and Qwen is because they're open weights under a friendly license. Someone, not just Google, could make a coding focused Gemma post-train. I don't think it's actually much weaker than Qwen 3.6 for coding; Gemma 4 31b outperforms Qwen 3.6 27b by a wide margin on security bug hunting (at least for the specific bugs in my benchmarks, which are mostly relatively difficult bugs from the Mythos-reported bugs).
I'd really love to see a bigger MoE from Google, though. A 70b or 120b MoE would likely be super fun.
ekianjo 12 hours ago [-]
gemma is also worse for tool calling. not just coding
satvikpendem 11 hours ago [-]
That is because they use a different tool calling format than most other models. Unsloth quants fix this in their Gemma releases.
feffe 39 minutes ago [-]
I've never been able to fix the tool calling issues. Running unsloth versions with llama.cpp, constant issues. Have tried many forum fixes, including lots of fixed chat templates, to no avail. It's mostly the edit call that breaks, which often results in "let me just rewrite the whole file from context".
stevenhubertron 9 hours ago [-]
Can you say a bit more about this? The bad tool calling has made me give up on using Gemma for my Hermes and a personal recipe site. I have only downloaded from Ollama.
satvikpendem 8 hours ago [-]
Ollama is not recommended [0], use llama.cpp or more specifically Unsloth Studio which wraps llama.cpp and which has an API mode you can use to hook into Hermes or another agent. Unsloth make both the Studio and the quants which fix various issues with many models [1] as well as implementing new features like MTP and QAT support much sooner than other teams. In general you should read r/LocalLLaMa as it has a lot of updates regarding local models as the field moves fast.
You can limit TDP on Strix Halo so it runs between 32 and 45W which seems to be the sweet spot for heat vs speed.
andai 15 hours ago [-]
> The reason is simple: your fingers will burn and your head will explode from the noise.
So, just buy a mac mini and put it in the other room? ( Like everyone was doing in February? :)
I've been running coding agents on my laptop in yolo mode for the past half year or so (though mostly not local ones, laptop too slow!) and the way I'm doing that without terror is that I just gave them their own Linux user "agent". They're free to nuke their homedir /agent, and they can't touch (or even read) mine.
There's some slight ergonomics issues (I need to sudo into the user to do anything, but I set up an alias for it), sometimes I get issues with permissions or ownership (gave up on "sticky bits" and just made a function I can run once a day when it breaks).
There's enough hassle that I wish I just had a dedicated machine for it, and then I'd just give them root on it. (For giggles I gave claude root on a $3 VPS and that's going just fine...)
But yeah after months of trial and error I reinvented "just buy a mac mini" from first principles...
iagooar 15 hours ago [-]
Just buy a Mac Mini really is good advice if you want to get into real, always-on convenient agentic work.
Soon it is going to be good even for coding using local LLMs. Until then, just run API models on it for coding, local LLMs for "knowledge" work or daily driver agent like Hermes.
marcuskaz 15 hours ago [-]
Except they're not available, 3-4 month wait time.
KiwiJohnno 11 hours ago [-]
I ordered a mac mini m4 pro with 48 gb of ram a couple of weeks ago. Apple said 8-9 weeks.
iagooar 14 hours ago [-]
Buy a refurished or 2nd hand one.
1over137 14 hours ago [-]
Also not really available.
klardotsh 13 hours ago [-]
Especially with anything resembling a usable amount of RAM. Mac Minis and Studios >=64GB are basically permanently sold out everywhere, because everyone, including commercial entities with deeper pockets than most of us plebs, has the exact same idea at the exact same time.
15 hours ago [-]
14 hours ago [-]
roadside_picnic 13 hours ago [-]
In general if you're setting up a local LLM you should assume it's going to be primarily working as a server and talking to various clients. I use my MBP, but that's because I don't travel much anymore so it can happily work as a server at all times. With the right agent setup you can probably manage most things from your phone even if you don't have a seperate machine to use as a client.
I have an older laptop I run a hermes agent on backed by an API based open (non-local) model and Macbook Pro M4 for running another model locally (also using hermes). The agents have a Mattermost (open source version of slack) server they run and I run Mattermost on my phone so I can talk to them and task them with things. In fact, it was through the hermes WhatsApp endpoint that I got the first agent (non-local) to setup the Mattermost server and unboard the second agent (local mbp).
Then I can just chat with them through Mattermost when I need work done. Whenever I need something done I just hope on the Mattermost server and chat with them. I've had them build me multiple research reports (the fully local agent did awesome at this), learn how to use Stable Diffusion on my desktop to generate images, install and perform maintenance on various local services I run (including Open WebUI).
jtbaker 12 hours ago [-]
Nope, have both these machines, can confirm the M5 max blows the M4 mini away. It does get hot, but I use it mostly with an external monitor and keyboard. Conceptually I like the headless model better with a workstation, but work was buying the M5 and can't get it in any other form factor at the monute.
827a 11 hours ago [-]
Apple does not sell a 64GB variant of the M4 Mac Mini. IIRC they never have; its always capped out at 48GB.
If you were planning on getting an M5 128GB; just get a DGX Spark (~$4500) or a 5090-equipped machine (~$4500) plus a Macbook Air (~$1500). You'll come in below the M5 Max 128 pricing (~$6700+ USD) and be happier for it.
The Mac mini was available with 64GB of RAM literally 4 days ago; the option was discontinued on June 25th.
dd8601fn 9 hours ago [-]
I'm using a 64GB M4 Mac Mini.
They pulled them a month or two ago, right after I bought it.
ozim 6 hours ago [-]
DGX Spark everyone is saying performance for the money is not there
Foobar8568 5 hours ago [-]
I have an access to a DGX spark, and while it performs better than my MacBook Pro (M3 Max), the performance on Qwen and Gemma dense models is dog shit, and not worth it.
dgacmu 10 hours ago [-]
That's incorrect, I have one on my desk right now. They've stopped selling it now, but I got one a year and a half ago:
> Apple M4 Pro chip with 14‑core CPU, 20‑core GPU, 16-core Neural Engine
64GB unified memory
2TB SSD storage
10 Gigabit Ethernet
Three Thunderbolt 5 ports, HDMI port, two USB‑C ports, headphone jack
Accessory Kit
$2,649.00
swang 16 hours ago [-]
I have an M4 Max and when I was trying out local LLM work with pi it has probably felt like the hottest I've ever felt any kind of Macbook be. I could feel the radiated heat off it even a few inches away. Honestly felt hotter than any Intel Macbook I've used. Because of that I stopped as I didn't want to harm my laptop in case I need to hold it for 10 years due to all the supply issues/price increases.
dimitrios1 15 hours ago [-]
I tried to run it on a M4 Air for shits and giggles.
After about 1 minute the entire machine basically bricked and I had to hard reset :D
acters 16 hours ago [-]
Would the new upcoming AMD AI ryzen halo desktop be a better value offer? or dgx spark?
You would have to get a third party reseller/scalper or refurbished mac mini to get 64gb of ram ever since apple stopped selling it.
girvo 14 hours ago [-]
My GB10 Spark-alike is absolutely amazingly fun… but it is not cost effective. Step 3.7 Flash is shockingly capable (IQ4_XS and used for web dev mainly), but it cost me $6800 AUD. They’re even more expensive now. The numbers just don’t make sense: with proper triple head MTP I can get it up to ~40tk/s decode and it runs at around 1000+ tk/s prefill.
$6800 is a lot of API credits for GLM, for example, on any provider you want to use.
Now being able to run models uncensored and with privacy has value! But the cost for these is rough today.
I still am going to buy a second one haha
c7b 15 hours ago [-]
My 2c: you don't need the Strix Halo desktop, the chip comes in many rigs, most of them cheaper, the performance difference isn't worth it. It used to be half the price of a DGX Spark or a Mac with 128GB RAM. If you can still find it at that price I'd say it's the best bang for your buck. Otherwise, Macs have 2-3x the memory bandwidth of the DGX Spark, depending on the chip, so I'd prefer them. Unless you're planning on building a cluster. The DGX Spark has two 100GB/s connectors, ideal for clustering. But I haven't checked what else you could get for the price of two DGX Sparks.
brandensilva 9 hours ago [-]
Thoughts on a M5 Ultra 768GB if it drops? What's the price to make it worth it for you over a spark cluster?
I'm wanting to run Kimi 2.6/2.7 GGUF on it and just slap it in the server rack, but trying to decide if a spark cluster makes more sense.
PeterStuer 4 hours ago [-]
The M3 with 512GB is currently sitting at around 30K, used. You can extrapolate from there.
lee_ars 15 hours ago [-]
I'm currently fiddling with a DGX Spark and Qwen3.6-35B-A3B (specifically Qwen3.6-35B-A3B-NVFP4 under vLLM, with EAGLE3 speculative decoding via eagle3-dogacel-vllm), and it's pretty okay in terms of smarts. The speed is relatively usable at about 50 tok/sec with a 256k context window, and it's definitely smart enough to one-shot some basic coding tasks. I had it doing reverse engineering/disassembly of some ancient MS-DOS assembly language games from the 80s and it handled the task well and produced good outputs.
But it's also really easy to trip up. I fed it some of my Ars pieces and asked it to analyze themes and composition, and it got into a looping argument with me over how it was unable to analyze "my" writing because "the user cannot be the article author, the user is the user, the user did not write the article, the article author wrote the article." I was utterly unable to convince it that I was in fact me.
Qwen3.6-35B-A3B hums along at about 50GB of RAM used with --gpu-memory-utilization=0.42. I haven't tried Qwen3.6-27B (I'd likely grab Qwen3.6-27B-FP8, I think), but I'm curious to see if it makes much of a difference.
coder543 12 hours ago [-]
Compared to a dynamic quant like Unsloth's UD-Q4_K_XL, which keeps some important parameters in higher precision, a basic NVFP4 quant seems to do a lot more damage to the model unless it is carefully calibrated.
I would recommend using llama-server if you're just on a single Spark. You get access to dynamic quants like that more easily, the performance is not that different from vLLM most of the time these days, and it is much faster and easier to switch between models.
As far as intelligence goes, Qwen3.6-27B is much smarter than the 35B-A3B model, but that's also not the sort of thing to argue with an AI model about in the first place. Just open a new chat and try again.
Gemma-4-31B is not as good at agentic use cases as Qwen3.6-27B, but it is a fairly balanced model overall, and worth trying out too. Its MTP can nearly triple the performance of the model, where the benefits of MTP or Eagle seem more limited for Qwen3.6-27B in my testing, maybe doubling the speed.
cpburns2009 13 hours ago [-]
Looping is a common problem with the Qwen models. I've had good luck using --repeat-penalty=1.1 with llama.cpp and 27B. vLLM should have a similar option.
rnxrx 14 hours ago [-]
There are also nvfp4 quants of Qwen 3.6 27/35 floating around. I've done benchmarks of both and the quality difference vs fp8/bf16 was barely notable. Honestly the nvfp4 capability is the most interesting feature of the Spark (at least for me).
anon373839 13 hours ago [-]
I use Qwen 3.6 35B-A3B constantly, but I don’t see the type of behavior you mentioned. I’m using Unsloth’s Q8_K_XL quant.
gnerd00 12 hours ago [-]
`llama-server` looping mitigations --repeat-penalty something greater than 1.0, set reasoning/thinking OFF explicitly, prefer a gguf with more than 4bit quant
pkroll 15 hours ago [-]
Check the LLM benchmarks once it's out: it's such a common use case for these kinds of machines, you won't be waiting long.
HSO 4 hours ago [-]
running potentially sota open-weight models locally only became a thing in fall 2023.
if a hardware cycle takes ~3 years then fall 2026 would be the first possible device generation where apple exploits its advantage with the unified ram architecture.
more realistically, spring 2027, since they probably also needed some time to make up their minds to lean into that on the top end.
that`s also how i would interpret the recent rumors on m6 and m7.
naturally, the cooling and all that will be optimized around that.
so the first devices that are actually intended and designed for this use case will come at the earliest this fall and more likely in q1/q2 next year.
you are basically paying the price now to be on the bleeding (sweating) edge
DwarfStar is the only thing I've run that doesn't try and make my Mac Studio 128GB take off. Yes, it gets hot while doing inference but quickly cools down when idling, something I haven't experienced with Ollama, LMStudio or OMLX.
boomskats 13 hours ago [-]
Can you run Qwen 3.6 27B on antirez/ds4 now? I thought it was all about the DeepSeek models.
somewhatrandom9 13 hours ago [-]
No, I don't think Qwen, but I believe he may try and put some version of GLM in it.
Arch-TK 12 hours ago [-]
It's okay, completely wrong thread for this statement, but I wouldn't voluntarily use current MacOS (no idea if the older variants weren't terrible) over anything but ssh. Worse than Windows 11.
amatecha 9 hours ago [-]
"macOS" (or however they spell it now) is pretty bad, but I'm not sure it's possible Apple could ever possibly produce an OS as bad as Windows 11 lol, it's really surprising to me to see someone suggest it's somehow actually worse?! How many times has an Apple OS wiped your hard drive or otherwise been completely borked from a forced update? I know multiple people personally who have experienced this with Windows 10/11, not once with a Mac. Just that alone is like the end of the argument for me, ignoring all the shockingly brutal UI problems.
Tenoke 3 hours ago [-]
>How many times has an Apple OS wiped your hard drive or otherwise been completely borked from a forced update
I use Windows and this has never happened to me. I have had Macbooks I cant open to fix/replace something trivial while I can replace any part easily on a Windows PC/laptop though.
asimovDev 18 minutes ago [-]
>Windows PC/laptop though.
needs to be noted that it's increasingly uncommon to be able to do so. for desktops you have to build everything yourself - prebuilds (either gaming or workstations) have proprietary PSU and motherboards (in case of workstations, sometimes CPU is bound to the motherboard / manufacturer, for example Threadrippers). Windows laptops now often come with soldered RAM and soon will probably be without M.2 slots like Macs.
There is Framework though I guess
braebo 11 hours ago [-]
I could not disagree more.
c7b 14 hours ago [-]
This. Do consider local LLMs, but set aside a dedicated machine for it. Connect via VPN or reverse proxy. If it's not a Mac them I'd also put a server distro on it. No need for a desktop environment, save your RAM.
tedivm 14 hours ago [-]
I have a Linux box with two 3090s and it's been great for running Qwen3.6 27b. I lowered the power on each card down to 250w, and then built a small ducting/fan system to vent the waste heat outside. The machine is pretty much silent, and I'm still getting 110 tokens per second out of it for coding tasks.
Probably USD vs CAD. The parent posted a /ca/ link, which will look really similar to /us/, but the prices will all appear to be higher.
sixothree 6 hours ago [-]
Ah. Thank you. It seemed pretty sticky too, navigating the items via my previous orders even persisted the currency.
overgard 15 hours ago [-]
I'm running an M5 Max 128GB with Qwen 3.6 and unreal engine in the background and it seems to be ok for me. Quite a power drain if it's not plugged in but I haven't seen any thermal issues.
geophile 15 hours ago [-]
That's exactly what I'm doing -- Mini M4 Pro 64GB, qwen3.6.
My hearing is not great, but I think I would have noticed the fan, and I have never heard it. In fact, I had to google to find out if it even has a fan.
trollbridge 9 hours ago [-]
I'm still kicking myself for buying a 32GB M1 Max Studio two years ago when it wouldn't have been that difficult to get a 64GB instead.
trollbridge 9 hours ago [-]
Or just buy an R9700 and put it in the basement?
oceanplexian 16 hours ago [-]
If you want to do coding with a local LLM your best bet is a 6 year old Nvidia 3090 which is substantially more powerful than the highest end overhyped Apple product for 1/5th the price.
ThunderSizzle 1 hours ago [-]
The cheapest 3090s I could find with any sort of guarantee were pushing $1500.
An AMD AI Pro R9700 32GB brand new is $1350 right now.
After some tweaking, I had it running faster than the models the 3090 could run, and it could obviously run with higher context limits and bigger models due to the extra vram.
chorizo 16 hours ago [-]
That’s 24GB VRAM. Not enough to run a 27B model at a useful quant+context size.
You can run 8bit 27B models at 24GB, it's definitely enough for the model size.
SwellJoe 15 hours ago [-]
The 8-bit quantized 27B Qwen 3.6 is 29GB. You absolutely cannot run that entirely on a 24GB GPU.
You could run a 4-bit, which is 16-17GB. But, you'd need a smallish context or you'd need to quantize your KV cache. Something like TurboQuant or RotorQuant might help.
32GB is the lower bound for comfortably running this size model. I'd maybe even say 64GB is right-sized, because a 256k context is nice to have for agentic workflows, and that won't fit on a 32GB card without heavy quantization (but I haven't tried TurboQuant or RotorQuant to know what impact it has on memory use for context).
You could also put some of the model into system RAM, but that defeats the purpose of your argument that a 3090 will outperform a Mac Mini or Mac Studio. If part of a dense model is in system RAM, it absolutely will not outperform a recent unified memory device.
cpburns2009 14 hours ago [-]
A 32gb card does run it nicely. I use unsloth's UD-Q5_K_XL at 256k context (k/v at q8_0), and get ~67 t/s on a 5090. I still need to look into MTP.
adornKey 3 hours ago [-]
Nice. I used Q4_K_M to have some headroom. But yours seems to fit nicely.
pbgcp2026 12 hours ago [-]
[dead]
bityard 15 hours ago [-]
Quantization is a trade-off, though. The quality, while still perhaps good enough for many tasks, is not as good as the full 16-bit weights that the model was designed for/released with.
pbgcp2026 12 hours ago [-]
[dead]
barbacoa 14 hours ago [-]
I'm running qwen 3.6 27b at 8bit quantization and 262k context. It takes 53gb of vram on my system.
jnovek 15 hours ago [-]
I think that’s only true for MoE models. A dense model like 3.6 27b will require more (plus a KV store).
bityard 15 hours ago [-]
No, even MoE models need to fit into (V)RAM. MoE has faster inference because only a subset of layers are used to predict the next token, but the set of layers used changes with every token.
sanderjd 15 hours ago [-]
Yeah seems to me like the mac studios with the unified memory architecture are genuinely good bang for the buck at the moment, because of this memory size consideration?
angoragoats 11 hours ago [-]
So buy two.
iagooar 15 hours ago [-]
My problem is I won't accept anything lower than the 96GB the RTX Pro 6000 Blackwell has. My dream is a workstation with 2x Pro 6000 to run DeepSeek v4 Flash comfortably, possibly qwen 3.6 / ornith on turbo speed.
But man, I have never purchased a computer which is more expensive than a decent family car.
d0gsg0w00f 9 hours ago [-]
I had this dream too. My 2xDGX Sparks arrive in my reality on Monday.
jnovek 15 hours ago [-]
An M1 Ultra has 800gbps unified memory. It’s nothing to do with Apple, it’s their microarchitecture. They’re just about the only game in town with high-bandwidth memory if you want >24GB (for less than $10k, anyway).
murderfs 13 hours ago [-]
A 5090 gets you 32GB with 1.8 TB/s of memory bandwidth for ~$4k, RTX A6000 gets you 48GB at 768 GB/s for ~$3.5k, 2x 3090 gets you 48GB for $2000 or so, and if you're willing to go into the wilderness, there are much cheaper options like the AMD MI50.
jtbaker 9 hours ago [-]
The RTX 5000 Pro 72GB seems like kind of a sleeper to me, and sips < 300W of power, approx 1/2 that of its big bro the RTX 6000. Kind of dream about installing it in a 10" rack, it seems like it might be able to work? @jeffgeerling you out there?
Yeah this is just not the case at all; a 5090 or any of the recent nvidia workstation cards all fit this criteria.
Also, while memory bandwidth is important, it isn’t the only consideration. Apple’s architecture has memory bandwidth equal to a mid-range consumer GPU, but its GPU speed is much, much worse than, say, a 5080 or 5090. This translates into e.g. much slower time to first token on Mac systems compared to dedicated GPUs.
dheera 14 hours ago [-]
32GB V100
t0mpr1c3 7 hours ago [-]
Meh. I'd rather have 2x RTX 5060 Ti.
PeterStuer 4 hours ago [-]
No laptop is thermally designed to handle sustained high workloads. The whole point of a laptop is to keep it thin, quiet and light, the exact opposite of what cooling needs.
xd1936 15 hours ago [-]
Apple does not currently sell a Mac Mini with 64GB RAM.
iagooar 15 hours ago [-]
Get a 2nd hand one. I was lucky enough to get a new one first, last week I get a 2nd hand one in order to run one of my Hermes minions at work.
stevenaenns 15 hours ago [-]
how many tokens/s generation do you get?
iagooar 15 hours ago [-]
Ballpark 25-30 tok / sec on the Mac Mini Pro M4 + qwen3.6 35B. The generation itself is good, prefill is known to be slow on any Apple M-chip architecture. It is really decent.
angoragoats 11 hours ago [-]
They did until 4 days ago, so I’d forgive the OP for not knowing that the option was discontinued.
toephu2 14 hours ago [-]
I just checked apple's website and configured them:
Mac Studio:
Ships:
16–18 weeks
Mac mini:
Ships:
10–12 weeks
Arubis 16 hours ago [-]
Don't forget that your OLED screen will start to color-shift as the heat cooks the panel!
manmal 16 hours ago [-]
There is no MacBook Pro with OLED (yet).
Arubis 16 hours ago [-]
My mistake on tech; it’s a beautiful display. Alas, I speak from experience when it comes to the thermally-caused color shift. Hopefully it’ll be AppleCare covered.
kamranjon 3 hours ago [-]
I completely disagree, it is probably the best platform currently for this - and the way I run it is as a server with tailscale accessible from my coding machine (same as you suggest here) - the difference is that you can stop the server, use it as a video editing rig on a whim, or use it for training instead of inference (yes PyTorch has caught up and Metal is a great platform for this now).
It’s just so flexible, and I even use it in agent mode (ds4) directly on the machine as well sometimes (it’s really not that bad, I’m often running inference for small side projects on my couch), if there is another machine that can do all of this and still function as one of the more ergonomic, well built, and compact laptops out there, I’d love to hear what it is cause I’d likely be interested!
seunosewa 8 hours ago [-]
You can get some work done by using low power mode even when plugged in, and making your fan start running when the temps just start to rise (maybe 40 degrees. Use a third party fan app to set it up
cosmic_cheese 15 hours ago [-]
They really need to release those updated Studios already.
DennisP 14 hours ago [-]
Since they've reduced the max RAM on current Studios from 512GB to 96GB, I'm not holding my breath.
14 hours ago [-]
Matl 15 hours ago [-]
> If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk.
Can confirm this works rather well, most things that integrate with LLMs, (agents, editors), support providing a remote (LAN) URL for Ollama, LM Studio etc.
But you do need a fast LAN connection, otherwise working with agents will be a pain.
Retr0id 15 hours ago [-]
> you do need a fast LAN connection
Huh, how come? Low-latency I can understand, but I was under the impression that token throughputs were still barely exceeding dialup bandwidths.
iagooar 15 hours ago [-]
I disagree LAN connection is the bottleneck. I do even work with it remotely via Tailscale on shaky hotel WIFI and it works fine (or as fine as any other API-based model).
cmgbhm 15 hours ago [-]
A local model on my m2 made me come to that conclusion but I definitely was having “that config is $2k more” regret. Thanks for posting this!
SkitterKherpi 16 hours ago [-]
I am considering getting something like NVIDIA's RTX Spark when it comes out, though even that will be limited to 128GB.
jazzyjackson 16 hours ago [-]
They’ll sell you a bundle, either a pair or a quartet so you can have 256 or 512GB over a 400GB/s network link
I can’t figure out when it makes sense to pay 10k up front for a quantized Llama 3.1 but it’s an interesting option
But yeah, there's a bit of a dearth of models that could fully utilize memory in the 128-256GB bracket at the moment. But things move so fast in this space, I wouldn't base my decision on a generation of models that's just a few months old.
rnxrx 14 hours ago [-]
It depends on what's meant by "fully utilized" but fp8 quants of Nemotron 3 Super, the latest Minimax, Cohere A+ and the Mistral small and (especially) medium variants all sit in that 128-256 category, especially with full context or even moderate concurrency. In fact, in a 192GB environment I work with (Hopper GPUs, fwiw) I was pushed into using 4-bit quants with a couple of those to get the model working with a reasonable context window (..but 256 would have rocked out).
girvo 14 hours ago [-]
Not Llama 3.1, but Step 3.7 Flash is one of the few new high quality models in this size bracket. DeepSeek v4 Flash too
SkitterKherpi 15 hours ago [-]
10k is rather a lot yes. For LLMs you can use a lot of tokens with 10k with less hassle without the machine (and also it's not like electricity is free), but for some other things like video models 10k would get burned very fast. I am looking for something more in the 5k range though.
awesomeusername 16 hours ago [-]
It's out, I'm daily driving one. It's great
SkitterKherpi 15 hours ago [-]
I assume you have the dgx spark? At this point I am not 100% on the difference other than Linux and Windows. The RTX spark should come around Q4, unless I am mistaken.
vikingcat 15 hours ago [-]
Are you running a local LLM on it? Did you buy a whole laptop?
bilekas 13 hours ago [-]
Can you define "serious programming"? Because I use it to implement things I COULD go and figure out like algorithms or test generation or evaluations etc, the "serious" programming I tend to do myself. That is what I'm paid for.
overgard 11 hours ago [-]
Serious programming is using as many agents and loops as possible because anthropic needs you to spend more on tokens
pistoriusp 4 hours ago [-]
Mac Mini in the rack and a Neo in the lap.
stared 13 hours ago [-]
Yes, it gets really hot really fast.
As much as I was tempted to use it on longer projects, I had some reservations about whether it would put too much strain on my MacBook.
jarjoura 15 hours ago [-]
TBF, I just recently picked up this same model, and it's reminding me of the last gen Intel i9 MBP. Just visiting any non-basic website spins up the fans and battery life isn't great either. Yes, this thing is fast, but damn it gets hot just using it for normal tasks.
Still, I don't agree. I think this machine is meant to use local models. You just have to wear pants if you want to keep it directly on your lap. I rarely use it that way anyway. I prefer it plugged into an external display and comfortably sitting on a laptop stand.
KingMob 11 minutes ago [-]
As someone who just upgraded a month ago from the last Intel MBP to a new base M5 MBP, I think your laptop might have a problem. I'm definitely not experiencing any of what you describe when doing normal tasks.
y1n0 13 hours ago [-]
Is there something wrong with the m5s? I have an m4 pro and I’ve never heard the fan on it. I don’t do much with local llms, but I naturally use the web and play games (windows games at that with wine/crossover).
inventor7777 13 hours ago [-]
That seems very unusual for modern Apple Silicon. Our family has:
- M3 Pro MacBook Pro 36GB
- M2 Pro MacBook Pro 16GB
- Mac Studio M4 Max 48GB
and I have not heard the fans on any of them with normal use. The only time I've ever heard automatic fans was when I was using a local 12B model on the M3 MacBook Pro, and when running 70B models on the Studio.
You should consider checking Activity Monitor and making sure that the usual suspects are not causing issues with sustained high CPU. And you can use an app like [Stats](https://mac-stats.com) if you want to see that info while actively using the computer.
15 hours ago [-]
lowbloodsugar 11 hours ago [-]
This is not normal. You have a broken Mac. Make an appointment.
throwaway240403 10 hours ago [-]
No, buy a framework desktop.
verdverm 16 hours ago [-]
Get an OEM Spark instead, mine are silent and can fit 2 qwen/gemma at 8bit or give you room for a bunch of other, smaller models (embed,rerank,etc)
ako 7 hours ago [-]
You could use an external keyboard?
seanmcdirmid 15 hours ago [-]
What sort of M5 are you running? A max? MacMini's don't offer max CPUs.
iagooar 15 hours ago [-]
M5 Max. But I also have a MacMini M4 Pro 64GB. Qwen3.6 runs on the M4 just fine - sure the M5 is at least 2x the speed. If Apple launches a MacMini with an M5, I will be the 1st one to get it.
kristianp 15 hours ago [-]
You're only going to get an incremental improvement with an M5 Pro mini compared to an M4 Pro mini. Memory bandwidth goes from 273GB/s to 307GB/s, about 12.5% improvement for LLMs.
freehorse 13 hours ago [-]
M5's have the neural accelarator that boosts prefill speed a lot. But token generation itself will not change that much, that's true.
iagooar 15 hours ago [-]
I thought they might ship an M5 Max version, but you are probably right.
codazoda 14 hours ago [-]
Today the Mini tops out at 48GB. Gotta go to the Studio to get 64GB.
aurareturn 14 hours ago [-]
Don't buy the Mini or Studio. Both have the M4 which lacks the Neural Accelerators, making prompt processing ~3-4x slower.
mortenjorck 14 hours ago [-]
I assume those don't just work automatically with an off-the-shelf gguf. What do you need in your local inference stack to take advantage of M5's neural accelerators?
wren6991 4 hours ago [-]
Apple muddied the waters by calling them "neural accelerators" but it seems like what they actually added in the M5 generation is tensor instructions for the existing GPU cores. It's not a separate accelerator like the ANE.
llama.cpp's Metal backend does use them when they're available.
aurareturn 14 hours ago [-]
They do work with llama.cpp and MLX automatically.
2Gkashmiri 6 hours ago [-]
Apple Mac Studio (M3 Ultra Chip/28 CPU, 60 GPU/96 GB/1 TB
How is this config?
busymom0 16 hours ago [-]
Also look into buying the Mac mini refurbished from Apple. They come almost brand new, same warranty and you save money.
Fr0styMatt88 15 hours ago [-]
What kind of speed in tk/s do you get with the MacBook?
iagooar 15 hours ago [-]
qwen3.6 27B MLX 8bit -> 15 tok / sec. A bit slow but it is a delightful model to use, and smart too.
qwen3.6 35B A3B MLX 8bit -> 85-90 tok / sec! It is impressively fast and roughly 90% as good as 27B (in my opinion).
samtheprogram 14 hours ago [-]
Are you sure you're running it with MLX?
Abishek_Muthian 8 hours ago [-]
>Sure you can use it in clamshell mode
Wouldn't this damage the MBP display?
My RTX laptop has air intake underneath the keyboard and clamshell mode is surely a recipe for disaster; I've taken numerous measures to ensure that the laptop doesn't stay awake when the lid is down.
singpolyma3 14 hours ago [-]
With 128 you can run 122b ;)
julianlam 9 hours ago [-]
Very surprised an Apple device can have some atrocious ventilation design.
I'm running this model on a Framework 13 and the chassis barely heats up at all while running full tilt.
2Gkashmiri 10 hours ago [-]
How is Mac studio 32gb or 96 gb ram one?
gigatexal 13 hours ago [-]
Same. And your M5 has acceleration that I don’t with my M3 max. I can’t do anything local it gets hotter than an Intel Mac trying to run docker from back in the day.
dzonga 14 hours ago [-]
why not buy one of those "a.i" desktop kits being sold by Nvidia/AMD and just connect to them via network ?
to me that's cheaper than paying an LLM provider such as Anthropic spreading FUD around open weight models & more sustainable too.
Gigachad 13 hours ago [-]
It's still currently way cheaper to pay open router to run qwen for you. And you have the option to use much bigger better models like DeepSeek v4 flash.
zxexz 8 hours ago [-]
[dead]
ActorNightly 15 hours ago [-]
>If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement
Im sorry, but its time to start calling Apple sycophants out. Stop trying to push your tech jewelry on other people. You only buy those computers because they are Apple, you don't know anything about computing or running LLMs, you don't do any real work, so you should probably not give advice on what to buy.
A single 3090 will run Qwen3.6 27b fine, and its VRAM speed is twice of what the best Mac has.
And the build will be cheaper. Decent CPU/Motherboard, 32gb of DDR4 ram, an SSD and a Single 3090 should run max about $4grand. Mac m4 mini is 6grand.
Then, when gpu prices come down (or you find one on a deal), you can upgrade the card, or stick a second one, and benefit from more speed. You can't do that with the trash Apple produces.
Flag me if you want, I don't care. Its embarrasing for the tech community to give advice this bad.
iagooar 14 hours ago [-]
I am not going to flag you, I am much OK with having good arguments.
I just purchased a Mac Mini M4 Pro 64GB for $3k - 2nd hand of course.
I am not a hater of Nvidia and I am planning on building a workstation based on RTX cards. You clearly do not seem to understand how convenient the MacMini actually IS - the form factor, how quiet it is, how durable it is, how well it integrates with other Macs, how well it works as a bridge to a personal agent like Hermes (integration with iMessage, Calendar, Reminders, iCloud, etc).
I am pretty sure I know a thing or two about computing, I have been in the trenches for many, many years and I have had machines of all kinds, shapes and colors. It just so happens that Macs are very capable, very convenient machines that happen to work great in the era of LLMs, too.
But you do you.
lowbloodsugar 11 hours ago [-]
If you are in Apple ecosystem, and have reasons to own one besides inference, then buying a used Mac mini pro isn’t such a bad idea. I just bought a regular Mac mini just to provide a nice front end to my Ubuntu workstation. But if all you want is inference, then a cheap PC with a 32gb 9700 (or two!) in it is far cheaper. This specific thread was about someone who already has a MacBook. A cheap PC and GPU pairs well. Or a spark: slower but more memory. Or fuck it! Get a 5090 or a 6000!
ActorNightly 14 hours ago [-]
>You clearly do not seem to understand how convenient the MacMini actually IS - the form factor, how quiet it is, how durable it is, how well it integrates with other Macs, how well it works as a bridge to a personal agent like Hermes (integration with iMessage, Calendar, Reminders, iCloud, etc).
If you are that locked in to Apple, its pretty easy to buy a used Mac Mini older gen for all the non AI stuff.
But this is a discussion about inference. Buying a Mac anything for any sort of local inference is a COLOSSAL waste of money.
bensyverson 18 hours ago [-]
The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0]
Some people will be happy to pay that premium for privacy, but at roughly 10X the cost of a MacBook Neo, that money could also buy a lot of credits on OpenRouter or frontier labs.
The maths there is pretty undeniable, but it is not where I'd make the split. Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it.
I don't know how much serious hands-free agentic coding I will ever do on my MacBook alone, but I do know that I would not have got so far into understanding this without tinkering with local models, llama.cpp, LM Studio, and LM Studio and all that.
I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled. Because it's just huge, exhausting, jargon-drenched, unknowable, and I am over the hill at fifty-plus.
Until, that is, I could poke around with setting it up on my own (secondhand) machine, watching the API calls, understanding some of the terminology. I didn't even buy the machine for that; it's just adequate to the task.
The Neo is too small to really get much benefit from this opportunity to make it more visceral and knowable.
pizza234 17 hours ago [-]
> Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it.
Cloud models are (much) faster, they don't consume so much power/generate heat, they have much bigger (LLM) context, they're much more precise and they have a much wider (engineering) context of the given problem.
Except privacy and use cases that are blocked by cloud models (e.g. reverse engineering), local LLMs are currently an expensive toy.
When I try to program with a local LLM (I'm on a 32/128 GB system), I end up wasting time compared to a cloud LLM.
dofm 17 hours ago [-]
Again, I would not argue against any of this.
And I can't say that I won't switch to openrouter (even just for the same models) at some point.
But one of the things I have found about my own process learning is that some lessons only come to you when you make yourself available to them. And if that means doing things the difficult way, that is what you should do.
wahnfrieden 16 hours ago [-]
Difficult... and wastefully expensive
sanderjd 15 hours ago [-]
Seems like an investment into building expertise, which is likely to have high ROI in the future, rather than a wasteful cost.
dofm 16 hours ago [-]
I mean, it's a (secondhand) computer I bought for other tasks (processing very large photos, compiling large apps quickly). It's running all the time. It can also run LLMs when I want to.
The rest of my life is ultra-frugal so I am relaxed about this.
_puk 16 hours ago [-]
Don't bite. You're right.
Having spent a good weekend learning how to perform latent-steering through playing with pytorch and a local Gemma4 model, there is no way I could have groked any of that in the the way I did without hands on time.
This is on an M3 Max 36GB I've had for a couple of years. No further outlay needed.
monkmartinez 15 hours ago [-]
My thinking is totally aligned with yours, perhaps its because I am trying to do a second act at almost 50 from blue-collar to white collar office work. I have no formal degree, but I have been hobby programming for 20 years. I have made a habit of "letting myself be available to all lessons"... the localllama group has made this journey really fun if nothing else. I have learned an ABSOLUTE ton from this era!
dofm 15 hours ago [-]
I have been contemplating a move in the opposite direction because I have just been exhausted and depressed, so for me, really learning this stuff this way has been about managing those feelings, about a sense of pride and ownership of my processes.
I don't know if it has changed my mind about a career change but as I am sure you can understand, I no longer feel like I am running away defeated.
My very best wishes to you :-)
moffkalast 14 hours ago [-]
People pay thousands for model trains, everyone needs a hobby.
dofm 13 hours ago [-]
Training models vs modelling trains
moffkalast 3 hours ago [-]
Ah yes, the EMD0E9-30B-Union-Pacific.gguf
sanderjd 15 hours ago [-]
> currently
The interesting question is whether that gap will narrow, and if so, how much, and on what timescale.
The exact answer to this question is not knowable, but if you are the kind of person who comes to a site called "hacker news", and you think there is a nonzero chance that the answer is that yes, the gap will narrow and this won't always be an expensive toy, then now seems like a pretty great time to get in the game and start exploring the capabilities.
Abishek_Muthian 8 hours ago [-]
I agree completely. I think local AI is best limited to purpose built SLMs; all this craze around running quantized coding LLMs has taken the attention off SLMs.
AlpacaJones 17 hours ago [-]
The key word there is 'currently'.
smt88 16 hours ago [-]
Economies of scale are a fact of nature and aren’t going to be subverted in the future by even the most advanced local models
kennywinker 16 hours ago [-]
Which is of course why, if you want to render 3d scenes to play a video game, you have to rent time on a mainframe system. I don’t see that changing ever - it’s just economies of scale!
(sarcasm, btw)
Gigachad 13 hours ago [-]
The economies of scale gains are lost because you still have a middle man hosting provider who wants to profit too.
Over the long term it's always been better to buy than to rent, even if the renting option is technically more efficient on the GPUs, you don't have to pay some hosting providers profit margin.
Dylan16807 3 hours ago [-]
If the hosting provider can fit 1000 users onto 100 GPUs, that's enough for quite nice margins and being far cheaper than buying your own GPU.
And for users that aren't running multiple agents 24/7, you should be able to fit a good user:GPU ratio.
Gigachad 3 hours ago [-]
Maybe. The economics work out better than for game streaming. When I looked in to game streaming it ended up being cheaper to buy over the long term. Though games tend to use 100% of the hardware for hours, and they tend to all be used at the same hours of the day and have to be hyper local for latency reasons. Something LLMs don’t have issues with.
oceanplexian 16 hours ago [-]
Things can get both more expensive and cheaper at scale, hence the term.
For example (and relevant to AI) I can generate electricity on my roof at $0.20-25/kWh, batteries included. In California the electric utility can’t offer it cheaper than $0.30-0.50/kWh. Therefore at scale, electricity is actually more expensive.
There are many such examples.
Dylan16807 3 hours ago [-]
Apples and Oranges. The utility uses a weird conflated fee that combines the price of the electricity and the price of connecting your house to the grid. If they split it up your marginal price per kWh would be much less.
sanderjd 15 hours ago [-]
Yeah, I think the fallacy here is the conflation of scale and centralization.
Right now, there is way more scale in centralized AI than there is at the edge. But that could flip. I'd still probably put the probability that it will under 50%. But I'd also put it above zero!
sanderjd 15 hours ago [-]
... said the IBM executive to a young Bill Gates.
bogeholm 16 hours ago [-]
> Cloud models […] don't consume so much power/generate heat
I do realize the cloud is just someone else’s computer right? Power goes in, tokens and heat come out - just in another place
actionfromafar 15 hours ago [-]
The cloud computers produce more tokens per watt. That said, if you have a computer at home running 24/7 for other reasons and you also can use it for some LLM work, why not.
psychoslave 16 hours ago [-]
Anything done local will likely come at higher cost and at scale with less energy efficiency and commodity, with less possibility to fine tune engineer deeply on wider horizon of issues.
That's never the point of keeping local alternatives though.
dofm 16 hours ago [-]
Right.
For me this dates all the way back to installing Slackware 1.0 (0.99pl12!) on an offline 486SX rather than just using the internet-connected workstations in the lab.
Here, I already had a Mac that was powerful enough to run a local LLM, so now I do, because I can.
16 hours ago [-]
17 hours ago [-]
VerifiedReports 15 hours ago [-]
Exactly. The distinction between the various layers in "AI" systems is pretty vague to the newcomer. What is the "model" vs. the engine "running" it vs. weights?
I don't recall any previous tech stack that was barfed onto the scene with so little background or reference material, going from zero to endless undefined jargon... and no primer in sight.
For people who demand an understanding of their tools, it's a lot of work. I recognize the value of "AI" in performing the tasks I'd have to do manually; for example, keeping the data structures of my front- and back-ends in sync in a project. But do I want to interrupt my development and take weeks off to digest all of these tools?
And if I do, I want to run the show and fully understand it. And like you, I think that's best done locally.
Fr0styMatt88 15 hours ago [-]
The most unexpected thing for me was kind of philosophical in a ‘holy shit’ way.
Cloud models still feel ‘magic’, like you send a request off and get something back, like it’s something ‘special’. I used to joke that ChatGPT might be some kind of mechanical turk underneath.
Watching a model run local on your own machine hits different — you realise that yes, it IS just a computer program. Which for me actually makes me appreciate the leap we’ve made MORE, not less. From an information-theoretic point of view, LLMs really are something special.
The fact that they are just programs, that I’ve now experienced first-hand that they’re just programs, makes all those questions around consciousness and intelligence much more interesting.
dofm 15 hours ago [-]
Yep — it hasn't changed how I feel about what LLMs are capable of (and very much not capable of) but this visceral feeling is fascinating.
Like, just watching a computer I already owned act like ChatGPT with the wifi disconnected.
It was the first time I stopped feeling quite so helpless, somehow.
QuercusMax 15 hours ago [-]
Yeah, it's been fun for me running models (mostly Qwen 3.6 27B) on my 48GB M4 MacBook Pro. When i'm using it to run models, it's basically unusable for anything else - I actually do the work on my Macbook Neo. Took me a while to figure out why the models couldn't figure out how to make tool calls - because LMStudio by default uses a 32K input window, which is smaller than OpenCode's prompt, so half of the instructions were being pruned from the middle!
dofm 14 hours ago [-]
Yes — there is a setting for that isn't there. And as soon as you realise there's a setting for that, you have new knowledge.
Qwen barely needs any of Opencode's prompt, in my experience; I think I cut it down to about three general lines I found by googling. Mainly you need only a pre-amble to make sure that the plan mode, plan switch and build mode prompt fragments make sense.
Gemma 4 also needs almost nothing at all, which is fascinating, considering it is not a coding-specialist model. It just seems to be who you need it to be when you ask.
hypfer 2 hours ago [-]
What are those 3 lines you've cut it down to?
QuercusMax 7 hours ago [-]
[dead]
ricardobayes 15 hours ago [-]
For the most part you can just download LM Studio and go from there. It provides a chat interface and an easy-to-use interface to browse, load and use LLM models.
The engine: it is abstracted away by LM Studio, if you want to dig deep it's llama.cpp as the runtime. Weights are the files what you download, they are the models for practical purposes.
dofm 15 hours ago [-]
I definitely would recommend LM Studio as a learning environment, because it surfaces a bunch of things in relatively clear-minded ways. I am very grateful for it.
codazoda 16 hours ago [-]
I agree with the learning aspect, but I have another motivation. I suspect that closed models might become too expensive to run for personal hobbyist use. I’ve been planning to buy a 64GB machine just to allow the limited local models this enables.
bpye 5 hours ago [-]
> The maths there is pretty undeniable, but it is not where I'd make the split. Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it.
Seems like a GPU with 12GB+ VRAM is going to be a much more affordable way to achieve that? Even a B580 should get reasonable perf there.
dofm 5 hours ago [-]
No idea. I am a Mac guy, have been for a very long time. I buy them secondhand as a rule.
I guess I would build a powerful home LLM server if I was convinced I really needed one for my purposes for some agentic application or other. At the moment I'd prefer to ride this out with a machine that is also an excellent Mac.
ehnto 9 hours ago [-]
It's also great to have capability to run local models for more brute force tasks. Because you can change the system prompt, you can get local LLMs to do all kinds of high volume tasks without burning through tokens on a hosted model.
Just one example, I needed a bunch of images tagged and organised, with a local vision capable model I could pretty easily set that up and leave it running overnight.
I already had the GPU and memory for gaming, so it was at no cost for me to start running local models. But I feel the long term writing is on the wall, local models will only make more and more sense as they get better and more efficient.
ricardobayes 15 hours ago [-]
I'd say give it some time for the dust to settle. This field badly needs standardized benchmarks even before the conversation around model goodness can start.
ddalex 17 hours ago [-]
I just got Claude to download and install all the models and servers and agents and prepare all the launch scripts for me... no need to learn, just ask it to do it for you
dofm 17 hours ago [-]
Right, but I am a middle-aged bloke who is experiencing existential angst about whether I can carry on in this industry.
I have a pretty deep, maybe paranoid need to be confident I have an intrinsic understanding, and I have found in my life that lessons come to you when you make yourself open to learning.
So I need to build on top of what I know, taking as much of the hard way as I can bear to take at any one time — it has to be not quite difficult enough to put me off.
I can't really explain what I have learned this way that is different, but I feel it in a way that I wouldn't if I'd simply pushed a button.
For the same reason, I have a really basic 3D printer that I've set up myself, set up Klipper, configured how I want it, learned how to calibrate, all that. And now I can say that I feel I have an understanding of 3D printing. I could hold my head above water in a discussion with a real expert, maybe find work in an adjacent field where my insights would keep me grounded.
I can afford a really good printer that has all that set up, and more, has no problems. But I'd just be someone who has a 3D printer.
(Also who am I kidding about the existence of a printer with no problems)
greyskull 15 hours ago [-]
This really resonates with me, and I'm only a decade and change into my career. I use claude a lot day to day. I try to use it sensibly, making me more productive and produce better work. I'm also trying not to lose understanding along the way. I want to be able to actually talk to the conclusions I'm reaching.
I have colleagues that seem perfectly content to delegate too much to the agents, and it saddens me. It feels like there will be swaths of engineers that didn't train some of the critical thinking skills that I take for granted.
I certainly see it in slack discourse around anything more complicated than a feature implementation. Maybe I'm just cynical. Time will tell, I suppose.
bluGill 14 hours ago [-]
You will not live enough to learn everything. Eventually you have to say "I could figure [something] out but I won't take that time." Most things are that way - I probably could learn brain surgery (I used this example because it has a reputation of being a very difficult course of study). I would like to make a lathe from scratch - but I don't have easy access to enough iron ore to get started - even if I start from scrap metal, I probably wouldn't spend months making my own surface plate (...) and so I own a factory made lathe instead.
That is why I'm content to delegate to agents - I have more code/features I want to write than I have time to debug (writing is the easy part).
sanderjd 15 hours ago [-]
For me (about halfway between you and dofm in my career by your own statements in this thread), it's a dream at the moment. I can delegate all the tedious stuff that I've done "the hard way" a thousand times already and feel I have very little of value remaining to learn, so that I can spend more time on all the things that are actually new and thus much more interesting.
greyskull 14 hours ago [-]
It's been a great multiplier for me in similar ways. The "dreamiest" thing has been that it has freed up time that I would normally have spent doing sprint work, to work on things that just don't make the cut until it's bad enough to deprioritize other work.
Over the last few months, I've been digging into performance problems with a high throughput service that my team owns. I started working on the problems in my own time, put out short and medium term improvements that legitimately avoided operational issues, and started developing an alternate architecture that should meaningfully address the problems for the long term.
I've learned new things and made improvements that probably wouldn't have ever gone in otherwise.
sanderjd 14 hours ago [-]
Yes exactly. There is a narrative that it's driving everything toward low quality slop, but in my own work it's exactly the opposite. We're doing work on quality and performance that we never would have gotten to in the past.
I've spent my whole career being frustrated by the pile of low severity bugs and performance issues that "I could fix that if I could only justify putting a couple hours into it!". And now I can just fix all those. Nobody is going to question my use of time to write prompts and do code reviews of those things, when I can to my "real" work simultaneously.
sanderjd 15 hours ago [-]
Yeah, this is just the engineer's mindset. It's not surprising that this is a popular view here, even if it is not (and does not need to be) the mainstream perspective.
greyskull 14 hours ago [-]
> mainstream
What does "mainstream" refer to when we're talking about software development and LLMs? As opposed to "engineers".
sanderjd 13 hours ago [-]
This is a very fair question! When I wrote this comment, I was definitely thinking of the "real" mainstream, i.e. users of llm chat to generate text, not software engineers.
But I think there is (and has always been) also a distinction between the "mainstream" of software developers vs people who are working on new tools and capabilities to be used by that "mainstream".
IMO it is certainly true that the most efficient and cost effective was to do "mainstream" software delivery at the moment is hosted frontier models. But for people thinking about "what's next?", it makes a ton of sense to be exploring different models in anticipation of a possible (but certainly not inevitable) sea change.
swiftcoder 16 hours ago [-]
I don't necessarily think your answer is wrong for all people, but if you work in software... how do you plan to differentiate yourself from everyone else out there, if the depth of your understanding is "Claude can do it for me"?
dofm 16 hours ago [-]
This ultimately is the discussion I am here for.
I mean one of the things I use a local LLM for, because I can, is to generate starter documentation. But I ask it to — I want it to give me overviews, plans, all that. It can make something bespoke for me.
I guess I could also ask it to do the work. But where do you draw the line?
The universal labour-saving device is the great provocation of the next 100 years I think, and both Star Trek and Wall-E have grappled with it.
coldtea 17 hours ago [-]
>no need to learn, just ask it to do it for you
And that's how skills die.
ddalex 5 hours ago [-]
And why is this skill important, if a machine can do it ? What's the last time you ploughed your field with oxen ?
CamperBob2 16 hours ago [-]
When's the last time you shoed a horse?
The reason I delegate so much of local LLM installation and administration to Claude Code is simply because there's no point learning practical things that will work completely differently in a couple of years, or in memorizing procedures that I'll forget long before I need to perform them again.
No longer having to sweat all the details is a Good Thing, not a Bad Thing.
dofm 16 hours ago [-]
I am not sure I disagree, and I certainly don't mean to disagree very fervently.
But I think if you want to really learn to ride well, understand horses well, there might be some benefit in learning how to shoe a horse. At some level it should never only be someone else's job.
verdverm 16 hours ago [-]
At the same time, most people can drive without understanding how a car works.
coldtea 14 hours ago [-]
Yes, and they're all the worse, more at the mercy of car companies and mechanics, and less aware of the world they live and operate in, for it...
saganus 16 hours ago [-]
You actually do need some understanding of how a car works, no?
For example, you need to know it uses gasoline (or diesel), it requires oil changes every certain amount of time, break pad replacement, etc.
You also probably need to know that you can't operate cars over a certain amount of water, that you need a driver's license, stopping at red lights, etc.
Sure, you might not need to be a mechanic, but that's far from not understanding how a car works, which to me sounds similar to knowing how to shoe a horse, which is different than being a horse vet.
WickyNilliams 16 hours ago [-]
If I worked with horses for 8 hours a day I imagine the answer would be "recently"
psychoslave 16 hours ago [-]
Having to shoe a horse never was a general skill.
Maybe a more apt analogy would be a skill like making fire without a lighter.
sanderjd 15 hours ago [-]
Writing software never was never a general skill either though? Or am I misunderstanding your point?
psychoslave 13 hours ago [-]
Yes, LLM are thrown through pretty much everyone digital life whether they like it or not, it's not just devs. It might even unlock exploring things that need code that average user wouldn't have dared to do before.
coldtea 14 hours ago [-]
>When's the last time you shoed a horse?
That skill died too, so what's your point?
CamperBob2 13 hours ago [-]
Skills sometimes do that. What's your point?
coldtea 12 hours ago [-]
Skills are good. They shouldn't do that.
charcircuit 16 hours ago [-]
Except with AI models it's possible to make a backup of them creating a permanent artifact of a skill.
sorokod 17 hours ago [-]
Then what is the point of ddalex?
dofm 16 hours ago [-]
I think if you really don't feel the need to know the "why" of everything, sometimes this might be the right approach. It is quick, pragmatic, gets you started.
Maybe my biggest problem with the world of agentic AI, and the reason I am putting myself through learning it the way I am, is that the need to know the "why" of everything is so fundamental to me, that I don't know if there is any point to me without it.
So this is really the only way I know how to proceed.
sanderjd 15 hours ago [-]
To me, this is just a question of specialization. Not everyone needs to be a "I understand how the system actually works" person. In fact, not many people need to be that person. But every system does need some of that person to exist!
And we happen to be discussing this on a forum where the type of people who will be the specialists for the kinda of systems we're discussing are likely to gather.
I'd be surprised if in my casual discussions out in the real world, I were to run into a lot of people who care exactly how all this works, to the extent that they want to invest significant money into hardware that allows them to run things themselves and dig into what's actually going on. But I'm not at all surprised to come across such people here! (Indeed, it would be very disappointed if I didn't!)
nazgul17 7 minutes ago [-]
I think the more you know of how (many) things work, the slightly better you'll be at using them. From dishwashers to CPUs, from car engines to watercolours, from guitars to kitchen knives... You get the gist. Once you internalize a model of the thing, it becomes closer to an extension of you than a tool. You drive it better and with less friction.
kdkdjduxnd 17 hours ago [-]
[dead]
rusk 17 hours ago [-]
> I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled.
I found LM studio to be a nice starting point. Frindlier and more featureful than Ollama and not as intimidating as llama.cpp (though you will want to use that eventually)
dofm 17 hours ago [-]
LM Studio is also nice because of the way the interface explains things; parameters have explanations and hints. It has been designed by people who really care about making it understandable.
I tried Ollama but I've settled on Unsloth Studio generally; once things really settle down I'll just run the llama-server UI, which is pretty nice.
A friend is tinkering with LLMs for amusement on a 16GB Raspberry Pi 5, and when I explained that llama.cpp now had a typical web chat interface he was so happy — it's amazing what the "table stakes" are now.
not_kurt_godel 15 hours ago [-]
> Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it.
Agree having a powerful machine is really worth it in general for professionals, but strong disagree that running local LLMs has anything to do with it. It's hard enough as it is getting a good ROI on your time/money prompting/wrangling with frontier models. IMO leaning on the comparatively limited capabilities of local LLMs is best avoided in favor of keeping your own personal coding skills fresh and continuing to learn new ones.
dofm 15 hours ago [-]
I'm not that bothered about my coding skills, which are fine, and pretty up-to-date considering I'm now an old bloke. I am bothered about building an instinctive understanding that helps me deal with my anxieties and decide whether I want to carry on with this working life or quit.
I needed to do this, this way, in my own time, to put my brain back together. It has worked for me, which is why I recommend it.
YMMV.
ricardobayes 15 hours ago [-]
Unfortunately the local llm bunch is not the most emphatetic one in my experience: you are somehow "expected" to immediately know all this stuff and god forbid you ask the wrong question. I've never seen or felt this level of bullying and weird vibes over tools and LLM models. "My setup works for you or beat it".
sanderjd 15 hours ago [-]
Where has that been your experience? My experience interacting with people about this is almost entirely in HN threads like this one, and I haven't found what you're saying here to be the case.
But if this is the case, as you say, it seems like a good opportunity to build a more welcoming set of entry points into this!
dofm 15 hours ago [-]
There's also a lot of cargo-cult stuff, isn't there? Especially in the Reddit groups. Just do XYZ. And people ask why and they are never around to explain. Because, perhaps, they can't.
(Very reminiscent of 3D printing, where you get a lot of very trivial advice poorly applied, which is an analogy I've now made several times.)
Several of the youtubers are pretty helpful, though; I watched half a dozen things and absorbed the broad pattern and then went for it.
Also I got a lot out of reading HN comments, which is why I am here; tucked away in the corners of these discussions are people who can help. Over time I hope I am one.
sanderjd 15 hours ago [-]
Continuing to learn new ones, like what?
To me, "how do contemporary AI systems work and interact with contemporary hardware and how can I best take advantage of their capabilities?" is the set of skills that are worth learning at this moment.
What else is there? New / additional programming languages? New / additional database systems? frameworks? orchestrators? cloud provider / infra tooling? architectural patterns?
I dunno, all of this seems really boring and "been there done that" to me at this moment in time!
not_kurt_godel 15 hours ago [-]
Yes, that all tracks, and all of those skills are worth maintaining and improving. Great to tinker with LLMs locally hands-on to learn, and having a powerful enough machine to enable that to a reasonable degree is just one of many reasons why it's worth it. I'm just saying that IMO "how can I best take advantage" lands firmly in the bucket of only cloud-hosted frontier models being worth my time. I would speculate that holds true for a large portion of the wider HN audience but YMMV of course.
sanderjd 15 hours ago [-]
Maybe. I felt this way a year ago and definitely two years ago. But now my sense is that it's played out at this point, and the valuable thing to build expertise on now - precisely because I think it's coming rather than here - is local / open weights / hybrid models and harnesses.
oceanplexian 16 hours ago [-]
Honestly your best bet is to buy a $20 Claude subscription, ask Claude to set it all up with Pi and llama.cpp and come back in 20 minutes after a cup of coffee. This is also a good idea because it will help set expectations of what a local model can do vs. a frontier model.
mullen 16 hours ago [-]
This is what I did after struggling to get llama.cpp working at a decent speed on my M1 Macbook. The secret is to very specific with your needs and targeted in what you are using llama.cpp for. Mine setup is just about strictly for qwen3-coder and now, I get a fairly decent speed out of it.
I also installed Cursor to check Claude and it all worked out well.
kristianp 8 hours ago [-]
Are you talking about Qwen3 Coder 30b a3b Instruct from August 2025, which is a non-reasoning model? Or the more recent "Qwen3 Coder Next" from Feb this year with 80b params, 3b active? I found Qwen3 coder next to be quite good on openrouter [1], but couldn't run it locally.
I don't know why we're even talking about Qwen3.6 for writing code when qwen3-coder exists. My experience is there's no contest. I'm using 30b with 96k context on a dedicated server.
fouc 8 hours ago [-]
For agentic workflows like tool use, editing codebases, multi-turn debugging?
cyanydeez 17 hours ago [-]
I've setup to local paradigms for local coding:
- opencode with it's webui
- deer-flow with it's research/powered front end
They both run websites so you don't have to baby sit them (eg, keep your mac open). I've build a pdf compressor over a few days by first having deer flow try and research the frameworks and pipeline. It stalls out because its not really a fluid programmer. Once it stalls out, I transferred it (manually for now) to opencode and it's refactoring it because it's just a collective bundle of sticks and it needs a lot of testing to tweak out the limited scop context. LLMs can't really hold large scopes (locally anyway, from what I've read from HN, it's possible with longer context).
It'll complete in a few days with maybe 3-4 hours of full attention interaction, but it's running 3x that without my attention. Obviously, if I paid more attention it'd run quicker, but since it's local, it's not pumping out large volumes of code, it's mostly looping over tests and capabilities as observed.
It's running Qwen3.6 35B MoE on a AMD 128GB strix halo. If I switched to the dense models, perhaps it'd be smarter, but the trade off seems to be much slower gen.
dofm 17 hours ago [-]
> - opencode with it's webui
Have you tried Paseo?
I have opencode in a VM, and the paseo daemon running in the VM, and then the Paseo Mac app. Really nice.
(You can also use the Opencode GUI to frame a remote opencode web interface)
c-hendricks 17 hours ago [-]
You can also just add OpenCode web as a PWA, if that's what you mean by "frame".
I'm gonna check out paseo, but am not looking forward to all the ram the agent needs + all the ram paseo needs
c-hendricks 12 hours ago [-]
Have checked out Paseo, not sure what it offers over opencode web though. Definitely seems great if you're using other harnesses, but it seems like all it has over opencode web is split views and native apps. Neither of those really matter to me, plus you lose some opencode goodies. The preview urls are a neat idea, but our dev servers at work are mostly port independent and required to be on a certain subdomain for auth.
bsder 13 hours ago [-]
> I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled. Because it's just huge, exhausting, jargon-drenched, unknowable, and I am over the hill at fifty-plus.
Hello, my brother, just know that you have a fellow passenger in life at the same age who thinks the same thing. I agree that the local stuff is helping my understanding a LOT.
However, my gut feel as someone who got to experience the TeleBomb after the DotBomb is that the obfuscation is INTENTIONAL--it's neither you nor your age. I remember asking people to explain to me what the OC-768 startup endgame was when roughly 10 OC-768 links could carry the world's traffic at the time--and everybody giving me blank looks. The AI Bubble has the EXACT same feel as the Telecom Bubble--just bigger.
What I really wish is that I could find a VPS-type provider where I could toss things into their NVIDIA/AMD machines for an hour or two. Alas, all of the providers seem to want massive paperwork and huge minimum purchases.
I can't wait for the bubble to pop so that we mere mortals can finally build with this stuff.
porphyra 18 hours ago [-]
You can also run Qwen 3.6 27B dense model on DGX Spark with comparable performance [1][2] for about $4000 (Asus Ascent GX10 is $3999 at various retailers).
In theory you can also get 48GB of VRAM with, say, two 3090s, but it will take up a lot of space and generate a lot of heat compared to the Macbook Pro and GB10.
Alternatively you could run it on Strix Halo for $1,000 less, and while it may be slightly slower you won't have to deal with NVIDIA's shit on Linux and worrying about having to use their custom kernels or Ubuntu.
esperent 17 hours ago [-]
> 48GB of VRAM with, say, two 3090s
So like... $2000+ just for the used GPUs? Plus I assume it's considerably more effort to get it working.
fluoridation 17 hours ago [-]
>Plus I assume it's considerably more effort to get it working.
Nah, not really. It is a little annoying in terms of space and power, though. Not every case and motherboard can support cards that big.
lee_ars 14 hours ago [-]
The tweet you link shows "Qwen 3.6 35b NVFP4 - 256k ctx, 110 tok/s", but I'm getting only half that, around 50 tok/sec, on a DGX Spark with Qwen3.6-35B-A3B-NVFP4 (via vLLM) plus speculative decode w/EAGLE3. I'd be ecstatic to see 110 tok/sec and I wish they had some more sourcing for the exact config, because it's double what I'm getting.
edit - after actually reading the tweets (had to use xcancel) and visiting the source git repo, switching to MTP for speculative decode makes things a hell of a lot faster, and the abliterated model plus dflash makes it even faster! I'm now seeing 70-90 tok/sec for most stuff. I like!
porphyra 11 hours ago [-]
I think Atlas might also be slightly faster than vLLM:
The model they reference can be easily run with 24gb+ of VRAM, and there are other similar models capable of running easily on 16gb of VRAM. It's not like 128gb is a requirement here.
bitexploder 17 hours ago [-]
For a MBP I have 48 GB of RAM M5 Pro. It runs at about 12-14 t/s at Q4, you could probably optimize it further. RAM is not a limitation but overall memory bandwidth. Q8 is slower. 35B A3B Qwen is quite speedy, but a little less accurate. With Qwen 3.6 27B dense I can squeeze a 9B parameter model and use that for fast analysis or code scanning while 27B is churning on a task in the background. It is tight, but totally reasonable.
The real sweet spot for Qwen 27B is getting it on something like a Dual 3090 system or some other config where it can blaze at 50-80 t/s and that costs well under 6K currently. It is a surprisingly capable model. Using something like GLM for orchestration, specs, task farming and then letting Qwen churn is relatively inexpensive.
Overall I recommend people try models of this class out using OpenCode and some for pay service to experiment with them and understand how they work. I find they are very useful.
Long term, I am convinced enough that if I wanted to use local models for any number of reasons I would be okay investing in a dual GPU box. The Mac is not fast enough for me and M5 Max is just too expensive relative to GPU linux box. Still, it is nice to have the models local ON the laptop and it is useful for what I care about locally.
aunty_helen 15 hours ago [-]
I was doing some benchmarking last night on 2 3090s. The systems but old but I’m seeing 11tks 27b, 15tks 35b MoE.
The limited context is problematic. I’m not exactly sure what it’s got available but hermes was hit and miss on a prospecting job.
It does seem to be doing useful work but it’s not API call level quality
coder543 12 hours ago [-]
> The systems but old but I’m seeing 11tks 27b, 15tks 35b MoE
If that's accurate, then you must be doing something wrong/weird. On a single RTX 3090, I'm seeing substantially higher performance. Dual GPU won't necessarily give a ton of performance improvement, but it shouldn't hurt performance.
With llama-bench, I just measured Qwen3.6-27B at 41 tok/s and Qwen3.6-35B-A3B at 153 tok/s on one RTX 3090. (Those results are without MTP. With MTP, I'm seeing about 65 to 70 tok/s for Qwen3.7-27B.)
I'm using the unsloth UD-Q4_K_XL quant. If you're using bf16 for some reason, that could explain the low performance and inability to have enough context despite having 48GB of VRAM, I guess, but... don't do that.
coder543 13 hours ago [-]
> For a MBP I have 48 GB of RAM M5 Pro. It runs at about 12-14 t/s at Q4
Are you running with MTP enabled? I have seen some people on M5 hardware report 20+ t/s on Qwen3.6-27B using MTP... and I think that was a regular M5, not even M5 Pro.
bitexploder 12 hours ago [-]
Nope. MLX in LMStudio. The simplest config with zero tuning effort.
coder543 12 hours ago [-]
Unsloth Studio is also very low effort, and a lot better than LM Studio in my opinion. (Performance, compatibility with Gemma 4, actually open source, etc.)
CMay 16 hours ago [-]
At 24GB, Gemma 4 31B QAT will be better and give more concise answers. This post is mostly about unquantized results, so it's less relevant and I can't say much about as I haven't tested Qwen or Gemma via cloud API or unquantized locally. All I can say is locally, quantized in a 24GB scenario, Gemma 4 31B is better in my tests which are mostly reasoning or C programming related.
Gemma 4 is the only model series at this parameter scale I've seen correctly answer some of these. One of the answers even made me re-evaluate what I thought the correct answer was, which I did not expect.
When I look at the Artificial Analysis numbers, I can see that some things about Qwen 3.6 look inflated as a result of either metrics that weren't measured yet for Gemma 4 31B, or for metrics that just aren't going to be relevant in a lot of the essential tasks. In a lot of the relevant metrics, Gemma 4 is either better or on par.
Then once it's all quantized all those benchmark results will be hurt, and Gemma 4 QAT has better quantized performance. I think it's more competitive unquantized than people give it credit for and way better quantized than people give it credit for.
Qwen 3.6 clearly isn't legitimately bad and maybe it's quite nice at fp16, but it was a disaster quantized in a 24GB scenario by comparison.
thewebguyd 18 hours ago [-]
I'd go for at least 32GB+. It'll fit in 24GB but leaves you little to no room for context, and that's at 4-bit quantization.
If you want to run unquantized, you definitely need 128GB.
Catloafdev 17 hours ago [-]
Nobody runs unquantized, there's literally no reason to. Q8 would be the largest anyone actually runs on consumer hardware for inference.
14 hours ago [-]
bityard 14 hours ago [-]
Halving the precision of the weights is not a free lunch...
Catloafdev 12 hours ago [-]
Q8 is virtually lossless. The quantization is much more noticeable around Q4 and below. FP16->Q8 on consumer hardware is 2x the speed at ~99.99% the quality.
rvba 3 hours ago [-]
Any source that confirms the 99.99% quality?
bitexploder 17 hours ago [-]
It also comes down to inference speed, not "can I run this". 8-bit quant is quite a bit slower on an M5 Pro.
gchamonlive 17 hours ago [-]
[dead]
Numerlor 17 hours ago [-]
And if you go for actual GPUs it'll run much faster, I'd say 24gb may be pushing it for context, but my 5090 with 32GB VRAM is usually somewhere between 60 to 100 tok/s with mtp and 2-3k tok/s for prompt processing. I'm not sure what they cost now but it's definitely still quite far from the macbook, and there's also some other 32GB GPUs that are considerably more affordable
nok22kon 17 hours ago [-]
a computer with 24 GB VRAM is at least $3000
daemonologist 16 hours ago [-]
A 7900 XTX is about $850, and the rest of the computer basically just needs to boot Linux. You could easily build such a machine for $1500.
Even that isn't strictly necessary - you can get perfectly acceptable performance by splitting a model between multiple older 12 or 16 GB cards.
sleepyeldrazi 17 hours ago [-]
I can't speak for the US, but in Germany (where hardware is usually more expensive, not less), I got my 3090 3 months ago for 750 euro and have been running the iq4_nl 27B using q4 kv (which after recent patches in llama.cpp is in my xp indistinguishably accurate from q8 of f16) at full ctx, with MTP at 2, peaking around 70 t/s on small ctx, around 50 t/s when im around 64k and ends around 40 t/s near the cap. The rest of the PC is a 50 euro ddr3 16gb i5 4th gen box, absolutely nothing special. And this setup is often more useful than dsv4pro (and sometimes kimi, but not glm) for research and ML work.
danilocesar 16 hours ago [-]
I can't find a 3090 for less than 2k CADs (or 1200 eur). Is this the average price in Germany? It's pretty cheap.
sleepyeldrazi 2 hours ago [-]
I got it off kleinanzeigen, its a ebay-like site (but mostly 'pick it up yourself' instead of delivery). Looking at it right now, i do see multiple sales for 850-900. I did spot the 750 one after frequenting the site for a week or two, so it may be a bit of a 'better than average' deal, and it seems most are in the 1k euro range, but there are a handful available under.
As of writing this, it shows 24 offers between 700 and 950.
akman 15 hours ago [-]
I'm also curious, as this could pay for a trip out there, especially if buying for friends.
throw1234567891 17 hours ago [-]
But the tokens or credits are gone. MacBook stays. You can run other models on the same MacBook. What I read people burn every month on saas… for that money you break even on that MacBook in 5 months.
Edit: it’s not just “data privacy”, when you are using Claude, you are shipping EVERYTHING to Anthropic. It’s crazy.
wilsonnb3 16 hours ago [-]
Companies are already shipping everything to Microsoft or Google and 17 other companies, just the cost of doing business.
throw1234567891 16 hours ago [-]
Sure, but no one gets everything. Just that one.
DANmode 16 hours ago [-]
That’s at today-prices.
If the cost doubles, or 4x, which is seems to need to for them to go profitable, what then?
wahnfrieden 16 hours ago [-]
It's much slower, and often quantized
throw1234567891 2 hours ago [-]
Okay, and?
acchow 16 hours ago [-]
That $6700 is a $5000 upgrade over a base model Macbook Pro.
$5000 in US Treasuries (currently at 4.89%) yields $244.5/yr. That's more than enough to cover the annual Claude Pro subscription ($200/yr) which includes Claude Code with lots of Sonnet usage (far better than Qwen 3.6)
neonstatic 15 hours ago [-]
I think the argument isn't that local is cheaper - it's that local is doable and delivers unparalleled privacy.
iosjunkie 10 hours ago [-]
And your government can’t take it away on a Friday afternoon.
stymaar 17 hours ago [-]
> The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0]
Qwen3.6-27B would be faster on a 3090 that costs around $1000-1200 though so I don't think it's a good counter-argument.
Op just happened to have that MacBook, but it doesn't mean it's necessary to run the model.
boutell 17 hours ago [-]
That 3090 is going to burn 750W and it will still cap you at a 4 bit quant and ~48K context. Here's someone who worked through it:
Flies though (50-70tps is impressive for a model this smart)
I went through roughly the same process to get it working on my M2 Macbook Pro... at awful speeds of course, since models like this one are mostly bound by memory bandwidth.
stymaar 16 hours ago [-]
> That 3090 is going to burn 750W
The 3090's TPD is 350W, but given that LLM's token generation isn't compute bound, people usually undervolt these cards to reduce power consumption. IIRC you can get as low as 200-250W without any degradation. Caveat these figures are without speculative decoding and at batch size =1.
4chandaily 16 hours ago [-]
This is correct. I have (4) 3090s in my inference server, and they are each capped at 250w. I run Qwen 3.5 122B-A10 at about 45-50tok/s on this and am quite happy with it. At idle it draws around 95-105w for all four, which is a bit high, but tolerable.
hughw 15 hours ago [-]
My eyes glaze over reading all the AI produced verbiage.
I did find a few useful parameter settings I've already discovered using my single 3090 and ollama.
I'm just remarking that the LLMs overwhelm me with minutiae, especially as I'm working on code design. I frequently ask it to restate concisely, and that helps.
[edited to mention ollama as a nice alt]
nozzlegear 18 hours ago [-]
Just putting it out there: I run Qwen 3.6 on my M1 Mac Studio with 64gb. It's quantized and all that, but I agree with TFA: it's the sweet spot for local development right now.
dmayle 17 hours ago [-]
For that price you can put together a PC with 128GB of ram ($2000) and an RTX 5090 ($3600) and get 70-100 tokens per second instead of 45
montebicyclelo 17 hours ago [-]
Isn't the directionality important. I.e. it is currently possible to run useful / great models locally, but on high end machines; and in a few years we will likely be able to run even better models on standard machines.
organsnyder 18 hours ago [-]
I run Qwen 3.6 on my Framework Desktop 128GB, and it's very performant. I know Framework has had to raise the price since I preordered mine, but they're still well under half the cost of that Macbook.
andy99 17 hours ago [-]
I get ~55 Tok/s on my framework desktop with the 35B A3B q8 model, and so far am also very happy with the coding performance.
cyanydeez 17 hours ago [-]
did you upgrade to MTP?
imrehg 4 hours ago [-]
On the MoE versions of these models the MTP versions have only marginal benefit. In my trials the speed-up is <20% (not the ~2x that happens with some other setup/models) and usually more like 10%. Ie. something like 13 -> 15 token/s... on my device.
I still use the MTP version as it _feels_ slightly better quality, and because the unsloth quantizations I can get have more variety to fit into the various systems at hand... but that's not for the MTP aspect, unfortunately.
In the article they did have ~2x performance on the 27B (which might be something to retry, though on my Framework that would bring it from 5 -> 10 token/s so still "excrutiating" speed, probably).
YMMV for sure.
bityard 14 hours ago [-]
There are several variants of Qwen 3.6, the MoE models are performant on Strix Halo, but the 27B dense model (the one spoken about in TFA, and generally regarded as the best of the group in terms of quality) is not so performant: https://kyuz0.github.io/amd-strix-halo-toolboxes/
elorant 16 hours ago [-]
You can get an AMD Strix Halo with half that price even after hardware price adjustments. Besides you don't need 128GB of RAM to run a 27B model.
dannyw 18 hours ago [-]
I’m running the same model on a 48GB MBP with a q4 quant and it’s pretty decent. You definitely don’t 128GB. That’s the scale for 70B models at q8 or something.
dom96 17 hours ago [-]
I've been running it on my 48GB MBP too and it's not particularly great. Super slow and not near enough to the quality provided by even Claude Sonnet.
doodlesdev 17 hours ago [-]
How much does one of those cost in the US? Here in Brazil, your notebook is worth as much as a used Honda Fit, which seems absolutely insane. For comparison, the ThinkPad I'm currently running cost me 1/20 of how much this MBP costs here, leaving me with over $8.000 to spend with LLM inference (if I actually spent money with that).
dannyw 17 hours ago [-]
I purchased mine for approximately $4400 AUD before the price hikes. That unit is now ~$5100 AUD.
I use my MBP essentially as my workstation, it's almost always plugged in. I have a MBA (M4, 24GB RAM) that I picked up for ~A$1500 or so, and that's an amazing daily driver. I don't do local LLM inference on that unit, I can just hit my own APIs (via LM Studio) on the MBP over Tailscale.
DrammBA 7 hours ago [-]
> I’m running the same model on a 48GB MBP with a q4 quant and it’s pretty decent.
Context size?
shockembopper 11 hours ago [-]
I’ve got qwen3.6 27b running on my media server atm. Given that I built on top of what I already had, it didn’t cost me nearly that amount. I’ve been running 2x 5060 ti 16gbs, and when using text only and nvfp4, I can run the model with 200k context length and roughly 50-60 toks. It’s very good, and costed me about $800 after buying the gpus from microcenter.
pimeys 3 hours ago [-]
Yes. It is very expensive now. I'm still so so happy I decided last summer to bite the bullet and pre-ordered the Framework Desktop 128GB model.
I paid 2424 euros in total for this machine. And it can easily run the models discussed in the comments and in the article. It's tiny, and runs CachyOS like a champ. Over 4000 euros less than the price you listed.
I have a 1500 dollar machine that can run it at 50 tok/s (3 V100s)
Dig1t 17 hours ago [-]
How did you buy 3 V100's for $1500??
sixdimensional 6 hours ago [-]
Not OP and just guessing, but probably SXM2 GPU modules for the V100. Those can be acquired fairly inexpensively, but there is work to do to get them working together and the V100 has some limitations on the types of models you can run.
jeffybefffy519 12 hours ago [-]
I still dont trust the Anthopic and OpenAI are not training on my code. I even just thinking keeping track of what code you have received in prompts and to train/not train on it seems like an impossibly difficult task.
andrekandre 12 hours ago [-]
am i right in assuming your code is closed-source?
i'd expect anything on github for example to be already in their training set or is training on actual usage more useful to them?
redox99 16 hours ago [-]
I bought 2 used 3090s some years ago for $500 each. They're probably a bit more expensive now, but I guess for something like $2000 you can build a barebones 2x3090 PC which will be way faster than a Macbook. (you're fine with very basic hardware outside the GPUs)
stared 13 hours ago [-]
All experiments with Qwen 3.6 required no more than 48GB Apple Silicon. I believe you can go even further with more aggressive quantizations - one can go down even further.
In any cases, from the economic point of view, running models on laptops make little sense. Even at the pure cost of energy consumption, it might be hard to beat pricing at tokens generated at scale.
At the same time, it is a breaktrough, that will change the game. Previously such vibe coding on consumer device was not hard or costly - it was impossible.
trentor 17 hours ago [-]
Runs fine on 2x4080s or on two 5060/5070s with 16GBVRAM... and faster than on the mac.
dvduval 17 hours ago [-]
Absolutely for the average developer the token speed is just going to be too slow for it to be workable. I think we’re looking at 2028 when memory becomes cheaper again and they’ll be a lot more people using local models.
cyanydeez 17 hours ago [-]
AMD started their 128GB Halo Strix at a pretty damn good point at ~2.5k; I got mine after the first memory bump at $3k.
I think you might be a little to into the stew here.
zdragnar 17 hours ago [-]
I got mine at the same price point, and I've been pretty pleased with it. Tailscale lets me use it from my ultrabook / lightweight laptop, no burning lap or crazy fan noises. Desktops with the amd ai+ 395 are still fairly affordable for what they can do.
I'm running Lemonade on Nixos on my Framework Desktop. I had been trying other tools out before finding Lemonade, but Lemonade really made it plug-and-play.
Insanity 18 hours ago [-]
But you have to factor in that this device will last you 5-10 years. That said, I wouldn't spend almost $7k USD on this macbook lol.
petilon 18 hours ago [-]
Memory requirements of newer models will increase, so while the hardware may last 10 years it won't be able to run the latest models for 10 years.
roadside_picnic 18 hours ago [-]
My experience working in the open model space pretty deeply (both LLMs and diffusion models) for years now is that it is not quite as simple as that.
In the open model space an insane amount of effort goes into getting more powerful models to run with the same or less RAM. For example in the diffusion world many things that could not be run on easily under 24GB of VRAM actually run much better today with much less VRAM than they did a few years ago. You can do many things today with 8-16GB of VRAM that would not have been possible. At the same time the most advanced open models, like LTX 2.3 for video gen, still seem to respect 24GB of VRAM as the upper bound.
Similarly the standard "big" but localish open model for LLMs back in the day was Llama 3 70B, this was both a much worse and much larger model than Qwen 3.6 27B
So in two different spaces I've witnessed the "RAM required to run the best" decreasing or at least remaining stable, while the performance being achieved in both areas is astounding (LTX 2.3 is faster, better and more capable than the Wan 2.2 model that held popularity before it).
The biggest thing to watch out for is not just RAM/VRAM but memory bandwidth. You can try to "future proof" yourself with lots of RAM, but if it's 400 GB/S you're still constrained to smaller models.
prima-facie 16 hours ago [-]
> The biggest thing to watch out for is not just RAM/VRAM but memory bandwidth. You can try to "future proof" yourself with lots of RAM, but if it's 400 GB/S you're still constrained to smaller models.
I'm thinking of getting a SoC machine with 128GB RAM but the bandwidth is limited to 256 GBps. Would you even consider such a machine a decent investment, or should I wait for the newer gen of chips? Thanks!
roadside_picnic 14 hours ago [-]
It depends on your use case. There's a lot of hype around machines like the DGX spark (I'm assuming this is the type of device you're referring to) because they look awesome, and are priced reasonably well. However all of these have notoriously low memory bandwidth despite the high ram.
These devices, especially the DGX line, are fantastic if you are interested in low-level CUDA programming. The DGX spark can be used to prototype CUDA code/libraries for GPUs that most of us couldn't think about affording. If you want to learn how to program for datacenter level GPUs then these are the best way to get that at home. Sure your code will run very slow compared to the real thing, but you can take that code and, theoretically, run it on the real thing. For anything else though, I feel there are better options.
If you're interested in pure inference I'm pretty partial to Apple devices. The M4 Max gets you 546 GB/s, the M5 MAX 614 GB/s, and the M3 ultra (you'd have to buy used at this point) 819 GB/s. Plus you have a very useful computer even if you realize you don't want a full time home inference server. Additionally these devices require very low power (if you're running high end consumer GPUs you do have to think about what your energy costs are per hour and how warm you like your room).
If you're interested inference and training, or already have a pretty beefy desktop PC, or simply demand the most token/s you can get, then GPUs are the way to go. The downside is they're still pretty memory restricted (but honestly the options for what you can run on any RTX N090 are pretty good). You'll get blazing inference and prefill speeds on these devices. The only down side is, if you are using them heavily, you will see it on your energy bill and feel it in your room.
The "should I wait" question is also potentially applicable. The world of consumer hardware is looking increasingly bleak (and expensive) but if Apple does release a new "Ultra" model we could be looking at inference speeds very close to GPUs (there's still limitations to these devices that makes training preferable on GPU)
prima-facie 12 hours ago [-]
Thanks for the detailed response, I really appreciate it.
What I had in mind was an AMD Strix Halo machine, but it seems to have none of the advantages you mentioned. It's neither high bandwidth, nor does it have CUDA support, nor does it have support from the big OEMs. All the boards are from relatively obscure Chinese vendors.
It seems like all the major OEMs have rallied behind Nvidia, if you look at the upcoming RTX Spark laptops.
petilon 17 hours ago [-]
> insane amount of effort goes into getting more powerful models to run with the same or less RAM
The same can be said about operating system memory requirements. I am sure Linux and Windows kernel developers can confirm. Yet 30 years ago Solaris used to run comfortably in 16 MB of RAM, today you need 512 times that to run Linux.
Insanity 18 hours ago [-]
You raise a fair point, but I'm not convinced it'll offer a meaningful difference in performance as long as we're stuck with the current AI paradigm.
bluGill 18 hours ago [-]
Will they? Or will we find ways to optimize models and need less? Only time will tell.
simonw 18 hours ago [-]
It can't run the latest models today - GLM-5.2 class models already need 1TB+ of RAM.
... but, the models that WILL run on 128GB (or 64GB or even 32GB) models today are a huge improvement on the best models that would run in the same amount of memory six months ago.
If you find three finds that also have a 128GB MacBook, you can chain them together (the MacBooks, not your friends) and make it work.
You could also run GLM-5.2 on a single MacBook if you stream the active parameters from disk, but even with speculative decoding, you'd probably only get in the order of 1 token per second, so this is not really practical for most applications.
godwinsonsucks 16 hours ago [-]
[dead]
naikrovek 13 hours ago [-]
Available models aren’t really trending upward in size. Not like I thought they would, anyway.
They’re trending to be the right size to be good.
Qwen3.6-35B is not as good as Qwen3.6-27B. The larger model is faster, but a lot dumber; it gets caught in loops, makes crazy mistakes, and is just not as good. It’s bigger, but it is nowhere near as good as the 27B variant.
zargon 10 hours ago [-]
Qwen3.6-35B-A3B is worse than 27B because it's an MoE and 27B is dense. 35B only passes each token through 3B of its total parameters, whereas 27B sends each token through all 27B parameters.
cyanydeez 17 hours ago [-]
I think you have too much faith in context AGI.
at 128GB, you can find almost it's entire context for Qwen3.6 35B MoE.
Again, I think you have too much faith in extrapolation. It's like you got a baby at 0 months, then measured it at 12 months and expect it to be a giant.
someperson 18 hours ago [-]
In 5-10 years, incremental cloud tokens will be far cheaper (likely but not guaranteed).
jubilanti 17 hours ago [-]
[flagged]
colinsane 17 hours ago [-]
i like that people are taking the privacy argument seriously, after however many decades. i think there are other arguments to be made for running these locally which are less settled, but IMO the Fable debacle drives it home: the surest way to embrace this technology without worry that it will be taken away from you down the road is to physically own the compute.
r_lee 16 hours ago [-]
if you need to ensure that, then just back up the model and buy hardware if the need arises
colinsane 16 hours ago [-]
that's somewhere between saying "use Android, just switch to Graphene if/when they lock it down", and saying "just switch to postmarketOS/Ubuntu Touch/whatever flavor of Linux takes off".
i've watched friends try that route; i've been through this before. taking a downgrade is never fun: if it's a thing you're likely to care about in the future, then sometimes it's better to place yourself in the right ecosystem early.
r_lee 14 hours ago [-]
I just don't see how with the whole open weight system this situation would happen or that it'd be likely enough to warrant this
in terms of privacy, yes that's a real application, but someone taking it all away? I don't see it happening.
it's not an OS or a device, it's just a box/thing that runs a model, it's really commodity stuff we're talking about
more realistic concern would be that the open labs wouldn't be able to compete in the future thus development ends, but that means you can't host models that don't come out so...
again maybe I misunderstood but I just don't see why this would be worth it just for that one concern
ricardobayes 15 hours ago [-]
Oh definitely. I've seen GLM 5.2 go for around $4 per million output tokens.
17 hours ago [-]
oldfuture 18 hours ago [-]
a lot of credits? we can’t predict any price change for them
ant6n 6 hours ago [-]
Doesnt it run on the Macbook Neo... just slower?
AnimalMuppet 18 hours ago [-]
How many credits would it buy? How long would it take to use them up? What's the payback period?
From what I understand, for a developer, $5000/month is maybe the high end, but $5000/year is fairly standard. (Is that accurate?) So if it pays back in 15 months, that's pretty decent. If it pays back in two months, that's spectacular.
dminik 17 hours ago [-]
Using some rough napkin (well, spreadsheet) math, if you ran Qwen 27B for every minute every day at the current price of $0.195/$1.56 with a 2:1 input to output ratio (eg. agentic coding) at the advertised 22 tps it would take you just about 11 years to get to ~$5000 spent.
Disclaimer: There's a 35% sale from Alibaba right now. And I'm not accounting for input tokens going faster than output tokens.
eli 17 hours ago [-]
Are you comparing the cost of hosted Opus to running Qwen 3.6 locally? That doesn't really seem fair.
You're welcome to make your substantive points thoughtfully, just not aggressively.
kllrnohj 18 hours ago [-]
> maybe tell us how much a non-Apple system that you can run that (probably similarly or faster) would cost?
Ryzen AI Max 395+ with 128GB of unified memory can be found around $3-4k.
But 27B isn't that large, either, especially if you are ok with the quantized models. So this laptop choice seems to more be a "because they had it" rather than "this is what's necessary for this particular workflow"
h4ny 18 hours ago [-]
That's my point. You can run Qwen3.6 27B with MTP and whatever else you want to bolt onto it at 256k context for much less than even a Ryzen AI Max 395+ with 128GB would cost. Even unquantized you don't need 128 GB so given your comment and the downvotes maybe I didn't word my original comment properly for this?
onion2k 18 hours ago [-]
None of the examples reflect 'real work', at least not what I'd consider real work. Being able to nail a zero-shot greenfield project is relatively easy even for a small model. There's not much context to build up and it can fall back to similar examples in the training data easily. So long as you're not asking it to invent something wholly new it'll probably manage.
The real test is whether or not it can work with your existing codebases. In my limited experiments Qwen 3.5 (maybe 3.6 is loads better) does OK on a Rust+React app, and less well on a C# monolith. Not to the point of being unusable but definitely poorly enough that I went back to Claude after 20 minutes. If I lost access to a cloud model and had to use Qwen instead I'd be visibly sad.
janalsncm 17 hours ago [-]
> Being able to nail a zero-shot greenfield project is relatively easy even for a small model
Not really germane to your comment but I hope I don’t sound old when I say I remember a time when spinning up a PoC was a week of work, and a statement like yours was pure science fiction.
cyanydeez 17 hours ago [-]
I love the ability to spin up any repo on github by pointing a local model at it with zero cost beyond the heat & electricity.
onion2k 16 hours ago [-]
[dead]
ai_fry_ur_brain 16 hours ago [-]
Yeah, and we still do take a week for people that actually care.
If I start prompting away the core of a new project I lose interest in the entire thing almost straight away. I hate it. The next day I could care less about it. In fact it just makes me lazy, like a fat person who drives everywhere.
I love typing code and thinking for myself. Im going to continue to do that. I still dont know anyone who's shipped anything truly useful with this garbage tech, let alone with a local 30b param model. So much cope in these comments.
Spending 6k on hardware to run the worlds most mediocre model truly does make you an incredibly stupid person, so Im not really suprised by these comments of people saying these tiny models are helping them so much.
Its like a special needs kid all of sudden got the ability to code, of course they'd be impressed by basically all the code it produces.
j_bum 15 hours ago [-]
I mean, have you looked for examples of things that people using local models to build and ship? Or are you just assuming it doesn’t happen?
I’ve used Qwen 3.6 27B for many things at work, and I’m regularly able use it for reasonably scoped tasks.
I’m not saying these models are perfect.
But you are complaining about people on the extreme, while at the same shouting from the opposite extreme.
hollowturtle 14 hours ago [-]
In what era spinning up a PoC required a week of work? Especially on the web. I've been a developer for roughly 20 years and that has never been the case, to the point that I believe people impressed by LLMs are the same who had a very low productivity. Today we have game jams as short as 3 days and talented people are able to produce very good PoC, with some almost complete!
janalsncm 10 hours ago [-]
1) It depends entirely on the concept you are trying to prove and how experienced you are in that domain.
2) Not every team will have someone with 20 years of experience in a particular domain eager to spin up a PoC.
spiralcoaster 14 hours ago [-]
So what you're saying is that all PoC's are guaranteed to take less than a week of work.
What are you even saying? Are you aware that there is a massive range in the scope of projects? You must work on some incredibly simple CRUD apps if this is your take.
hollowturtle 4 hours ago [-]
These people work mostly in CRUD apps and they're telling you they how feel productive. Btw exploratory ideas even for hard problems come out already after a hackaon of a day or a game jam of 3 days
Aurornis 16 hours ago [-]
> and it can fall back to similar examples in the training data easily.
This is an underrated consideration when evaluating the small models: The further you deviate from standard example code, the more their weaknesses show.
My experience is that Qwen3.6 produced some amazing results for a small model when I tried it with simple apps that are widely reproduced everywhere. If you want a React TODO app or to set up a little boilerplate app with shadcn and other popular tools, it will produce something that looks not too bad.
Then when I started straying outside of common tasks and into some of my more niche work, it would spin for hours and go in circles before finally producing some groan-inducing output that wasn't usable.
If you're looking for a model to help with simple refactoring or small tasks where you provide very explicit instructions for exactly what you want, but you don't want to do all of the typing yourself, they can do a lot of good work, though. But you're right that once you get into long context sessions involving topics off the beaten path, the weaknesses are very apparent.
The quantizations that are popular for making these models fit on smaller hardware make the problems worse. When you read it about online there is almost a consensus that 4-bit quants are lossless and that you can use q8_0/q8_0 kv cache quantization without any real loss, but in my experience with real projects there's a substantial degradation in long context performance with any of these quants.
CMay 15 hours ago [-]
This is my experience too. Qwen optimizes for a lot of scenarios which masks their weaker generalization compared to US frontier models.
Never go below an fp16 kv cache unless you've already tested it in advance with your model on a verified task that you know it can successfully complete. People should also test the difference using the exact same seed value so they can see how the tokens diverge. If you have memory constraints, sometimes you can still use an fp16 kv cache and use storage for an agentic buffer to work your task with mixed abstractions rather than having everything in memory.
For 4-bit weight quants, Gemma 4 31B QAT is where people should be looking instead of Qwen 3.6.
Zambyte 16 hours ago [-]
I have been using pi (and previously the codex cli) with Qwen 3.6 27b with 100k context for my development at work, and I have been very blown away by how well it works. It's not perfect, but it's enough to accelerate my normal development flow. I mostly use it for writing Go and C#.
sosodev 17 hours ago [-]
In my experience, even with basic project concepts the small models struggle to spin up greenfield stuff. There's just too many decisions to be made and they're not good at that.
Modifying existing code is way easier if you don't expect it to be smart about it. Don't say "add X feature" and let it explore the codebase and build its own understanding. Point it at the relevant files and say "the goal is to add X feature to this code, follow Y guidelines". Now you've done the hardest part of making the decisions and it just has to follow instructions while coloring within the lines.
fluoridation 17 hours ago [-]
>Point it at the relevant files and say "the goal is to add X feature to this code, follow Y guidelines".
Is that not how you would work with any model, local or not? I wouldn't trust it to make the right decisions unattended. I just know the moment I look away it's going to do something utterly braindead.
tenuousemphasis 14 hours ago [-]
Claude Opus with xhigh thinking is surprisingly good at figuring our details. Granted I'm only using it for little hobby projects, nothing overly complicated.
verdverm 16 hours ago [-]
I had good results doing an open box reimplementation. Gave qwen access to my old projects and it rebuilt it on JAX.
There are several general types of tasks that a Gemma 4 12B class model works for me, including: 1) design a large project composed of small libraries that can be coded and tested in isolation. 2) clean up old coding projects: add README files, comment code, show an example of using a new API and have it update API use, etc.
All small-scale stuff. For large integrated projects I am finding DeepSeek v4 Pro commercial API to be very inexpensive and helps me produce good results.
internet101010 7 hours ago [-]
Exactly. If the repo has all of the knowledge living inside of it that window fills up fast, even when using something like codegraph.
> In my limited experiments Qwen 3.5 (maybe 3.6 is loads better)
1. Maybe you should tell us what those limited experiments are.
2. Maybe you should actually try 3.6 because it's huge difference in most cases. Don't forget to tell us quants and don't forget to tell us scope.
3. Maybe actually show us data compared to frontier models instead of this... vibe comment. Pretty tired of this kind of comments on HN that doesn't require logic or evidence. Just vibes. Like the pelican riding a bicycle crap that everyone has taken for granted but has no objective way of assessing goodness.
snapcaster 16 hours ago [-]
Nobody owes you a scientifically rigorous write up
doodlesdev 17 hours ago [-]
I feel like I'm going insane seeing people buy these 128gb MBP for thousands of dollars to run models that are objectively much worse than SOTA and spending so much more. The amount spent on a 128gb M5 MAX can buy you a damned new car here. What the hell am I missing? Are developers in other countries living in such different worlds?
(I'm aware the price is, in absolute terms, more expensive where I live compared to the USA. That reinforces what I think, because anyone sane that would've bought one of those in another country would sell them as soon as they landed here and save that money.)
JeremyNT 17 hours ago [-]
I also don't understand why people in this price bracket are buying Mac laptops instead of desktop computers with GPUs? Just to flex that it's portable?
mft_ 14 hours ago [-]
(I'm not one of the people you're speaking of with a 128gb M5 but) if you want to run one of the medium-sized open-weights models (Qwen 27b, 35b, Gemma 4 26b, 31b) or larger, you get into an interesting optimisation space.
* yes, you can run it on an older/smaller GPU plus system RAM but performance will suffer
* if you want optimal GPU performance you need the model in VRAM plus context, so 24GB (3090, 4090) or 32GB (5090) cards, plus a system that's reasonable powerful to plug them in to. Ideally you'd have a multiple cards working together but for optimal performance this means either 2x 3090 or nvidia's workstation cards.
* you can go for a 128gb Strix Halo system, but the memory bandwidth isn't great and they're becoming increasingly more expensive (5.5k EUR for HP laptop, 3.9k EUR for GMKtec EVO-X2 mini PC)
* you can go for a 128gb DGX Spark (5k EUR+) which also has unspectacular memory bandwidth or RTX Spark (price unclear but probably not cheaper)
* or go for a Mac with a decent CPU and a good amount of RAM (bandwidth varies by model, but typically a bit better than Strix Halo/DGX Spark and worse than bespoke GPUs.
As usual with such questions, there are of course cheaper paths (if you want to accept the tradeoffs) but Macs are reasonable vs. competition for these workloads.
pletnes 6 hours ago [-]
And with a mac, there are no cuda drivers to fiddle with.
girvo 5 hours ago [-]
But prompt processing is terrible
jeroenhd 17 hours ago [-]
A mac with a boatload of RAM can run models that will exceed the limits of any GPU not worth at least twice the Apple hardware itself.
You get fewer tokens per second, but at some point the balance between quality and quantity makes the large model size worth the spend.
When you're spending this kind of money, you may as well treat yourself to a pretty screen and some decent speakers. Nothing the competition doesn't offer these days, but you get them for free with the car-priced RAM upgrade so why go for less.
ctkhn 15 hours ago [-]
I don't even travel a ton but portability is huge. It's not a flex, it's a functional thing that lets me move around within my house or work while I'm at my parents or traveling or anywhere else. Other than my media collection that lives on my home server, I want most of my files to come with me on my laptop.
FuckButtons 11 hours ago [-]
The fact that I can take it with me? That I don’t need internet to still have access to deepseek? The fact that electricity is expensive and an mbp uses ~10% of the power that an equivalent vram set up would using gpu’s. Also, in order to get the same vram I would need to spend a similar amount, but wouldn’t also have a machine that was useful for other workloads that need a huge amount of ram.
indemnity 10 hours ago [-]
Potentially going to sound privileged here, but why not both?
Personally when going on the road I like portability (14" MBP or MBA), but at home I want raw non-thermally throttled power.
LeBit 16 hours ago [-]
I think it is because desktop computers with GPUs with enough VRAM to run interesting models are insanely expensive, hard to source and consume a lot of electricity and dissipate a lot of heat.
ilogik 17 hours ago [-]
What GPU can I buy with >100GB of memory?
verdverm 16 hours ago [-]
DGX Spark is one, but really depends on how much you want to spend
aurareturn 14 hours ago [-]
273GB/s bandwidth vs 614 GB/s of the M5 Max. And you're getting a whole laptop.
$5k for DGX Spark as well.
verdverm 14 hours ago [-]
Prompt processing time is better on the spark, which aligns more with coding (more reading than writing).
I spent less than $4k, OEM are better boxes for cooling, no apple markup, I get a real Linux system for stuff like k3s.
aurareturn 12 hours ago [-]
Yes, it's better on the Spark but the M5 is a lot closer than before with neural accelrators. After prompt processing, token generation speed on the M5 Max is 2.3x faster.
No Apple markup but you get the Nvidia market up instead. Prior to the recent Apple price increase due to RAM shortage, an M5 Max 128GB was a bargain if you want to run local LLMs.
redox99 16 hours ago [-]
Yeah, it's a much better idea to buy many used 3090s. 4090s or 5090s if you can afford it. Way faster.
aurareturn 14 hours ago [-]
Probably depends on what you're trying to do.
You need an expensive motherboard, cooling, PSU(s) to use multiple high end GPUs together. Then there is the noise and the fact that you can't bring it on an airplane.
bastardoperator 15 hours ago [-]
I have a bunch of computers and gadgets, why settle on one?
satvikpendem 10 hours ago [-]
Unified memory.
btbuildem 14 hours ago [-]
I think it's silly to go for a laptop form factor. Last fall I put together a workstation with two second-hand 3090s in it (paid $850CDN each, now the best I can find is $1200). With 48GB VRAM it's reasonable - and I've been using Qwen 3.6 27B for various tasks around building KGs from text corpora / reasoning about them.
I've ran comparisons against everything that's available on OpenRouter (well, as of few weeks ago), and for $0/tok, the local 27B Qwen can't be beat. Sure, it's slower, and yeah, the office is a few degrees warmer than it ought to be -- but nobody can pull the plug, nobody is watching over my shoulder, and the results are on par with SOTA.
Can't wait for a similarly sized Qwen 3.7 - from what I've seen so far, it's a leap ahead of the previous version.
Gigachad 12 hours ago [-]
I think it still makes sense to wait. Hardware is currently hyper expensive and cloud models are subsidized. Waiting 2 years or so once memory prices have dropped and datacenters start wanting a profit would get you a usable setup that's more economical.
whichquestion 13 hours ago [-]
How much electricity does running your local models take?
alemanek 8 hours ago [-]
If your workflow benefits from the speed it quickly pays for itself when factoring in developer salaries here in the US. I recently switched companies and they bought me an M5 Max 128GB as my dev machine.
Builds and local test runs are 3 times faster than the Windows laptop option. The machine will pay for itself just based on that within 3 months. I can spin up a local kubernetes cluster and do full integration tests while I am working on other things as well.
It isn’t a strictly Mac vs Windows thing though. It looks like the culprit is the MDM software on the Windows machines is just crazy slow and constantly getting in the way.
If I was paid less it would definitely make less sense for the company to pay for this machine.
v1ne 4 hours ago [-]
Don't worry. Once IT Security discovers that they miss their trusty endpoint security products on your Mac, they'll add it and you'll be in the same ballpark as the Windows machine. Been there, received that, and learnt that Microsoft Defender exists for macOS, too.
bellowsgulch 17 hours ago [-]
> Are developers in other countries living in such different worlds?
Yes. Your people earn an order of magnitude less income than Americans.
adamors 17 hours ago [-]
Yes they are, 6k is peanuts to a lot of people.
verdverm 16 hours ago [-]
It's not always about the price or being the cheapest. For me, it's about freedom, both to play and from the govt/corp censorship.
reilly3000 15 hours ago [-]
It’s an asset on my balance sheet that’s already appreciating nicely and will likely be resale-able for what I paid for it for the next 7-10 years. I am on an Apple monthly installment plan so $5k is $416/month for 1 year, no interest. I’m able to run DS4 scale models and other open models without quantization, often multiple at once.
Imagine its value if war broke out over Taiwan / Greater China, or really any of the dark scenarios with global connectivity or the truthiness of commercially available models. It is a very, very difficult piece of equipment to make at any other moment in history. I wish I could have purchased more. I saw the signs and price trends and out of stocks as they unfolded. No doubt others with the means are stockpiling.
simplyluke 14 hours ago [-]
> will likely be resale-able for what I paid for it for the next 7-10 years
There is not a period in the history of computing where this is true of consumer hardware over a decade for anything other than hardware already at the very bottom of its depreciation curve. It is surprising to me that you state that as an obvious assumption.
I suppose if your base case is Taiwan war that may be true, but there's a lot of folks who seem to be assuming the current hardware crunch will go on indefinitely when the natural state of hardware is getting cheaper over time.
znpy 16 hours ago [-]
> Are developers in other countries living in such different worlds?
Yes. Back in the my days at $faang in europe it was not uncommon to hear people getting 120-160 k€/year in compensation and we were “poor” compared to us engineers at the same faang (4-500 k$/year total compensation) with a bit of seniority…
doodlesdev 15 hours ago [-]
That makes a lot of sense! I have no idea how I'd use that much money, so maybe the 128gb MBP for messing around with local LLMs wouldn't sound so absurd :)
zx76 16 hours ago [-]
I see a lot of people writing about how expensive the hardware to run these local models is - but see no mentions of the Intel Arc Pro B50/B60/B70 which seem like decent value if you're not interested in Apple kit (as much as anything can be decent value in the current status quo).
I just got a B70 with 32GB RAM for the equivalent of $1200 (incl. sales tax and import duties to my non-US location, so presumably it could be cheaper elsewhere). The memory bandwidth is 608 GB/s. For M5 Max (32-core GPU) it's 460 GB/s and for M5 Max (40-core GPU) it's 614 GB/s. A 3090 is still faster at ~900 GB/s but you're getting 32GB VRAM for a lot less than equivalent Nvidia cards. It's about 1/3 the bandwidth of a 5090 for 1/3 the cost, but with the same 32GB VRAM. If you're interested in being able to run bigger quants with some context and stay on a lower budget then it's an appealing trade off.
I'm still exploring using these local models so don't want to spend the equivalent of $5 000 - $10 000 just to test it out. I don't mind slightly slower perf to do some experimentation more affordably.
I actually got an B50 16GB (with meager 70w TDP!) first to test an Intel card with my stack - it worked easily with Ubuntu & Vulkan. I'd read a lot about hassles and people writing them off as unusable but it seems like these are often with SYCL which doesn't even seem to outperform vulkan and so why bother? (The B50 was just $370 inclusive tax and duties). Literally `apt install` the vulkan libraries and it worked with default xe driver in 26.04 and the vulkan build of llama.cpp. The SR-IOV PF/VF also just works with qemu/kvm, no tricks required. Since I got it fwupdmgr has updated the firmware twice so Intel is presumably actually trying to support these products.
bblb 8 hours ago [-]
I got B70 few days ago. Running on CachyOS. 9070XT on PCIe x16 and B70 on the x4.
ROCm nightly was pretty easy to setup and get up running. The 9070XT has been a decent card for my use cases.
But the SYCL ecosystem versions. Absolutely horrendous and everything is hundred commits behind. Vulkan is probably the only way forward with this card.
kristianp 8 hours ago [-]
Interesting that Intels latest consumer GPUs only have 10 and 12GB respectively for the B570 and B580.
mashygpig 13 hours ago [-]
It's fun to run a model locally, but I don't think the economics make sense for anyone just trying to use models atm. It's absurdly cheap to use the same model via openrouter in comparison.
Seriously, just put $10 into openrouter and play with models that are cheap but bigger than what you'd reasonably be able to run locally like deepseek v4 flash (unquantized). You'll be surprised by how far that $10 goes for a model better than what you'd be able to run. Even further on the model you would be able to run locally. Then think of how many long it would take to match the cost of spend + power on doing it locally...
Saris 12 hours ago [-]
Even with deepseek v4 flash I burned though $5 in credits in a day just playing around with Hermes, and qwen 3.6 35B is significantly more expensive.
I can run qwen 3.6 35B on my gaming PC at around 50 tok/s and other than power cost of a tiny bit extra per month, it's hardware I already owned from years ago.
I'm not really sure why qwen 3.6 35B is so expensive on openrouter, it seems abnormally high for what hardware it takes to run it.
Perenti 9 hours ago [-]
If you're not good at prompting yet, that $10 doesn't go very far. The local model allows me to learn what works and what doesn't without paying for tokens. Then when I know how not to waste them, I'll try a paid model.
alentred 4 hours ago [-]
There is one side effect of running your LLM locally: you stop thinking about the token budget. I often run `/goal` with no limits, or script an endless loop in bash to run opencode, etc. Sometimes I just brute force the task by throwing a /goal at it. Maybe it's not the most efficient use either, but it's nice to have the option.
SchemaLoad 12 hours ago [-]
Agreed, I'm waiting for the time when 48GB+ ram is just the standard that computers come with rather than being the absolute top tier option. It just doesn't make sense to spend extra on a local AI computer right now when the same money would last for a decade of API pricing.
boppo1 2 hours ago [-]
Have you considered this may never happen? What if datacenters continue to swallow all capacity?
an0malous 8 hours ago [-]
Those are all pre-rugpull prices though. Give it a year.
imrehg 5 hours ago [-]
I'm having a decently good time time with `qwen3.6-35b-a3b-mtp` (unsloth's multi-token prediction version) and and `qwen-agentworld-35b-a3b`.
On a 2021 M1 Pro (32GB RAM) I can get either of them as `IQ4_NL` quantized models (the first with reduced context, around 160k; the second can do the whole 264k with RAM left over), running something like 30tokens/s.
On a Framework 13 AMD AI HX370 it can use the same, but both on Q8_0 quantization, full context window, parallelism. Speed is just ~15tokens/s so slower, but definitely smarter than the lower quantized siblings.
Both of them are good developer partners for an engineer who wants more of a second pair of eyes and a rubber duck, rather than a model to just do everything for them. Pretty good for my brain dumping, some commit reviews, sanity checks, just always assume that every claim has to be checked and re-checked.
The only problem is really the context loading, that's pretty slow (starts off around 300token/s on empty context, by the time we get to something like 70-80k which is just a bit of repo discovery, it can run around 80 prompt token/s or less, so there's always a lot more waiting around. Local tools need to bump all of their timeouts, and have to be mindful that there's unlikely to be really meaningful parallelism on these machines with local models.
I'm still figuring out how to approach these things, though. Definitely better than glorified autocomplete or search tool (and too slow for the former, pretty decent for the latter). Their limited skill and performance make it more in line with other tools like my IDE or editors, that they are still in the "tools" compartment of my thinking, rather than "independent, cognitively active entities". Which feels like a good thing.
nunodonato 4 hours ago [-]
what are you using agentworld for?
cpburns2009 15 hours ago [-]
Before you run and go purchase a unified memory computer (e.g., DGX Spark, Mac, Ryzen AI Max 395 / Strix Halo), be aware dense models generally run slow on these machines. Dedicated GPUs run dense models significantly better. Look for benchmarks for your prospective machine. If you really want one of these, you'll be better off running Qwen 3.6 35B or another sparse MoE model.
amlord 17 minutes ago [-]
Tried looking at it, but needs a much beefier machine than I have RN.
Hopefully we're looking at a future where local models become more & more realistic to use for reducing remote AOI spend.
beastman82 18 hours ago [-]
FWIW I'm running gemma4 31b on my 5090 and it's pretty great as well.
QAT, MTP, 128k context.
I liked Qwen 3.6 27b too, it just seems that Gemma4 is a bit underrated.
kofu 18 hours ago [-]
My experience also aligns with this. I'm running gemma4 31B on a 4090 through llm.cpp with unsloth models.
I also run Qwen 3.6. Qwen is good for thinking and planning as it is faster, but Gemma4's generated code is much higher quality in the first try (Rust, C++ and C#). so it needs less revisions to be at a level I'm comfortable for merging.
beastman82 17 hours ago [-]
I second unsloth models. I'm using them over blackwell-oriented nvfp4 models as they are (empirically) top quality and performance.
kroaton 12 hours ago [-]
NVFP4 will be better if the model provider actually post-trained properly after quantizing.
girvo 5 hours ago [-]
Which basically only Nvidia does, because it’s very expensive.
Though I’m currently working on QADing the smaller Qwen 3.5 models from FP16 teacher to NVFP4 student, to hopefully eventually apply it to 3.6 27B… harder to get right than I expected though!
17 hours ago [-]
nozzlegear 16 hours ago [-]
I can't Gemma4 to actually finish a turn properly, it's always ending abruptly or making malformed tool calls. It's probably something I've misconfigured in oMLX or Opencode.
Huh. Same problem, and I run with llama.cpp. In my case, Gemma4-31B (4-bit quant though) will just stop sometimes.
accrual 18 hours ago [-]
Nice. I flip flop between Qwen 3.5 9B Q6_M and Gemma4 12B Q4_K_M on a 4080 Super. They run at about the same speed and I can have them review each other's plan or diffs. For smaller projects I find them very capable, and I can step up to a better quant for slightly more challenging work.
boppo1 2 hours ago [-]
Have you tried qwen 27b q4_K_XL? It's a little bigger than the 4080 but not too much
nok22kon 17 hours ago [-]
you can probably run Gemma4 26B on your card also at 4 bit. World of a difference compared with 12B.
zingar 16 hours ago [-]
Where does “big model highly quantized” start getting worse than “smaller model less quantized”? Is there a general formula or is it just trial and error?
nok22kon 12 hours ago [-]
paper is a bit old, but matches current empirical recommandation: a good starting point is the biggest model you can fit at 4 bit
Local development for who? How many of y'all are rocking 128GB of memory? Am I reading Apple's site correctly that it's a $10,000 laptop?
kllrnohj 18 hours ago [-]
You don't need nearly that much RAM to run Qwen 3.6 27B, though. qwen3.6:27b-q4_K_M is only 17GB, for example.
DanHulton 17 hours ago [-]
This is what I run on an M5 MacBook Air 32GB. Works great.
I’m not having it build whole features from scratch, though. I give it pretty explicit instructions closer to the class or function level, and it still saves me an immense amount of time, while I’m very connected to the code that’s written.
Definitely the sweet spot for me.
rhdunn 18 hours ago [-]
A 27B model can fit easily on a 32GB VRAM card (e.g. 5090) or a 32GB computer in RAM at FP8/Q8 (unsloth have 28.6GB Q8 files).
For 24GB VRAM cards (e.g. 4090) you can use Q6_K (22.5GB) or Q5_K_M (19.5GB) quants, possibly offloading some of the weights to RAM.
jboss10 15 hours ago [-]
For the 35B model, ofloading to RAM doesn't slow it down much. If you have a nice CPU and a weak GPU, it will be fast enough to use.
__s 18 hours ago [-]
I'm on 128GB ram strix halo, bought framework desktop for a few thousand CAD back when everyone was calling framework desktop overpriced
wpm 18 hours ago [-]
It wasn't $10k a month ago
bahmboo 13 hours ago [-]
I work with a lot of 3D graphics and geo stuff so I can hit the ceiling with my 48 GB mac. It's not all LLM work. I prioritized more storage than RAM with my budget. Being able to run local llms has greatly helped me understand how they work. For day to day dev I pay for Gemini or Claude.
mr_mitm 17 hours ago [-]
Think commercial. My company invested in a local rig since privacy is important to our customers and sometimes I want to use these models on private data.
Gigachad 12 hours ago [-]
Even in that case it would make more sense to put the hardware in a server rack shared with everyone rather than inside macbooks.
At any rate it makes a stolen backpack or spilled drink a lot less damaging.
mr_mitm 4 hours ago [-]
Obviously the rig is not a macbook but indeed a server rack. I'm just saying that we're using this model for local development.
scotty79 15 hours ago [-]
Qwen3.6 runs great on GPU with 24GB VRAM. You could get used 3090 for it.
spike021 18 hours ago [-]
Certainly won't work on my M4 Pro with 24GB lol
MatthiasPortzel 17 hours ago [-]
I’m using it on a 48GB machine and it causes some lag, so it might be worse on 24, but it should run.
Unsloth recommends 18GB of RAM for Qwen3.6-27B (for their version of the model).
I'm still rocking my nvidia 2060, which I had purchased for $400 at the time.
I struggle to imagine purchasing multiple 1k+ cards on my own dime.
XCSme 14 hours ago [-]
Considering the cloud version, all three models compared in the article (Qwen 3.6 35BA3b, 3.6 27B and DeepSeek V4 Flash), have very similar performance[0], BUT on cloud, for some reason DeepSeek V4 Flash is 10-20x cheaper than the Qwen models.
If Qwen models are so much easier to run, why are the providers charging more than V4 Flash?
Also confused by this. Deepseek V4 flash is so much better than Qwen 3.6 yet cheaper to use.
ctkhn 15 hours ago [-]
I have been running qwen 3.6 35b a3b with opencode on my macbook pro 16" with m3 max and 64gb ram, and it's been great for local planning and coding. To be honest I have been on and off wishing I had future proofed with the 128gb after seeing how powerful 64gb is. On the other hand, I also haven't run up against a wall with a model that is just slightly larger than qwen.
Xeoncross 15 hours ago [-]
What is the speed on responses? (t/s)
The full 128GB is surely helpful in keeping browsers, editors and other things running since even 20-35GB models + k/v caches can eat up a lot of the core 64GB in my experience.
LeifCarrotson 14 hours ago [-]
I've also been running Qwen 3.6 35B A3b on my Windows laptop (64 GB RAM, a 4GB GPU) and it's at least tolerable. It's not fast - a few tokens per second, slower than reading speed - but I can give it a task and come back later. That was a $600 laptop off eBay a few years ago, not a $6,000 machine.
Are these unified memory Macs and giant 24GB desktop GPUs achieving dozens or hundreds of tokens per second commensurate with their 10x-20x cost?
jaggederest 11 hours ago [-]
35b A3b runs ~100 tokens a second on the best M5 Max gpu setup.
letmetweakit 32 minutes ago [-]
Any chance to run this on a RTX 3090 and 64GB of regular RAM with decent context size?
jimmaswell 9 hours ago [-]
My partner has been trying various models on our server but we haven't gotten anything to run at a usable speed. Q30H engineering sample (Xeon 8570) with two cpus, 56 cores per CPU, 768GB DDR5 RAM running at 5600MHz, two old 3090s in it at the moment with an NVLink and we could put our third in there. We built this server before the prices skyrocketed because we happened across some Tyan boards on Woot that were absurdly cheap for what they are (the motherboards should be $1000+ but we got them for a few hundred).
This thing sounds like it should be a monster but we keep running into issues of the old GPU architecture, lack of support for AMX or AMX not being as big of a help as you'd hope when it does work, etc. Apparently we only got 5 tokens per second trying to set up Qwen 3.6 27B, and a similarly bad result trying to run GLM 5.2 which fits in memory but the custom kernels we had to try to contrive were too slow. I feel like this system should have tons of potential, especially if something was designed to let the AMX and huge system memory shine.
Does anyone have any suggestions? This thing was fun to set up and it's really cool but it's been a bit disappointing not getting any big tangible results so far.
We have a similar system on a single-cpu Tyan board with 256GB RAM that I'm hoping we might be able to use in conjunction with the first one if EXO ever gets good Linux support for GPU/RDMA over InfiniBand.
danielrmay 6 hours ago [-]
Yes, this should be a monster machine. Ampere is an older generation, so I expect that's where some of your issues have been
christina97 9 hours ago [-]
Start with a quant, you can run the Qwen 27B model at 4-bit on one 3090, presumably 6/8-bit on 2x3090.
starefossen 16 hours ago [-]
We have have had the same experience (qwen3.6 rocks) when we are evaluating local models for our developers in the Norwegian Government https://github.com/navikt/mlx-workspace
mips_avatar 15 hours ago [-]
I think the sweet spot right now is 2x 3090s and a pcie 4 motherboard with 64-128 gb of ddr4 ram, you can build this right now for $3k and it runs qwen 27b/35b stupid fast at int4.
tasoeur 7 hours ago [-]
I know how to build PCs but suck at picking parts, would you happen to have a recommended build or links to people who've done similar ones? Heck I'll click on an affiliate link to support the author of the build :-)
I love it because the watercooled 3090s are completely silent even under load. Facebook marketplace is definitely the move for a lot of the parts unfortunately, since you ideally would have higher end parts that are 2-3 years old.
androiddrew 10 hours ago [-]
Dual AMD Radeon AI Pro 9700s (600 watts total 64GB of vram) runs Qwen 3.6 27B at FP8 with mtp on vLLM at 50ish TPS for decode. Cards cost $1300 a piece. Enough KV cache to fully max out two concurrent sessions.
It was super rough going to get started with them back in January, but right now the cards purrrr and I haven't even tried tuning yet. You need to use a patched vLLM image with aiter but besides that things are finally working on the ROCm front.
ThunderSizzle 1 hours ago [-]
Agreed. I have a single 9700 and I'm able to fit Q6 27B at 30tps or Q5 35B at 100tps very easily via llamacpp running vulkan.
The results are impressive considering the amount of people trashing AMD and still trying to recommend 3090s. I hope to buy a 2nd one at some point, but I also hate the version hell of vLLM, the R9700, the ROCM version, and Qwen3.6 all not agreeing with each other. I haven't gotten vLLM to run properly for Qwen3.6, since the version that runs on a 9700 doesn't support 3.6 yet.
I'm trying to quickly hack out a optimized path for just Qwen3.6 to run against rocm natively (e.g. my own inference server for 9700s basically) and see if it can perform better than llamacpp vulkan's results.
Word of caution - the last llamacpp with good performance was b9209 from a month ago. After that, for some reason, vulkan performance dropped by 10x, which has made me lose confidence in llamacpp in the long run.
Having said all that, 3x is 96GB for 4k and peak 900 watts. A 96GB Blackwell is $12k and peak 600 watss. And they will have a similar memory throughput (minor negative to the AMD cards for split processing). It's crazy how price efficient the r9700 is compared to the Nvidia cards.
rhgraysonii 18 hours ago [-]
I have been having pretty good success with Qwen 3.5 9B for "nontrivial but not challenging work all things considered" -- it runs great on my 24gb unified memory m4 pro MacBook Pro. What do the baseline specs look like Mac-wise for getting this model to run? Am I looking at a 96gb? 128? 256?
MatthiasPortzel 17 hours ago [-]
I posted this elsewhere, but Unsloth says the 27B model should run in 18GB. That leaves little RAM for other tasks, but it depends on your tolerance for slowness I suppose. I haven’t tried it in 24GB so report back if you do.
It got rather tangled up when I tried it with one of my coding tests, which is a simple wordpress plugin, but I frustrate the model by asking it to write code for older PHP, break WP coding conventions and use a rather bespoke method for arranging code in objects. So it is sort of a hybrid of a green field and brown field task; a bit muddy.
It did not do as well as Qwen 3.6 35B, but the way it worked through its thoughts was interesting.
TBH I struggled to understand what DeepReinforce are doing that is materially different; the explanation of their training technique goes over my head at this point.
jensC 16 hours ago [-]
It is also available with Ollama now and I am equally impressed too.
rhgraysonii 18 hours ago [-]
Thanks! I was thinking of doing the 128gb to have some future proofing. I figure at this point, it's akin to a mechanic keeping great tools around, when it comes to having this sort of homelab and exposing it for your own uses. And great practice for building the next era of user facing computing that will be around as this proliferates.
dofm 18 hours ago [-]
I would not buy a 64GB model again, probably, if this were to remain particularly important to me. But I gather memory bandwidth is pretty important here.
So for example I'd favour a used M1 Max over a used M2 Pro, at least based on my naïve understanding. Not quite sure where the balance changes.
There appear to be some hardware improvements with the M3 and up regarding the Apple Neural Engine which I'd hope would show up in MLX performance; I remember seeing some optimisations in image generation models that are only possible on later hardware.
The GPU cores are progressively better I believe, but the memory bandwidth is lower. Though perhaps the M4 can get closer to actually saturating said bandwidth.
(And I must reiterate that my understanding of this stuff is pretty naïve.)
freehorse 17 hours ago [-]
Used M1 max is still a good choice because its memory bandwidth only got surpassed by generation m4 and later (except with ultra variants which are more expensive). Its prefill speed is not great though, and that is an issue for running larger contexts, which only substantially improved with m5. Moreover, up to m3 they only have thunderbolt 4, not 5, which means that they lack RDMA support which would make stacking machines more effective. So unless you go higher price for m4+ max, or any m ultra, m1 max is pretty decent still compared to m2 and m3 max, definitely better than pro variants, if you can find in a decent price and want to experiment without caring much about time to first token and large contexts.
Note the drop in performance for the base (binned) m3 max version. You are better off with full m1 max than the binned m3 max, even price aside.
The issue I have with my m1 max is that with 64gb you cannot run really decent MoE models, ie the ones you can run like qwen 35B-A3B have only 3b active parameters and are much less capable than qwen 27b in my testing. So I end up running the 27b one, but it runs relatively slow (though still usable at 10-20 tok/s) and I would have been better off a used nvidia gpu setup for dense models. I assume 35B-A3B has its use cases, eg as subagents, just that I cannot find them. With a higher amount of ram I could probably run bigger MoE models which could be more comparable, though prefill would still be an issue (and prob a bigger one). The only hopeful thing is that there are performance hacks appearing (speculative decoding and prefill) that seem to start improving inference speed once getting implemented, so I am mildly hopeful.
(I must also iterate that my understanding is not very deep either)
dofm 16 hours ago [-]
Good reply, those two links are v. useful and I had missed them.
ljosifov 15 hours ago [-]
Running 27B dense model on M5 128GB is ok, but one can do better.
On M5 128GB one can make use of the ram and use sparse MoE. For example, DeepSeek-V4-Flash will fit, served by DwarfStar (https://github.com/antirez/ds4). One will probably improve 2x the token/sec speed, given DS4F 13B activated params in the MoE are ~1/2 of the ~27B of the dense Qwen.
27B Of the Qwen fit even on a cheaper 24GB card, e.g. amd 7900xtx (<$1K?) or slightly dearer nvidia 3090 (with cuda). With ~900 GB/s bandwidth they will likely be ~50% faster than the M5 with 600 GB/s.
brandall10 14 hours ago [-]
This is discussed in the article:
"My personal impression is that within these quantizations Qwen 3.6 27B is as good as (or maybe slightly better than) DwarfStar4. Though, I won’t be surprised if for longer context projects DS4 has an edge."
drnick1 15 hours ago [-]
Works beautifully on a 3090, very usable speed. Don't expect Opus 4.8-level performance, but there are some things you just need to keep local.
ljosifov 15 hours ago [-]
True - they are workhorses. Not super bright, but good enough for lots of everyday tasks. I've found sweet spot to be turning thinking off, as it adds small or no value, while increasing the token count and waiting time. Last 27B I used was https://huggingface.co/Jackrong/Qwopus3.6-27B-Coder-GGUF - specifically post-train adapted a bit to run with thinking off. I saw today the 35B-A3B MoE from the same HF acc is out, downloading that rn to try.
kroaton 12 hours ago [-]
Please don't use that garbage. Just use the base Qwen models or Nex/Orinth, as those are the only properly post-trained finetunes. The Qwopus models are marketing.
aand16 12 hours ago [-]
Can you expand on why Qwopus is not recommended and what "Nex/Orinth" brings to the table?
kroaton 12 hours ago [-]
"DeepSeek-V4-Flash will fit"
At Q2, 2bit? Lobotomized to death.
ljosifov 2 hours ago [-]
Hobbled - but not to death, the few times I use it (usually on a plane). I tried 2bit of a 20% REAP reduced experts. :-O That's the biggest that fits on my own h/w (3yrs old M2 Max 96gb). It's coherent, it does work, doesn't fall apart on casual use. IDK if better than dense 27b. Think 27b was slower on the same h/w. DS4F has got 1M context window. Nowadays with weeks long run hermes sessions, I get to 300k-400k context depths easily. The speed decline profile of DS4F with context depth increase is superior to any other model I try. (I try them all - love this stuff) Only previous model coming close on that is nemotron-cascade-2 (only 30b-a3b) - that also has 1M context window.
jboss10 15 hours ago [-]
I don't understand the talk about how expensive the hardware is. These models can run on very old or old and low end. I've been running Qwen3.6-35B Q4 on an old 1080 GPU(8GB vram) with 32GB sys RAM. I have a i7-12700.
It does about 30 tok/s which is enough for me. It's about half what the online models do, but it's enough.
I've heard their 9B models are also good, but they aren't much faster if you have the ram and a nice cpu.
These qwen3.6 models are the first ones I find can do much. GPT OSS was good, and Gemma4 is better. Gemma knows more facts, but qwen3.6 is smarter.
CMay 15 hours ago [-]
The MoE models hold up better on old hardware, but the dense models like this post promotes are in fact better. This isn't unique to Qwen. Are the dense models better-enough to use given the performance costs? It depends on what you are doing.
If a model runs fast enough for your use case and does exactly what you need it to, then you don't need a much slower model that might be more accurate. If you do anything more complicated, the dense models become more necessary and they are much more computationally heavy by comparison.
On your hardware an Unsloth quant of Gemma 4 26BA4B QAT would likely give you better results, but because it has 4B active parameters instead of Qwen's 3B active parameters, it will probably run slower.
felooboolooomba 15 hours ago [-]
Mind sharing the command line you use to rig it up?
schmuhblaster 5 hours ago [-]
I've worked extensively with the slightly less able cousin, the 35B A3B model and tuned my own harness around making it work well with local or non-sota models. The results are quite promising [0], if one sticks to a plan-execute approach. After a bit of fiddling with llama.cpp I was able to get it to work through a small change on a real codebase from work on a 32GB M5 (typical python FastAPI backend, so nothing out of the ordinary). While that's somewhat encouraging, the whole local experience was still far from pleasant with all the noise and heat.
Call me back when you can run these models on 16GB of RAM and any recent i5/i7. Until then, there’s no point on using these toy models.
guax 16 hours ago [-]
Its so funny, these "toy models" would be the wet dreams of researchers not 5 years ago.
Progress marches without mercy.
kgeist 14 hours ago [-]
Yeah people don't realize these "toy models" now completely destroy gpt-4o on most tasks, and no one called gpt-4o a toy model back in the day... It was OpenAI's flagship model from 2024 to 2025.
Gigachad 12 hours ago [-]
Tbh in 2024 most were calling these models useless for programming and a scam. It wasn't until this year things really changed. My experience with Qwen 3.6 is it can do things, and it's super impressive it can do things, but it's not any more productive than doing it myself.
Catloafdev 18 hours ago [-]
Hello, it's the internet calling, today is that day.
Edit: it's gonna be slow if you're not using any VRAM. But it's possible. Software isn't going to speed that up anytime soon, it's just a hardware bandwidth limit.
giancarlostoro 18 hours ago [-]
You need it to run in about 8 GB so you have extra space for the context window.
jboss10 15 hours ago [-]
They can be ran on 32GB with 8GB VRAM. I don't think these will be on 16GB for a while. (35B MoE)
TheCycoONE 15 hours ago [-]
I have 32GB of RAM with 16GB VRAM and I haven't had a lot of luck running larger models like this. Are you able to expand on that?
slim 14 hours ago [-]
use llama.cpp with cuda
TheCycoONE 13 hours ago [-]
The problem may be that it's a 7800XT which handles memory contention by freezing.
kpw94 18 hours ago [-]
> What it does:
>
> --jinja for tool calling support
Pretty sure this flag hasn't done anything for a while. It's enabled by default since ~November of last year
meta-level 2 hours ago [-]
why does everyone imply you need a $10k laptop which then starts burning when you run Qwen 3.6? Get any other system with enough VRAM for a third of the price. Framework Desktop (Strix Halo 128GB) still costs under 4k nowadays, is nearly silent even on 100% GPU + CPU. (also it gets only slightly 'warm', but with a desktop you don't care anyway, I guess).
paintbox 2 hours ago [-]
But how will I signal my status to other people then?
On a serious note, I run my models on desktop pc, simple api and i can use them wherever whenever.
pkroll 15 hours ago [-]
Since no one else posted it... I have open-webui pointed at a linux box with 128 gig of ram and an RTX Pro 6000, and after a couple of runs on trivia, had it do one of Open WebUI's conversation starters: "Show me a code snippet of a website's sticky header in CSS and JavaScript."
72.06 t/s. That's the full Qwen 3.6 27B model BF16, using MTP, running on Ollama. Yes I know I should bite the bullet and get vllm running on that box.
That was, also, at a 570 watt limit: I normally run a little less, but when I first tried this I actually forgot I had set the limit to 300 (it's a hot day, I figured why fight the A/C?), and at 300 watts the same question came back at 69.38 t/s. (The extra power matters more for compute bound things, the difference in generating LTX2.3 videos is considerably higher... but still not linear.)
HotGarbage 18 hours ago [-]
And AI companies will continue to buy up all the silicon to make this prohibitively expensive to run at home.
dofm 18 hours ago [-]
It will run (somewhat slowly) on a five year old M1 Max with 64GB RAM.
Personally I prefer the 35B MoE model, which is fast enough to be interactively useful, and capable, but I would probably use the 27B if I wanted to generate whole applications like that.
I am unconvinced that most "local" AI applications need anything much more powerful than the Gemma 4 12B model. Local agentic coding is a small niche, but there are plenty of ways a local model can help with development tasks.
I would really like to see a 12B or 16B Qwen 3.6.
I am currently playing with Ornith 1.0 in the MoE configuration, which is based on the 35B variant of Qwen 3.5; I am not sure if it is better than the 3.6 version.
Benchmarks say it is; my own silly tests either suggest otherwise or suggest that I have to talk to it a bit differently.
sleepyeldrazi 18 hours ago [-]
I need to ask, since I have desperately wanted to make Gemma 4 12B work, but im not sure if its the quant (i usually up it to q8, which is a lot higher than iq4_nl that i use for 3.6 27B) or the model itself, but it just starts confusing itself really quickly when I give it coding tasks. And quickly starts failing tool calls.
I really want to have a model that i can run locally on my 24gb m4 pro mbp for when i don't have internet to connect to my 3090 running the qwen, and i love how gemma 4 models 'feel', but i can't make them be competent. I am in the middle of finetuning both qwen3.5 9B and gemma 4 12B just to try and make those bridge closer to 27B for coding/agentic tasks (and am trying to ternarize and DQT 27B so that it fits in ~9gb pre-KV).
How do you run the gemma? What do you use it for (and in what harness), maybe llama.cpp and pi-mono just aren't for this model and that's what i'm doing wrong.
dofm 17 hours ago [-]
It sounds to me like you're further along on this than I am, if you are fine tuning?
I am still mostly tinkering/learning rather than spilling out code, and I feel quite slow on it. So it doesn't matter too much to me if it is really slow. More the journey than the destination if that makes sense. I'm stubborn.
I have tried the Gemma 4 12B model (Unsloth's QAT version) with search/browse tools in LM Studio and Unsloth Studio, when I am trying to understand a new thing.
Basically I get it to write introductory starter documentation for me to absorb, because my big personal problem, these days, is focussing enough to start a project and then digging in; I need the help.
I have found its limits on obscure packages (that it sometimes makes up) but before that it's a bit like stumbling on a blog post that happens to be really right for your particular need. Good enough to work through.
It's stuff I could ask Perplexity to do, or ChatGPT, to be fair, I just like LM Studio for this and have the inquisitiveness to want to run it locally.
In your case: I don't believe it's the quant. I'm sure it's the model — it has good coding knowledge but it's clearly not specialised. It might be good enough at writing Python/PHP/JavaScript at a novice level. It is also quite good on WordPress tooling and functions.
But I wouldn't bother with it for agentic coding if you've got experience elsewhere. Might be interesting to see what you can do with the 9B Ornith model?
Qwen 3.6 MoE in its Unsloth version is another matter. Impressive and I am trying to find ways to support my old brain doing what I've done before.
blopker 17 hours ago [-]
I've been working with local models for the past year. There's so many possibilities, but I don't think coding is one. Coding requires so many layers beyond inference; I spent so much time trying to replicate what Claude Code does end to end locally. Understanding all the layers and keeping up with the advancements feels like a slog. Even this article messes up and misunderstands what some of the settings are doing. Qwen in particular seems to work at first, then often gets stuck in thought loops when used for actual work.
However, text-to-speech, speech-to-text, and non-code LLM use cases are so useful to have local, and don't require big hardware.
Having a universal reliable inference engine interface, I think, is the big unlock that needs to happen before app devs can ship these features.
Personal concrete use case: meeting recording app. This uses Parakeet + Qwen to create local transcriptions and post-cleanup, respectively.
Right now this app has to download and manage all these models, then bundle an inference engine to run them. It's a lot of code that probably should belong to the OS, or at least a standard interface.
While apps can offload some of this to llama.cpp or a similar process over http, that's another set of setup for the user to do before they can have a useful app.
Anyway, if you're getting started on a Mac, I'd suggest trying out oMLX (https://github.com/jundot/omlx) before messing with llama.cpp. In particular they have community benchmarks so you can see what kind of performance you're likely to get: https://omlx.ai/benchmarks. I wished each one had more configuration details though.
iwontberude 17 hours ago [-]
> I don't think coding is one
Certainly this is falsifiable easily by any of us doing it on a regular basis
> Qwen stuck in thought loops
This does happen when context is not managed effectively; creating plans, using subagents and compactions strategically resolves this
blopker 16 hours ago [-]
Sure, local coding is clearly _possible_, but it's not practical for most people. I've yet to see a reliable setup, if you have one, I'd love to see.
> creating plans, using subagents and compactions
Yes, these are all things that Claude Code does for you. However, for the thought loop issue, these are not the fixes. The canonical fix is to limit the number of thought tokens (llama.cpp's `--reasoning-budget`) or try to mess with the various penalty parameters. In any case, it's not a solved problem as far as I can tell.
jjcm 17 hours ago [-]
I'd also look at the qwopus distil if you're using qwen 3.6 27b. It's a nice refinement of the current 27b with slightly better stats.
We need machines designed around wide memory + sustained inference thermals, not gaming/creator chassis we're borrowing. Until then "local dev" means clamshell + external fans.
blueside 16 hours ago [-]
i have been trying several open source models for the last few years. running qwen 3.6 27b on my 4090 is the first local llm i have used that made me start to second question if anthropic and openai are actually worth the (already) insane valuations.
don't get me wrong, the frontier models are leaps and bounds ahead of what qwen/kimikgemma are doing - but i don't need to drive a ferrari to the grocery store everytime either.
MangoCoffee 16 hours ago [-]
Running LLMs locally for development doesn’t make sense to me. The hardware gets outdated in just a few years. Even hyperscalers replace their GPUs faster than they can buy them, plus the cost of running it locally, isn’t cheap. the cost saving just ain't there.
kgeist 14 hours ago [-]
From the perspective of LLM inference, you currently mostly care about:
- Memory bandwidth; BUT the requirements are currently capped because models have stopped growing at around 1-1.5 trillion parameters for quite a while now. You only need more bandwidth if you're optimizing for the highest possible concurrency (i.e. you're a cloud provider). Also, MoE exists.
- Support for native low-precision math (like FP4 and FP8); BUT once your GPU supports native FP4 (Blackwell+), there's generally no reason for GPUs to go lower because of the obvious quality degradation.
- VRAM capacity - just like memory bandwidth, it's practically capped by 1-1.5 trillion parameter models and is unlikely to need much more in the near future. Also, the current trend is toward miniaturization: modern 30B-class models (which require far less VRAM), now completely destroy 200B-class models from just two years ago on most tasks. We also have better understanding now how to compress contexts.
Most model improvements currently seem to come from RL/harness-based methods, not from scaling models or running new algorithms that require fundamentally new GPUs.
So I don't see why GPUs that exist today must become "outdated" in a few years. They'll be seen as outdated by hyperscalers because they need to serve the maximum number of users as cheaply as possible, so of course they'll replace their GPUs with newer ones that have higher memory bandwidth or more tensor cores. But you don't need that for local inference.
logankeenan 15 hours ago [-]
3090 was released six years ago and is still very relevant for running models locally.
guax 16 hours ago [-]
> replace their GPUs faster than they can buy them
How does that work? They have negative GPUs now!
jboss10 15 hours ago [-]
Qwen 3.6 35B runs on 32GB with a 1080. That GPU is from 2017.
mbgerring 18 hours ago [-]
Something I find really confusing from this post is the MLX versions of the model running much slower. As I understand it, these model versions are meant to take advantage of Apple Silicon and MacOS APIs, and should produce better/faster results. Any insight into what’s happening here?
grokkedit 4 hours ago [-]
I've been using it with a couple of tools (like context7) as a documentation/helper, without giving it direct access to writing code, in marimo. it works great, albeit a little slow on my server (m1 max 64gb ram), at 8bit with omlx
Otternonsenz 17 hours ago [-]
Is there any hope for people that cant even run 27B parameters, Qwen3.6 or otherwise? Are there any quantized models that do well with tool calling at smaller parameter sizes?
I do not have a crazy rig, a modest gaming one at that, but in trying to understand more about agents and their capabilities, I am SOL with my 16 GB of RAM and 8GB of VRAM. I can get most small, non tool calling models to perform well, but I've had major issues with anything over 9B doing anything more than reasoning (egregiously slow at higher parameter counts).
And so far, I cant get even Pi to extend itself or do any meaningful work with any of the models I currently can get to run.
fumeux_fume 17 hours ago [-]
I suspect with those specs, you're not in the game right now for reliably using local models for code generation. The easiest way in is a MacBook with at least 32GB of RAM. This should be able to run a 4bit quantization of qwen 3.6 using the MLX format really well.
Otternonsenz 16 hours ago [-]
Now that I’m dipping more into this space, am gonna see what I can upgrade with the motherboard I have, but RAM pricing as it is, I’ll need to be smart about when I upgrade.
I very much appreciate the frank response, as it makes me feel less defeated at knowing my understanding of how it should work is not the full issue, hahaha
fumeux_fume 16 hours ago [-]
M series macs are usually used for running these LLMs locally because the GPU and CPU share the same pool of RAM at very low latency. If you upgrade your RAM on a different kind of chipset without the Unified Memory Architecture, then it'll be much slower to produce all the tokens you need. Just another data point to add to your upgrade equation.
jboss10 15 hours ago [-]
I have 8GB VRAM, but 32GB sys ram. I can run qwen 3.6 35B at 30 tok/s. I also use pi, and it's smart enough to extend itself(multishot and maybe a few tries)
For you, you could try gemma-4-26B-A4B
jboss10 15 hours ago [-]
I have 8GB VRAM but 32GB RAM. Qwen 3.6 35B runs nicely.
You should look at gemma-4-26B-A4B. 16+8=24gb and Q4 is about 16GB. Not much context left, but might run.
fluoridation 17 hours ago [-]
I think at 16 GB you'd struggle to run the regular development tools nowadays, forget about any interesting inference.
Otternonsenz 16 hours ago [-]
Fully agreed, and my hope is as open models grow and change, that getting some amount of this working on Pro-sumer hardware will be more attainable.
But certainly seems like we are a few years away from that, sadly.
Am I also screwed in being able to train my own small model or adjust another one with such a non-workhorse PC?
fluoridation 16 hours ago [-]
Training requires even beefier hardware than inference.
spaqin 11 hours ago [-]
I got a 32GB of RAM and a 6GB VRAM card; tried both 27B and 35B, with pi. And it's a laptop. Speed isn't exactly a concern for me, I can enjoy the real life while the agent is doing its thing. And while they appear smart enough on the first glance, once it reads a file that's more than 100 lines it loses all memory of anything I asked it to do. The lack of failure state or any indication what might be wrong here is just frustrating. Guess local models aren't for me, unless I move to Silicon Valley and redeem my free MacBook at a local Startbucks.
jadbox 17 hours ago [-]
[dead]
PeterStuer 5 hours ago [-]
Been running it on a 9950x3D with 96GB and a 4090. Speedwise it is fine. Bit while not completely useless, for software development it is unsurprisingly a dramatic downgrade from the Opus I use as my daily driver.
IronWolve 17 hours ago [-]
I think things are moving fast, tested that new vibethink-3B, works on many small tasks/fast, and playing with ornith-35B with a draft vibethinker-3b as a draft gave me some good speed/results.
Was just trying to see how small I could go and get acceptable results, but yeah, larger Qwen 3.6 with MTP is going to be better. Cant wait to see how AI model (unsloth/local-llm/heretic/reaper/etc communities) are tweaking/engineering quality down into smaller models. Lots of new things coming out.
marcuskaz 15 hours ago [-]
When is Amazon Bedrock going to get these newer models?
Offloading compute to them is much easier, except its still a limited set of open models. Most companies are already running in AWS, so it's an easy add, models run in a trusted location, just another line item on the Amazon bill. You don't have to talk anyone into signing up with a new vendor. Plus you don't have to worry about local hardware at all.
SamInTheShell 14 hours ago [-]
This is probably the first small model I got through some simple web game tests without having to reset the context. It tends to opt to overwrite an entire file instead of doing edits... which editing is where most of these small models fall apart along with getting stuck in repeating loops. Only 24k tokens in so far, it did some decent newbie work.
decide1000 5 hours ago [-]
A lot of replies here are about Mac devices and their support for these 27B models. I own a MacBook but use a Lenovo Thinkstation PGX to run my models. It has a gb10 Blackwell gpu and 128gb unified memory. You can connect multiple ones.
prasanthabr 17 hours ago [-]
Has anyone considered a home server? Assuming mobility is not important if we pick components to match a similar hardware would it be more value for money?
6 minutes ago [-]
Greenpants 5 hours ago [-]
I specifically chose a Mac Studio 128GB as my home server that's also running LLMs to be always online, in part due to the minimal idle power consumption and mostly fan-less operation. It's definitely expensive, especially nowadays, but I can still recommend Mac Minis as a cheaper alternative for someone to just get started with an affordable, always-on home server that won't annoy any housemates. I think both are in some sweet spot in terms of value for money, depending on what you're looking for in a home server. If image or video generation is your thing, look further though, definitely look into a proper GPU then. Macs are quite slow at that. They're just great at MoE LLMs because it's mostly a matter of (V)RAM size.
cpburns2009 8 hours ago [-]
Generally speaking a home server/workstation set up is going to provide better performance at lower cost. You don't sacrifice much mobility either so long as you have an internet connection and can either SSH tunnel or use Tailscale (never used, just know it's popular).
drillsteps5 16 hours ago [-]
A decent gaming machine perfectly doubles as your friendly local inference server. Just start llama-server with the model of your choosing and start chatting with it through its Web interface or connect any chat completion-compatible client (agentic or not) which will use REST to send requests and receive responses. From any device on your network. Voila.
LeBit 16 hours ago [-]
Which components are you thinking about?
prasanthabr 15 hours ago [-]
Am unsure - was hoping someone tried this and there is a tested component list of consumer grade pc parts that can do the trick
mark_l_watson 14 hours ago [-]
I can come close to agreeing because queen-3.6-27b is my second favorite for local coding. I am using gemma4:26b-a4b-it-qat-48k (the "-48k" is from my modifying a model run with Ollama to always use a 48K context size). On a 32G Mac I use gemma4:26b-a4b-it-qat-48k and OpenCode and on my 16G MacBook Air I use gemma4:12b-it-qat-16k ("-16k" is my resizing context size) and little-coder. I break up projects into small libraries because local coding works better for me using small code bases.
I find that for local coding, I need to spend a lot of time building concise SKILLs for specific things I work on and try to only enable one or two skills per coding session.
To the author of the linked article nice job, and if you feel like adding to it, please add details on your setup.
brandall10 14 hours ago [-]
Curious why OpenCode instead of a more 'full-fat' version of Pi with the larger model?
I feel like the amount of context bloat that OpenCode puts these small models into the dumb zone too quickly. The system prompt alone is 9k tokens, and when you add your own setup it can easily creep up to 15k.
mark_l_watson 11 hours ago [-]
I disabled many built in skills and increased the context size. I also use little-coder that is based in pi.
trey-jones 10 hours ago [-]
Qwen3.6 was the first model I ran locally that seemed smart, but qwen3-coder:30b is way, way more responsive and more accurate for writing code according to my tests, including human-eval. If you can run one than you can almost certainly run the other. If you haven't tried qwen3-coder I would definitely recommend it. It feels positively snappy compared to every other local model I've tried. All you need is 32G VRAM and some heat dissipation.
dom96 17 hours ago [-]
What do folks use to keep on top of new model releases that are appropriate to their system? i.e. the models that will actually work on the MacBook Pro with 48GB of RAM or whatever their specs are.
I've seen sites here and there but they feel like quick little toys that don't get updated, so they always suggest old models.
simplyluke 14 hours ago [-]
The open source models have gotten heavily conflated with local development. While that is cool and I'm excited about the future of local LLMs, it is not necessary to play around with these models. Without shilling for companies I don't have a relationship with, there are a number of companies who will give you an API just like Anthropic/OpenAI and you pay per token, albeit much cheaper than the frontier labs.
I've been using the full GLM 5.2 model this way (through opencode) at work for the past week. It's quite impressive.
Alifatisk 14 hours ago [-]
Shouldn’t we call them open weight models?
simplyluke 14 hours ago [-]
That's probably more precise.
seemaze 18 hours ago [-]
I was interested to see that Qwen3.5-122B-A10B narrowly beat Qwen3.6-27B on Donato Capitella's SWEBench-verified-mini run with a similar 128GB UMA architecture.
Many people in LocalLLaMA Reddit community has been reporting the same, that 3.5 122B-A10B is on par or slightly better. And a 3.6 or 3.7 od the 122B is one of the models people want to see the most.
SkitterKherpi 17 hours ago [-]
27-30B in general seems to be the level where you actually start having decent models. I just wish consumer hardware hadn't stagnated so much that we can't easily go higher than that, and that even running those requires a $5k machine now.
blagui 13 hours ago [-]
How you can do dev in 2026 using 64k context and without sub agents?
The benchmark seemed fine until I saw that.
If you use sub agents, they will overwrite the cache and each request will trigger full reprocessing. Have fun with that as it will crash the t/s metrics on each prefill on top of the max 64k including input + output is a major blocker.
If you push the context higher and add parallel slots the requirements will be far higher and the numbers less shiny.
zedascouves 15 hours ago [-]
Just tried on some arduino code. after 10 minutes i got a list of improvements to my code.
I ran those throu opus saking if it was good advice and was not impressed:
I read the actual qr_scanner.ino. Short answer: partially, but I'd push back on most of it. That review reads like
generic ESP boilerplate advice written against an imagined version of your code — several of its "fixes" are already
in your file, and its headline "critical" claim misreads what the code does. Going point by point:...
aand16 18 hours ago [-]
I've come from the future to say Qwen 3.7 27B is just around the corner and slaps!
lor_louis 18 hours ago [-]
Do no give me hope like that.
layer8 18 hours ago [-]
Are RAM prices down?
mendeza 18 hours ago [-]
I am eagerly waiting!
jensC 16 hours ago [-]
Me too, I am on a Jetson Orion 64GB (about 50W max). Using the nvidia graphic cards for AI seem to be so power hungry that it was not a choice I could take with todays environmental problems.
NamlchakKhandro 5 minutes ago [-]
Huh?
alfiedotwtf 16 hours ago [-]
Qwen 3.7 120B will kill off Antropic’s IPO
hollowturtle 14 hours ago [-]
> Real work
Ok that's the part I'm interested in, don't care about minesweeper clones....
> Make a landing page selling candles for women that are into wellbeing and SPA.
can't be serious...
kristopolous 10 hours ago [-]
Help me improve local model performance with petsitter!
It basically exploits the face that time can be traded for intelligence with local models
I have 24GB of VRAM (via a RTX 4090) and run Qwen3.6-35b:iq4, so it's importance-aware quantization and isn't nearly as dumb as it sounds like, fitting the 35b into 18 GB so you have some left over. So far I've had no issues, other than it taking a while for things like image gen, which I found out if you're gonna do with any alacrity, just have a cloud model do it.
For anything else local, including writing some automation scripts and such, it works great.
Zambyte 16 hours ago [-]
Can you link the model? I also have a 24gb card (7900 XTX). I've been having success with the dense 27b model, but I'd like to see if the 35b iq4 is any better.
Has anyone managed to cleanly integrate Web search into local models (run with llama.cpp)? The biggest limitation of the class of models that fit into one or two consumer GPUs is that they lack world knowledge, but presumably this can be remedied by enabling access to use the Internet.
kroaton 13 hours ago [-]
You're late to the party, mate; we've been doing this for years. Grab a SearXNG instance, stand up an MCP server for it, and expose the tool into your system prompt. Or use Brave Search. Or Exa if you want to pay. Any of them work. The model will pick it up straight away.
Even llama.cpp's bundled web UI handles it fine. Dead simple.
Havoc 11 hours ago [-]
Searxng is the ghetto solution. Commercial uruky is good. Basically Kagi except you can also run api calls over it
Neither is going to return much knowledge. Basically just relevant url so you need a second tool to grab them and there bot walls get tricky
mwowow 12 hours ago [-]
Working fine with LM Studio + Web search plugin
markdog12 17 hours ago [-]
I've tested it extensively for actual local development for my job, and hard disagree here. It's a waste of time to use it. Wish it were not true.
beastman82 17 hours ago [-]
I posted elsewhere but if you have more space try gemma4 31b
kopirgan 8 hours ago [-]
Lost count of number of times I read this or similar:
For me it’s the first local model that actually makes sense as a general intelligence.
recursivedoubts 14 hours ago [-]
I would like to offer someone the next openclaw: a GUI for the mac that allows people to manage and install local models with a single click, provides GUI tools for tweaking important aspects of them, and also provides a good command line interface to those models.
hollowturtle 14 hours ago [-]
ollama is a good starting point
fossheart 9 hours ago [-]
> I recommend llama.cpp - a direct, open source tool that allows running models on various devices. You don’t need Ollama, and frankly - I would recommend against using that on ethical grounds.
I had faced roadblocks while integrating with openclaw using ollama (Was trying to experiment with `qwen3-vl:2b`). I was tracking the issue back to openclaw at that time, I didn't even consider investigating ollama.
I attached a threads post here where I'm talking to meta ai to expand on both scenarios (not to use ollama, but llama.cpp & my take on the why this is the way it is - ie. commercial gains)
How does llama.cpp use the GPU efficiently as opposed to MLX?
Is there any way to use MLX and GPU at the same time? Or does memory become a big problem?
TBH, I never understood Apple hyping these neural cores because I didn't think anyone actually uses them except maybe certain photo/video editing software.
If I can generate voice at the same time as video, that would be useful.
dannyw 18 hours ago [-]
Llama.cpp uses the GPU very effectively because inference of LLMs is very rudimentary and basically as simple as your GPU memory bandwidth. That's essentially the baseline performance ceiling, with model-specific optimisations like MTP potentially increasing it.
The neural cores aren't suitable for LLMs/transformers and isn't used in LLM inference. On the M5 and later chips, it comes with neural accelerators, aka Tensor Cores, which speed up the 'prefill' (i.e. processing your context window) part, but don't do anything for inference.
The MLX vs GGUF debate is mostly irrelevant. The GGUF pathways are optimised for apple silicon to the extent of practically identical performance to MLX. MLX is just one way of using Apple GPUs, it comes with many optimisations in the box, but they're not hard and they're no longer MLX-exclusive.
drillsteps5 16 hours ago [-]
I honestly don't get the hostility against local models in this thread (and in some other threads recently).
I haven't seen anyone make an argument they are as good as SotA (OpenAI, Anthropic). It's just they are approaching state where they are "as good" for some _limited_ set of use cases. Which will allow us to resolve 2 primary issues with these SotA models: privacy and vendor lock-in. Plus, they're very useful for education purposes, you get to explore what things looks like under the hood, play with various models, tools, maybe put something simple together yourself.
You get Macbook - great. You got gaming rig with a decent GPU - great (set it up as a dedicated server that you connect to through simple REST).
What exactly is wrong with any of that?
simplyluke 14 hours ago [-]
> I honestly don't get the hostility against local models in this thread
Consider that there are literally trillions of dollars being wagered on this not being the future state of computing. Not even speculating that HN is being astroturfed (though I see no reason it wouldn't be by interested parties), but many of the US tech employees here have direct financial incentives in various forms to be rooting for the failure of open source and optionally local models.
agenticup 5 hours ago [-]
qwen 3.6 27b and qen35b a3b work like magic, if we get dpark speculative decoding versions of these models it will further improve the throughput
LoganDark 2 hours ago [-]
I see OpenCode mentioned in the article, and I would strongly warn against using it for local development because it disrespects caching (the content of the first turn / system prompt is NOT stable). I use Pi which works much better.
narrator 16 hours ago [-]
In hindsight, the Mac 512gb for about $10k was a total steal given that to run GLM 5.2 you need a 4x H100 to get the necessary amount of VRAM. Yeah the h100 is 2 to 8 times faster, but it's $20k a month to rent a 4xH100 VPS.
konart 5 hours ago [-]
>Real work
This part should have featured something about real work. But instead it features a paragraph about one-shot bs that creates "something".
Unless your work is to create thousands wordpress tremplates to sell - this is not a "real work".
Give it a repository (any kind of OSS project will do for an example) and a github issue requesting a knew feature or describing a confirmed bug. (you can and probably should write a prompt for LLM shough, don't just provide the issue itself)
And then whatch it go.
And then judge the result and it's quality.
Sorry, but from my experience 27B is just useless. You do get a result and some times it does work, but most of the times it is not event on junior dev level. And it takes it a lot of time to do the thing, unless you have an extremely expensive machine.
hypfer 4 hours ago [-]
If your expectation is to treat it as a coworker, then you're right.
If your expectation is to treat it as a tool, then you're wrong.
I guess that's where the disconnect lies.
konart 1 hours ago [-]
Define "a tool" for me and we can talk.
I already have tools for autocomplete, working with structured data and many more. Deterministic tools.
Obviously you do not expect something like that from a model with some harness. It can read some input (user's or other tools) and give you some output.
My expectation is that this tool, given some meaning full input (instructions, expectations, motivations and an optional source files to work with), will produce something that will actually be aligned with the input.
For example: consider I have a services that has some sort of events created now and then. I what those events to be available for other services. So I decide it to have a transactional outbox and an observer that will pull events from the outbox and put them into a kafka topic.
My expectation is that I can give this tool some context (source code and description), state my instructions, expectations, motivations, design decisions and have an implementation as a result.
My other expectation is that given my context etc and agent's context (skills etc) were correct and adequate - the outout will also be correct and adequate.
Its feasible but that laptop is not available for most devs.
I do have access for a 64 gb ram mac mini but most people don't.
alansaber 15 hours ago [-]
Is qwen finetuned/RL'd on any agent harness? Or does it just work well enough off the bat with opencode?
cpburns2009 11 hours ago [-]
If Qwen is finetuned for a hardness, it'll be Qwen Code. Qwen 27b works well enough in OpenCode though which is what I use. My one complaint is it likes to get cute with bash commands instead of OpenCode's built-in tools. I use a skill to steer that.
anonym29 18 hours ago [-]
Strix Halo user here. While Qwen 3.6 27B exhibits remarkable intelligence density, I will still take unsloth's dynamic IQ2_XXS of Minimax M2.7 over Q8_0 Qwen 3.6 27B any day of the week, and this isn't just because of generation speed either. I wrote my own custom harness, and I get hallucinated tool call parameters and bizarre invocations with Q3.6 27B even at Q8_0, but no issues with the IQ2_XXS of M2.7.
BoredomIsFun 17 hours ago [-]
> I get hallucinated tool call parameters and bizarre invocations
tweaking sampler might help
felooboolooomba 15 hours ago [-]
What's the minimum requirement for a Nvidia card to run it? For let's say 10 t/s.
zerolines 15 hours ago [-]
Yup, been rocking theQwen3.6-35B-A3B-MTP-GGUF locally with 88tk/s it's amazing.
devin 15 hours ago [-]
If I have 10k to spend, what should I buy for the best local model experience?
wolttam 14 hours ago [-]
You can buy a pair of DGX Sparks and run Deepseek V4 Flash at ~60-70TPS (once DSpark support matures over the next few days).
That will get you a near-frontier experience. DSv4 Flash launched in April with capabilities on par with GLM 5.0, which launched in February.
simplyluke 14 hours ago [-]
I really think giving it a year for the hardware market to come back to earth and spending a fraction of that for API access to the same models is a better use of the money.
devin 14 hours ago [-]
Implicit in your answer is the belief that they will come back to earth. I wonder how realistic that belief is.
simplyluke 14 hours ago [-]
We have decades upon decades of hardware getting dramatically cheaper year over year for the same performance, and ~1 year of the inverse due to dramatic buildout for AI.
It's a surprising example of the recency bias to me to assume anything other than the market returning to its historic norm, even if the AI buildout doesn't slow, producers will scale factories to meet that demand.
v3ss0n 14 hours ago [-]
3.5 122B is much better. 27 B is bad at Long context and Svelte
mannyv 16 hours ago [-]
FYI token speed is somewhat irrelevant for agentic development. You let it run, then you come back. The whole point is that it's asynchronous. If it takes 4 hours, 8 hours, 16 hours...who cares?
kmike84 16 hours ago [-]
You care if you run it on a laptop. It's getting hot, fans are spinning, and you may want to use laptop for other things while the agent is working.
mannyv 15 hours ago [-]
I have a Studio 128gb, so it's not an issue.
rvz 6 hours ago [-]
When reading the comments, it seems that in the AI race to zero, Apple was already at the finish line. as predicted.
So it will be no surprise that there will be a time where everyone will be able to run a local model, say GLM 5.2 locally on their machine. Like it or not.
happyash1 6 hours ago [-]
Qwen is so good a model.
Go7hic 4 hours ago [-]
goat
m3kw9 8 hours ago [-]
Hmm, i used it and it can't get past a simple coding test that 5.5 passes with light reasoning
cat_plus_plus 17 hours ago [-]
Gemma4 31B with MTP enabled is faster and I feel a bit stronger at coding. Either one can run in 32GB VRAM or unified RAM with some tuning (3 bit weights, 8 bit kv cache)
I'm running the NVFP4 alongside Gemma4 at the same quant on an OEM Spark
colinsane 16 hours ago [-]
AgentWorld is _fantastic_. i just migrated "down" from the 122B A10B qwen model to agentworld (35B A3B) because it feels as capable, easier to steer, and it's 3x faster.
also i like that if i drop more sophisticated tools into my harness (e.g. any of the NLP/RAG-based search tools in place of grep/rg), the agent will actually reach for them and make progress faster; previous models have been reluctant to embrace new tools.
ascii0eks84 18 hours ago [-]
Very capable lora adapters are surfacing but it seems they are very niche.
DenisM 18 hours ago [-]
Can you share more? It’s the first I hear of lora outside research papers. Practical applications would be great to see.
Lora if effective could be a great reason to run local models.
mikert89 18 hours ago [-]
none of these local models are good for development, complete waste of time. nobody has $100k+ hardware sitting around at home to actually run a good model
jlongr 18 hours ago [-]
skill issue
mikert89 15 hours ago [-]
the models suck
dmezzetti 17 hours ago [-]
Local models are great for a lot of things past just software development. We need to move towards solving other real world problems vs just building software. I've been focused on that with TxtAI (https://github.com/neuml/txtai) for 6 years now.
rusk 18 hours ago [-]
Spent a week trying to get sensible results out of llama 3.3 At one point it even simulated doing the work, log output and everything and when I challenged it about the missing artefacts it actually started questioning my intelligence. Seems appropriate for a Zuck enterprise.
Qwen on the other hand got straight to work with astonishing competency on the same system.
From what I read llama3 needs beefier compute to reliably invoke tools, which I presume relates to it focussing more on simulating AGI rather than being a useful tool.
culi 18 hours ago [-]
You might find this helpful. llama is not anywhere near the Pareto distribution (performance vs cost)
Llama3.1 instruct seems to be doing okay on that page, mostly because it's dirt cheap.
am17an 18 hours ago [-]
llama 3? Are you from 2023?
Frankybeatz 33 minutes ago [-]
[flagged]
hendry 2 hours ago [-]
[dead]
Nasser_CAD 8 hours ago [-]
[flagged]
cloudcanalx 5 hours ago [-]
[dead]
modgate 5 hours ago [-]
[flagged]
6 hours ago [-]
15 hours ago [-]
ShizuhaLabs 15 hours ago [-]
[flagged]
modgate 5 hours ago [-]
[flagged]
so_it_be 13 hours ago [-]
[dead]
Getchowned 15 hours ago [-]
[dead]
dhanush_2905 13 hours ago [-]
[dead]
suthakamal 17 hours ago [-]
[flagged]
CurbStomper 17 hours ago [-]
[dead]
Reuben_Santoso 11 hours ago [-]
[dead]
sourcegrift 13 hours ago [-]
[dead]
217 18 hours ago [-]
This is kind of like saying grass is green to be honest
madduci 18 hours ago [-]
Like everybody got 128 GB RAM..
sleepyeldrazi 18 hours ago [-]
I've been running it almost since launch on a 3090 (24gb vram), you really don't need that much. Second hand those are really cheap and i get 50-70 t/s (with MTP at 2), full ctx. IQ4_NL (unsloth) on this model seems suspiciously competent, and after the (by now not so recent) updates to q4 KV on llama.cpp, I just keep going back to it after dsv4pro disappointed me for the 100th time because it gave up on a task.
dofm 18 hours ago [-]
Doesn't need it at Q4 at least; it'll run in 64GB.
intothemild 16 hours ago [-]
Q6 can run with 256k at Q4 on 32gb easy.
200k @ K : Q5_0 V: 4_1 (which is a bit of a sweet spot)
BUT DO NOT buy this MacBook if you plan on doing serious coding using local LLMs with it. The reason is simple: your fingers will burn and your head will explode from the noise.
Running any kind of sophisticated job on the very laptop you are using is just not viable. Sure you can use it in clamshell mode, but forget touching it while working with AI coding or agents.
If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk. Connect to it over LAN or Tailscale. The MacMini will also cost you almost 1/3 of the MacBook Pro.
Thank me later.
With no speculative decoding, using high power mode, I get 80 t/s on 35B A3B - and it gets hot and spins up. On low power mode I get 38 t/s - no fans, cool to warm laptop.
If you currently don't use speculative decoding and you start using it, it can nearly offset the difference between high and low power, and it's night and day experience.
I almost always keep my laptop on low power mode.
614 GB/s of memory bandwidth
> MacMini M4 with 64GB of RAM
273 GB/s of memory bandwidth (also only currently available with 48GB)
When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible. In this case a hypothetical Mini with 1TB of memory would still be over 2x slower with 27-35B models.
And FWIW I have an M4 Max MBP 128GB that I keep on a Roost laptop stand, with a separate keyboard/mouse/video. It does fire up the cooling jets when running local LLMs, but stays within tolerance for me on noise. I haven't heat-tested it on longer runs, but I imagine the risen airflow helps a ton.
This is only true when your GPU isn't bottlenecked building a KV cache, which it usually will be on Apple Silicon. The Achilles heel of the M-series chips are their weak, SOC-grade GPU that holds back the Max and Ultra models from having interactive TTFTs on larger models and contexts.
It's a nice idea to run a model on a laptop so you can work anywhere...but, that's a job for models in the cloud. Not much data has to traverse the network, so it's not a big deal. Or one could also setup a VPN so you can reach a self-hosted model on a big box at home for things that require data privacy.
All that said, there are models that work great on very small devices for some tasks and won't work it to death. Gemma 4 12B QAT 4-bit runs on a 16GB device, maybe even smaller, including a tablet. It's the best self-hostable vision model I've tested for my purposes (categorization, identification, labeling, type stuff), beating much larger models. It's also a decent conversationalist with good prose but it doesn't know much of anything (not a lot of the world fits in 7GB), so it needs search if you want to use it for research. It's a pretty good tool user. I definitely wouldn't want to use it for code, though, beyond very simple stuff.
That said, the reason they're able to release Ornith branded post-trains of both Gemma and Qwen is because they're open weights under a friendly license. Someone, not just Google, could make a coding focused Gemma post-train. I don't think it's actually much weaker than Qwen 3.6 for coding; Gemma 4 31b outperforms Qwen 3.6 27b by a wide margin on security bug hunting (at least for the specific bugs in my benchmarks, which are mostly relatively difficult bugs from the Mythos-reported bugs).
I'd really love to see a bigger MoE from Google, though. A 70b or 120b MoE would likely be super fun.
[0] https://sleepingrobots.com/dreams/stop-using-ollama/
[1] https://github.com/unslothai/unsloth/discussions/4921
So, just buy a mac mini and put it in the other room? ( Like everyone was doing in February? :)
I've been running coding agents on my laptop in yolo mode for the past half year or so (though mostly not local ones, laptop too slow!) and the way I'm doing that without terror is that I just gave them their own Linux user "agent". They're free to nuke their homedir /agent, and they can't touch (or even read) mine.
There's some slight ergonomics issues (I need to sudo into the user to do anything, but I set up an alias for it), sometimes I get issues with permissions or ownership (gave up on "sticky bits" and just made a function I can run once a day when it breaks).
There's enough hassle that I wish I just had a dedicated machine for it, and then I'd just give them root on it. (For giggles I gave claude root on a $3 VPS and that's going just fine...)
But yeah after months of trial and error I reinvented "just buy a mac mini" from first principles...
Soon it is going to be good even for coding using local LLMs. Until then, just run API models on it for coding, local LLMs for "knowledge" work or daily driver agent like Hermes.
I have an older laptop I run a hermes agent on backed by an API based open (non-local) model and Macbook Pro M4 for running another model locally (also using hermes). The agents have a Mattermost (open source version of slack) server they run and I run Mattermost on my phone so I can talk to them and task them with things. In fact, it was through the hermes WhatsApp endpoint that I got the first agent (non-local) to setup the Mattermost server and unboard the second agent (local mbp).
Then I can just chat with them through Mattermost when I need work done. Whenever I need something done I just hope on the Mattermost server and chat with them. I've had them build me multiple research reports (the fully local agent did awesome at this), learn how to use Stable Diffusion on my desktop to generate images, install and perform maintenance on various local services I run (including Open WebUI).
If you were planning on getting an M5 128GB; just get a DGX Spark (~$4500) or a 5090-equipped machine (~$4500) plus a Macbook Air (~$1500). You'll come in below the M5 Max 128 pricing (~$6700+ USD) and be happier for it.
They pulled them a month or two ago, right after I bought it.
> Apple M4 Pro chip with 14‑core CPU, 20‑core GPU, 16-core Neural Engine 64GB unified memory 2TB SSD storage 10 Gigabit Ethernet Three Thunderbolt 5 ports, HDMI port, two USB‑C ports, headphone jack Accessory Kit $2,649.00
After about 1 minute the entire machine basically bricked and I had to hard reset :D
You would have to get a third party reseller/scalper or refurbished mac mini to get 64gb of ram ever since apple stopped selling it.
$6800 is a lot of API credits for GLM, for example, on any provider you want to use.
Now being able to run models uncensored and with privacy has value! But the cost for these is rough today.
I still am going to buy a second one haha
I'm wanting to run Kimi 2.6/2.7 GGUF on it and just slap it in the server rack, but trying to decide if a spark cluster makes more sense.
But it's also really easy to trip up. I fed it some of my Ars pieces and asked it to analyze themes and composition, and it got into a looping argument with me over how it was unable to analyze "my" writing because "the user cannot be the article author, the user is the user, the user did not write the article, the article author wrote the article." I was utterly unable to convince it that I was in fact me.
Qwen3.6-35B-A3B hums along at about 50GB of RAM used with --gpu-memory-utilization=0.42. I haven't tried Qwen3.6-27B (I'd likely grab Qwen3.6-27B-FP8, I think), but I'm curious to see if it makes much of a difference.
I would recommend using llama-server if you're just on a single Spark. You get access to dynamic quants like that more easily, the performance is not that different from vLLM most of the time these days, and it is much faster and easier to switch between models.
As far as intelligence goes, Qwen3.6-27B is much smarter than the 35B-A3B model, but that's also not the sort of thing to argue with an AI model about in the first place. Just open a new chat and try again.
Gemma-4-31B is not as good at agentic use cases as Qwen3.6-27B, but it is a fairly balanced model overall, and worth trying out too. Its MTP can nearly triple the performance of the model, where the benefits of MTP or Eagle seem more limited for Qwen3.6-27B in my testing, maybe doubling the speed.
if a hardware cycle takes ~3 years then fall 2026 would be the first possible device generation where apple exploits its advantage with the unified ram architecture.
more realistically, spring 2027, since they probably also needed some time to make up their minds to lean into that on the top end.
that`s also how i would interpret the recent rumors on m6 and m7.
naturally, the cooling and all that will be optimized around that.
so the first devices that are actually intended and designed for this use case will come at the earliest this fall and more likely in q1/q2 next year.
you are basically paying the price now to be on the bleeding (sweating) edge
I use Windows and this has never happened to me. I have had Macbooks I cant open to fix/replace something trivial while I can replace any part easily on a Windows PC/laptop though.
needs to be noted that it's increasingly uncommon to be able to do so. for desktops you have to build everything yourself - prebuilds (either gaming or workstations) have proprietary PSU and motherboards (in case of workstations, sometimes CPU is bound to the motherboard / manufacturer, for example Threadrippers). Windows laptops now often come with soldered RAM and soon will probably be without M.2 slots like Macs.
There is Framework though I guess
https://github.com/tedivm/qwen36-27b-docker
My hearing is not great, but I think I would have noticed the fan, and I have never heard it. In fact, I had to google to find out if it even has a fan.
An AMD AI Pro R9700 32GB brand new is $1350 right now.
After some tweaking, I had it running faster than the models the 3090 could run, and it could obviously run with higher context limits and bigger models due to the extra vram.
You could run a 4-bit, which is 16-17GB. But, you'd need a smallish context or you'd need to quantize your KV cache. Something like TurboQuant or RotorQuant might help.
32GB is the lower bound for comfortably running this size model. I'd maybe even say 64GB is right-sized, because a 256k context is nice to have for agentic workflows, and that won't fit on a 32GB card without heavy quantization (but I haven't tried TurboQuant or RotorQuant to know what impact it has on memory use for context).
You could also put some of the model into system RAM, but that defeats the purpose of your argument that a 3090 will outperform a Mac Mini or Mac Studio. If part of a dense model is in system RAM, it absolutely will not outperform a recent unified memory device.
But man, I have never purchased a computer which is more expensive than a decent family car.
https://www.microcenter.com/product/709071/pny-nvidia-rtx-pr...
Also, while memory bandwidth is important, it isn’t the only consideration. Apple’s architecture has memory bandwidth equal to a mid-range consumer GPU, but its GPU speed is much, much worse than, say, a 5080 or 5090. This translates into e.g. much slower time to first token on Mac systems compared to dedicated GPUs.
Mac Studio: Ships: 16–18 weeks
Mac mini: Ships: 10–12 weeks
It’s just so flexible, and I even use it in agent mode (ds4) directly on the machine as well sometimes (it’s really not that bad, I’m often running inference for small side projects on my couch), if there is another machine that can do all of this and still function as one of the more ergonomic, well built, and compact laptops out there, I’d love to hear what it is cause I’d likely be interested!
Can confirm this works rather well, most things that integrate with LLMs, (agents, editors), support providing a remote (LAN) URL for Ollama, LM Studio etc.
But you do need a fast LAN connection, otherwise working with agents will be a pain.
Huh, how come? Low-latency I can understand, but I was under the impression that token throughputs were still barely exceeding dialup bandwidths.
I can’t figure out when it makes sense to pay 10k up front for a quantized Llama 3.1 but it’s an interesting option
But yeah, there's a bit of a dearth of models that could fully utilize memory in the 128-256GB bracket at the moment. But things move so fast in this space, I wouldn't base my decision on a generation of models that's just a few months old.
As much as I was tempted to use it on longer projects, I had some reservations about whether it would put too much strain on my MacBook.
Still, I don't agree. I think this machine is meant to use local models. You just have to wear pants if you want to keep it directly on your lap. I rarely use it that way anyway. I prefer it plugged into an external display and comfortably sitting on a laptop stand.
- M3 Pro MacBook Pro 36GB
- M2 Pro MacBook Pro 16GB
- Mac Studio M4 Max 48GB
and I have not heard the fans on any of them with normal use. The only time I've ever heard automatic fans was when I was using a local 12B model on the M3 MacBook Pro, and when running 70B models on the Studio.
You should consider checking Activity Monitor and making sure that the usual suspects are not causing issues with sustained high CPU. And you can use an app like [Stats](https://mac-stats.com) if you want to see that info while actively using the computer.
llama.cpp's Metal backend does use them when they're available.
How is this config?
qwen3.6 35B A3B MLX 8bit -> 85-90 tok / sec! It is impressively fast and roughly 90% as good as 27B (in my opinion).
Wouldn't this damage the MBP display?
My RTX laptop has air intake underneath the keyboard and clamshell mode is surely a recipe for disaster; I've taken numerous measures to ensure that the laptop doesn't stay awake when the lid is down.
I'm running this model on a Framework 13 and the chassis barely heats up at all while running full tilt.
to me that's cheaper than paying an LLM provider such as Anthropic spreading FUD around open weight models & more sustainable too.
Im sorry, but its time to start calling Apple sycophants out. Stop trying to push your tech jewelry on other people. You only buy those computers because they are Apple, you don't know anything about computing or running LLMs, you don't do any real work, so you should probably not give advice on what to buy.
A single 3090 will run Qwen3.6 27b fine, and its VRAM speed is twice of what the best Mac has. And the build will be cheaper. Decent CPU/Motherboard, 32gb of DDR4 ram, an SSD and a Single 3090 should run max about $4grand. Mac m4 mini is 6grand.
Then, when gpu prices come down (or you find one on a deal), you can upgrade the card, or stick a second one, and benefit from more speed. You can't do that with the trash Apple produces.
Flag me if you want, I don't care. Its embarrasing for the tech community to give advice this bad.
I just purchased a Mac Mini M4 Pro 64GB for $3k - 2nd hand of course.
I am not a hater of Nvidia and I am planning on building a workstation based on RTX cards. You clearly do not seem to understand how convenient the MacMini actually IS - the form factor, how quiet it is, how durable it is, how well it integrates with other Macs, how well it works as a bridge to a personal agent like Hermes (integration with iMessage, Calendar, Reminders, iCloud, etc).
I am pretty sure I know a thing or two about computing, I have been in the trenches for many, many years and I have had machines of all kinds, shapes and colors. It just so happens that Macs are very capable, very convenient machines that happen to work great in the era of LLMs, too.
But you do you.
If you are that locked in to Apple, its pretty easy to buy a used Mac Mini older gen for all the non AI stuff.
But this is a discussion about inference. Buying a Mac anything for any sort of local inference is a COLOSSAL waste of money.
Some people will be happy to pay that premium for privacy, but at roughly 10X the cost of a MacBook Neo, that money could also buy a lot of credits on OpenRouter or frontier labs.
[0]: https://www.apple.com/shop/buy-mac/macbook-pro/14-inch-space...
I don't know how much serious hands-free agentic coding I will ever do on my MacBook alone, but I do know that I would not have got so far into understanding this without tinkering with local models, llama.cpp, LM Studio, and LM Studio and all that.
I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled. Because it's just huge, exhausting, jargon-drenched, unknowable, and I am over the hill at fifty-plus.
Until, that is, I could poke around with setting it up on my own (secondhand) machine, watching the API calls, understanding some of the terminology. I didn't even buy the machine for that; it's just adequate to the task.
The Neo is too small to really get much benefit from this opportunity to make it more visceral and knowable.
Cloud models are (much) faster, they don't consume so much power/generate heat, they have much bigger (LLM) context, they're much more precise and they have a much wider (engineering) context of the given problem.
Except privacy and use cases that are blocked by cloud models (e.g. reverse engineering), local LLMs are currently an expensive toy.
When I try to program with a local LLM (I'm on a 32/128 GB system), I end up wasting time compared to a cloud LLM.
And I can't say that I won't switch to openrouter (even just for the same models) at some point.
But one of the things I have found about my own process learning is that some lessons only come to you when you make yourself available to them. And if that means doing things the difficult way, that is what you should do.
The rest of my life is ultra-frugal so I am relaxed about this.
Having spent a good weekend learning how to perform latent-steering through playing with pytorch and a local Gemma4 model, there is no way I could have groked any of that in the the way I did without hands on time.
This is on an M3 Max 36GB I've had for a couple of years. No further outlay needed.
I don't know if it has changed my mind about a career change but as I am sure you can understand, I no longer feel like I am running away defeated.
My very best wishes to you :-)
The interesting question is whether that gap will narrow, and if so, how much, and on what timescale.
The exact answer to this question is not knowable, but if you are the kind of person who comes to a site called "hacker news", and you think there is a nonzero chance that the answer is that yes, the gap will narrow and this won't always be an expensive toy, then now seems like a pretty great time to get in the game and start exploring the capabilities.
(sarcasm, btw)
Over the long term it's always been better to buy than to rent, even if the renting option is technically more efficient on the GPUs, you don't have to pay some hosting providers profit margin.
And for users that aren't running multiple agents 24/7, you should be able to fit a good user:GPU ratio.
For example (and relevant to AI) I can generate electricity on my roof at $0.20-25/kWh, batteries included. In California the electric utility can’t offer it cheaper than $0.30-0.50/kWh. Therefore at scale, electricity is actually more expensive.
There are many such examples.
Right now, there is way more scale in centralized AI than there is at the edge. But that could flip. I'd still probably put the probability that it will under 50%. But I'd also put it above zero!
I do realize the cloud is just someone else’s computer right? Power goes in, tokens and heat come out - just in another place
That's never the point of keeping local alternatives though.
For me this dates all the way back to installing Slackware 1.0 (0.99pl12!) on an offline 486SX rather than just using the internet-connected workstations in the lab.
Here, I already had a Mac that was powerful enough to run a local LLM, so now I do, because I can.
I don't recall any previous tech stack that was barfed onto the scene with so little background or reference material, going from zero to endless undefined jargon... and no primer in sight.
For people who demand an understanding of their tools, it's a lot of work. I recognize the value of "AI" in performing the tasks I'd have to do manually; for example, keeping the data structures of my front- and back-ends in sync in a project. But do I want to interrupt my development and take weeks off to digest all of these tools?
And if I do, I want to run the show and fully understand it. And like you, I think that's best done locally.
Cloud models still feel ‘magic’, like you send a request off and get something back, like it’s something ‘special’. I used to joke that ChatGPT might be some kind of mechanical turk underneath.
Watching a model run local on your own machine hits different — you realise that yes, it IS just a computer program. Which for me actually makes me appreciate the leap we’ve made MORE, not less. From an information-theoretic point of view, LLMs really are something special.
The fact that they are just programs, that I’ve now experienced first-hand that they’re just programs, makes all those questions around consciousness and intelligence much more interesting.
Like, just watching a computer I already owned act like ChatGPT with the wifi disconnected.
It was the first time I stopped feeling quite so helpless, somehow.
Qwen barely needs any of Opencode's prompt, in my experience; I think I cut it down to about three general lines I found by googling. Mainly you need only a pre-amble to make sure that the plan mode, plan switch and build mode prompt fragments make sense.
Gemma 4 also needs almost nothing at all, which is fascinating, considering it is not a coding-specialist model. It just seems to be who you need it to be when you ask.
Seems like a GPU with 12GB+ VRAM is going to be a much more affordable way to achieve that? Even a B580 should get reasonable perf there.
I guess I would build a powerful home LLM server if I was convinced I really needed one for my purposes for some agentic application or other. At the moment I'd prefer to ride this out with a machine that is also an excellent Mac.
Just one example, I needed a bunch of images tagged and organised, with a local vision capable model I could pretty easily set that up and leave it running overnight.
I already had the GPU and memory for gaming, so it was at no cost for me to start running local models. But I feel the long term writing is on the wall, local models will only make more and more sense as they get better and more efficient.
I have a pretty deep, maybe paranoid need to be confident I have an intrinsic understanding, and I have found in my life that lessons come to you when you make yourself open to learning.
So I need to build on top of what I know, taking as much of the hard way as I can bear to take at any one time — it has to be not quite difficult enough to put me off.
I can't really explain what I have learned this way that is different, but I feel it in a way that I wouldn't if I'd simply pushed a button.
For the same reason, I have a really basic 3D printer that I've set up myself, set up Klipper, configured how I want it, learned how to calibrate, all that. And now I can say that I feel I have an understanding of 3D printing. I could hold my head above water in a discussion with a real expert, maybe find work in an adjacent field where my insights would keep me grounded.
I can afford a really good printer that has all that set up, and more, has no problems. But I'd just be someone who has a 3D printer.
(Also who am I kidding about the existence of a printer with no problems)
I have colleagues that seem perfectly content to delegate too much to the agents, and it saddens me. It feels like there will be swaths of engineers that didn't train some of the critical thinking skills that I take for granted.
I certainly see it in slack discourse around anything more complicated than a feature implementation. Maybe I'm just cynical. Time will tell, I suppose.
That is why I'm content to delegate to agents - I have more code/features I want to write than I have time to debug (writing is the easy part).
Over the last few months, I've been digging into performance problems with a high throughput service that my team owns. I started working on the problems in my own time, put out short and medium term improvements that legitimately avoided operational issues, and started developing an alternate architecture that should meaningfully address the problems for the long term.
I've learned new things and made improvements that probably wouldn't have ever gone in otherwise.
I've spent my whole career being frustrated by the pile of low severity bugs and performance issues that "I could fix that if I could only justify putting a couple hours into it!". And now I can just fix all those. Nobody is going to question my use of time to write prompts and do code reviews of those things, when I can to my "real" work simultaneously.
What does "mainstream" refer to when we're talking about software development and LLMs? As opposed to "engineers".
But I think there is (and has always been) also a distinction between the "mainstream" of software developers vs people who are working on new tools and capabilities to be used by that "mainstream".
IMO it is certainly true that the most efficient and cost effective was to do "mainstream" software delivery at the moment is hosted frontier models. But for people thinking about "what's next?", it makes a ton of sense to be exploring different models in anticipation of a possible (but certainly not inevitable) sea change.
I mean one of the things I use a local LLM for, because I can, is to generate starter documentation. But I ask it to — I want it to give me overviews, plans, all that. It can make something bespoke for me.
I guess I could also ask it to do the work. But where do you draw the line?
The universal labour-saving device is the great provocation of the next 100 years I think, and both Star Trek and Wall-E have grappled with it.
And that's how skills die.
The reason I delegate so much of local LLM installation and administration to Claude Code is simply because there's no point learning practical things that will work completely differently in a couple of years, or in memorizing procedures that I'll forget long before I need to perform them again.
No longer having to sweat all the details is a Good Thing, not a Bad Thing.
But I think if you want to really learn to ride well, understand horses well, there might be some benefit in learning how to shoe a horse. At some level it should never only be someone else's job.
For example, you need to know it uses gasoline (or diesel), it requires oil changes every certain amount of time, break pad replacement, etc.
You also probably need to know that you can't operate cars over a certain amount of water, that you need a driver's license, stopping at red lights, etc.
Sure, you might not need to be a mechanic, but that's far from not understanding how a car works, which to me sounds similar to knowing how to shoe a horse, which is different than being a horse vet.
Maybe a more apt analogy would be a skill like making fire without a lighter.
That skill died too, so what's your point?
Maybe my biggest problem with the world of agentic AI, and the reason I am putting myself through learning it the way I am, is that the need to know the "why" of everything is so fundamental to me, that I don't know if there is any point to me without it.
So this is really the only way I know how to proceed.
And we happen to be discussing this on a forum where the type of people who will be the specialists for the kinda of systems we're discussing are likely to gather.
I'd be surprised if in my casual discussions out in the real world, I were to run into a lot of people who care exactly how all this works, to the extent that they want to invest significant money into hardware that allows them to run things themselves and dig into what's actually going on. But I'm not at all surprised to come across such people here! (Indeed, it would be very disappointed if I didn't!)
I found LM studio to be a nice starting point. Frindlier and more featureful than Ollama and not as intimidating as llama.cpp (though you will want to use that eventually)
I tried Ollama but I've settled on Unsloth Studio generally; once things really settle down I'll just run the llama-server UI, which is pretty nice.
A friend is tinkering with LLMs for amusement on a 16GB Raspberry Pi 5, and when I explained that llama.cpp now had a typical web chat interface he was so happy — it's amazing what the "table stakes" are now.
Agree having a powerful machine is really worth it in general for professionals, but strong disagree that running local LLMs has anything to do with it. It's hard enough as it is getting a good ROI on your time/money prompting/wrangling with frontier models. IMO leaning on the comparatively limited capabilities of local LLMs is best avoided in favor of keeping your own personal coding skills fresh and continuing to learn new ones.
I needed to do this, this way, in my own time, to put my brain back together. It has worked for me, which is why I recommend it.
YMMV.
But if this is the case, as you say, it seems like a good opportunity to build a more welcoming set of entry points into this!
(Very reminiscent of 3D printing, where you get a lot of very trivial advice poorly applied, which is an analogy I've now made several times.)
Several of the youtubers are pretty helpful, though; I watched half a dozen things and absorbed the broad pattern and then went for it.
Also I got a lot out of reading HN comments, which is why I am here; tucked away in the corners of these discussions are people who can help. Over time I hope I am one.
To me, "how do contemporary AI systems work and interact with contemporary hardware and how can I best take advantage of their capabilities?" is the set of skills that are worth learning at this moment.
What else is there? New / additional programming languages? New / additional database systems? frameworks? orchestrators? cloud provider / infra tooling? architectural patterns?
I dunno, all of this seems really boring and "been there done that" to me at this moment in time!
[1] https://openrouter.ai/qwen/qwen3-coder-next
- opencode with it's webui
- deer-flow with it's research/powered front end
They both run websites so you don't have to baby sit them (eg, keep your mac open). I've build a pdf compressor over a few days by first having deer flow try and research the frameworks and pipeline. It stalls out because its not really a fluid programmer. Once it stalls out, I transferred it (manually for now) to opencode and it's refactoring it because it's just a collective bundle of sticks and it needs a lot of testing to tweak out the limited scop context. LLMs can't really hold large scopes (locally anyway, from what I've read from HN, it's possible with longer context).
It'll complete in a few days with maybe 3-4 hours of full attention interaction, but it's running 3x that without my attention. Obviously, if I paid more attention it'd run quicker, but since it's local, it's not pumping out large volumes of code, it's mostly looping over tests and capabilities as observed.
It's running Qwen3.6 35B MoE on a AMD 128GB strix halo. If I switched to the dense models, perhaps it'd be smarter, but the trade off seems to be much slower gen.
Have you tried Paseo?
I have opencode in a VM, and the paseo daemon running in the VM, and then the Paseo Mac app. Really nice.
(You can also use the Opencode GUI to frame a remote opencode web interface)
I'm gonna check out paseo, but am not looking forward to all the ram the agent needs + all the ram paseo needs
Hello, my brother, just know that you have a fellow passenger in life at the same age who thinks the same thing. I agree that the local stuff is helping my understanding a LOT.
However, my gut feel as someone who got to experience the TeleBomb after the DotBomb is that the obfuscation is INTENTIONAL--it's neither you nor your age. I remember asking people to explain to me what the OC-768 startup endgame was when roughly 10 OC-768 links could carry the world's traffic at the time--and everybody giving me blank looks. The AI Bubble has the EXACT same feel as the Telecom Bubble--just bigger.
What I really wish is that I could find a VPS-type provider where I could toss things into their NVIDIA/AMD machines for an hour or two. Alas, all of the providers seem to want massive paperwork and huge minimum purchases.
I can't wait for the bubble to pop so that we mere mortals can finally build with this stuff.
In theory you can also get 48GB of VRAM with, say, two 3090s, but it will take up a lot of space and generate a lot of heat compared to the Macbook Pro and GB10.
[1] https://x.com/MiaAI_lab/status/2070859135399182444
[2] https://github.com/MiaAI-Lab/Qwen3.6-27B-NVFP4-vLLM
So like... $2000+ just for the used GPUs? Plus I assume it's considerably more effort to get it working.
Nah, not really. It is a little annoying in terms of space and power, though. Not every case and motherboard can support cards that big.
edit - after actually reading the tweets (had to use xcancel) and visiting the source git repo, switching to MTP for speculative decode makes things a hell of a lot faster, and the abliterated model plus dflash makes it even faster! I'm now seeing 70-90 tok/sec for most stuff. I like!
https://flowtivity.ai/blog/120-tok-s-1m-context-private-ai-d...
The real sweet spot for Qwen 27B is getting it on something like a Dual 3090 system or some other config where it can blaze at 50-80 t/s and that costs well under 6K currently. It is a surprisingly capable model. Using something like GLM for orchestration, specs, task farming and then letting Qwen churn is relatively inexpensive.
Overall I recommend people try models of this class out using OpenCode and some for pay service to experiment with them and understand how they work. I find they are very useful.
Long term, I am convinced enough that if I wanted to use local models for any number of reasons I would be okay investing in a dual GPU box. The Mac is not fast enough for me and M5 Max is just too expensive relative to GPU linux box. Still, it is nice to have the models local ON the laptop and it is useful for what I care about locally.
The limited context is problematic. I’m not exactly sure what it’s got available but hermes was hit and miss on a prospecting job.
It does seem to be doing useful work but it’s not API call level quality
If that's accurate, then you must be doing something wrong/weird. On a single RTX 3090, I'm seeing substantially higher performance. Dual GPU won't necessarily give a ton of performance improvement, but it shouldn't hurt performance.
With llama-bench, I just measured Qwen3.6-27B at 41 tok/s and Qwen3.6-35B-A3B at 153 tok/s on one RTX 3090. (Those results are without MTP. With MTP, I'm seeing about 65 to 70 tok/s for Qwen3.7-27B.)
I'm using the unsloth UD-Q4_K_XL quant. If you're using bf16 for some reason, that could explain the low performance and inability to have enough context despite having 48GB of VRAM, I guess, but... don't do that.
Are you running with MTP enabled? I have seen some people on M5 hardware report 20+ t/s on Qwen3.6-27B using MTP... and I think that was a regular M5, not even M5 Pro.
Gemma 4 is the only model series at this parameter scale I've seen correctly answer some of these. One of the answers even made me re-evaluate what I thought the correct answer was, which I did not expect.
When I look at the Artificial Analysis numbers, I can see that some things about Qwen 3.6 look inflated as a result of either metrics that weren't measured yet for Gemma 4 31B, or for metrics that just aren't going to be relevant in a lot of the essential tasks. In a lot of the relevant metrics, Gemma 4 is either better or on par.
Then once it's all quantized all those benchmark results will be hurt, and Gemma 4 QAT has better quantized performance. I think it's more competitive unquantized than people give it credit for and way better quantized than people give it credit for.
Qwen 3.6 clearly isn't legitimately bad and maybe it's quite nice at fp16, but it was a disaster quantized in a 24GB scenario by comparison.
If you want to run unquantized, you definitely need 128GB.
Even that isn't strictly necessary - you can get perfectly acceptable performance by splitting a model between multiple older 12 or 16 GB cards.
As of writing this, it shows 24 offers between 700 and 950.
Edit: it’s not just “data privacy”, when you are using Claude, you are shipping EVERYTHING to Anthropic. It’s crazy.
If the cost doubles, or 4x, which is seems to need to for them to go profitable, what then?
$5000 in US Treasuries (currently at 4.89%) yields $244.5/yr. That's more than enough to cover the annual Claude Pro subscription ($200/yr) which includes Claude Code with lots of Sonnet usage (far better than Qwen 3.6)
Qwen3.6-27B would be faster on a 3090 that costs around $1000-1200 though so I don't think it's a good counter-argument.
Op just happened to have that MacBook, but it doesn't mean it's necessary to run the model.
https://github.com/noonghunna/qwen36-27b-single-3090
Flies though (50-70tps is impressive for a model this smart)
I went through roughly the same process to get it working on my M2 Macbook Pro... at awful speeds of course, since models like this one are mostly bound by memory bandwidth.
The 3090's TPD is 350W, but given that LLM's token generation isn't compute bound, people usually undervolt these cards to reduce power consumption. IIRC you can get as low as 200-250W without any degradation. Caveat these figures are without speculative decoding and at batch size =1.
I did find a few useful parameter settings I've already discovered using my single 3090 and ollama.
I'm just remarking that the LLMs overwhelm me with minutiae, especially as I'm working on code design. I frequently ask it to restate concisely, and that helps.
[edited to mention ollama as a nice alt]
I still use the MTP version as it _feels_ slightly better quality, and because the unsloth quantizations I can get have more variety to fit into the various systems at hand... but that's not for the MTP aspect, unfortunately.
In the article they did have ~2x performance on the 27B (which might be something to retry, though on my Framework that would bring it from 5 -> 10 token/s so still "excrutiating" speed, probably).
YMMV for sure.
I use my MBP essentially as my workstation, it's almost always plugged in. I have a MBA (M4, 24GB RAM) that I picked up for ~A$1500 or so, and that's an amazing daily driver. I don't do local LLM inference on that unit, I can just hit my own APIs (via LM Studio) on the MBP over Tailscale.
Context size?
I paid 2424 euros in total for this machine. And it can easily run the models discussed in the comments and in the article. It's tiny, and runs CachyOS like a champ. Over 4000 euros less than the price you listed.
We can all send a thank you letter for our friendly billionaires such as Sam Altman for the price situation we're in today: https://www.mooreslawisdead.com/post/sam-altman-s-dirty-dram...
i'd expect anything on github for example to be already in their training set or is training on actual usage more useful to them?
In any cases, from the economic point of view, running models on laptops make little sense. Even at the pure cost of energy consumption, it might be hard to beat pricing at tokens generated at scale.
At the same time, it is a breaktrough, that will change the game. Previously such vibe coding on consumer device was not hard or costly - it was impossible.
I think you might be a little to into the stew here.
I haven't tried it with https://lemonade-server.ai/ yet but I just might give it a shot.
In the open model space an insane amount of effort goes into getting more powerful models to run with the same or less RAM. For example in the diffusion world many things that could not be run on easily under 24GB of VRAM actually run much better today with much less VRAM than they did a few years ago. You can do many things today with 8-16GB of VRAM that would not have been possible. At the same time the most advanced open models, like LTX 2.3 for video gen, still seem to respect 24GB of VRAM as the upper bound.
Similarly the standard "big" but localish open model for LLMs back in the day was Llama 3 70B, this was both a much worse and much larger model than Qwen 3.6 27B
So in two different spaces I've witnessed the "RAM required to run the best" decreasing or at least remaining stable, while the performance being achieved in both areas is astounding (LTX 2.3 is faster, better and more capable than the Wan 2.2 model that held popularity before it).
The biggest thing to watch out for is not just RAM/VRAM but memory bandwidth. You can try to "future proof" yourself with lots of RAM, but if it's 400 GB/S you're still constrained to smaller models.
I'm thinking of getting a SoC machine with 128GB RAM but the bandwidth is limited to 256 GBps. Would you even consider such a machine a decent investment, or should I wait for the newer gen of chips? Thanks!
These devices, especially the DGX line, are fantastic if you are interested in low-level CUDA programming. The DGX spark can be used to prototype CUDA code/libraries for GPUs that most of us couldn't think about affording. If you want to learn how to program for datacenter level GPUs then these are the best way to get that at home. Sure your code will run very slow compared to the real thing, but you can take that code and, theoretically, run it on the real thing. For anything else though, I feel there are better options.
If you're interested in pure inference I'm pretty partial to Apple devices. The M4 Max gets you 546 GB/s, the M5 MAX 614 GB/s, and the M3 ultra (you'd have to buy used at this point) 819 GB/s. Plus you have a very useful computer even if you realize you don't want a full time home inference server. Additionally these devices require very low power (if you're running high end consumer GPUs you do have to think about what your energy costs are per hour and how warm you like your room).
If you're interested inference and training, or already have a pretty beefy desktop PC, or simply demand the most token/s you can get, then GPUs are the way to go. The downside is they're still pretty memory restricted (but honestly the options for what you can run on any RTX N090 are pretty good). You'll get blazing inference and prefill speeds on these devices. The only down side is, if you are using them heavily, you will see it on your energy bill and feel it in your room.
The "should I wait" question is also potentially applicable. The world of consumer hardware is looking increasingly bleak (and expensive) but if Apple does release a new "Ultra" model we could be looking at inference speeds very close to GPUs (there's still limitations to these devices that makes training preferable on GPU)
What I had in mind was an AMD Strix Halo machine, but it seems to have none of the advantages you mentioned. It's neither high bandwidth, nor does it have CUDA support, nor does it have support from the big OEMs. All the boards are from relatively obscure Chinese vendors.
It seems like all the major OEMs have rallied behind Nvidia, if you look at the upcoming RTX Spark laptops.
The same can be said about operating system memory requirements. I am sure Linux and Windows kernel developers can confirm. Yet 30 years ago Solaris used to run comfortably in 16 MB of RAM, today you need 512 times that to run Linux.
... but, the models that WILL run on 128GB (or 64GB or even 32GB) models today are a huge improvement on the best models that would run in the same amount of memory six months ago.
If you find three finds that also have a 128GB MacBook, you can chain them together (the MacBooks, not your friends) and make it work.
You could also run GLM-5.2 on a single MacBook if you stream the active parameters from disk, but even with speculative decoding, you'd probably only get in the order of 1 token per second, so this is not really practical for most applications.
They’re trending to be the right size to be good.
Qwen3.6-35B is not as good as Qwen3.6-27B. The larger model is faster, but a lot dumber; it gets caught in loops, makes crazy mistakes, and is just not as good. It’s bigger, but it is nowhere near as good as the 27B variant.
at 128GB, you can find almost it's entire context for Qwen3.6 35B MoE.
Again, I think you have too much faith in extrapolation. It's like you got a baby at 0 months, then measured it at 12 months and expect it to be a giant.
i've watched friends try that route; i've been through this before. taking a downgrade is never fun: if it's a thing you're likely to care about in the future, then sometimes it's better to place yourself in the right ecosystem early.
in terms of privacy, yes that's a real application, but someone taking it all away? I don't see it happening.
it's not an OS or a device, it's just a box/thing that runs a model, it's really commodity stuff we're talking about
more realistic concern would be that the open labs wouldn't be able to compete in the future thus development ends, but that means you can't host models that don't come out so...
again maybe I misunderstood but I just don't see why this would be worth it just for that one concern
From what I understand, for a developer, $5000/month is maybe the high end, but $5000/year is fairly standard. (Is that accurate?) So if it pays back in 15 months, that's pretty decent. If it pays back in two months, that's spectacular.
Disclaimer: There's a 35% sale from Alibaba right now. And I'm not accounting for input tokens going faster than output tokens.
You're welcome to make your substantive points thoughtfully, just not aggressively.
Ryzen AI Max 395+ with 128GB of unified memory can be found around $3-4k.
But 27B isn't that large, either, especially if you are ok with the quantized models. So this laptop choice seems to more be a "because they had it" rather than "this is what's necessary for this particular workflow"
The real test is whether or not it can work with your existing codebases. In my limited experiments Qwen 3.5 (maybe 3.6 is loads better) does OK on a Rust+React app, and less well on a C# monolith. Not to the point of being unusable but definitely poorly enough that I went back to Claude after 20 minutes. If I lost access to a cloud model and had to use Qwen instead I'd be visibly sad.
Not really germane to your comment but I hope I don’t sound old when I say I remember a time when spinning up a PoC was a week of work, and a statement like yours was pure science fiction.
If I start prompting away the core of a new project I lose interest in the entire thing almost straight away. I hate it. The next day I could care less about it. In fact it just makes me lazy, like a fat person who drives everywhere.
I love typing code and thinking for myself. Im going to continue to do that. I still dont know anyone who's shipped anything truly useful with this garbage tech, let alone with a local 30b param model. So much cope in these comments.
Spending 6k on hardware to run the worlds most mediocre model truly does make you an incredibly stupid person, so Im not really suprised by these comments of people saying these tiny models are helping them so much.
Its like a special needs kid all of sudden got the ability to code, of course they'd be impressed by basically all the code it produces.
I’ve used Qwen 3.6 27B for many things at work, and I’m regularly able use it for reasonably scoped tasks.
I’m not saying these models are perfect.
But you are complaining about people on the extreme, while at the same shouting from the opposite extreme.
2) Not every team will have someone with 20 years of experience in a particular domain eager to spin up a PoC.
What are you even saying? Are you aware that there is a massive range in the scope of projects? You must work on some incredibly simple CRUD apps if this is your take.
This is an underrated consideration when evaluating the small models: The further you deviate from standard example code, the more their weaknesses show.
My experience is that Qwen3.6 produced some amazing results for a small model when I tried it with simple apps that are widely reproduced everywhere. If you want a React TODO app or to set up a little boilerplate app with shadcn and other popular tools, it will produce something that looks not too bad.
Then when I started straying outside of common tasks and into some of my more niche work, it would spin for hours and go in circles before finally producing some groan-inducing output that wasn't usable.
If you're looking for a model to help with simple refactoring or small tasks where you provide very explicit instructions for exactly what you want, but you don't want to do all of the typing yourself, they can do a lot of good work, though. But you're right that once you get into long context sessions involving topics off the beaten path, the weaknesses are very apparent.
The quantizations that are popular for making these models fit on smaller hardware make the problems worse. When you read it about online there is almost a consensus that 4-bit quants are lossless and that you can use q8_0/q8_0 kv cache quantization without any real loss, but in my experience with real projects there's a substantial degradation in long context performance with any of these quants.
Never go below an fp16 kv cache unless you've already tested it in advance with your model on a verified task that you know it can successfully complete. People should also test the difference using the exact same seed value so they can see how the tokens diverge. If you have memory constraints, sometimes you can still use an fp16 kv cache and use storage for an agentic buffer to work your task with mixed abstractions rather than having everything in memory.
For 4-bit weight quants, Gemma 4 31B QAT is where people should be looking instead of Qwen 3.6.
Modifying existing code is way easier if you don't expect it to be smart about it. Don't say "add X feature" and let it explore the codebase and build its own understanding. Point it at the relevant files and say "the goal is to add X feature to this code, follow Y guidelines". Now you've done the hardest part of making the decisions and it just has to follow instructions while coloring within the lines.
Is that not how you would work with any model, local or not? I wouldn't trust it to make the right decisions unattended. I just know the moment I look away it's going to do something utterly braindead.
https://github.com/verdverm/pge-jax
All small-scale stuff. For large integrated projects I am finding DeepSeek v4 Pro commercial API to be very inexpensive and helps me produce good results.
1. Maybe you should tell us what those limited experiments are.
2. Maybe you should actually try 3.6 because it's huge difference in most cases. Don't forget to tell us quants and don't forget to tell us scope.
3. Maybe actually show us data compared to frontier models instead of this... vibe comment. Pretty tired of this kind of comments on HN that doesn't require logic or evidence. Just vibes. Like the pelican riding a bicycle crap that everyone has taken for granted but has no objective way of assessing goodness.
(I'm aware the price is, in absolute terms, more expensive where I live compared to the USA. That reinforces what I think, because anyone sane that would've bought one of those in another country would sell them as soon as they landed here and save that money.)
* yes, you can run it on an older/smaller GPU plus system RAM but performance will suffer
* if you want optimal GPU performance you need the model in VRAM plus context, so 24GB (3090, 4090) or 32GB (5090) cards, plus a system that's reasonable powerful to plug them in to. Ideally you'd have a multiple cards working together but for optimal performance this means either 2x 3090 or nvidia's workstation cards.
* you can go for a 128gb Strix Halo system, but the memory bandwidth isn't great and they're becoming increasingly more expensive (5.5k EUR for HP laptop, 3.9k EUR for GMKtec EVO-X2 mini PC)
* you can go for a 128gb DGX Spark (5k EUR+) which also has unspectacular memory bandwidth or RTX Spark (price unclear but probably not cheaper)
* or go for a Mac with a decent CPU and a good amount of RAM (bandwidth varies by model, but typically a bit better than Strix Halo/DGX Spark and worse than bespoke GPUs.
As usual with such questions, there are of course cheaper paths (if you want to accept the tradeoffs) but Macs are reasonable vs. competition for these workloads.
You get fewer tokens per second, but at some point the balance between quality and quantity makes the large model size worth the spend.
When you're spending this kind of money, you may as well treat yourself to a pretty screen and some decent speakers. Nothing the competition doesn't offer these days, but you get them for free with the car-priced RAM upgrade so why go for less.
Personally when going on the road I like portability (14" MBP or MBA), but at home I want raw non-thermally throttled power.
$5k for DGX Spark as well.
I spent less than $4k, OEM are better boxes for cooling, no apple markup, I get a real Linux system for stuff like k3s.
No Apple markup but you get the Nvidia market up instead. Prior to the recent Apple price increase due to RAM shortage, an M5 Max 128GB was a bargain if you want to run local LLMs.
You need an expensive motherboard, cooling, PSU(s) to use multiple high end GPUs together. Then there is the noise and the fact that you can't bring it on an airplane.
I've ran comparisons against everything that's available on OpenRouter (well, as of few weeks ago), and for $0/tok, the local 27B Qwen can't be beat. Sure, it's slower, and yeah, the office is a few degrees warmer than it ought to be -- but nobody can pull the plug, nobody is watching over my shoulder, and the results are on par with SOTA.
Can't wait for a similarly sized Qwen 3.7 - from what I've seen so far, it's a leap ahead of the previous version.
Builds and local test runs are 3 times faster than the Windows laptop option. The machine will pay for itself just based on that within 3 months. I can spin up a local kubernetes cluster and do full integration tests while I am working on other things as well.
It isn’t a strictly Mac vs Windows thing though. It looks like the culprit is the MDM software on the Windows machines is just crazy slow and constantly getting in the way.
If I was paid less it would definitely make less sense for the company to pay for this machine.
Yes. Your people earn an order of magnitude less income than Americans.
Imagine its value if war broke out over Taiwan / Greater China, or really any of the dark scenarios with global connectivity or the truthiness of commercially available models. It is a very, very difficult piece of equipment to make at any other moment in history. I wish I could have purchased more. I saw the signs and price trends and out of stocks as they unfolded. No doubt others with the means are stockpiling.
There is not a period in the history of computing where this is true of consumer hardware over a decade for anything other than hardware already at the very bottom of its depreciation curve. It is surprising to me that you state that as an obvious assumption.
I suppose if your base case is Taiwan war that may be true, but there's a lot of folks who seem to be assuming the current hardware crunch will go on indefinitely when the natural state of hardware is getting cheaper over time.
Yes. Back in the my days at $faang in europe it was not uncommon to hear people getting 120-160 k€/year in compensation and we were “poor” compared to us engineers at the same faang (4-500 k$/year total compensation) with a bit of seniority…
I just got a B70 with 32GB RAM for the equivalent of $1200 (incl. sales tax and import duties to my non-US location, so presumably it could be cheaper elsewhere). The memory bandwidth is 608 GB/s. For M5 Max (32-core GPU) it's 460 GB/s and for M5 Max (40-core GPU) it's 614 GB/s. A 3090 is still faster at ~900 GB/s but you're getting 32GB VRAM for a lot less than equivalent Nvidia cards. It's about 1/3 the bandwidth of a 5090 for 1/3 the cost, but with the same 32GB VRAM. If you're interested in being able to run bigger quants with some context and stay on a lower budget then it's an appealing trade off.
I'm still exploring using these local models so don't want to spend the equivalent of $5 000 - $10 000 just to test it out. I don't mind slightly slower perf to do some experimentation more affordably.
I actually got an B50 16GB (with meager 70w TDP!) first to test an Intel card with my stack - it worked easily with Ubuntu & Vulkan. I'd read a lot about hassles and people writing them off as unusable but it seems like these are often with SYCL which doesn't even seem to outperform vulkan and so why bother? (The B50 was just $370 inclusive tax and duties). Literally `apt install` the vulkan libraries and it worked with default xe driver in 26.04 and the vulkan build of llama.cpp. The SR-IOV PF/VF also just works with qemu/kvm, no tricks required. Since I got it fwupdmgr has updated the firmware twice so Intel is presumably actually trying to support these products.
ROCm nightly was pretty easy to setup and get up running. The 9070XT has been a decent card for my use cases.
But the SYCL ecosystem versions. Absolutely horrendous and everything is hundred commits behind. Vulkan is probably the only way forward with this card.
Seriously, just put $10 into openrouter and play with models that are cheap but bigger than what you'd reasonably be able to run locally like deepseek v4 flash (unquantized). You'll be surprised by how far that $10 goes for a model better than what you'd be able to run. Even further on the model you would be able to run locally. Then think of how many long it would take to match the cost of spend + power on doing it locally...
I can run qwen 3.6 35B on my gaming PC at around 50 tok/s and other than power cost of a tiny bit extra per month, it's hardware I already owned from years ago.
I'm not really sure why qwen 3.6 35B is so expensive on openrouter, it seems abnormally high for what hardware it takes to run it.
On a 2021 M1 Pro (32GB RAM) I can get either of them as `IQ4_NL` quantized models (the first with reduced context, around 160k; the second can do the whole 264k with RAM left over), running something like 30tokens/s.
On a Framework 13 AMD AI HX370 it can use the same, but both on Q8_0 quantization, full context window, parallelism. Speed is just ~15tokens/s so slower, but definitely smarter than the lower quantized siblings.
Both of them are good developer partners for an engineer who wants more of a second pair of eyes and a rubber duck, rather than a model to just do everything for them. Pretty good for my brain dumping, some commit reviews, sanity checks, just always assume that every claim has to be checked and re-checked.
The only problem is really the context loading, that's pretty slow (starts off around 300token/s on empty context, by the time we get to something like 70-80k which is just a bit of repo discovery, it can run around 80 prompt token/s or less, so there's always a lot more waiting around. Local tools need to bump all of their timeouts, and have to be mindful that there's unlikely to be really meaningful parallelism on these machines with local models.
I'm still figuring out how to approach these things, though. Definitely better than glorified autocomplete or search tool (and too slow for the former, pretty decent for the latter). Their limited skill and performance make it more in line with other tools like my IDE or editors, that they are still in the "tools" compartment of my thinking, rather than "independent, cognitively active entities". Which feels like a good thing.
Hopefully we're looking at a future where local models become more & more realistic to use for reducing remote AOI spend.
QAT, MTP, 128k context.
I liked Qwen 3.6 27b too, it just seems that Gemma4 is a bit underrated.
Though I’m currently working on QADing the smaller Qwen 3.5 models from FP16 teacher to NVFP4 student, to hopefully eventually apply it to 3.6 27B… harder to get right than I expected though!
https://huggingface.co/google/gemma-4-31B-it/discussions/118
https://arxiv.org/abs/2212.09720
Local development for who? How many of y'all are rocking 128GB of memory? Am I reading Apple's site correctly that it's a $10,000 laptop?
I’m not having it build whole features from scratch, though. I give it pretty explicit instructions closer to the class or function level, and it still saves me an immense amount of time, while I’m very connected to the code that’s written.
Definitely the sweet spot for me.
For 24GB VRAM cards (e.g. 4090) you can use Q6_K (22.5GB) or Q5_K_M (19.5GB) quants, possibly offloading some of the weights to RAM.
At any rate it makes a stolen backpack or spilled drink a lot less damaging.
Unsloth recommends 18GB of RAM for Qwen3.6-27B (for their version of the model).
https://unsloth.ai/docs/models/qwen3.6
Sent from my 8gb M2 Mac mini.
I struggle to imagine purchasing multiple 1k+ cards on my own dime.
If Qwen models are so much easier to run, why are the providers charging more than V4 Flash?
[0]: https://aibenchy.com/compare/qwen-qwen3-6-35b-a3b-medium/qwe... <-- compare how the three models draw hamsters svgs, lol
The full 128GB is surely helpful in keeping browsers, editors and other things running since even 20-35GB models + k/v caches can eat up a lot of the core 64GB in my experience.
Are these unified memory Macs and giant 24GB desktop GPUs achieving dozens or hundreds of tokens per second commensurate with their 10x-20x cost?
This thing sounds like it should be a monster but we keep running into issues of the old GPU architecture, lack of support for AMX or AMX not being as big of a help as you'd hope when it does work, etc. Apparently we only got 5 tokens per second trying to set up Qwen 3.6 27B, and a similarly bad result trying to run GLM 5.2 which fits in memory but the custom kernels we had to try to contrive were too slow. I feel like this system should have tons of potential, especially if something was designed to let the AMX and huge system memory shine.
Does anyone have any suggestions? This thing was fun to set up and it's really cool but it's been a bit disappointing not getting any big tangible results so far.
We have a similar system on a single-cpu Tyan board with 256GB RAM that I'm hoping we might be able to use in conjunction with the first one if EXO ever gets good Linux support for GPU/RDMA over InfiniBand.
I love it because the watercooled 3090s are completely silent even under load. Facebook marketplace is definitely the move for a lot of the parts unfortunately, since you ideally would have higher end parts that are 2-3 years old.
It was super rough going to get started with them back in January, but right now the cards purrrr and I haven't even tried tuning yet. You need to use a patched vLLM image with aiter but besides that things are finally working on the ROCm front.
The results are impressive considering the amount of people trashing AMD and still trying to recommend 3090s. I hope to buy a 2nd one at some point, but I also hate the version hell of vLLM, the R9700, the ROCM version, and Qwen3.6 all not agreeing with each other. I haven't gotten vLLM to run properly for Qwen3.6, since the version that runs on a 9700 doesn't support 3.6 yet.
I'm trying to quickly hack out a optimized path for just Qwen3.6 to run against rocm natively (e.g. my own inference server for 9700s basically) and see if it can perform better than llamacpp vulkan's results.
Word of caution - the last llamacpp with good performance was b9209 from a month ago. After that, for some reason, vulkan performance dropped by 10x, which has made me lose confidence in llamacpp in the long run.
Having said all that, 3x is 96GB for 4k and peak 900 watts. A 96GB Blackwell is $12k and peak 600 watss. And they will have a similar memory throughput (minor negative to the AMD cards for split processing). It's crazy how price efficient the r9700 is compared to the Nvidia cards.
https://unsloth.ai/docs/models/qwen3.6
Qwen 3.6 27B will run in full offload with a 4-bit quantisation in 64GB on an M1 Max. It is quite slow.
I don't know about 48GB but 64GB should be enough.
It got rather tangled up when I tried it with one of my coding tests, which is a simple wordpress plugin, but I frustrate the model by asking it to write code for older PHP, break WP coding conventions and use a rather bespoke method for arranging code in objects. So it is sort of a hybrid of a green field and brown field task; a bit muddy.
It did not do as well as Qwen 3.6 35B, but the way it worked through its thoughts was interesting.
TBH I struggled to understand what DeepReinforce are doing that is materially different; the explanation of their training technique goes over my head at this point.
So for example I'd favour a used M1 Max over a used M2 Pro, at least based on my naïve understanding. Not quite sure where the balance changes.
There appear to be some hardware improvements with the M3 and up regarding the Apple Neural Engine which I'd hope would show up in MLX performance; I remember seeing some optimisations in image generation models that are only possible on later hardware.
The GPU cores are progressively better I believe, but the memory bandwidth is lower. Though perhaps the M4 can get closer to actually saturating said bandwidth.
(And I must reiterate that my understanding of this stuff is pretty naïve.)
A very useful resource for characteristics and comparative performance of all M variants, if anybody is interested, is https://github.com/ggml-org/llama.cpp/discussions/4167?sort=...
Its sister discussion for nvidia gpus is https://github.com/ggml-org/llama.cpp/discussions/15013
Note the drop in performance for the base (binned) m3 max version. You are better off with full m1 max than the binned m3 max, even price aside.
The issue I have with my m1 max is that with 64gb you cannot run really decent MoE models, ie the ones you can run like qwen 35B-A3B have only 3b active parameters and are much less capable than qwen 27b in my testing. So I end up running the 27b one, but it runs relatively slow (though still usable at 10-20 tok/s) and I would have been better off a used nvidia gpu setup for dense models. I assume 35B-A3B has its use cases, eg as subagents, just that I cannot find them. With a higher amount of ram I could probably run bigger MoE models which could be more comparable, though prefill would still be an issue (and prob a bigger one). The only hopeful thing is that there are performance hacks appearing (speculative decoding and prefill) that seem to start improving inference speed once getting implemented, so I am mildly hopeful.
(I must also iterate that my understanding is not very deep either)
On M5 128GB one can make use of the ram and use sparse MoE. For example, DeepSeek-V4-Flash will fit, served by DwarfStar (https://github.com/antirez/ds4). One will probably improve 2x the token/sec speed, given DS4F 13B activated params in the MoE are ~1/2 of the ~27B of the dense Qwen.
27B Of the Qwen fit even on a cheaper 24GB card, e.g. amd 7900xtx (<$1K?) or slightly dearer nvidia 3090 (with cuda). With ~900 GB/s bandwidth they will likely be ~50% faster than the M5 with 600 GB/s.
"My personal impression is that within these quantizations Qwen 3.6 27B is as good as (or maybe slightly better than) DwarfStar4. Though, I won’t be surprised if for longer context projects DS4 has an edge."
It does about 30 tok/s which is enough for me. It's about half what the online models do, but it's enough.
I've heard their 9B models are also good, but they aren't much faster if you have the ram and a nice cpu.
These qwen3.6 models are the first ones I find can do much. GPT OSS was good, and Gemma4 is better. Gemma knows more facts, but qwen3.6 is smarter.
If a model runs fast enough for your use case and does exactly what you need it to, then you don't need a much slower model that might be more accurate. If you do anything more complicated, the dense models become more necessary and they are much more computationally heavy by comparison.
On your hardware an Unsloth quant of Gemma 4 26BA4B QAT would likely give you better results, but because it has 4B active parameters instead of Qwen's 3B active parameters, it will probably run slower.
[0] https://deepclause.substack.com/p/how-to-make-small-models-p...
Progress marches without mercy.
https://github.com/ikawrakow/ik_llama.cpp
Edit: it's gonna be slow if you're not using any VRAM. But it's possible. Software isn't going to speed that up anytime soon, it's just a hardware bandwidth limit.
>
> --jinja for tool calling support
Pretty sure this flag hasn't done anything for a while. It's enabled by default since ~November of last year
On a serious note, I run my models on desktop pc, simple api and i can use them wherever whenever.
72.06 t/s. That's the full Qwen 3.6 27B model BF16, using MTP, running on Ollama. Yes I know I should bite the bullet and get vllm running on that box.
That was, also, at a 570 watt limit: I normally run a little less, but when I first tried this I actually forgot I had set the limit to 300 (it's a hot day, I figured why fight the A/C?), and at 300 watts the same question came back at 69.38 t/s. (The extra power matters more for compute bound things, the difference in generating LTX2.3 videos is considerably higher... but still not linear.)
Personally I prefer the 35B MoE model, which is fast enough to be interactively useful, and capable, but I would probably use the 27B if I wanted to generate whole applications like that.
I am unconvinced that most "local" AI applications need anything much more powerful than the Gemma 4 12B model. Local agentic coding is a small niche, but there are plenty of ways a local model can help with development tasks.
I would really like to see a 12B or 16B Qwen 3.6.
I am currently playing with Ornith 1.0 in the MoE configuration, which is based on the 35B variant of Qwen 3.5; I am not sure if it is better than the 3.6 version.
Benchmarks say it is; my own silly tests either suggest otherwise or suggest that I have to talk to it a bit differently.
I really want to have a model that i can run locally on my 24gb m4 pro mbp for when i don't have internet to connect to my 3090 running the qwen, and i love how gemma 4 models 'feel', but i can't make them be competent. I am in the middle of finetuning both qwen3.5 9B and gemma 4 12B just to try and make those bridge closer to 27B for coding/agentic tasks (and am trying to ternarize and DQT 27B so that it fits in ~9gb pre-KV).
How do you run the gemma? What do you use it for (and in what harness), maybe llama.cpp and pi-mono just aren't for this model and that's what i'm doing wrong.
I am still mostly tinkering/learning rather than spilling out code, and I feel quite slow on it. So it doesn't matter too much to me if it is really slow. More the journey than the destination if that makes sense. I'm stubborn.
I have tried the Gemma 4 12B model (Unsloth's QAT version) with search/browse tools in LM Studio and Unsloth Studio, when I am trying to understand a new thing.
Basically I get it to write introductory starter documentation for me to absorb, because my big personal problem, these days, is focussing enough to start a project and then digging in; I need the help.
I have found its limits on obscure packages (that it sometimes makes up) but before that it's a bit like stumbling on a blog post that happens to be really right for your particular need. Good enough to work through.
It's stuff I could ask Perplexity to do, or ChatGPT, to be fair, I just like LM Studio for this and have the inquisitiveness to want to run it locally.
In your case: I don't believe it's the quant. I'm sure it's the model — it has good coding knowledge but it's clearly not specialised. It might be good enough at writing Python/PHP/JavaScript at a novice level. It is also quite good on WordPress tooling and functions.
But I wouldn't bother with it for agentic coding if you've got experience elsewhere. Might be interesting to see what you can do with the 9B Ornith model?
Qwen 3.6 MoE in its Unsloth version is another matter. Impressive and I am trying to find ways to support my old brain doing what I've done before.
However, text-to-speech, speech-to-text, and non-code LLM use cases are so useful to have local, and don't require big hardware.
Having a universal reliable inference engine interface, I think, is the big unlock that needs to happen before app devs can ship these features.
Personal concrete use case: meeting recording app. This uses Parakeet + Qwen to create local transcriptions and post-cleanup, respectively.
Right now this app has to download and manage all these models, then bundle an inference engine to run them. It's a lot of code that probably should belong to the OS, or at least a standard interface.
While apps can offload some of this to llama.cpp or a similar process over http, that's another set of setup for the user to do before they can have a useful app.
Anyway, if you're getting started on a Mac, I'd suggest trying out oMLX (https://github.com/jundot/omlx) before messing with llama.cpp. In particular they have community benchmarks so you can see what kind of performance you're likely to get: https://omlx.ai/benchmarks. I wished each one had more configuration details though.
Certainly this is falsifiable easily by any of us doing it on a regular basis
> Qwen stuck in thought loops
This does happen when context is not managed effectively; creating plans, using subagents and compactions strategically resolves this
> creating plans, using subagents and compactions
Yes, these are all things that Claude Code does for you. However, for the thought loop issue, these are not the fixes. The canonical fix is to limit the number of thought tokens (llama.cpp's `--reasoning-budget`) or try to mess with the various penalty parameters. In any case, it's not a solved problem as far as I can tell.
Jackrong has a few different ones available depending on what you're trying to do: https://huggingface.co/Jackrong
don't get me wrong, the frontier models are leaps and bounds ahead of what qwen/kimikgemma are doing - but i don't need to drive a ferrari to the grocery store everytime either.
- Memory bandwidth; BUT the requirements are currently capped because models have stopped growing at around 1-1.5 trillion parameters for quite a while now. You only need more bandwidth if you're optimizing for the highest possible concurrency (i.e. you're a cloud provider). Also, MoE exists.
- Support for native low-precision math (like FP4 and FP8); BUT once your GPU supports native FP4 (Blackwell+), there's generally no reason for GPUs to go lower because of the obvious quality degradation.
- VRAM capacity - just like memory bandwidth, it's practically capped by 1-1.5 trillion parameter models and is unlikely to need much more in the near future. Also, the current trend is toward miniaturization: modern 30B-class models (which require far less VRAM), now completely destroy 200B-class models from just two years ago on most tasks. We also have better understanding now how to compress contexts.
Most model improvements currently seem to come from RL/harness-based methods, not from scaling models or running new algorithms that require fundamentally new GPUs.
So I don't see why GPUs that exist today must become "outdated" in a few years. They'll be seen as outdated by hyperscalers because they need to serve the maximum number of users as cheaply as possible, so of course they'll replace their GPUs with newer ones that have higher memory bandwidth or more tensor cores. But you don't need that for local inference.
How does that work? They have negative GPUs now!
I do not have a crazy rig, a modest gaming one at that, but in trying to understand more about agents and their capabilities, I am SOL with my 16 GB of RAM and 8GB of VRAM. I can get most small, non tool calling models to perform well, but I've had major issues with anything over 9B doing anything more than reasoning (egregiously slow at higher parameter counts).
And so far, I cant get even Pi to extend itself or do any meaningful work with any of the models I currently can get to run.
I very much appreciate the frank response, as it makes me feel less defeated at knowing my understanding of how it should work is not the full issue, hahaha
For you, you could try gemma-4-26B-A4B
You should look at gemma-4-26B-A4B. 16+8=24gb and Q4 is about 16GB. Not much context left, but might run.
But certainly seems like we are a few years away from that, sadly.
Am I also screwed in being able to train my own small model or adjust another one with such a non-workhorse PC?
Was just trying to see how small I could go and get acceptable results, but yeah, larger Qwen 3.6 with MTP is going to be better. Cant wait to see how AI model (unsloth/local-llm/heretic/reaper/etc communities) are tweaking/engineering quality down into smaller models. Lots of new things coming out.
Offloading compute to them is much easier, except its still a limited set of open models. Most companies are already running in AWS, so it's an easy add, models run in a trusted location, just another line item on the Amazon bill. You don't have to talk anyone into signing up with a new vendor. Plus you don't have to worry about local hardware at all.
I find that for local coding, I need to spend a lot of time building concise SKILLs for specific things I work on and try to only enable one or two skills per coding session.
To the author of the linked article nice job, and if you feel like adding to it, please add details on your setup.
I feel like the amount of context bloat that OpenCode puts these small models into the dumb zone too quickly. The system prompt alone is 9k tokens, and when you add your own setup it can easily creep up to 15k.
I've seen sites here and there but they feel like quick little toys that don't get updated, so they always suggest old models.
I've been using the full GLM 5.2 model this way (through opencode) at work for the past week. It's quite impressive.
https://pi-local-coding-bench.dev
The benchmark seemed fine until I saw that.
If you use sub agents, they will overwrite the cache and each request will trigger full reprocessing. Have fun with that as it will crash the t/s metrics on each prefill on top of the max 64k including input + output is a major blocker.
If you push the context higher and add parallel slots the requirements will be far higher and the numbers less shiny.
I ran those throu opus saking if it was good advice and was not impressed:
I read the actual qr_scanner.ino. Short answer: partially, but I'd push back on most of it. That review reads like generic ESP boilerplate advice written against an imagined version of your code — several of its "fixes" are already in your file, and its headline "critical" claim misreads what the code does. Going point by point:...
Ok that's the part I'm interested in, don't care about minesweeper clones....
> Make a landing page selling candles for women that are into wellbeing and SPA.
can't be serious...
It basically exploits the face that time can be traded for intelligence with local models
https://github.com/day50-dev/Petsitter
For anything else local, including writing some automation scripts and such, it works great.
Source: https://chatgpt.com/share/6a42dd8a-4e28-83e8-9ef7-6ba56d665c...
If you want to play a hyperbolic minesweeper, Hyperrogue features that https://hyperrogue.fandom.com/wiki/Minefield
Even llama.cpp's bundled web UI handles it fine. Dead simple.
Neither is going to return much knowledge. Basically just relevant url so you need a second tool to grab them and there bot walls get tricky
For me it’s the first local model that actually makes sense as a general intelligence.
> https://sleepingrobots.com/dreams/stop-using-ollama/
I had faced roadblocks while integrating with openclaw using ollama (Was trying to experiment with `qwen3-vl:2b`). I was tracking the issue back to openclaw at that time, I didn't even consider investigating ollama.
I attached a threads post here where I'm talking to meta ai to expand on both scenarios (not to use ollama, but llama.cpp & my take on the why this is the way it is - ie. commercial gains)
https://www.threads.com/@riojos/post/DaMXIs4k4w8
Is there any way to use MLX and GPU at the same time? Or does memory become a big problem?
TBH, I never understood Apple hyping these neural cores because I didn't think anyone actually uses them except maybe certain photo/video editing software.
If I can generate voice at the same time as video, that would be useful.
The neural cores aren't suitable for LLMs/transformers and isn't used in LLM inference. On the M5 and later chips, it comes with neural accelerators, aka Tensor Cores, which speed up the 'prefill' (i.e. processing your context window) part, but don't do anything for inference.
The MLX vs GGUF debate is mostly irrelevant. The GGUF pathways are optimised for apple silicon to the extent of practically identical performance to MLX. MLX is just one way of using Apple GPUs, it comes with many optimisations in the box, but they're not hard and they're no longer MLX-exclusive.
I haven't seen anyone make an argument they are as good as SotA (OpenAI, Anthropic). It's just they are approaching state where they are "as good" for some _limited_ set of use cases. Which will allow us to resolve 2 primary issues with these SotA models: privacy and vendor lock-in. Plus, they're very useful for education purposes, you get to explore what things looks like under the hood, play with various models, tools, maybe put something simple together yourself.
You get Macbook - great. You got gaming rig with a decent GPU - great (set it up as a dedicated server that you connect to through simple REST).
What exactly is wrong with any of that?
Consider that there are literally trillions of dollars being wagered on this not being the future state of computing. Not even speculating that HN is being astroturfed (though I see no reason it wouldn't be by interested parties), but many of the US tech employees here have direct financial incentives in various forms to be rooting for the failure of open source and optionally local models.
This part should have featured something about real work. But instead it features a paragraph about one-shot bs that creates "something".
Unless your work is to create thousands wordpress tremplates to sell - this is not a "real work".
Give it a repository (any kind of OSS project will do for an example) and a github issue requesting a knew feature or describing a confirmed bug. (you can and probably should write a prompt for LLM shough, don't just provide the issue itself)
And then whatch it go.
And then judge the result and it's quality.
Sorry, but from my experience 27B is just useless. You do get a result and some times it does work, but most of the times it is not event on junior dev level. And it takes it a lot of time to do the thing, unless you have an extremely expensive machine.
If your expectation is to treat it as a tool, then you're wrong.
I guess that's where the disconnect lies.
I already have tools for autocomplete, working with structured data and many more. Deterministic tools.
Obviously you do not expect something like that from a model with some harness. It can read some input (user's or other tools) and give you some output.
My expectation is that this tool, given some meaning full input (instructions, expectations, motivations and an optional source files to work with), will produce something that will actually be aligned with the input.
For example: consider I have a services that has some sort of events created now and then. I what those events to be available for other services. So I decide it to have a transactional outbox and an observer that will pull events from the outbox and put them into a kafka topic.
My expectation is that I can give this tool some context (source code and description), state my instructions, expectations, motivations, design decisions and have an implementation as a result.
My other expectation is that given my context etc and agent's context (skills etc) were correct and adequate - the outout will also be correct and adequate.
I do have access for a 64 gb ram mac mini but most people don't.
tweaking sampler might help
That will get you a near-frontier experience. DSv4 Flash launched in April with capabilities on par with GLM 5.0, which launched in February.
It's a surprising example of the recency bias to me to assume anything other than the market returning to its historic norm, even if the AI buildout doesn't slow, producers will scale factories to meet that demand.
So it will be no surprise that there will be a time where everyone will be able to run a local model, say GLM 5.2 locally on their machine. Like it or not.
I'm running the NVFP4 alongside Gemma4 at the same quant on an OEM Spark
also i like that if i drop more sophisticated tools into my harness (e.g. any of the NLP/RAG-based search tools in place of grep/rg), the agent will actually reach for them and make progress faster; previous models have been reluctant to embrace new tools.
Lora if effective could be a great reason to run local models.
Qwen on the other hand got straight to work with astonishing competency on the same system.
From what I read llama3 needs beefier compute to reliably invoke tools, which I presume relates to it focussing more on simulating AGI rather than being a useful tool.
https://arena.ai/leaderboard/code/webdev/pareto?license=open...
https://arena.ai/leaderboard/text/pareto?license=open-source
200k @ K : Q5_0 V: 4_1 (which is a bit of a sweet spot)