128 exports at 17B active parameters. This is going to be fun to play with!

behnamoh · 2025-04-05T18:45:42 1743878742

does the entire model have to be loaded in VRAM? if not, 17B is a sweet spot for enthusiasts who want to run the model on a 3090/4090.

NitpickLawyer · 2025-04-05T18:58:27 1743879507

Yes. MoE models tipically use a different set of experts at each token. So while the "compute" is similar to a dense model equal to the "active" parameters, the VRAM requirements are larger. You could technically run inference & swap the models around, but the latency would be pretty horrendous.

manmal · 2025-04-05T19:42:19 1743882139

I think prompt processing also needs all the weights.

scosman · 2025-04-05T18:54:41 1743879281

Oh for perf reasons you’ll want it all in vram or unified memory. This isn’t a great local model for 99% of people.

I’m more interested in playing around with quality given the fairly unique “breadth” play.

And servers running this should be very fast and cheap.