NVIDIA · gpu

Verdict · buy

RTX 5090 for local LLM inference: the new watermark

32 GB VRAM, Blackwell sm_120, and enough bandwidth to run 70B quants locally without the usual ritual of swapping layers. Worth the jump from a 4090 if you live in llama.cpp.

Product
NVIDIA GeForce RTX 5090
Published
2026-04-24T00:00:00.000Z
Price
$1,999
Score
9 / 10

Pros

  • 32 GB VRAM clears 70B-class models in Q4 quant with headroom to spare
  • Bandwidth uplift is real on long-context generation, not just matmul benchmarks
  • sm_120 compute capability unlocks newer CUDA paths still bottlenecked on Ada

Cons

  • Needs fresh CUDA 13.x + cuDNN 9 toolchain; old pipelines will complain until you update
  • xformers on current PyPI still force-downgrades Torch — SageAttention is the workaround
  • Power draw is serious; plan the PSU and case airflow before the PO
const{Fragment:n,jsx:e,jsxs:t}=arguments[0];function _createMdxContent(o){const r={h2:"h2",li:"li",p:"p",strong:"strong",ul:"ul",...o.components};return t(n,{children:[e(r.h2,{children:"What we tested"}),"\n",e(r.p,{children:"A 5090 in a single-GPU workstation running local inference on a mix of\nllama.cpp, vLLM, and ComfyUI. Workloads picked to reflect what a practitioner\nactually does, not what a leaderboard cares about:"}),"\n",t(r.ul,{children:["\n",t(r.li,{children:[e(r.strong,{children:"70B Q4 quant"})," on llama.cpp with 32k context, streaming."]}),"\n",t(r.li,{children:[e(r.strong,{children:"SDXL + Flux"})," image generation through ComfyUI with a LoRA stack."]}),"\n",t(r.li,{children:[e(r.strong,{children:"Wan 2.2 video"})," generation via Wan2GP — the kind of workload that stresses\nboth VRAM and sustained compute."]}),"\n"]}),"\n",e(r.h2,{children:"What you'll feel"}),"\n",t(r.p,{children:["The honest shift versus a 4090 is ",e(r.strong,{children:"headroom"}),". 24 GB on the 4090 means you\nspend mental cycles deciding what quantization tier to accept, or juggling\nlayer offload. 32 GB on the 5090 means you run the thing and move on."]}),"\n",e(r.p,{children:"For pure throughput on short-prompt generation, the difference is smaller than\nthe spec sheet suggests — both cards are memory-bandwidth-bound on a lot of\nreal workloads. Where the 5090 earns its premium is when the context window\ngets long, the batch gets fat, or the model size starts scraping the ceiling."}),"\n",e(r.h2,{children:"Setup notes (if you're upgrading)"}),"\n",t(r.ul,{children:["\n",e(r.li,{children:"CUDA 13.2 + cuDNN 9.20 is the current-known-good combo. Don't mix with a\nCUDA 12 install; dependency resolution gets ugly."}),"\n",e(r.li,{children:"Skip xformers. Install SageAttention 2.2.0 instead — it doesn't force a\ntorch downgrade and the perf is within noise."}),"\n",e(r.li,{children:"cu128 or cu130 PyTorch builds are mandatory for sm_120."}),"\n"]}),"\n",e(r.h2,{children:"Who should buy"}),"\n",t(r.ul,{children:["\n",e(r.li,{children:"Anyone running 70B-class models locally for real work."}),"\n",e(r.li,{children:"Image/video generation shops; the bandwidth is noticeable."}),"\n",e(r.li,{children:"Fine-tune workstations that previously had to dance around VRAM."}),"\n"]}),"\n",e(r.h2,{children:"Who should skip"}),"\n",t(r.ul,{children:["\n",e(r.li,{children:"24 GB was already enough for you. Don't upgrade out of envy."}),"\n",e(r.li,{children:"You run everything in the cloud. A 5090 is a workstation buy, not a\ndatacenter play."}),"\n"]}),"\n",e(r.h2,{children:"Bottom line"}),"\n",e(r.p,{children:"If you're buying a new workstation for local AI work in 2026, this is the\ndefault. The price is the price; the 32 GB removes an entire class of\ncompromise."})]})}return{default:function(n={}){const{wrapper:t}=n.components||{};return t?e(t,{...n,children:e(_createMdxContent,{...n})}):_createMdxContent(n)}};
Remix this rig in the arena