Skip to content

Nvidia GPU Generations -- An AI Inference Perspective

A reference for understanding Nvidia’s consumer GPU lineup through the lens of local LLM inference rather than gaming performance. The two use cases share hardware but reward different specifications.

This document serves two distinct purposes and it is worth being clear about both upfront.

For readers choosing hardware: each generation section identifies the best-in-class card per VRAM tier — the card that delivers the most inference performance for that memory capacity. These are acquisition targets. If you are buying, this is what you should be aiming for.

For LocoLLM and Colmena: the benchmark lab deliberately inverts this. Colmena runs the worst-in-class card per VRAM tier — the floor, not the ceiling. This is a constraint-based research discipline. By benchmarking the slowest representative of each tier, results are honest baselines that readers with better hardware can only improve on. More importantly, it keeps the research focused on optimisation rather than hardware. If a technique only works on a top-tier card, it is not a useful technique for most people. If it works on the floor card, it works everywhere in that tier.

The constraint is intentional. “Just buy a better card” is not a solution. The goal is to find out what is actually achievable within a given VRAM budget, using whatever card represents the worst realistic starting point. That discipline surfaces optimisations that comfortable hardware conceals.


Before the tables, the metrics worth understanding:

Memory bandwidth — how fast weights move from VRAM to compute units. This is the primary bottleneck for LLM inference. More bandwidth = more tokens per second.

VRAM — how large a model you can load. The hard ceiling. A model that doesn’t fit in VRAM doesn’t run (or runs agonisingly slowly via system RAM offload).

Tensor Cores — dedicated hardware for mixed-precision matrix operations. Enable QLoRA fine-tuning and accelerate inference on supported frameworks. Arrived with Turing (RTX 2000 series). Absent from GTX cards including the GTX 1600 series.

CUDA Compute Capability — the version floor that frameworks require. Ollama’s minimum is Compute 5.0 (Maxwell). Modern tools like bitsandbytes, Unsloth, and current quantisation kernels require 6.0+ (Pascal). Cards below Compute 5.0 are unsupported by any current inference framework. Maxwell (5.x) works for basic Ollama inference but is excluded from advanced tooling.

CUDA Core count — largely irrelevant for inference. Matters for gaming and rendering. Don’t optimise for this.


Pre-Maxwell (600/700 Kepler Series) — Not Supported

Section titled “Pre-Maxwell (600/700 Kepler Series) — Not Supported”

Kepler (600 series and most 700 series) sits at Compute 3.x. No current inference framework supports it. These cards will not run Ollama, bitsandbytes, or any modern AI tooling. There is no workaround. If you own one, it is not useful for inference.

Exception within the 700 series: The GTX 750 and GTX 750 Ti are Maxwell architecture (Compute 5.0) despite carrying a 700 series number. They are covered in the Maxwell section below.


GTX 700 Maxwell / GTX 900 Series — Maxwell (2014—2016)

Section titled “GTX 700 Maxwell / GTX 900 Series — Maxwell (2014—2016)”

The true compute floor for Ollama is Compute 5.0, which means Maxwell qualifies. No Tensor Cores, GDDR5 memory throughout, and bitsandbytes/Unsloth will not run here — but Ollama does, and that makes these cards legitimate LocoBench data points for the bottom of each VRAM tier.

The GTX 750 Ti is technically a 700 series card but Maxwell architecture, making it the single lowest-compute card that Ollama will use. The 900 series is fully Maxwell.

Note on the GTX 970: Avoid for inference. The advertised 4 GB is actually 3.5 GB of full-speed VRAM plus 0.5 GB of slow VRAM accessed via a narrower bus. Models that exceed 3.5 GB will suffer a severe performance cliff. It is not a true 4 GB card.

CardVRAMMemory BWComputeAI Notes
GTX 750 Ti2 GB GDDR586 GB/s5.0True floor. Minimum compute Ollama accepts. TinyLlama only.
GTX 9502 GB GDDR5105 GB/s5.22 GB ceiling — TinyLlama, Phi-2 territory only.
GTX 960 2 GB2 GB GDDR5112 GB/s5.2Same VRAM ceiling as 950, slightly more bandwidth.
GTX 960 4 GB4 GB GDDR5112 GB/s5.2Floor of 4 GB tier in this generation. Low bandwidth.
GTX 9703.5 GB GDDR5196 GB/s5.2Avoid — memory architecture defect. Not a true 4 GB card.
GTX 9804 GB GDDR5224 GB/s5.2Better 4 GB option than 960. Twice the bandwidth.
GTX 980 Ti6 GB GDDR5336 GB/s5.2Only 6 GB card in this generation. Reasonable bandwidth.
GTX Titan X12 GB GDDR5336 GB/s5.212 GB VRAM is the headline. Bandwidth modest for the tier.

AI sweet spot in this generation: There isn’t one for acquisition purposes. These cards are useful as LocoBench floor representatives — particularly the GTX 960 4 GB (floor of 4 GB Maxwell), GTX 980 Ti (only 6 GB option), and GTX Titan X (floor of 12 GB tier predating Pascal). The Titan X’s 12 GB at $150—200 AUD secondhand is marginal value given Pascal alternatives, but as a benchmark data point for the absolute bandwidth floor of the 12 GB tier it has research utility.

Colmena relevance: Maxwell cards are worth acquiring cheaply as tier floor representatives. None should be prioritised over Pascal or later equivalents. If a GTX 960 4 GB or GTX 980 Ti appears for under $30—40 AUD, it rounds out the Maxwell tier data. The Titan X at current secondhand prices ($150—200 AUD) does not represent good value when a Pascal 1080 Ti offers more bandwidth for a similar outlay.


Compute capability 6.x. No Tensor Cores. Still functional for inference but heading toward framework deprecation. The floor for anything worth running today.

CardVRAMMemory BWAI Notes
GTX 1050 Ti4 GB GDDR5112 GB/sMinimum viable. Runs Q4 3B-4B models. Reference floor.
GTX 1060 6 GB6 GB GDDR5192 GB/sUseful step up from 4 GB. Fits most Q4 7B models.
GTX 10708 GB GDDR5256 GB/s8 GB headroom but low bandwidth hurts token speed.
GTX 10808 GB GDDR5X320 GB/sBetter bandwidth, still no Tensor Cores.
GTX 1080 Ti11 GB GDDR5X484 GB/sInteresting: 11 GB and high bandwidth. Aging fast.
Titan X Pascal12 GB GDDR5X480 GB/s12 GB VRAM attractive but Pascal lifespan limited.

AI sweet spot in this generation: GTX 1060 6 GB for minimum viable 7B inference. GTX 1080 Ti if you find one cheaply — the bandwidth still holds up. Everything else is either too little VRAM or too little bandwidth to be worth acquiring now.


GTX 1600 Series — Turing Without Tensor Cores (2019)

Section titled “GTX 1600 Series — Turing Without Tensor Cores (2019)”

Turing architecture but deliberately stripped of RT Cores and Tensor Cores to protect RTX pricing. Compute capability 7.5. Better efficiency than Pascal, GDDR6 memory on top models.

CardVRAMMemory BWAI Notes
GTX 16504 GB GDDR5/6128/192 GB/sBudget floor. 75W, no external power.
GTX 1650 Super4 GB GDDR6192 GB/sBetter than base 1650, same VRAM ceiling.
GTX 16606 GB GDDR5192 GB/sFine, but 1660 Super/Ti are better.
GTX 1660 Super6 GB GDDR6336 GB/sBig bandwidth jump over base 1660. Good value.
GTX 1660 Ti6 GB GDDR6288 GB/sTop of GTX line. Turing efficiency, no Tensor Cores.

AI sweet spot in this generation: GTX 1660 Super — same VRAM as the Ti but higher bandwidth. The GTX 1660 Ti is the better-known card but the Super edges it on the metric that matters. No Tensor Cores means inference only, no QLoRA training.

Note on the 1650: Low power draw and no external connector makes it practical in office machines (Optiplex, ThinkCentre). That’s its value, not inference performance.


RTX 2000 Series — Turing With Tensor Cores (2018—2019)

Section titled “RTX 2000 Series — Turing With Tensor Cores (2018—2019)”

The RTX brand debut. First generation Tensor Cores enable mixed-precision training and accelerated inference. Compute capability 7.5. The first generation worth considering for QLoRA fine-tuning.

CardVRAMMemory BWAI Notes
RTX 20606 GB GDDR6336 GB/sTensor Cores but 6 GB is tight for 7B models.
RTX 2060 Super8 GB GDDR6448 GB/sThe sleeper. 8 GB + 448 GB/s + Tensor Cores. Excellent value secondhand.
RTX 20708 GB GDDR6448 GB/sIdentical to 2060 Super on inference metrics.
RTX 2070 Super8 GB GDDR6448 GB/sSame again. CUDA cores irrelevant for inference.
RTX 20808 GB GDDR6X448 GB/sMinimal gain over 2060 Super for inference.
RTX 2080 Super8 GB GDDR6X496 GB/sOnly meaningful step up in this tier.
RTX 2080 Ti11 GB GDDR6616 GB/sGood bandwidth, 11 GB useful. Premium pricing.
Titan RTX24 GB GDDR6672 GB/s24 GB VRAM, high bandwidth. Rare and expensive.

AI sweet spot in this generation: RTX 2060 Super. Punches well above its name and price. Matches the 2070 Super and 2080 on inference-relevant specs while trading on a lower reputation. The 2080 Super is the only card in this range that meaningfully improves on it — and the price gap rarely justifies the modest bandwidth gain.


Second generation Tensor Cores, significant efficiency gains, DLSS 2.0. Compute capability 8.6. The generation where VRAM choices became fragmented — Nvidia made some unusual decisions.

CardVRAMMemory BWAI Notes
RTX 30508 GB GDDR6224 GB/sLow bandwidth hurts. Not recommended.
RTX 306012 GB GDDR6360 GB/sKey card: 12 GB unlocks a larger model tier. Bandwidth lower than 2060 Super.
RTX 3060 Ti8 GB GDDR6448 GB/sFaster than 3060 for 8 GB models, less VRAM headroom.
RTX 30708 GB GDDR6448 GB/sSame bandwidth as 2060 Super, newer architecture.
RTX 3070 Ti8 GB GDDR6X608 GB/sMeaningful bandwidth step. Still 8 GB ceiling.
RTX 3080 10 GB10 GB GDDR6X760 GB/sLarge bandwidth jump. 10 GB is an awkward VRAM amount.
RTX 3080 12 GB12 GB GDDR6X912 GB/sBetter: 12 GB + near 1 TB/s bandwidth. Strong card.
RTX 309024 GB GDDR6X936 GB/sProsumer ceiling. 24 GB enables 13B+ models at speed.
RTX 3090 Ti24 GB GDDR6X1008 GB/sMarginal gain over 3090 at significant cost premium.

AI sweet spots in this generation:

RTX 3060 — the cheapest path to 12 GB VRAM. Counterintuitively slower than the 2060 Super for models that fit in 8 GB due to lower bandwidth. Its value is entirely in the VRAM headroom.

RTX 3080 12 GB — if budget allows. Near 1 TB/s bandwidth plus 12 GB VRAM is a genuinely strong combination.

RTX 3090 — the consumer ceiling worth waiting for. 24 GB VRAM handles 13B models comfortably. Secondhand pricing is elevated while demand from AI workloads keeps it high, but will fall as 40-series supply increases.

A note on scope: For a lab focused on small models — 3B to 13B quantised — 12 GB is the practical ceiling where new VRAM headroom stops unlocking meaningfully different use cases. The 3060 is the workhorse tier. The 3090 is documented for completeness and to answer the question of whether results scale predictably to larger VRAM, not because 24 GB is necessary for the lab’s primary purpose. Everything above 12 GB in this document is reference material, not a recommendation.


RTX 4000 and 5000 Series — Reference Only

Section titled “RTX 4000 and 5000 Series — Reference Only”

The sections below cover Ada Lovelace (40-series) and Blackwell (50-series). They are included as a reference for bandwidth progression and VRAM tiers, not as acquisition targets for a lab built on opportunistic secondhand purchasing.

Two things make these generations practically out of scope. First, pricing: 40-series secondhand values remain elevated, and 50-series has no meaningful secondhand market at all. Second, and more fundamentally, Nvidia has changed its approach to consumer memory. High-bandwidth GDDR6X and GDDR7 is increasingly allocated to datacenter products, while consumer cards — particularly in the 60-class — receive narrower memory buses and lower bandwidth than their predecessors. The 4060 Ti is slower for inference than a 2060 Super from 2019. The 5060 Ti 16 GB launched, sold out, and was quietly designated end of life by major AIB partners within months. This is not an accident; it reflects where Nvidia’s memory supply priorities sit.

The 40 and 50 series tables are here so readers with newer hardware can locate themselves in the bandwidth progression and extrapolate from Colmena’s floor-card results. They are not recommendations.


RTX 4000 Series — Ada Lovelace (2022—2024)

Section titled “RTX 4000 Series — Ada Lovelace (2022—2024)”

Third generation Tensor Cores, DLSS 3, AV1 encode. Compute capability 8.9. The generation where Nvidia made memory bus decisions that directly penalise LLM inference on the lower-end cards — and introduced 16 GB as a meaningful new VRAM tier at the high end.

CardVRAMMemory BWAI Notes
RTX 40608 GB GDDR6272 GB/sNarrower bandwidth than a 2060 Super. Not recommended for inference.
RTX 4060 Ti 8 GB8 GB GDDR6288 GB/sSame bus problem. Skip unless price is exceptional.
RTX 4060 Ti 16 GB16 GB GDDR6288 GB/sThe 16 GB VRAM is tempting; the 288 GB/s is not. Slowest path to 16 GB.
RTX 407012 GB GDDR6X504 GB/sReturns to reasonable bandwidth. Solid 12 GB card.
RTX 4070 Super12 GB GDDR6X504 GB/sSame bandwidth as 4070, slightly more CUDA cores — irrelevant for inference.
RTX 4070 Ti12 GB GDDR6X504 GB/sNo bandwidth gain over 4070. Premium for gaming features that don’t help here.
RTX 4070 Ti Super16 GB GDDR6X672 GB/sFirst 16 GB card with meaningful bandwidth. Good inference card at the right price.
RTX 408016 GB GDDR6X717 GB/sStrong 16 GB card. 34% more bandwidth than the 4080 Super’s predecessor.
RTX 4080 Super16 GB GDDR6X736 GB/sModest step over 4080. Same tier, modest premium.
RTX 409024 GB GDDR6X1008 GB/sAda ceiling. 24 GB VRAM, 1 TB/s+ bandwidth. Powerful and expensive accordingly.

AI sweet spots in this generation:

RTX 4070 Super — the lowest Ada card where the bandwidth story holds up. Matches the 3070 Ti on bandwidth while bringing Ada efficiency and longer framework support horizon.

RTX 4070 Ti Super — opens the 16 GB tier without the bandwidth penalty of the 4060 Ti 16 GB. If 16 GB is the target and budget is flexible, this is the floor to look for.

RTX 4080 / 4080 Super — the practical 16 GB high watermark for Ada. Not meaningfully different from each other; buy whichever is cheaper.

What to avoid: The 4060 and 4060 Ti in any configuration. The 128-bit memory bus makes them slower for inference than cards released years earlier. The 4060 Ti 16 GB is particularly frustrating — the VRAM is there, the bandwidth is not.


Fourth generation RT Cores, fifth generation Tensor Cores. GDDR7 memory across the range — the first consumer generation where the memory technology itself is a step change. Compute capability 10.0. FP4 precision support is new to this generation, with potential benefits for quantised inference as frameworks mature to use it.

The 50-series has no meaningful secondhand market at time of writing. This section is included as a reference for the bandwidth ceiling and VRAM tiers the generation introduces, and for future reference as pricing eventually normalises.

CardVRAMMemory BWAI Notes
RTX 507012 GB GDDR7672 GB/sSame VRAM as 4070, substantially more bandwidth. Matches the 4070 Ti Super.
RTX 5070 Ti16 GB GDDR7896 GB/sStrong 16 GB card. Better bandwidth than the 4090 for token generation.
RTX 508016 GB GDDR7960 GB/s16 GB at near 1 TB/s. Outperforms the 4090 in token throughput.
RTX 509032 GB GDDR71792 GB/sFirst consumer 32 GB card. 1.79 TB/s — 77% more than the 4090.

AI notes on this generation:

The bandwidth gains from GDDR7 are real and inference-relevant. The 5090 leads every consumer card on token generation throughput by a meaningful margin. The 5080 at 960 GB/s exceeds the 4090’s 1008 GB/s — near enough to be comparable with the advantage of 16 GB rather than 24 GB.

The 5070’s 12 GB VRAM is a missed opportunity. Nvidia held VRAM constant while doubling bandwidth — a deliberate segmentation decision. For 12 GB workloads it is fast; for 13B+ models it still can’t load them.

The 5090’s 32 GB tier is genuinely new. No previous consumer card has reached it. It opens 30B-class models in quantised form and enables larger context windows on 13B models that previously fit but ran out of KV-cache headroom.

FP4 inference support lands in this generation but framework maturity lags. As bitsandbytes, llama.cpp, and Unsloth add FP4 kernels, 50-series cards will see throughput gains that 40-series cannot match architecturally. This is the long-term reason to prefer Blackwell, ahead of any near-term secondhand market.

The 5060 Ti 16 GB situation: This card launched in April 2025 at around AUD $750-850, with GDDR7 on a 128-bit bus yielding approximately 448 GB/s — the same as a 2060 Super, but with GDDR7 efficiency gains. It sold quickly and was designated end of life by major AIB partners within months, with Nvidia apparently halting GPU supply to that SKU. Production has shifted to 8 GB variants. The 16 GB version may reappear secondhand but treat any remaining new stock with caution regarding long-term availability of drivers and support. It is a better 16 GB option than the 4060 Ti 16 GB on bandwidth, but the EOL status complicates it as a Colmena acquisition.


A few things this data shows that contradict the obvious assumption that newer = better for AI:

The 3060 is slower than the 2060 Super for any model that fits in 8 GB. 360 GB/s vs 448 GB/s. The 3060’s only advantage is VRAM capacity.

The 2060 Super matches the 2070, 2070 Super, and 2080 on every inference-relevant metric. Three generations of naming and pricing separate them; inference performance does not.

The 1080 Ti’s bandwidth (484 GB/s) still competes with cards released years later. Pascal architecture limits its future, but the raw numbers hold up.

CUDA core count predicts nothing about inference speed. The 2080’s 2944 cores vs the 2060 Super’s 2176 produces identical inference throughput on identical bandwidth.

The 4060 Ti 16 GB is slower than the 2060 Super despite being four years newer and having twice the VRAM. 288 GB/s vs 448 GB/s. The VRAM is real; the inference performance is not. It is the starkest example of Nvidia’s memory bus decisions working against this use case.

Nvidia is increasingly treating high-bandwidth memory as a datacenter resource. The pattern across 40 and 50 series consumer cards is narrower buses, lower bandwidth, and deliberate segmentation of VRAM. The 5060 Ti 16 GB being quietly discontinued months after launch is consistent with this. For local inference, the secondhand 30-series market remains better value per GB/s than new consumer hardware in the 60-class.


Most inference guides recommend a “4 GB minimum” as received wisdom. Colmena will produce the evidence behind that claim by running a fixed evaluation set across every tier in the stack.

The same 15 prompts — spanning reasoning, factual recall, instruction following, and multi-step tasks — will be run at every VRAM tier using the best-fit model for that tier at Q4 quantisation. Results will be scored for coherence and accuracy and published as a comparable chart.

The expected finding, and what the data should confirm:

VRAMModelExpected quality
2 GBTinyLlama 1.1B Q4Functional for simple tasks. Reasoning gaps. Narrow use cases.
3 GBPhi-2 2.7B Q4Noticeably better. Still limited on complex reasoning.
4 GBPhi-3 Mini 3.8B Q4Meaningful step up. Usable for most everyday tasks.
6 GBQwen2 4B Q4Solid general capability.
8 GBLlama3 8B Q4Comfortable general purpose. Most users stop here.
12 GBLlama3 13B Q4Clear quality improvement. Longer coherent context.
24 GBLlama3 70B Q4Ceiling of consumer hardware usefulness.

The goal is not to confirm the obvious — it is to show where the cliff is steep and where it is gradual. The 2GB to 3GB jump and the 3GB to 4GB jump may tell very different stories. That granularity is what makes running the full tier stack worthwhile.

This data is also relevant to fine-tuned model evaluation. If Burro produces a fine-tuned 1.1B model that outperforms the base Phi-2 2.7B on domain tasks despite running on a 2 GB card, that is a meaningful finding about what fine-tuning can recover at the quality floor.


Colmena’s card selection isn’t arbitrary — each card is the floor of its VRAM tier, not the best available. Conservative baselines mean: if it runs here, it runs on your card. Community submissions extend each tier upward.

The tier stack deliberately extends down to 2 GB. This is not because 2 GB is a useful inference target — it isn’t — but because documenting where quality degrades and why is a research output in its own right. Most inference guides assert a “4 GB minimum” without evidence. Colmena will show the data behind that claim.

Card in ColmenaVRAMBandwidthStatusTier Role
GTX 9502 GB105 GB/sWatchingFloor of 2 GB tier (Maxwell). TinyLlama 1.1B only. Quality cliff reference point.
GTX 1060 3 GB3 GB192 GB/sWatchingFloor of 3 GB tier (Pascal). Phi-2 2.7B territory. Quality cliff reference point.
GTX 960 4 GB4 GB112 GB/sWatchingFloor of 4 GB tier (Maxwell, Compute 5.2, Ollama only)
GTX 980 Ti6 GB336 GB/sWatchingFloor of 6 GB Maxwell tier (Ollama only)
GTX Titan X12 GB336 GB/sWatchingFloor of 12 GB Maxwell tier (Ollama only)
GTX 1050 Ti4 GB112 GB/sAcquiredFloor of 4 GB tier (Pascal, no Tensor Cores)
GTX 1060 6 GB6 GB192 GB/sPendingFloor of 6 GB tier (Pascal, no Tensor Cores)
RTX 2060 Super (x3)8 GB448 GB/sAcquiredFloor of 8 GB Turing bandwidth (Tensor Cores)
RTX 3060 AORUS Elite12 GB360 GB/sAcquiredFloor of 12 GB tier (Ampere, Tensor Cores)
RTX 309024 GB936 GB/sPendingReference only — 24 GB scaling data, not a recommendation
RTX 4060 Ti 16 GB16 GB288 GB/sPendingFloor of 16 GB tier — documents the bandwidth penalty, not the ideal

Three 2060 Supers in the 8 GB tier is intentional. Running identical hardware in parallel validates that LocoBench results are consistent and not an artefact of a single card’s condition.

The 4060 Ti 16 GB is acquired as the bandwidth floor of the 16 GB tier precisely because it is the worst representative — readers with a 4070 Ti Super or 4080 know their card is substantially faster, and the delta is documented clearly. The same logic that makes the 3060 useful despite being slower than the 2060 Super applies here.

The 3090 sits outside the lab’s small-model focus. It is included to answer one question: do results from the lower tiers scale predictably when VRAM is no longer the constraint? Once that question is answered, the 24 GB tier has limited ongoing relevance for this project.

The bandwidth delta within each tier allows readers to extrapolate to their specific card. Colmena runs LocoBench sequentially due to host RAM constraints (8 GB DDR3) — results are identical to parallel runs, the benchmarks just don’t happen simultaneously.


Part of the LocoLLM documentation. For hardware acquisition context see meet-the-lab.md.