Skip to content

Main Navigation

Puget Systems Logo
  • Solutions
    • Recommended Systems For:
    • Content Creation
      • Photo Editing
        • Recommended Systems For:
        • Adobe Lightroom Classic
        • Adobe Photoshop
        • Stable Diffusion
      • Video Editing
        • Recommended Systems For:
        • Adobe After Effects
        • Adobe Premiere Pro
        • DaVinci Resolve
        • Foundry Nuke
      • 3D Design & Animation
        • Recommended Systems For:
        • Autodesk 3ds Max
        • Autodesk Maya
        • Blender
        • Cinema 4D
        • Houdini
        • ZBrush
      • Real-Time Engines
        • Recommended Systems For:
        • Game Development
        • Unity
        • Unreal Engine
        • Virtual Production
      • Rendering
        • Recommended Systems For:
        • Keyshot
        • OctaneRender
        • Redshift
        • V-Ray
      • Digital Audio
        • Recommended Systems For:
        • Ableton Live
        • FL Studio
        • Pro Tools
    • Engineering
      • Architecture & CAD
        • Recommended Systems For:
        • Autodesk AutoCAD
        • Autodesk Inventor
        • Autodesk Revit
        • SOLIDWORKS
      • Visualization
        • Recommended Systems For:
        • Enscape
        • Lumion
        • Twinmotion
      • Photogrammetry & GIS
        • Recommended Systems For:
        • ArcGIS Pro
        • Agisoft Metashape
        • Pix4D
        • RealityCapture
    • AI & HPC
      • Recommended Systems For:
      • Data Science
      • Generative AI
      • Large Language Models
      • Machine Learning / AI Dev
      • Scientific Computing
    • More
      • Recommended Systems For:
      • Compact Size
      • Live Streaming
      • NVIDIA RTX Studio
      • Quiet Operation
      • Virtual Reality
    • Business & Enterprise
      We can empower your company
    • Government & Education
      Services tailored for your organization
  • Products
    • Computer System Styles:
    • Desktop Workstations
      • AMD Ryzen
        • Ryzen 9000:
        • Mini Tower
        • Mid Tower
        • Full Tower
      • AMD Threadripper
        • Threadripper 7000:
        • Mid Tower
        • Full Tower
        • Threadripper PRO 5000WX:
        • Full Tower
        • Threadripper PRO 7000WX:
        • Full Tower
      • AMD EPYC
        • EPYC 9004:
        • Full Tower
      • Intel Core
        • Core 13th Gen:
        • Small Form Factor
        • Core 14th Gen:
        • Mini Tower
        • Mid Tower
        • Full Tower
      • Intel Xeon
        • Xeon W-2400:
        • Mid Tower
        • Xeon W-3400:
        • Full Tower
    • Custom Computers
    • Laptop Workstations
      • Puget Mobile 17″
    • Rackstations
      • AMD Rackstations
        • Ryzen 7000:
        • R550-6U 5-Node
        • Ryzen 9000:
        • R121-4U
        • Threadripper 7000:
        • T121-4U
        • Threadripper PRO 5000WX:
        • WRX80 4U
        • Threadripper PRO 7000WX:
        • T141-4U
        • EPYC 9004:
        • E140-4U
      • Intel Rackstations
        • Core 14th Gen:
        • C131-4U
        • Xeon W-3400:
        • X141-4U
        • X141-5U
    • Custom Rackmount Workstations
    • Puget Servers
      • Puget Servers
        • AMD EPYC:
        • E200-1U
        • E140-2U
        • E280-4U
        • Intel Xeon:
        • X200-1U
    • Custom Servers
    • Storage Solutions
      • Network Attached Storage
        • QNAP NAS Recommendations
      • Puget Storage
        • Puget Storage:
        • 12-Bay 2U
        • 24-Bay 2U
        • 36-Bay 4U
    • Recommended Third Party Peripherals
      Curated list of accessories for your workstation
    • Puget Gear
      Quality apparel with Puget Systems branding
  • Publications
    • Articles
    • Blog Posts
    • Case Studies
    • HPC Blog
    • Podcasts
    • Press
    • PugetBench
  • Support
    • Contact Support
    • Support Articles
    • Warranty Details
    • Onsite Services
    • Unboxing
  • About Us
    • About Us
    • Contact Us
    • Our Customers
    • Enterprise
    • Gov & Edu
    • Press Kit
    • Testimonials
    • Careers
  • Talk to an Expert
  • My Account
  1. Home
  2. /
  3. Hardware Articles
  4. /
  5. LLM Inference – Consumer GPU performance

LLM Inference – Consumer GPU performance

Posted on August 22, 2024 (August 22, 2024) by Jon Allman

Table of Contents

  • Introduction
  • Test Setup
  • GPU Performance
  • Final Thoughts

Introduction

In our ongoing effort to assess hardware performance for AI and machine learning workloads, today we’re publishing results from the built-in benchmark tool of llama.cpp, focusing on a variety NVIDIA GeForce GPUs, from the RTX 4090 down to the now-ancient (in tech terms) GTX 1080 Ti. Although this round of testing is limited to NVIDIA graphics cards, we plan to expand our scope in future benchmarks to include AMD offerings.

If you’re interested in how NVIDIA’s professional GPUs performed using this benchmark, then follow this link to check out those results.

Image
Open Full Resolution

Test Setup

Test Platform

CPU: AMD Ryzen Threadripper PRO 7985WX 64-Core
CPU Cooler: Asetek 836S-M1A 360mm Threadripper CPU Cooler
Motherboard: ASUS Pro WS WRX90E-SAGE SE
BIOS Version: 0404
RAM: 8x Kingston DDR5-5600 ECC Reg. 1R 16GB
(128GB total)
GPUs: NVIDIA GeForce RTX 4090 24GB
NVIDIA GeForce RTX 4080 SUPER 16GB
NVIDIA GeForce RTX 4080 16GB
NVIDIA GeForce RTX 4070 Ti SUPER 16GB
NVIDIA GeForce RTX 4070 Ti 12GB
NVIDIA GeForce RTX 4070 SUPER 12GB
NVIDIA GeForce RTX 4070 12GB
NVIDIA GeForce RTX 4060 Ti 8GB
NVIDIA GeForce RTX 4060 8GB
NVIDIA GeForce RTX 3080 Ti 12GB
NVIDIA GeForce RTX 2080 Ti 11GB
NVIDIA GeForce GTX 1080 Ti 11GB
Driver Version: 560.70
PSU: Super Flower LEADEX Platinum 1600W
Storage: Samsung 980 Pro 2TB
OS: Windows 11 Pro 23H2 Build 22631.3880

Llama.cpp build 3140 was utilized for these tests, using CUDA version 12.2.0, and Microsoft’s Phi-3-mini-4k-instruct model in 4-bit GGUF. Both the prompt processing and token generation tests were performed using the default values of 512 tokens and 128 tokens respectively with 25 repetitions apiece, and the results averaged.

GPU Performance

Prompt processing chart for consumer GPUs
Image
Open Full Resolution

Within the latest GeForce RTX 4000 series, the rankings of the cards in the prompt processing test are as we would expect based on the models’ positioning within their product stack. We found that the RTX 4090 was 28.5% faster than the RTX 4080 SUPER, which was only 6.2% faster than the standard RTX 4080. Especially with it’s hefty 24GB of VRAM, the RTX 4090 continues to be a great choice for LLMs, but the RTX 4080 SUPER is also likely worth considering since it actually has a lower MSRP than the RTX 4080. Although, at the 16GB mark, the RTX 4070 Ti SUPER is a worthwhile contender to the RTX 4080 SUPER, with an overall lower cost, but similar price/performance ratio. This all assumes, however, that prompt processing performance is your main concern over token generation, which is unlikely scenario in many workflows.

Interestingly, we find that last-generation’s 3080 Ti came out ahead of the RTX 4070 SUPER and RTX 4070, and the venerable RTX 2080 Ti managed to edge out the RTX 4060 Ti variant. Finally, with its complete lack of tensor cores, the GTX 1080 Ti truly shows its age, scoring five times slower than its closest competition, the RTX 4060.

When we compare these results with the technical specifications of these GPUs, then it becomes clear that FP16 performance has a direct impact on how quickly they are able process prompts in the llama.cpp benchmark. FP16 performance is almost exclusively a function of both the number of tensor cores and which generation of tensor core the GPUs were manufactured with. This explains why, with it’s complete lack of tensor cores, the GTX 1080 Ti’s FP16 performance is anemic compared to the rest of the GPUs tested. However, the fact that the RTX 3080 Ti was able to come out ahead against the RTX 4070 SUPER indicates that FP16 performance is not the only factor in effect during prompt processing, and the following section should shine some light on what else we should be considering.

GPUFP16 (TFLOPS)Tensor Core CountTensor Core Generation
RTX 409082.585124th
RTX 4080 SUPER52.223204th
RTX 408048.743044th
RTX 4070 Ti SUPER44.102644th
RTX 4070 Ti40.092404th
RTX 3080 Ti34.103203rd
RTX 4070 SUPER35.482244th
RTX 407029.151844th
RTX 2080 Ti26.905442nd
RTX 4060 Ti22.061364th
RTX 406015.11964th
GTX 1080 Ti.180N/A
Image
Open Full Resolution

Once again, the RTX 4090 continues to show its dominance by landing at the top of the token generation chart, but surprisingly, the RTX 3080 Ti took second place, jumping up four positions compared to the prompt processing results, with a score functionally equivalent to the much newer RTX 4080 SUPER.

Again, if we refer back to the technical specifications of these GPUs, we can see how the older RTX 3080 Ti was able to achieve this result: through its notable memory bandwidth. Although the two RTX 4080 variants have considerably more FP16 compute capability compared to the RTX 3080 Ti (~50TFLOPS vs. ~35 TFLOPS), the roughly 25% more memory bandwidth on the RTX 3080 Ti allows it to come out just ahead of the newer GPUs during token generation.

Compared to the prompt processing results, we also see that the token generation test narrowed the performance gap between certain models. For example, in prompt processing, the percentage increase from an RTX 4070 Ti SUPER to the RTX 4080 SUPER was about 22%, but during token generation, the increase was only 8%. Likewise the percentage increase from an RTX 4070 to an RTX 4070 Ti shrinks from 25% during prompt processing, to the two cards achieving nearly identical token generation scores.

But similar to the prompt processing results, if we look at where older cards like the RTX 2080 Ti and GTX 1080 Ti land on the chart, we can see that memory bandwidth is not the end-all-be-all specification for token generation, and FP16 compute performance still has a role to play. Otherwise, we’d expect to see the GTX 1080 Ti achieve a better result, considering its memory bandwidth is comparable to the RTX 4070 and its variants.

One somewhat anomalous result is the unexpectedly low tokens per second that the RTX 2080 Ti was able to achieve. The RTX 2080 Ti has more memory bandwidth and FP16 performance compared to the RTX 4060 series GPUs, but achieves similar results. We expect this to be a result of either software optimizations for the newer generations of GPUs or increased overhead from using a higher number of less capable tensor cores (544 second gen cores vs. 96-136 fourth gen cores).

GPUMemory Bandwidth (GB/s)
RTX 40901008.4
RTX 3080 Ti912.4
RTX 4080 SUPER736.3
RTX 4080716.8
RTX 4070 Ti SUPER672.3
RTX 4070 Ti504.2
RTX 4070 SUPER504.2
RTX 4070504.2
RTX 2080 Ti616
RTX 4060 Ti288
RTX 4060272
GTX 1080 Ti484.4
Call to Action
Looking for an AI Workstation?
Call to Action
Looking for an AI Workstation?

Final Thoughts

These results emphasize an important consideration when choosing GPUs for LLM usage: while raw memory capacity is very important, it is not the only factor that should be taken into account. It’s also important to consider the memory bandwidth and overall compute performance of a GPU in order to get a comprehensive understanding of a GPU’s suitability for LLMs.

This is just the starting point for our LLM testing series. Future updates will include more topics, such as inference with larger models, multi-GPU configurations, testing with AMD & Intel GPUs, and model training as well. We’re eager to hear from you – if there’s a specific aspect of LLM performance you’d like us to investigate, please let us know in the comments!

Tower Computer Icon in Puget Systems Colors

Looking for an AI and Scientific Computing workstation?

We build computers tailor-made for your workflow. 

Configure a System
Talking Head Icon in Puget Systems Colors

Don’t know where to start?
We can help!

Get in touch with one of our technical consultants today.

Talk to an Expert

Related Content

  • AMD Ryzen 9000: Performance vs Previous Generations
  • AMD Ryzen 9000 Content Creation Review
  • DaVinci Resolve Studio: AMD Ryzen 9000 Series vs Intel Core 14th Gen
  • Adobe Premiere Pro: AMD Ryzen 9000 Series vs Intel Core 14th Gen
View All Related Content

Latest Content

  • LLM Inference – Professional GPU performance
  • LLM Inference – Consumer GPU performance
  • AMD Ryzen 9000: Performance vs Previous Generations
  • AMD Ryzen 9000 Content Creation Review
View All
Image
Open Full Resolution
Image
Open Full Resolution
Tags: GPU, GTX 1080 Ti, LLM, NVIDIA, RTX 2080 Ti, RTX 3080 Ti, RTX 4060, RTX 4060 Ti, RTX 4070, RTX 4070 SUPER, RTX 4070 Ti, RTX 4070 Ti SUPER, RTX 4080, RTX 4080 SUPER, RTX 4090

Who is Puget Systems?

Puget Systems builds custom workstations, servers and storage solutions tailored for your work.

We provide:

Extensive performance testing
making you more productive and giving better value for your money

Reliable computers
with fewer crashes means more time working & less time waiting

Support that understands
your complex workflows and can get you back up & running ASAP

A proven track record
as shown by our case studies and customer testimonials

Get Started

Browse Systems

Puget Systems Mobile Laptop Workstation Icon

Mobile

Puget Systems Tower Workstation Icon

Workstations

Puget Systems Rackmount Workstation Icon

Rackstations

Puget Systems Rackmount Server Icon

Servers

Puget Systems Rackmount Storage Icon

Storage

Latest Articles

  • LLM Inference – Professional GPU performance
  • LLM Inference – Consumer GPU performance
  • AMD Ryzen 9000: Performance vs Previous Generations
  • AMD Ryzen 9000 Content Creation Review
  • DaVinci Resolve Studio: AMD Ryzen 9000 Series vs Intel Core 14th Gen
View All

Post navigation

 AMD Ryzen 9000: Performance vs Previous GenerationsLLM Inference – Professional GPU performance 
Puget Systems Logo
Build Your Own PC Site Map FAQ
facebook instagram linkedin rss twitter youtube

Optimized Solutions

  • Adobe Premiere
  • Adobe Photoshop
  • Solidworks
  • Autodesk AutoCAD
  • Machine Learning

Workstations

  • Content Creation
  • Engineering
  • Scientific PCs
  • More

Support

  • Online Guides
  • Request Support
  • Remote Help

Publications

  • All News
  • Puget Blog
  • HPC Blog
  • Hardware Articles
  • Case Studies

Policies

  • Warranty & Return
  • Terms and Conditions
  • Privacy Policy
  • Delivery Times
  • Accessibility

About Us

  • Testimonials
  • Careers
  • About Us
  • Contact Us

© Copyright 2024 - Puget Systems, All Rights Reserved.