AI Inference

Inference can be deployed in many ways, depending on the use-case. Offline processing of data is best done at larger batch sizes, which can deliver optimal GPU utilization and throughput. However, increasing throughput also tends to increase latency. Generative AI and Large Language Models (LLMs) deployments seek to deliver great experiences by lowering latency. So developers and infrastructure managers need to strike a balance between throughput and latency to deliver great user experiences and best possible throughput while containing deployment costs.

When deploying LLMs at scale, a typical way to balance these concerns is to set a time-to-first token limit, and optimize throughput within that limit. The data presented in the Large Language Model Low Latency section show best throughput at a time limit of one second, which enables great throughput at low latency for most users, all while optimizing compute resource use.

Click here to view other performance data.

MLPerf Inference v5.0 Performance Benchmarks

Offline Scenario, Closed Division

Network	Throughput	GPU	Server	GPU Version	Target Accuracy	Dataset
Llama3.1 405B	13,886 tokens/sec	72x GB200	NVIDIA GB200 NVL72	NVIDIA GB200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	Subset of LongBench, LongDataCollections, Ruler, GovReport
	1,538 tokens/sec	8x B200	SYS-421GE-NBRT-LCC	NVIDIA B200-SXM-180GB	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	Subset of LongBench, LongDataCollections, Ruler, GovReport
	574 tokens/sec	8x H200	Cisco UCS C885A M8	NVIDIA H200-SXM-141GB	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B	98,858 tokens/sec	8x B200	NVIDIA DGX B200	NVIDIA B200-SXM-180GB	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	OpenOrca (max_seq_len=1024)
	35,453 tokens/sec	8x H200	ThinkSystem SR680a V3	NVIDIA H200-SXM-141GB	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	OpenOrca (max_seq_len=1024)
Mixtral 8x7B	128,795 tokens/sec	8x B200	SYS-421GE-NBRT-LCC	NVIDIA B200-SXM-180GB	99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16)	OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048)
	63,515 tokens/sec	8x H200	ThinkSystem SR780a V3	NVIDIA H200-SXM-141GB	99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16)	OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048)
Stable Diffusion XL	30 samples/sec	8x B200	NVIDIA DGX B200	NVIDIA B200-SXM-180GB	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	Subset of coco-2014 val
	19 samples/sec	8x H200	AS-4125GS-TNHR2-LCC	NVIDIA H200-SXM-141GB	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	Subset of coco-2014 val
RGAT	450,175 samples/sec	8x H200	ThinkSystem SR780a V3	NVIDIA H200-SXM-141GB	99% of FP32 (72.86%)	IGBH
GPT-J	21,626 tokens/sec	8x H200	ThinkSystem SR780a V3	NVIDIA H200-SXM-141GB	99.9% of FP32 (rouge1=42.9865, rouge2=20.1235, rougeL=29.9881)	CNN Dailymail (v3.0.0, max_seq_len=2048)
ResNet-50	773,300 samples/sec	8x H200	ThinkSystem SR680a V3	NVIDIA H200-SXM-141GB	76.46% Top1	ImageNet (224x224)
RetinaNet	15,200 samples/sec	8x H200	AS-4125GS-TNHR2-LCC	NVIDIA H200-SXM-141GB	0.3755 mAP	OpenImages (800x800)
DLRMv2	654,489 samples/sec	8x H200	HPE Cray XD670 with Cray ClusterStor	NVIDIA H200-SXM-141GB	99% of FP32 (AUC=80.31%)	Synthetic Multihot Criteo Dataset
3D-UNET	55 samples/sec	8x H200	HPE Cray XD670 with Cray ClusterStor	NVIDIA H200-SXM-141GB	99.9% of FP32 (0.86330 mean DICE score)	KiTS 2019

Server Scenario - Closed Division

Network	Throughput	GPU	Server	GPU Version	Target Accuracy	MLPerf Server Latency Constraints (ms)	Dataset
Llama3.1 405B	8,850 tokens/sec	72x GB200	NVIDIA GB200 NVL72	NVIDIA GB200	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	TTFT/TPOT: 6000 ms/175 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
	1,080 tokens/sec	8x B200	SYS-A21GE-NBRT	NVIDIA B200-SXM-180GB	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	TTFT/TPOT: 6000 ms/175 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
	294 tokens/sec	8x H200	Cisco UCS C885A M8	NVIDIA H200-SXM-141GB	99% of FP16 ((GovReport + LongDataCollections + 65 Sample from LongBench)rougeL=21.6666, (Remaining samples of the dataset)exact_match=90.1335)	TTFT/TPOT: 6000 ms/175 ms	Subset of LongBench, LongDataCollections, Ruler, GovReport
Llama2 70B Interactive	62,266 tokens/sec	8x B200	SYS-A21GE-NBRT	NVIDIA B200-SXM-180GB	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 450 ms/40 ms	OpenOrca (max_seq_len=1024)
	20,235 tokens/sec	8x H200	G893-SD1	NVIDIA H200-SXM-141GB	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 450 ms/40 ms	OpenOrca (max_seq_len=1024)
Llama2 70B	98,443 tokens/sec	8x B200	NVIDIA DGX B200	NVIDIA B200-SXM-180GB	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca (max_seq_len=1024)
	33,072 tokens/sec	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB-CTS	99.9% of FP32 (rouge1=44.4312, rouge2=22.0352, rougeL=28.6162)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca (max_seq_len=1024)
Mixtral 8x7B	129,047 tokens/sec	8x B200	SYS-421GE-NBRT-LCC	NVIDIA B200-SXM-180GB	99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048)
	61,802 tokens/sec	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB-CTS	99% of FP16 ((OpenOrca)rouge1=45.5989, (OpenOrca)rouge2=23.3526, (OpenOrca)rougeL=30.4608, (gsm8k)Accuracy=73.66, (mbxp)Accuracy=60.16)	TTFT/TPOT: 2000 ms/200 ms	OpenOrca (5k samples, max_seq_len=2048), GSM8K (5k samples of the train split, max_seq_len=2048), MBXP (5k samples, max_seq_len=2048)
Stable Diffusion XL	29 samples/sec	8x B200	SYS-A21GE-NBRT	NVIDIA B200-SXM-180GB	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	20 s	Subset of coco-2014 val
	18 samples/sec	8x H200	NVIDIA H200	NVIDIA H200-SXM-141GB-CTS	FID range: [23.01085758, 23.95007626] and CLIP range: [31.68631873, 31.81331801]	20 s	Subset of coco-2014 val
GPT-J	21,813 queries/sec	8x H200	Cisco UCS C885A M8	NVIDIA H200-SXM-141GB	99% of FP32 (72.86%)	20 s	CNN Dailymail
ResNet-50	676,219 queries/sec	8x H200	G893-SD1	NVIDIA H200-SXM-141GB	76.46% Top1	15 ms	ImageNet (224x224)
RetinaNet	14,589 queries/sec	8x H200	AS-4125GS-TNHR2-LCC	NVIDIA H200-SXM-141GB	0.3755 mAP	100 ms	OpenImages (800x800)
DLRMv2	590,167 queries/sec	8x H200	HPE Cray XD670 with Cray ClusterStor	NVIDIA H200-SXM-141GB	99% of FP32 (AUC=80.31%)	60 ms	Synthetic Multihot Criteo Dataset

MLPerf™ v5.0 Inference Closed: Llama3.1 405B 99% of FP16, Llama2 70B Interactive 99.9% of FP32, Llama2 70B 99.9% of FP32, Mixtral 8x7B 99% of FP16, Stable Diffusion XL, ResNet-50 v1.5, RetinaNet, RNN-T, RGAT, 3D U-Net 99.9% of FP32 accuracy target, GPT-J 99.9% of FP32 accuracy target, DLRM 99% of FP32 accuracy target: 5.0-0011, 5.0-0033, 5.0-0041, 5.0-0051, 5.0-0053, 5.0-0056, 5.0-0058, 5.0-0060, 5.0-0070, 5.0-0072, 5.0-0074. MLPerf name and logo are trademarks. See https://mlcommons.org/ for more information.
Llama2 70B Max Sequence Length = 1,024
Mixtral 8x7B Max Sequence Length = 2,048
For MLPerf™ various scenario data, click here
For MLPerf™ latency constraints, click here

LLM Inference Performance of NVIDIA Data Center Products

B200 Inference Performance - Per User

Model	Attention	MoE	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
DeepSeek R1 671B	TP8	EP8	1,024	2,048	253 output tokens/sec/user	8x B200	DGX B200	FP4	TensorRT-LLM	NVIDIA B200

Attention: Tensor Parallelism = 8
MoE: Expert Parallelism = 8
TensorRT-LLM version: internal release
Batch Size = 1
Input tokens not included in TPS calculations
Check out this blog for more details

B200 Inference Performance - Max Throughput

Model	Attention	MoE	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
DeepSeek R1 671B	DP8	EP8	1,024	2,048	30,389 output tokens/sec	8x B200	DGX B200	FP4	TensorRT-LLM	NVIDIA B200

Attention: Data Parallelism = 8
MoE: Expert Parallelism = 8
TensorRT-LLM version: internal release
Input tokens not included in TPS calculations
Max concurrency use case
Check out this blog for more details

H200 Inference Performance - High Throughput

Model	PP	TP	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Llama v3.1 405B	1	8	128	128	3,874 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 405B	1	8	128	2048	5,938 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 405B	1	8	128	4096	5,168 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 405B	8	1	2048	128	764 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.14a	NVIDIA H200
Llama v3.1 405B	1	8	5000	500	669 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 405B	1	8	500	2000	5,084 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 405B	1	8	1000	1000	3,400 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 405B	1	8	2048	2048	2,941 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 405B	1	8	20000	2000	535 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200

Llama v3.1 70B	1	1	128	128	4,021 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 70B	1	1	128	2048	4,166 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 70B	1	2	128	4096	6,527 output tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 70B	1	1	2048	128	466 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 70B	1	1	5000	500	560 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 70B	1	2	500	2000	6,848 output tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 70B	1	1	1000	1000	2,823 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 70B	1	2	2048	2048	4,184 output tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 70B	1	2	20000	2000	641 output tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200

Llama v3.1 8B	1	1	128	128	29,526 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 8B	1	1	128	2048	25,399 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 8B	1	1	128	4096	17,371 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 8B	1	1	2048	128	3,794 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 8B	1	1	5000	500	3,988 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 8B	1	1	500	2000	21,021 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 8B	1	1	1000	1000	17,538 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 8B	1	1	2048	2048	11,969 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Llama v3.1 8B	1	1	20000	2000	1,804 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200

Mistral 7B	1	1	128	128	31,938 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mistral 7B	1	1	128	2048	27,409 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mistral 7B	1	1	128	4096	18,505 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mistral 7B	1	1	2048	128	3,834 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mistral 7B	1	1	5000	500	4,042 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mistral 7B	1	1	500	2000	22,355 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mistral 7B	1	1	1000	1000	18,426 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mistral 7B	1	1	2048	2048	12,347 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mistral 7B	1	1	20000	2000	1,823 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200

Mixtral 8x7B	1	1	128	128	17,158 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x7B	1	1	128	2048	15,095 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x7B	1	2	128	4096	21,565 output tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x7B	1	1	2048	128	2,010 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x7B	1	1	5000	500	2,309 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x7B	1	1	500	2000	12,105 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x7B	1	1	1000	1000	10,371 output tokens/sec	1x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x7B	1	2	2048	2048	14,018 output tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x7B	1	2	20000	2000	2,227 output tokens/sec	2x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200

Mixtral 8x22B	1	8	128	128	25,179 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.14.0	NVIDIA H200
Mixtral 8x22B	1	8	128	2048	32,623 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x22B	1	8	128	4096	25,753 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x22B	1	8	2048	128	3,095 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x22B	1	8	5000	500	4,209 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x22B	1	8	500	2000	27,430 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x22B	1	8	1000	1000	20,097 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.15.0	NVIDIA H200
Mixtral 8x22B	1	8	2048	2048	15,799 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.17.0	NVIDIA H200
Mixtral 8x22B	1	8	20000	2000	2,897 output tokens/sec	8x H200	DGX H200	FP8	TensorRT-LLM 0.14.0	NVIDIA H200

TP: Tensor Parallelism
PP: Pipeline Parallelism
For more information on pipeline parallelism, please read Llama v3.1 405B Blog
Output tokens/second on Llama v3.1 405B is inclusive of time to generate the first token (tokens/s = total generated tokens / total latency)

GH200 Inference Performance - High Throughput

Model	PP	TP	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Llama v3.1 70B	1	1	128	128	3,637 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 70B	1	4	128	2048	10,358 output tokens/sec	4x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.13.0	NVIDIA GH200 96B
Llama v3.1 70B	1	4	128	4096	6,628 output tokens/sec	4x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.13.0	NVIDIA GH200 96B
Llama v3.1 70B	1	1	2048	128	425 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 70B	1	1	5000	500	422 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 70B	1	4	500	2000	9,091 output tokens/sec	4x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.13.0	NVIDIA GH200 96B
Llama v3.1 70B	1	1	1000	1000	1,746 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 70B	1	4	2048	2048	4,865 output tokens/sec	4x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.13.0	NVIDIA GH200 96B
Llama v3.1 70B	1	4	20000	2000	959 output tokens/sec	4x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.13.0	NVIDIA GH200 96B

Llama v3.1 8B	1	1	128	128	29,853 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 8B	1	1	128	2048	21,770 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 8B	1	1	128	4096	14,190 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 8B	1	1	2048	128	3,844 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 8B	1	1	5000	500	3,933 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 8B	1	1	500	2000	17,137 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 8B	1	1	1000	1000	16,483 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 8B	1	1	2048	2048	10,266 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Llama v3.1 8B	1	1	20000	2000	1,560 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B

Mistral 7B	1	1	128	128	32,498 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mistral 7B	1	1	128	2048	23,337 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mistral 7B	1	1	128	4096	15,018 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mistral 7B	1	1	2048	128	3,813 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mistral 7B	1	1	5000	500	3,950 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mistral 7B	1	1	500	2000	18,556 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mistral 7B	1	1	1000	1000	17,252 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mistral 7B	1	1	2048	2048	10,756 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mistral 7B	1	1	20000	2000	1,601 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B

Mixtral 8x7B	1	1	128	128	16,859 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mixtral 8x7B	1	1	128	2048	11,120 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mixtral 8x7B	1	4	128	4096	30,066 output tokens/sec	4x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.13.0	NVIDIA GH200 96B
Mixtral 8x7B	1	1	2048	128	1,994 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mixtral 8x7B	1	1	5000	500	2,078 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mixtral 8x7B	1	1	500	2000	9,193 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mixtral 8x7B	1	1	1000	1000	8,849 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mixtral 8x7B	1	1	2048	2048	5,545 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B
Mixtral 8x7B	1	1	20000	2000	861 output tokens/sec	1x GH200	NVIDIA Grace Hopper x4 P4496	FP8	TensorRT-LLM 0.17.0	NVIDIA GH200 96B

TP: Tensor Parallelism
PP: Pipeline Parallelism

H100 Inference Performance - High Throughput

Model	PP	TP	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Llama v3.1 70B	1	1	128	128	3,378 output tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Llama v3.1 70B	1	2	128	4096	3,897 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Llama v3.1 70B	1	2	2048	128	774 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.15.0	H100-SXM5-80GB
Llama v3.1 70B	1	2	500	2000	4,973 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Llama v3.1 70B	1	2	1000	1000	4,391 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Llama v3.1 70B	1	2	2048	2048	2,898 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Llama v3.1 70B	1	4	20000	2000	920 output tokens/sec	4x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB

Mixtral 8x7B	1	1	128	128	15,962 output tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	128	2048	23,010 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.15.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	128	4096	14,237 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Mixtral 8x7B	1	1	2048	128	1,893 output tokens/sec	1x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	5000	500	3,646 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	500	2000	18,186 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.14.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	1000	1000	15,932 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.14.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	2048	2048	10,686 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB
Mixtral 8x7B	1	2	20000	2000	1,757 output tokens/sec	2x H100	DGX H100	FP8	TensorRT-LLM 0.17.0	H100-SXM5-80GB

TP: Tensor Parallelism
PP: Pipeline Parallelism

L40S Inference Performance - High Throughput

Model	PP	TP	Input Length	Output Length	Throughput	GPU	Server	Precision	Framework	GPU Version
Llama v3.1 8B	1	1	128	128	9,105 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.17.0	NVIDIA L40S
Llama v3.1 8B	1	1	128	2048	5,366 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.17.0	NVIDIA L40S
Llama v3.1 8B	1	1	128	4096	3,026 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.17.0	NVIDIA L40S
Llama v3.1 8B	1	1	2048	128	1,067 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.17.0	NVIDIA L40S
Llama v3.1 8B	1	1	5000	500	981 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.17.0	NVIDIA L40S
Llama v3.1 8B	1	1	500	2000	4,274 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.17.0	NVIDIA L40S
Llama v3.1 8B	1	1	1000	1000	4,055 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.17.0	NVIDIA L40S
Llama v3.1 8B	1	1	2048	2048	2,225 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.17.0	NVIDIA L40S
Llama v3.1 8B	1	1	20000	2000	328 output tokens/sec	1x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.17.0	NVIDIA L40S

Mixtral 8x7B	4	1	128	128	15,278 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Mixtral 8x7B	2	2	128	2048	9,087 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Mixtral 8x7B	1	4	128	4096	5,736 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.17.0	NVIDIA L40S
Mixtral 8x7B	4	1	2048	128	2,098 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Mixtral 8x7B	2	2	5000	500	1,558 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Mixtral 8x7B	2	2	500	2000	7,974 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Mixtral 8x7B	2	2	1000	1000	6,579 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S
Mixtral 8x7B	2	2	2048	2048	4,217 output tokens/sec	4x L40S	Supermicro SYS-521GE-TNRT	FP8	TensorRT-LLM 0.15.0	NVIDIA L40S

TP: Tensor Parallelism
PP: Pipeline Parallelism

Inference Performance of NVIDIA Data Center Products

H200 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
Stable Diffusion v2.1 (512x512)	1	4.33 images/sec	-	231.26	1x H200	DGX H200	24.10-py3	INT8	Synthetic	TensorRT 10.5.0.26	NVIDIA H200
	4	6.8 images/sec	-	588.08	1x H200	DGX H200	24.10-py3	INT8	Synthetic	TensorRT 10.5.0.26	NVIDIA H200
Stable Diffusion XL	1	0.86 images/sec	-	1157.27	1x H200	DGX H200	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA H200
ResNet-50v1.5	8	20,801 images/sec	62 images/sec/watt	0.38	1x H200	DGX H200	25.01-py3	INT8	Synthetic	TensorRT 10.8.0.40	NVIDIA H200
	128	65,045 images/sec	107 images/sec/watt	1.97	1x H200	DGX H200	25.01-py3	INT8	Synthetic	TensorRT 10.8.0.40	NVIDIA H200
EfficientNet-B0	8	16,769 images/sec	77 images/sec/watt	0.48	1x H200	DGX H200	25.01-py3	INT8	Synthetic	TensorRT 10.8.0.40	NVIDIA H200
	128	56,981 images/sec	122 images/sec/watt	2.25	1x H200	DGX H200	25.01-py3	INT8	Synthetic	TensorRT 10.8.0.40	NVIDIA H200
EfficientNet-B4	8	4,507 images/sec	14 images/sec/watt	1.78	1x H200	DGX H200	25.01-py3	INT8	Synthetic	TensorRT 10.8.0.40	NVIDIA H200
	128	8,991 images/sec	15 images/sec/watt	14.24	1x H200	DGX H200	25.01-py3	INT8	Synthetic	TensorRT 10.8.0.40	NVIDIA H200
HF Swin Base	8	5,090 samples/sec	11 samples/sec/watt	1.57	1x H200	DGX H200	25.01-py3	Mixed	Synthetic	TensorRT 10.8.0.40	NVIDIA H200
	32	8,204 samples/sec	12 samples/sec/watt	3.9	1x H200	DGX H200	25.01-py3	Mixed	Synthetic	TensorRT 10.8.0.40	NVIDIA H200
HF Swin Large	8	3,382 samples/sec	6 samples/sec/watt	2.37	1x H200	DGX H200	25.01-py3	INT8	Synthetic	TensorRT 10.8.0.40	NVIDIA H200
	32	4,676 samples/sec	7 samples/sec/watt	6.84	1x H200	DGX H200	25.01-py3	INT8	Synthetic	TensorRT 10.8.0.40	NVIDIA H200
HF ViT Base	8	9,006 samples/sec	19 samples/sec/watt	0.89	1x H200	DGX H200	25.01-py3	FP8	Synthetic	TensorRT 10.8.0.40	NVIDIA H200
	64	15,640 samples/sec	23 samples/sec/watt	4.09	1x H200	DGX H200	25.01-py3	FP8	Synthetic	TensorRT 10.8.0.40	NVIDIA H200
HF ViT Large	8	3,439 samples/sec	6 samples/sec/watt	2.33	1x H200	DGX H200	25.01-py3	FP8	Synthetic	TensorRT 10.8.0.40	NVIDIA H200
	64	5,471 samples/sec	8 samples/sec/watt	11.7	1x H200	DGX H200	25.01-py3	FP8	Synthetic	TensorRT 10.8.0.40	NVIDIA H200
QuartzNet	8	6,741 samples/sec	25 samples/sec/watt	1.19	1x H200	DGX H200	25.01-py3	Mixed	Synthetic	TensorRT 10.8.0.40	NVIDIA H200
	128	34,280 samples/sec	92 samples/sec/watt	3.73	1x H200	DGX H200	25.01-py3	INT8	Synthetic	TensorRT 10.8.0.40	NVIDIA H200
RetinaNet-RN34	8	3,015 images/sec	8 images/sec/watt	2.65	1x H200	DGX H200	25.01-py3	INT8	Synthetic	TensorRT 10.8.0.40	NVIDIA H200

512x512 image size, 50 denoising steps for Stable Diffusion v2.1
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
QuartzNet: Sequence Length = 256

GH200 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
Stable Diffusion v2.1 (512x512)	1	4.27 images/sec	-	234.4	1x GH200	NVIDIA P3880	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	GH200 96GB
	4	5.82 images/sec	-	687.91	1x GH200	NVIDIA P3880	24.09-py3	INT8	Synthetic	TensorRT 10.4.0.26	GH200 96GB
Stable Diffusion XL	1	0.68 images/sec	-	1149.44	1x GH200	NVIDIA P3880	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	GH200 96GB
ResNet-50v1.5	8	21,533 images/sec	63 images/sec/watt	0.37	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
	128	63,043 images/sec	99 images/sec/watt	2.03	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
EfficientNet-B0	8	16,695 images/sec	67 images/sec/watt	0.48	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
	128	56,674 images/sec	113 images/sec/watt	2.26	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
EfficientNet-B4	8	4,531 images/sec	13 images/sec/watt	1.77	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
	128	8,784 images/sec	14 images/sec/watt	14.57	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
HF Swin Base	8	5,106 samples/sec	10 samples/sec/watt	1.57	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
	32	8,197 samples/sec	12 samples/sec/watt	3.9	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
HF Swin Large	8	3,403 samples/sec	6 samples/sec/watt	2.35	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
	32	4,846 samples/sec	6 samples/sec/watt	6.6	1x GH200	NVIDIA P3880	24.12-py3	Mixed	Synthetic	TensorRT 10.7.0	GH200 96GB
HF ViT Base	8	8,990 samples/sec	18 samples/sec/watt	0.89	1x GH200	NVIDIA P3880	24.12-py3	FP8	Synthetic	TensorRT 10.7.0	GH200 96GB
	64	15,562 samples/sec	21 samples/sec/watt	4.11	1x GH200	NVIDIA P3880	24.12-py3	FP8	Synthetic	TensorRT 10.7.0	GH200 96GB
HF ViT Large	8	3,707 samples/sec	6 samples/sec/watt	2.16	1x GH200	NVIDIA P3880	24.12-py3	FP8	Synthetic	TensorRT 10.7.0	GH200 96GB
	64	5,703 samples/sec	7 samples/sec/watt	11.22	1x GH200	NVIDIA P3880	24.12-py3	FP8	Synthetic	TensorRT 10.7.0	GH200 96GB
QuartzNet	8	6,688 samples/sec	22 samples/sec/watt	1.2	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
	128	34,272 samples/sec	85 samples/sec/watt	3.73	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB
RetinaNet-RN34	8	2,945 images/sec	4 images/sec/watt	2.72	1x GH200	NVIDIA P3880	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	GH200 96GB

H100 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
Stable Diffusion v2.1 (512x512)	1	4.22 images/sec	-	236.8	1x H100	DGX H100	24.10-py3	INT8	Synthetic	TensorRT 10.5.0.26	H100 SXM5-80GB
	4	6.41 images/sec	-	624.6	1x H100	DGX H100	24.10-py3	INT8	Synthetic	TensorRT 10.5.0.26	H100 SXM5-80GB
Stable Diffusion XL	1	0.83 images/sec	-	1210.08	1x H100	DGX H100	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	H100 SXM5-80GB
ResNet-50v1.5	8	21,588 images/sec	63 images/sec/watt	0.37	1x H100	DGX H100	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	H100-SXM5-80GB
	128	59,535 images/sec	99 images/sec/watt	2.15	1x H100	DGX H100	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	H100-SXM5-80GB
EfficientNet-B0	8	16,351 images/sec	67 images/sec/watt	0.49	1x H100	DGX H100	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	H100-SXM5-80GB
	128	55,498 images/sec	116 images/sec/watt	2.31	1x H100	DGX H100	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	H100-SXM5-80GB
EfficientNet-B4	8	4,550 images/sec	12 images/sec/watt	1.76	1x H100	DGX H100	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	H100-SXM5-80GB
	128	8,144 images/sec	15 images/sec/watt	15.72	1x H100	DGX H100	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	H100-SXM5-80GB
HF Swin Base	8	5,072 samples/sec	9 samples/sec/watt	1.58	1x H100	DGX H100	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	H100-SXM5-80GB
	32	7,706 samples/sec	11 samples/sec/watt	4.15	1x H100	DGX H100	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	H100-SXM5-80GB
HF Swin Large	8	3,299 samples/sec	6 samples/sec/watt	2.42	1x H100	DGX H100	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	H100-SXM5-80GB
	32	4,463 samples/sec	7 samples/sec/watt	7.17	1x H100	DGX H100	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	H100-SXM5-80GB
HF ViT Base	8	9,078 samples/sec	17 samples/sec/watt	0.88	1x H100	DGX H100	25.02-py3	FP8	Synthetic	TensorRT 10.8.0.43	H100-SXM5-80GB
	64	15,210 samples/sec	22 samples/sec/watt	4.21	1x H100	DGX H100	25.02-py3	FP8	Synthetic	TensorRT 10.8.0.43	H100-SXM5-80GB
HF ViT Large	8	3,440 samples/sec	6 samples/sec/watt	2.33	1x H100	DGX H100	25.02-py3	FP8	Synthetic	TensorRT 10.8.0.43	H100-SXM5-80GB
	64	5,363 samples/sec	8 samples/sec/watt	11.93	1x H100	DGX H100	25.02-py3	FP8	Synthetic	TensorRT 10.8.0.43	H100-SXM5-80GB
QuartzNet	8	6,767 samples/sec	22 samples/sec/watt	1.18	1x H100	DGX H100	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	H100-SXM5-80GB
	128	35,389 samples/sec	77 samples/sec/watt	3.62	1x H100	DGX H100	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	H100-SXM5-80GB
RetinaNet-RN34	8	2,827 images/sec	8 images/sec/watt	2.83	1x H100	DGX H100	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	H100-SXM5-80GB

L40S Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
Stable Diffusion v2.1 (512x512)	1	2.49 images/sec	-	401.48	1x L40S	Supermicro SYS-521GE-TNRT	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L40S
	4	2.91 images/sec	-	1372.72	1x L40S	Supermicro SYS-521GE-TNRT	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L40S
Stable Diffusion XL	1	0.37 images/sec	-	2678.19	1x L40S	Supermicro SYS-521GE-TNRT	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L40S
ResNet-50v1.5	8	23,472 images/sec	78 images/sec/watt	0.34	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L40S
	32	37,069 images/sec	109 images/sec/watt	0.86	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L40S
BERT-BASE	8	8,412 sequences/sec	26 sequences/sec/watt	0.95	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L40S
	128	13,169 sequences/sec	38 sequences/sec/watt	9.72	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L40S
BERT-LARGE	8	3,188 sequences/sec	10 sequences/sec/watt	2.51	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L40S
	24	4,034 sequences/sec	12 sequences/sec/watt	31.73	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L40S
EfficientDet-D0	8	4,696 images/sec	17 images/sec/watt	1.7	1x L40S	Supermicro SYS-521GE-TNRT	24.11-py3	INT8	Synthetic	TensorRT 10.6.0.26	NVIDIA L40S
EfficientNet-B0	8	20,534 images/sec	106 images/sec/watt	0.39	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L40S
	32	41,526 images/sec	140 images/sec/watt	0.77	1x L40S	Supermicro SYS-521GE-TNRT	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L40S
EfficientNet-B4	8	5,149 images/sec	17 images/sec/watt	1.55	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L40S
	16	6,116 images/sec	18 images/sec/watt	2.62	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L40S
HF Swin Base	8	3,843 samples/sec	11 samples/sec/watt	2.08	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0.23	NVIDIA L40S
	16	4,266 samples/sec	12 samples/sec/watt	7.5	1x L40S	Supermicro SYS-521GE-TNRT	24.11-py3	INT8	Synthetic	TensorRT 10.6.0.26	NVIDIA L40S
HF Swin Large	8	1,932 samples/sec	6 samples/sec/watt	4.14	1x L40S	Supermicro SYS-521GE-TNRT	24.11-py3	Mixed	Synthetic	TensorRT 10.6.0	NVIDIA L40S
	16	2,141 samples/sec	6 samples/sec/watt	7.47	1x L40S	Supermicro SYS-521GE-TNRT	24.11-py3	INT8	Synthetic	TensorRT 10.6.0	NVIDIA L40S
HF ViT Base	8	5,799 samples/sec	17 samples/sec/watt	1.38	1x L40S	Supermicro SYS-521GE-TNRT	24.11-py3	FP8	Synthetic	TensorRT 10.6.0	NVIDIA L40S
HF ViT Large	8	1,926 samples/sec	6 samples/sec/watt	4.15	1x L40S	Supermicro SYS-521GE-TNRT	24.11-py3	FP8	Synthetic	TensorRT 10.6.0	NVIDIA L40S
Megatron BERT Large QAT	8	4,213 sequences/sec	13 sequences/sec/watt	1.9	1x L40S	Supermicro SYS-521GE-TNRT	24.11-py3	INT8	Synthetic	TensorRT 10.6.0	NVIDIA L40S
	24	5,097 sequences/sec	15 sequences/sec/watt	4.71	1x L40S	Supermicro SYS-521GE-TNRT	24.11-py3	INT8	Synthetic	TensorRT 10.6.0	NVIDIA L40S
QuartzNet	8	7,643 samples/sec	32 samples/sec/watt	1.05	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L40S
	128	22,595 samples/sec	65 samples/sec/watt	5.66	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0.23	NVIDIA L40S
RetinaNet-RN34	8	1,463 images/sec	7 images/sec/watt	5.47	1x L40S	Supermicro SYS-521GE-TNRT	24.12-py3	INT8	Synthetic	TensorRT 10.7.0.23	NVIDIA L40S

1,024 x 1,024 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

L4 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
Stable Diffusion v2.1 (512x512)	1	0.82 images/sec	-	1221.73	1x L4	GIGABYTE G482-Z54-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
Stable Diffusion XL	1	0.11 images/sec	-	9098.4	1x L4	GIGABYTE G482-Z54-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
ResNet-50v1.5	8	9,649 images/sec	134 images/sec/watt	0.83	1x L4	GIGABYTE G482-Z54-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA L4
	32	10,101 images/sec	111 images/sec/watt	16.27	1x L4	GIGABYTE G482-Z54-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA L4
BERT-BASE	8	3,323 sequences/sec	46 sequences/sec/watt	2.41	1x L4	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
	24	4,052 sequences/sec	56 sequences/sec/watt	5.92	1x L4	GIGABYTE G482-Z54-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
BERT-LARGE	8	1,081 sequences/sec	15 sequences/sec/watt	7.4	1x L4	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
	13	1,314 sequences/sec	19 sequences/sec/watt	9.9	1x L4	GIGABYTE G482-Z54-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
EfficientNet-B4	8	1,844 images/sec	26 images/sec/watt	4.34	1x L4	GIGABYTE G482-Z54-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA L4
HF Swin Base	8	1,221 samples/sec	17 samples/sec/watt	6.55	1x L4	GIGABYTE G482-Z54-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA L4
HF Swin Large	8	621 samples/sec	9 samples/sec/watt	12.89	1x L4	GIGABYTE G482-Z54-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA L4
HF ViT Base	16	1,844 samples/sec	26 samples/sec/watt	4.34	1x L4	GIGABYTE G482-Z54-00	25.02-py3	FP8	Synthetic	TensorRT 10.8.0.43	NVIDIA L4
HF ViT Large	8	617 samples/sec	9 samples/sec/watt	12.96	1x L4	GIGABYTE G482-Z54-00	25.02-py3	FP8	Synthetic	TensorRT 10.8.0.43	NVIDIA L4
Megatron BERT Large QAT	24	1,789 sequences/sec	25 sequences/sec/watt	13.42	1x L4	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA L4
QuartzNet	8	3,886 samples/sec	54 samples/sec/watt	2.06	1x L4	GIGABYTE G482-Z54-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA L4
	128	6,144 samples/sec	85 samples/sec/watt	20.83	1x L4	GIGABYTE G482-Z54-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA L4
RetinaNet-RN34	8	355 images/sec	5 images/sec/watt	22.51	1x L4	GIGABYTE G482-Z54-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA L4

512x512 image size, 50 denoising steps for Stable Diffusion v2.1
BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

A40 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
ResNet-50v1.5	8	11,177 images/sec	40 images/sec/watt	0.72	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
	128	15,473 images/sec	52 images/sec/watt	8.27	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
BERT-BASE	8	4,257 sequences/sec	15 sequences/sec/watt	1.88	1x A40	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A40
	128	5,667 sequences/sec	19 sequences/sec/watt	22.59	1x A40	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A40
BERT-LARGE	8	1,573 sequences/sec	5 sequences/sec/watt	5.08	1x A40	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A40
	128	1,966 sequences/sec	7 sequences/sec/watt	65.11	1x A40	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A40
EfficientNet-B0	8	11,130 images/sec	61 images/sec/watt	0.72	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
	128	20,078 images/sec	67 images/sec/watt	6.38	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
EfficientNet-B4	8	2,145 images/sec	8 images/sec/watt	3.73	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
	128	2,689 images/sec	9 images/sec/watt	47.59	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
HF Swin Base	8	1,697 samples/sec	6 samples/sec/watt	4.71	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
	32	1,842 samples/sec	6 samples/sec/watt	17.38	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
HF Swin Large	8	959 samples/sec	3 samples/sec/watt	8.34	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
	32	1,010 samples/sec	3 samples/sec/watt	31.68	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
HF ViT Base	8	2,175 samples/sec	7 samples/sec/watt	3.68	1x A40	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
	64	2,324 samples/sec	8 samples/sec/watt	27.54	1x A40	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
HF ViT Large	8	694 samples/sec	2 samples/sec/watt	11.53	1x A40	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
	64	750 samples/sec	2 samples/sec/watt	85.34	1x A40	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
Megatron BERT Large QAT	8	2,059 sequences/sec	7 sequences/sec/watt	3.89	1x A40	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A40
	128	2,650 sequences/sec	9 sequences/sec/watt	48.31	1x A40	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A40
QuartzNet	8	4,388 samples/sec	21 samples/sec/watt	1.82	1x A40	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
	128	8,453 samples/sec	28 samples/sec/watt	15.14	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40
RetinaNet-RN34	8	706 images/sec	2 images/sec/watt	11.34	1x A40	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A40

BERT Base: Sequence Length = 128 | BERT Large: Sequence Length = 128
HF Swin Base: Input Image Size = 224x224 | Window Size = 224x 224. HF Swin Large: Input Image Size = 224x224 | Window Size = 384x384
HF ViT Base: Input Image Size = 224x224 | Patch Size = 224x224. HF ViT Large: Input Image Size = 224x224 | Patch Size = 384x384
Megatron BERT Large QAT: Sequence Length = 128
QuartzNet: Sequence Length = 256

A30 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
ResNet-50v1.5	8	10,261 images/sec	71 images/sec/watt	0.78	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
	128	16,465 images/sec	101 images/sec/watt	7.77	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
BERT-BASE	8	4,334 sequences/sec	26 sequences/sec/watt	1.85	1x A30	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A30
	128	5,820 sequences/sec	35 sequences/sec/watt	21.99	1x A30	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A30
BERT-LARGE	8	1,500 sequences/sec	10 sequences/sec/watt	5.33	1x A30	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A30
	128	2,053 sequences/sec	13 sequences/sec/watt	62.34	1x A30	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A30
EfficientNet-B0	8	8,993 images/sec	81 images/sec/watt	0.89	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
	128	17,119 images/sec	105 images/sec/watt	7.48	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
EfficientNet-B4	8	1,875 images/sec	13 images/sec/watt	4.27	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
	128	2,397 images/sec	15 images/sec/watt	53.4	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
HF Swin Base	8	1,646 samples/sec	10 samples/sec/watt	4.86	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
	32	1,851 samples/sec	11 samples/sec/watt	17.28	1x A30	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
HF Swin Large	8	907 samples/sec	6 samples/sec/watt	8.82	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
	32	1,000 samples/sec	6 samples/sec/watt	32	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
HF ViT Base	8	2,058 samples/sec	13 samples/sec/watt	3.89	1x A30	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
	64	2,271 samples/sec	14 samples/sec/watt	28.18	1x A30	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
HF ViT Large	8	675 samples/sec	4 samples/sec/watt	11.86	1x A30	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
	64	708 samples/sec	4 samples/sec/watt	90.34	1x A30	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
QuartzNet	8	3,434 samples/sec	29 samples/sec/watt	2.33	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
	128	9,997 samples/sec	73 samples/sec/watt	12.8	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30
RetinaNet-RN34	8	703 images/sec	4 images/sec/watt	11.39	1x A30	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A30

A10 Inference Performance

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
ResNet-50v1.5	8	8,499 images/sec	57 images/sec/watt	0.94	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
	128	10,654 images/sec	71 images/sec/watt	12.01	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
BERT-BASE	8	3,109 sequences/sec	21 sequences/sec/watt	2.57	1x A10	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A10
	128	3,822 sequences/sec	26 sequences/sec/watt	33.49	1x A10	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.5.0	NVIDIA A10
BERT-LARGE	8	1,086 sequences/sec	7 sequences/sec/watt	7.36	1x A10	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.6.0	NVIDIA A10
	128	1,265 sequences/sec	8 sequences/sec/watt	101.17	1x A10	GIGABYTE G482-Z52-00	24.10-py3	INT8	Synthetic	TensorRT 10.6.0	NVIDIA A10
EfficientNet-B0	8	9,679 images/sec	65 images/sec/watt	0.83	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
	128	14,418 images/sec	96 images/sec/watt	8.88	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
EfficientNet-B4	8	1,633 images/sec	11 images/sec/watt	4.9	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
	128	1,863 images/sec	12 images/sec/watt	68.72	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
HF Swin Base	8	1,214 samples/sec	8 samples/sec/watt	6.59	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
	32	1,258 samples/sec	8 samples/sec/watt	25.44	1x A10	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
HF Swin Large	8	623 samples/sec	4 samples/sec/watt	12.84	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
	32	656 samples/sec	4 samples/sec/watt	48.75	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
HF ViT Base	8	1,370 samples/sec	9 samples/sec/watt	5.84	1x A10	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
	64	1,503 samples/sec	10 samples/sec/watt	42.59	1x A10	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
HF ViT Large	8	453 samples/sec	3 samples/sec/watt	17.68	1x A10	GIGABYTE G482-Z52-00	25.02-py3	Mixed	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
Megatron BERT Large QAT	8	1,566 sequences/sec	10 sequences/sec/watt	5.11	1x A10	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A10
	128	1,801 sequences/sec	12 sequences/sec/watt	71.06	1x A10	GIGABYTE G482-Z52-00	24.12-py3	INT8	Synthetic	TensorRT 10.7.0	NVIDIA A10
QuartzNet	8	3,842 samples/sec	26 samples/sec/watt	2.08	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
	128	5,867 samples/sec	39 samples/sec/watt	21.82	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10
RetinaNet-RN34	8	516 images/sec	4 images/sec/watt	15.5	1x A10	GIGABYTE G482-Z52-00	25.02-py3	INT8	Synthetic	TensorRT 10.8.0.43	NVIDIA A10

Inference Performance of NVIDIA GPUs in the Cloud

A100 Inference Performance in the Cloud

Network	Batch Size	Throughput	Efficiency	Latency (ms)	GPU	Server	Container	Precision	Dataset	Framework	GPU Version
ResNet-50v1.5	8	13,768 images/sec	- images/sec/watt	0.58	1x A100	GCP A2-HIGHGPU-1G	23.10-py3	INT8	Synthetic	-	A100-SXM4-40GB
	128	30,338 images/sec	- images/sec/watt	4.22	1x A100	GCP A2-HIGHGPU-1G	23.10-py3	INT8	Synthetic	-	A100-SXM4-40GB
BERT-LARGE	8	2,308 images/sec	- images/sec/watt	3.47	1x A100	GCP A2-HIGHGPU-1G	23.10-py3	INT8	Synthetic	-	A100-SXM4-40GB
	128	4,045 images/sec	- images/sec/watt	31.64	1x A100	GCP A2-HIGHGPU-1G	23.10-py3	INT8	Synthetic	-	A100-SXM4-40GB

BERT-Large: Sequence Length = 128

View More Performance Data

Training to Convergence

Deploying AI in real-world applications requires training networks to convergence at a specified accuracy. This is the best methodology to test whether AI systems are ready to be deployed in the field to deliver meaningful results.

Learn More

AI Pipeline

NVIDIA Riva is an application framework for multimodal conversational AI services that deliver real-performance on GPUs.

Learn More