NVIDIA is announcing a brand new AI software stack today known as TensorRT LLM which boosts Large Language Models performance across its GPUs.
NVIDIA TensorRT-LLM Delivers Up To 8x Gain In Large Language Model Performance On Hopper AI GPUs
NVIDIA's TensorRT-LLM is announced as a highly optimized, open-source library that enables the fastest inferencing performance across all Large Language Models with NVIDIA's AI GPUs such as Hopper. NVIDIA has worked with all LLMs within the open-source community to optimize its GPUs by utilizing the latest AI kernels with cutting-edge techniques such as SmoothQuant, FlashAttention & fMHA. The open-source foundation includes ready-to-run SOTA inference-optimized versions of LLMs such as GPT-3 (175B), Llama Falcom (180B), & Bloom, just to name a few.
TensorRT-LLM is also optimized to do automatic parallelization across multiple NVLINK servers with Infiniband interconnect. Previously, servers had to be manually assigned a large language model across multiple servers/GPUs which shouldn't be the case anymore with Tensor-RT LLM.
One of the biggest updates that TensorRT-LLM brings is in the form of a new scheduler known as In-Flight batching which allows work to enter and exit the GPU independent of other tasks. It allows dynamic processing of several smaller queries while processing large compute-intensive requests in the same GPU. This whole process makes the GPU more efficient and leads to some huge gains in throughput on GPUs such as the H100, up to 2x to be exact.
The TensorRT-LLM stack is also optimized around Hopper's Transformer engine and its compute FP8 capabilities. The library offers automatic FP8 conversion, a DL compiler for kernel fusion, & a mixed precision optimizer along with support for NVIDIA's own Smoothquaint algorithm enabling 8-bit quantization performance without accuracy loss.
So coming to the performance figures, NVIDIA compares the A100 with the H100's performance in August and the H100's performance with TensorRT-LLM. In GPT-J 6B (Inference), the H100 already offered a 4x gain but with TensorRT-LLM, the company doubles the performance which leads to an 8x gain in this specific test. In Llama2, we see up to a 5x gain with TensorRT LLM and almost a 2x gain over the standard H100 without TensorRT-LLM.
NVIDIA states that they are working with all leading inference workloads such as Meta, Grammarly, Deci, anyscale, etc. to accelerate their LLMs using TensorRT-LLM. As for availability, TensorRT-LLM is available in early access now with a full release expected next month. As for support, TensorRT-LLM will be supported by all NVIDIA Data Center & AI GPUs that are in production today such as A100, H100, L4, L40, L40S, HGX, Grace Hopper, and so on.
WccftechContinue reading/original-link]