NVIDIA TensorRT-LLM Boosts Large Language Models Immensely, Up To 8x Gain on Hopper GPUs

NVIDIA TensorRT-LLM Boosts Large Language Models Immensely, Up To 8x Gain on Hopper GPUs 1

NVIDIA is announcing a brand new AI software stack today known as TensorRT LLM which boosts Large Language Models performance across its GPUs.

NVIDIA TensorRT-LLM Delivers Up To 8x Gain In Large Language Model Performance On Hopper AI GPUs

NVIDIA's TensorRT-LLM is announced as a highly optimized, open-source library that enables the fastest inferencing performance across all Large Language Models with NVIDIA's AI GPUs such as Hopper. NVIDIA has worked with all LLMs within the open-source community to optimize its GPUs by utilizing the latest AI kernels with cutting-edge techniques such as SmoothQuant, FlashAttention & fMHA. The open-source foundation includes ready-to-run SOTA inference-optimized versions of LLMs such as GPT-3 (175B), Llama Falcom (180B), & Bloom, just to name a few.

TensorRT-LLM is also optimized to do automatic parallelization across multiple NVLINK servers with Infiniband interconnect. Previously, servers had to be manually assigned a large language model across multiple servers/GPUs which shouldn't be the case anymore with Tensor-RT LLM.

One of the biggest updates that TensorRT-LLM brings is in the form of a new scheduler known as In-Flight batching which allows work to enter and exit the GPU independent of other tasks. It allows dynamic processing of several smaller queries while processing large compute-intensive requests in the same GPU. This whole process makes the GPU more efficient and leads to some huge gains in throughput on GPUs such as the H100, up to 2x to be exact.

The TensorRT-LLM stack is also optimized around Hopper's Transformer engine and its compute FP8 capabilities. The library offers automatic FP8 conversion, a DL compiler for kernel fusion, & a mixed precision optimizer along with support for NVIDIA's own Smoothquaint algorithm enabling 8-bit quantization performance without accuracy loss.

So coming to the performance figures, NVIDIA compares the A100 with the H100's performance in August and the H100's performance with TensorRT-LLM. In GPT-J 6B (Inference), the H100 already offered a 4x gain but with TensorRT-LLM, the company doubles the performance which leads to an 8x gain in this specific test. In Llama2, we see up to a 5x gain with TensorRT LLM and almost a 2x gain over the standard H100 without TensorRT-LLM.

NVIDIA states that they are working with all leading inference workloads such as Meta, Grammarly, Deci, anyscale, etc. to accelerate their LLMs using TensorRT-LLM. As for availability, TensorRT-LLM is available in early access now with a full release expected next month. As for support, TensorRT-LLM will be supported by all NVIDIA Data Center & AI GPUs that are in production today such as A100, H100, L4, L40, L40S, HGX, Grace Hopper, and so on.

Written by Hassan Mujtaba

Wccftech Continue reading/original-link]

Ukraine is pushing for EU membership. But what are the real chances?

Europe looks for alternate gas solutions but could it be left in cold?

More people in need of charity in Europe since COVID-19, NGO says

Eight Bulgarians among 11 missing after fire on ship near Corfu

Near the frontline in eastern Ukraine, snipers and scepticism abound

War in Ukraine will not be short, and it’s changed everything for Europe

WA records 1,766 new local COVID cases as it prepares to open border

Clive Palmer may have just bought Hitler’s car, say Liberals and Labor

Mud Army 2.0 urged to check with home owners before tossing things out

Ramping cut almost in half in last four months, SA government says

Nordstrom shares soar as it makes ‘baby steps’, still has a ways to go

Target thinks it can keep growing sales, here’s how the retailer will do it

AMC is charging more for ‘Batman’ tickets as it tests out a new pricing model

Benioff touts Salesforce’s sales guidance, ‘$30 billions are ahead of us’

Meta says today’s cellular networks aren’t ready for the metaverse

Skyrim Co-Op Mod Released, Mostly Actually Works

Can you name Barca’s starting XI from last Europa League appearance?

After scoring confirmed, should Taylor offer Catterall a rematch?

The ‘internal battle’ when counter culture meets elite sport

‘Messi-inspired’ Grealish helps Man City beat Peterborough in match

A newfound quasicrystal formed in the first atomic bomb testesd in US

How omicron’s mutations make it the most infectious coronavirus variant

Africa’s fynbos plants hold their ground with the world’s thinnest roots

‘Fresh Banana Leaves’ shows how Indigenous people have been harmed

A fast radio burst’s unlikely source may be a cluster of old stars

NVIDIA TensorRT-LLM Boosts Large Language Models Immensely, Up To 8x Gain on Hopper GPUs

NVIDIA TensorRT-LLM Delivers Up To 8x Gain In Large Language Model Performance On Hopper AI GPUs

Related articles

How To Unlock Every Hero And Weapon Evolution In Vampire Survivors Ode To Castlevania DLC

Overwatch Players, Y’all Lived Like This In 2016?

Is Black Myth: Wukong Coming To Xbox? Phil Spencer Knows, But Won’t Say

Best Android app price drops and freebies: Doom & Destiny Worlds, YoWindow Weather, more

Recent articles

How To Unlock Every Hero And Weapon Evolution In Vampire Survivors Ode To Castlevania DLC

Overwatch Players, Y’all Lived Like This In 2016?

Is Black Myth: Wukong Coming To Xbox? Phil Spencer Knows, But Won’t Say

Best Android app price drops and freebies: Doom & Destiny Worlds, YoWindow Weather, more