NVIDIA has announced that TensorRT-LLM is coming to Windows soon and will bring a huge AI boost to PCs running RTX GPUs.
NVIDIA RTX GPU-Powered PCs To Get Free AI Performance Boost In Windows With Upcoming TensorRT-LLM Support
Back in September, NVIDIA announced its TensoRT-LLM model for Data Centers which offered an 8x gain on the industry's top AI GPUs such as the Hopper H100 and the Ampere A100. Taking full advantage of the tensor core acceleration featured on NVIDIA's GeForce RTX & RTX Pro GPUs, the latest model will deliver up to 4x faster performance in LLM Inferencing workloads.
Earlier, we explained that One of the biggest updates that TensorRT-LLM brings is in the form of a new scheduler known as In-Flight batching which allows work to enter & exit the GPU independent of other tasks. It allows dynamic processing of several smaller queries while processing large compute-intensive requests in the same GPU. The TensorRT-LLM makes use of optimized open-source models which allow for higher speedups when Batch Sizes are increased. Starting today, these optimized open-source models have been made available to the public and are available to download at developer.nvidia.com.
The added AI acceleration with the TensorRT-LLM model will help drive various daily productivity tasks such as engaging in chat, summarising documents and web content, drafting emails and blogs, and can also be used to analyze data and generate vast amounts of content using what is available to the model.
So how will TensorRT-LLM help consumer PCs running Windows? Well in a demo shown by NVIDIA, a comparison between an open-source pre-trained LLM model such as LLaMa-2 and TensorRT-LLM was shown. When a query is passed to LLaMa-2, it will gather information from a large generalized dataset like Wikipedia so they don't have up-to-date information after they were trained nor do they have domain-specific datasets that they weren't trained on. They also won't certainly know about any dataset that is stored on your personalized devices or systems. So you won't get the specific data that you are looking for.
There are two approaches to solving this problem, one is fine-tuning where the LLM is optimized around a specific data set but that takes a lot of time depending on the size of the data set. The other approach is RAG or Retrieval Augamanted Generation which uses a localized library that can be filled with the dataset you want the LLM to go through & then leverage the language understating capabilities of that LLM to provide you with the information that only comes from that dataset.
In the example, a question is asked related to the NVIDIA tech integrations within Alan Wake 2 which the standard LLaMa 2 model is unable to find the proper results to but the other model with TensorRT-LLM which is fed data from 30 GeForce News articles in the local repository can provide the required information without any problems. So TensorRT-LLM provides a relevant answer and also does it faster than the LLaMa-2 model. Furthermore, NVIDIA also confirmed that you can use TenosrRT-LLM to accelerate almost any model. This is just one of the many use cases where NVIDIA TensorRT-LLM can leverage AI to deliver faster and more productive PC experiences in Windows so stay tuned for more announcements in the future.
WccftechContinue reading/original-link]