Generative AI, the branch of artificial intelligence (AI) that is designed to generate new data, images, code, or other types of content that humans do not explicitly program, is rapidly becoming pervasive across nearly all facets of business and technology.

Inferencing, the process of using a trained AI model to generate predictions that make decisions or produce outputs based on input data, plays a crucial role in generative AI as it enables the practical application and real-time generation of content or responses. It enables near instantaneous content creation and interactive experiences, and when properly designed and managed, does so with resource efficiency, scalability, and contextual adaptation. It allows generative AI models to support applications ranging from chatbots and virtual assistants to context-aware natural language generation and dynamic decision-making systems.

Earlier this year, Dell Technologies and NVIDIA introduced a groundbreaking project for generative AI, with a joint initiative to bring generative AI to the world’s enterprise data centers. This project delivers a set of validated designs for full-stack integrated hardware and software solutions that enable enterprises to create and run custom AI large language models (LLMs) using unique data that is relevant to their own organization.

An LLM is an advanced type of AI model that has been trained on an extensive dataset, typically using deep learning techniques, which is capable of understanding, processing, and generating natural language text. However, AI built on public or generic models is not well suited for an enterprise to use in their business. Enterprise use cases require domainspecific knowledge to train, customize, and operate their LLMs.

Dell Technologies and NVIDIA have designed a scalable, modular, and high-performance architecture that enables enterprises everywhere to create a range of generative AI solutions that apply specifically to their businesses, reinvent their industries, and give them competitive advantage.

This design for inferencing is the first in a series of validated designs for generative AI that focus on all facets of the generative AI life cycle, including inferencing, model customization, and model training. While these designs are focused on generative AI use cases, the architecture is more broadly applicable to more general AI use cases as well.

This guide describes the Dell Validated Design for Generative AI Inferencing with NVIDIA.

It describes the validated design and reference architecture for a modular and scalable platform for generative AI in the enterprise. The guide focuses specifically on inferencing, which is the process of using a trained model to generate predictions, make decisions, or

A Scalable and Modular Production Infrastructure with NVIDIA for Artificial Intelligence Large Language Model Inferencing

produce outputs based on input data for production outcomes. Subsequent guides will address validated designs for model customization and training.

This design guide can be read alongside the associated white paper, Generative AI in the Enterprise. The white paper provides an overview of generative AI, including its underlying principles, benefits, architectures, and techniques; the various types of generative AI models and how they are used in real-world applications; the challenges and limitations of generative AI; and descriptions of the various Dell and NVIDIA hardware and software components to be used in the series of validated designs to be released

What is inferencing?

Inferencing in AI refers to the process of using a trained model to generate predictions, make decisions, or produce outputs based on input data. It applies the learned knowledge and patterns acquired during the model’s training phase to respond with new and unique content.

During inferencing, the trained model processes input data through its computational algorithms or neural network architecture to produce an output or prediction. The model applies its learned parameters, weights, or rules to transform the input data into meaningful information or actions.

Inferencing is the culminating and operational stage in the life cycle of an AI system. After training a model on relevant data to learn patterns and correlations, inferencing allows the model to generalize its knowledge and make predictions or generate responses that are accurate and appropriate to the specific context of the business.

For example, in a natural language processing task like sentiment analysis, the model is trained on a labeled dataset with text samples and corresponding sentiment labels (positive, negative, or neutral). During inferencing, the trained model takes new, unlabeled text data as input and predicts the sentiment associated with it.

Inferencing can occur in various contexts and applications, such as image recognition, speech recognition, machine translation, recommendation systems, and chatbots. It enables AI systems to provide meaningful outputs, help with decision-making, automate processes, or interact with users based on the learned knowledge and patterns captured by the model during training. Generative AI inferencing is what allows AI systems to produce coherent and contextually relevant responses or content in real time.

Inferencing and AI model development workflow

The following figure shows a typical workflow for generative AI development, depicting where inferencing fits into the overall work stream. While this process can vary from organization to organization, the basic flow is generally consistent.

Figure 1. Generative AI workflow

In step 1, the business must establish its strategy for generative AI by considering its goals and objectives, identifying the problems it wants to solve or opportunities to create, and defining the use case or cases to address.

Step 2 consists of data preparation and curation. It may include data cleansing and labeling, data aggregation, anonymizing of data or generation of synthetic data if necessary, and generally ensuring that the dataset is well-managed, high-quality, and readily available for model training and model customization. Software tools such as Machine Learning Operations (MLOps) platforms can help in the data preparation phase.

In step 3, the real work begins, especially if we are training a model from scratch, which requires a substantial amount of labeled data relevant to the use case, heavy computational resources, and potentially significant time for training. This step is where validated, high-performance, and accelerated infrastructure can make a significant difference in the time and efficiency to complete the training phase. We can also evaluate existing models and select a pretrained model if it is applicable to the business, or use a pretrained model as the basis for the next step of model customization.

Step 4 consists of customization of a trained model, whether it is one that you have trained from scratch or acquired as a pretrained model. Customization methods include fine-tuning, prompt learning that can include both prompt tuning and parameter tuning (Ptuning), transfer learning, and reinforcement learning. These methods are discussed in more detail in the Generative AI in the Enterprise white paper.

Step 5 consists of inferencing, the subject of this validated design. This step is where you deploy and operate the trained model to generate business outcomes on an ongoing basis, scaling up or scaling out the computing resources as necessary to match demands. The inferencing step may be iterative as well, as new data and new model customization and fine-tuning opportunities are identified to optimize the outcomes of the inferencing operations in practice

Inferencing use cases

Inferencing using LLMs for natural language generation in generative AI has numerous practical use cases across various domains. While the Generative AI in the Enterprise white paper discussed some use cases for generative AI for various industries, some notable examples of use cases based specifically on inferencing include:

Natural language generation—Generative AI models can be used for text

generation tasks such as document writing, dialogue generation, summarization, or

content creation for marketing and advertising purposes.

Chatbots and virtual assistants—Generative AI powers conversational agents,

chatbots, and virtual assistants by generating natural language responses based on

user queries or instructions.

Personalized recommendations—Generative AI can generate personalizedrecommendations for products, movies, music, or content based on user preferences, behavior, and historical data.

Data augmentation—Generative AI can generate synthetic data samples to augment existing datasets, increasing the diversity and size of the training data for machine learning models.

Customer service and troubleshooting—In addition to chatbots and virtual assistants, there are a number of applications of inferencing in customer service and troubleshooting environments, including applications such as:

Self-service knowledge bases—Generative AI can automatically generate and update knowledge base articles, FAQs, and troubleshooting guides. When customers encounter issues, they can search the knowledge base to find relevant self-help resources that provide step-by-step instructions or solutions.

Contextual responses, problem solving, and proactive troubleshooting— Generative AI can analyze customer queries or problem descriptions and generate contextually relevant responses or troubleshooting suggestions. By understanding the context, the AI system can offer tailored recommendations,

guiding customers through the troubleshooting process.

Interactive diagnostics—Generative AI can simulate interactive diagnostic conversations to identify potential issues and guide customers towards resolution. Through a series of questions and responses, the system can narrow down the problem, offer suggestions, or provide next steps for troubleshooting.

Intelligent routing and escalation—Generative AI models can intelligently route customer inquiries or troubleshoot specific issues based on their complexity or severity. They can determine when a query must be escalated to human support agents, ensuring efficient use of resources and timely resolution.

Sentiment analysis and customer sentiment monitoring—Generative AI can analyze customer sentiment and emotional cues from their messages or interactions. This analysis allows organizations to monitor customer satisfaction levels, identify potential issues, and take proactive measures to address concerns

You can download the white paper, by clicking the screenshot below

Leave a ReplyCancel reply