A guest post by Nomuka Luehr Dell Technologies has announced a joint initiative with NVIDIA to bring generative AI to enterprise customers, called Project Helix. Project Helix is a full-stack solution […]
Dell Technologies has announced a joint initiative with NVIDIA to bring generative AI to enterprise customers, called Project Helix.
Project Helix is a full-stack solution that enables enterprises to create and run custom AI models built with the knowledge of their business. We’ll get to the specifics of the announcement in this series but firstly lets set the stage.
What is Generative AI in simple terms?
Generative AI is a type of AI system capable of generating text, images, or other media in response to prompts. Generative models learn the patterns and structure of the input data, and then generate new content that is similar to the training data but with some degree of novelty.
The most prominent frameworks for approaching generative AI include generative adversarial networks (GANs) and generative pre-trained transformers (GPTs).
GPTs are artificial neural networks that are based on the transformer architecture, pre-trained on large datasets of unlabeled text, and able to generate novel human-like text.
GANs consist of two parts: a generator network that creates new data samples and a discriminator network that evaluates whether the samples are real or fake. (This series will not focus on GANs)
Generative AI has many potential applications, including in creative fields such as art, music, and writing, as well as in fields such as healthcare, finance, and gaming. We have all seen this in action with the release of Chat GPT and Bing Chat as of late.
Where do Large Language Models (LLMs) fit into the picture?
Glad you asked! GPTs (Generative Pre-trained Transformers) are a type of LLM (Large Language Model). The datasets used to train a GPT can come from a variety of sources, including internal documents, customer interactions, and publicly available information. The GPT will then use this data to learn the patterns and structure of the language, allowing it to generate new content that is similar to the training data.
GPTs are trained using large amounts of text data. In other words, LLMs are the outcome of training a model on large amounts of text data. A Large Language Model (LLM) consists of a neural network with many parameters (typically billions of weights or more), trained on large quantities of text using self-supervised learning or semi-supervised learning.
LLMs emerged around 2018 and have since become capable of performing well at a wide variety of tasks.
LLMs are general-purpose models which excel at a wide range of tasks, as opposed to being trained for one specific purpose. They demonstrate considerable general knowledge about the world and can “memorize” many facts during training.
How can LLM’s help businesses?
Artificial Intelligence (AI) will become integral to modern business operations and large Language Models (LLMs) have emerged as one of the most powerful types of AI models available today. With business cases ranging from conversational agents and chatbots for customer service, audio and visual content creation, software programming, security, fraud detection, threat intelligence, natural language interaction, and translation.
LLMs can help enable a myriad of new applications and business opportunities. Enterprise customers can (and will) use LLMs to empower their company’s business intelligence and unlock the value of AI in ways that were previously not possible.
There will be few areas of business and society that will not be impacted in some way by this technology.
Why should I consider developing my own LLM’s instead of using other services?
While public generative AI models such as ChatGPT, Google Bard AI, Microsoft Bing Chat, and a host of other and more specialized offerings are intriguing, you can’t download these GPT models and further train / fine tune them (they are not open / open source).
You can access these models through a few different means. For example OpenAI offers an API for developers to use the GPT model in their applications (Charged API model, tokens, embedding etc). Embeddings and Token (Words and the amount of them) are the means available in terms of further providing context to these models on your specific data.
Company’s such as OpenAI and Microsoft are starting to work with other companies to implement their GPT/LLM’s models on-prem with customers and this landscape is rapidly evolving, however, there is a compelling need for enterprises to develop their own Large Language Models (LLMs) that are trained on known data sets or developed or fine-tuned from known pre-trained models.
Benefits of developing own LLM – Iterate, Re-Train, Improve
Developing your own large language model as opposed to using an existing one can provide several commercial and business benefits. However, it’s important to note that creating a large language model from scratch requires resources and expertise. Here are some potential benefits:
Customization and Optimization: Developing your own model allows you to train it on specific data to meet your unique business needs. You can tailor it to understand industry-specific jargon, customer interaction styles, or the nuances of your particular products and services.
Data Security and Privacy: When you use a third-party model, you often need to send your data to the provider’s servers, which may raise privacy concerns. By developing your own model, you can keep your data in-house, enhancing data security and privacy.
Control Over Updates and Maintenance: Owning the model means you control when and how to update it, allowing for quicker reactions to changing business needs or customer feedback.
Competitive Advantage: A unique, effective language model can be a powerful tool that sets your business apart from competitors. It can improve the customer experience, drive efficiencies, and even become a product or service you can sell.
Reduced Long-Term Costs: While the initial investment might be high, you could save money in the long run by not paying licensing or usage fees to a third-party provider.
Intellectual Property: The algorithms, training data, and resulting models can become valuable intellectual property assets for your business.
What about a pre-trained model?
Training your own language model can give you greater control over the training data, as well as the ability to fine-tune the model for your specific needs. However, it can also be time-consuming and resource-intensive, as training a language model requires a significant amount of computing power and data.
On the other hand, using a pre-trained language model can save time and resources, as well as provide a strong foundation for your NLP tasks. Pre-trained language models, such as GPT-3 and BERT, have been trained on large amounts of high-quality data, and can be fine-tuned for specific tasks with smaller amounts of task-specific data. Additionally, pre-trained models often have a range of pre-built functionalities, such as sentence encoding and language translation, that can be readily used.
Ultimately, the decision to train your own language model or use a pre-trained one should be based on your specific needs and resources. If you have ample computing power and high-quality data that is specific to your use case, training your own language model may be the best choice. However, if you have limited resources or need a strong foundation for your NLP tasks, using a pre-trained language model may be the way to go.
How are Dell Technologies helping ?
Dell and NVIDIA have already been leading the way in delivering joint innovations for artificial intelligence and high-performance computing and are actively collaborating in this new space to enable customers to create and operate Generative AI models for the Enterprise.
Dell has industry-leading servers with NVIDIA compute and infrastructure accelerators, data storage systems, networking, management, reference designs, and the experience of helping numerous enterprises of all types and sizes with their AI and Infrastructure Solutions initiatives.
NVIDIA has state-of-the-art, pre-trained foundation models, NVIDIA AI Enterprise software, system software to manage many networked systems simultaneously, and expertise in building, customizing, and running Generative AI.
We are now partnering on a new generative AI project called Project Helix, a joint initiative between Dell and NVIDIA, to bring Generative AI to the world’s enterprise data centers. Project Helix is a full-stack solution that enables enterprises to create and run custom AI models built with the knowledge of their business.
Dell is designing extremely scalable, highly efficient infrastructure solutions that allows enterprises everywhere to create a new wave of generative AI solutions that will reinvent their industries and offer a competitive advantage.
The complete announcement and white paper is available here
Unpacking Project Helix
In our upcoming blog series, we will explore the world of Generative AI and LLMs, including training and fine-tuning models, reinforcement learning, general AI training and inferencing.
Ultimately, this series aims to provide an overview of the key concepts involved in AI model development and usage, specifically focusing on LLMs, their applications and, how Dell Technologies can enable you to succeed.
Part 1: Generative AI and LLMs – Introduction and Key Concepts (Transformers and training types)
We will explore the Key concept of LLM’s and GPT’s, the Transformer. What makes up a transformer architecture, why this architecture has been a game changer, along with business and technical challenges to be aware of.
Part 2: LLM Training Types and Techniques
We will focus on the training of LLMs, including the different types of training available and the tools and techniques used to create these models. We will explore how training data is collected and processed, and the importance of fine-tuning LLMs for specific tasks.
Part 3: Pre-Trained Model Fine Tuning and Transfer Learning (working with a pre-trained model)
We will delve into the world of reinforcement learning, a type of machine learning that involves training models to make decisions based on rewards or penalties. We will discuss how reinforcement learning can be used to optimize operations and improve decision-making processes in industries such as healthcare and finance.
Part 4: Inferencing
We will discuss the general training and inferencing of AI models.
Part 5: Project Helix – An Overview
We will explore the particular advantages that Dell and NVIDIA bring to the table and how this will enable enterprises to use purpose-built Generative AI on-premises to solve specific business challenges.
How project Helix can deliver full-stack Generative AI solutions built on the best of Dell infrastructure and software, in combination with the latest NVIDIA accelerators, AI software, and AI expertise.
Enable enterprises to use purpose-built Generative AI on-premises to solve specific business challenges.
Assist enterprises with the entire Generative AI lifecycle, from infrastructure provisioning, large model training, pre-trained model fine-tuning, multi-site model deployment, and large model inferencing.
Ensure security and privacy of sensitive and proprietary company data, as well as compliance with government regulations.
Include the ability to develop safer and more trustworthy AI – a fundamental requirement of Enterprises today.
Let’s cover some key concepts
Natural language processing (NLP) is a branch of artificial intelligence that deals with the interaction between computers and human languages. LLM stands for large language model, which is a type of neural network that can learn from large amounts of text data and generate natural language outputs. GPT (one of the most common transformer models) stands for Generative Pre-trained Transformer.
A transformer is a type of Neural Network that can process sequential data, such as natural language, by using a mechanism called self-attention (this part is key, which we’ll cover later).
Before we cover the Transformer architecture, lets look at parameters.
What is a Transformer? “Attention is All You Need”
The “transformer” is a type of model architecture used in the field of deep learning, particularly for tasks involving natural language processing (NLP). It was introduced by Vaswani et al. in a 2017 paper titled “Attention is All You Need”. Since then, numerous variations and improvements upon the original transformer model have been introduced.
Standard Transformer: Introduced in the “Attention is All You Need” paper. The standard transformer model uses a mechanism called self-attention (or scaled dot-product attention) and consists of an encoder-decoder structure.
BERT (Bidirectional Encoder Representations from Transformers): Developed by Google, BERT is a transformer used for text classification tasks that reads the entire sequence of words at once, making it bidirectional. This allows the model to learn context from both past and future words in the text.
GPT (Generative Pretrained Transformer): Developed by OpenAI, GPT is a large-scale, unsupervised, transformer-based language model. Unlike BERT, it’s an autoregressive model that generates text sequentially from left to right.
Transformer-XL (Transformer with Extra Long context): This variant introduces a recurrence mechanism to the Transformer model to enable it to handle longer-term dependencies, making it more suitable for tasks such as text generation.
RoBERTa (Robustly Optimized BERT approach): RoBERTa is a variant of BERT that modifies key hyperparameters in the model architecture and the training approach. It removes the next-sentence pretraining objective and trains with much larger mini-batches and learning rates.
T5 (Text-to-Text Transfer Transformer): T5 is a transformer model from Google that casts all NLP tasks into a unified text-to-text-format. This allows the model to use the same approach to handle different tasks, such as translation, summarization, and classification.
DistilBERT: This is a smaller, faster, cheaper, and lighter version of BERT. It retains 95% of BERT’s performance while being 60% smaller and 60% faster.
ALBERT (A Lite BERT): ALBERT is another variant of BERT that reduces the model size (but not the model architecture) by sharing parameters between layers. It also introduces a new self-supervised loss for sentence-order prediction.
The list goes on………. (Hugging Face, Lama etc)
During inference, the Transformer model takes in an input sequence (e.g., a sentence in natural language) and generates a corresponding output sequence (e.g., a translated sentence in a different language). The attention mechanism in the Transformer model helps it focus on the most relevant parts of the input sequence to generate more accurate output sequences.
What makes up a transformer architecture?
Tokens: In the context of natural language processing (NLP) and transformers, tokens are the smallest units of language that a model can understand and process. They can range from a single character to a whole word or even more in some languages.
Embeddings: Once our text is broken into tokens, we need a way to represent these tokens numerically, so the model can process them. This is done through embeddings, which are learned by the model during training. Embeddings represent tokens as vectors in a high-dimensional space (certainly beyond the scope of this post!) where similar words have similar embeddings.
Positional Encoding: In addition to the token embeddings, Transformers use positional encodings to capture the order of words in a sentence. This is important because unlike models like RNNs and LSTMs, Transformers do not process tokens sequentially, so they need another way to understand word order.
Self-Attention Mechanism: This is a key part of the Transformer architecture. It allows the model to weigh the importance of each token in the context of every other token in the sentence. It helps the model understand the context and relationships between words.
Layers: The Transformer model is made up of multiple layers, each consisting of a self-attention mechanism and a feed-forward neural network. The output from one layer is fed as input to the next, allowing the model to learn complex relationships between tokens.
Training and Fine-Tuning: Transformers are trained in two steps. First, they are pre-trained on a large-scale dataset to learn general language understanding. During this phase, the model learns both weights and embeddings. Then, they are fine-tuned on a smaller, task-specific dataset. During fine-tuning, the model updates its weights and embeddings to better suit the specific task.
Token Limit: Transformers have a maximum sequence length, or token limit, due to the self-attention mechanism which increases computational cost with the number of tokens. This is a fundamental aspect of the architecture and is something to consider when working with these models.
Parameters
Parameters in a machine learning model, including Transformers, are the parts of the model that are learned from the data during training. In Transformers, there are two main types of parameters: weights and biases.
Weights: These are the values that determine how much each input feature, in this case, the value of each element in the embeddings contributes to the output. In the self-attention mechanism, for instance, weights are used to calculate the attention scores. These scores are essentially the weights assigned to each word when considering its influence on other words. The weights in the model are adjusted during training to minimize the difference between the model’s predictions and the actual values.
Biases: These are additional parameters that are added to the outputs of the weighted sum of inputs. They allow the output to be shifted by a constant value, regardless of the input values. Like weights, biases are also learned during training.
The combination of weights and biases forms the learned parameters of the model. The learning process involves iteratively adjusting these parameters to reduce the model’s error on the training data. Once the model is trained, these parameters are used to make predictions on new, unseen data.
In a Transformer model, weights and biases are present in various parts, including the self-attention layers and the feed-forward neural networks. The embeddings are also parameters of the model that are learned during training.
In large Transformer models, there can be hundreds of millions or even billions of parameters. This large number of parameters is part of what allows these models to capture complex patterns in the data, but it also makes them computationally intensive to train and requires a lot of data to avoid overfitting.
The first GPT model was introduced in 2018 and had 117 million parameters. Since then, OpenAI has released several improved versions of GPT with more parameters and capabilities, such as GPT-2 (1.5 billion parameters), GPT-3 (175 billion parameters), and GPT-4 (50 trillion parameters)3. Other organizations have also created their own GPT-inspired models, such as EleutherAI’s GPT-Neo (2.7 billion parameters), Cerebras’ CS-1 (120 trillion parameters), Salesforce’s EinsteinGPT (for CRM), and Bloomberg’s BloombergGPT (for finance).
Why have transformers have been a game changer for LLMs?
Scalability: One of the main advantages of transformers is their scalability. They can be parallelized across multiple GPUs, which allows for the training of much larger models compared to traditional recurrent neural networks (RNNs). Transformers have led to state-of-the-art performance in a wide range of NLP tasks, such as machine translation, sentiment analysis, and question-answering systems.
Transfer learning: (Illustrated Below), (we cover this in much greater detail here) Transformers can be pre-trained on large amounts of text data and fine-tuned for specific tasks. This has made it easier to develop high-performing models for different applications with relatively small amounts of task-specific data.
Their versatility has paved the way: Transformers have paved the way for LLMs like GPT and BERT (Bidirectional Encoder Representations from Transformers). These models have demonstrated remarkable abilities in understanding and generating human-like text.
Real-world applications: Transformers have led to numerous real-world applications, such as chatbots, virtual assistants, content generation, and many others, making them an essential part of the AI landscape.
Making an informed decision
One can certainly develop a language model without understanding (or deep knowledge) of these details. However, having an understanding can help you solve problems, make more informed decisions, and potentially create more effective models.
If you’re looking to build your own language model, or start with a pre-trained model, understanding the Transformer architecture can be incredibly helpful. Whether it’s for problem-solving, parameter tuning, model customization, or just staying up-to-date with the latest in NLP, knowing the inner workings of these models can provide valuable insights.
Furthermore, if you have enterprise data that you’d like to leverage, you can fine-tune a pretrained model on your specific data or even train a model from scratch. You can also combine structured and unstructured data, create a knowledge graph, or employ a hybrid approach depending on the nature of your data and the specific use case you have in mind
LLMs are powerful AI models that have demonstrated remarkable capabilities in natural language processing and other domains. Their success is due in large part to the Transformer architecture, which enables the models to effectively capture long-range dependencies in sequential data such as text.
In the Next post we’ll take a look a LLM pre-training types and techniques.