LLMs : Large Language Models Learnings & Laws

LLMs : Large Language Models Learnings & Laws

Introduction

Artificial Intelligence has been taking huge steps in the development processes not only by helping in developing the codebase but also have coming up with deployments and auto management of the codebase to production. Eatin up a lot of jobs in the tech community, it has come across to perform a lot of actions that weren't easy to find with a single developer let it be related to Machine Learning, Data Structure, Web Development, DevOps, Cloud, SRE and a lot more roles, which started from the training model for text manipulation and generation which enables models to understand form the input and then guess its answer from the training data been fed or being developed after scaling it up, these were the Large Language Models (LLMs). Here in this blog, we will be learning about different models, learnings adopted by these models and their efficiency comparison and Laws that can be further used by us to make our own tools and contribute to this revolution.

Large Language Models

Large Language Models (LLMs) are a specific category of neural network models characterized by having an exceptionally high number of parameters, often in the billions. These parameters are essentially the variables within the model that allow it to process and generate text which are trained on vast quantities of textual data, which provides them with a broad understanding of language patterns and structures. The main goal of LLMs is to comprehend and produce text that closely resembles human-written language, enabling them to capture the subtle complexities of both syntax (the arrangement of words in a sentence) and semantics (the meaning conveyed by those words).

These models undergo training with a simple objective: predicting the subsequent word in a sentence. However, they develop a range of emergent abilities during this training process.

Language Modelling

Language modeling helps in explicitly learning the probability distribution of the words in a language. This method predicts the next token based on the previous token in sequence by using various statistical methods or deep learning techniques.

Tokenization

The first step in this process is tokenization, where the input text is broken down into smaller units called tokens. Tokens can be as small as individual characters or as large as whole words. The choice of token size can significantly affect the model's performance. Some models even use subword tokenization, where words are broken down into smaller units that capture meaningful linguistic information.

For example: we could take the word Large Language Model as the tokens in the training dataset, where we may split up the tokens in the line whenever we find the white spaces as :

["Large", "Language", "Model"]

or else, we may also include the white spaces in the training dataset if we wants to include them as:

["Large", " ", "Language", " ", "Model"]

Model Architecture and Attention

The core of a language model is its architecture. Recurrent Neural Networks (RNNs) were traditionally used for this task, as they are capable of processing sequential data by maintaining an internal state that captures the information from previous tokens.

Vanishing Gradient Problem (VGP)

In machine learning, the vanishing gradient problem is encountered when training artificial neural networks with gradient-based learning methods and backpropagation. In such methods, during each iteration of training each of the neural network weights receives an update proportional to the partial derivative of the error function concerning the current weight. The problem is that in some cases, the gradient will be vanishingly small, effectively preventing the weight from changing its value. In the worst case, this may completely stop the neural network from further training.

Transformer Adoption to Overcome VGP:

This Vanishing Gradient Problem leads to addressing the struggle for long sequences. To overcome these limitations, transformer-based models have become the standard for language modeling tasks. These models use a mechanism called attention, which allows them to weigh the importance of different tokens when making predictions. This allows them to capture long-range dependencies between tokens and generate high-quality text.

Training

The model is trained based on a large amount of training dataset and parameters to maximize the probability of the output received, to predict the next token of a sentence correctly.

Typically a model is trained on a very large general dataset of texts from the Internet, such as The Pile or CommonCrawl. Sometimes also more specific datasets are used, such as the StackOverflow Posts dataset.

The model learns to predict the next token in a sequence by adjusting its parameters to maximize the probability of outputting the correct next token from the training data.

Once the model, is trained it is expected to output the next token in the sequence based on the corpus provided, which is done by feeding the sequence into the model. This outputs a probability distribution over the possible subsequent tokens. based on which the next token is then chosen.

Fine Tuning

Fine-tuning is the process of taking pre-trained models and further training them on smaller, specific datasets to refine their capabilities and improve performance in a particular task or domain. Fine-tuning is about turning general-purpose models and turning them into specialized models. It bridges the gap between generic pre-trained models and the unique requirements of specific applications, ensuring that the language model aligns closely with human expectations. Think of OpenAI's GPT-3, a state-of-the-art large language model designed for a broad range of natural language processing (NLP) tasks.

Few Short Learning

Few-shot learning refers to a machine learning paradigm where a model is trained on a small amount of data before making predictions, which can be particularly useful when label

ed data is limited. These examples "teach" the model how to reason and act as "filters" to help the model search for relevant patterns in the dataset.

The advantage of few-shot learning is fascinating as it suggests that the model can be quickly reprogrammed for new tasks.

The few-shot examples are helping the model search for relevant patterns in the dataset. The dataset, which is effectively compressed into the model's weights, can be searched for patterns that strongly respond to these provided examples. These patterns are then used to generate the model's output. The more examples provided, the more precise the output becomes.

Scaling Laws

Scaling laws refer to the relationship between the model's performance and factors such as the number of parameters, the size of the training dataset, the compute budget, and the network architecture. They were discovered after a lot of experiments and are described in the Chinchilla paper. These laws provide insights into how to allocate resources when training these models optimally.

The main elements characterizing a language model are:

  1. The number of parameters (N) reflects the model's capacity to learn from data. More parameters allow the model to capture complex patterns in the data.

  2. The size of the training dataset (D) is measured in the number of tokens.

  3. FLOPs (floating point operations per second) measure the compute budget used for training.

Emergent Abilities in LLMs

Emergent abilities in LLMs refer to the sudden appearance of new capabilities as the size of the model increases. These abilities, which include performing arithmetic, answering questions, summarizing passages, and more, are not explicitly trained in the model. Instead, they seem to arise spontaneously as the model scales, hence the term "emergent."

LLMs are probabilistic models that learn patterns in natural language.

When these models are scaled up, they not only improve quantitatively in their ability to learn patterns, but they also exhibit qualitative changes in their behavior.

Earlier, the models needed to be architecturally upgraded using various task-specific trainings and learnings, but now only scaling these models can help in developing new functionalities and perform these specific trainign and learnings by itself without getting to invest much time in these tasks, and can just monitor to keep track of the the features and learnings it produced due to scaling in itself.

These LLMs can grow rapidly and unpredictably transition from near-zero to sometimes state-of-the-art performance. This phenomenon suggests that these abilities are emergent properties of the model's scale rather than being explicitly programmed into the model.

Thus just scaling up these models can lead to the spontaneous development of new capabilities. Thus these self-learnings can give rise to some unethical or incorrect outputs being reflected by the LLMs when asked for an output, which can result in spreading wrong information, misleading people or might even hurt some people of color or gender, due to some racial comment. These are known as the Hallucinations and Biases in LLMs, which need to be monitored and corrected by the developers after scaling up the LLMs.

Hallucinations and Biases in LLMs

Hallucinations in LLMs refer to instances where the model generates outputs that do not align with real-world facts or context. This can lead to the propagation of misinformation, especially in critical sectors like healthcare and education where the accuracy of information is of utmost importance. Similarly, bias in LLMs can result in outputs that favor certain perspectives over others, potentially leading to the reinforcement of harmful stereotypes and discrimination.

For example, someone asked who get the Nobel Prize for Physics in 2050, and LLM came up with a name for an year that hasn't even announced.

Bias in AI and LLMs is another significant issue. It refers to these models' inclination to favor specific outputs or decisions based on their training data. If the training data is predominantly from a specific region, the model might show a bias toward that region's language, culture, or perspectives. If the training data contains inherent biases, such as gender or racial bias, the AI system might produce skewed or discriminatory outputs.

For example, ChatGPT and many other LLMs were being blamed for biasing according to Gender, Color, and other types, like the most common type of Bias being observed by many people was, while they were generating images through the image generation tools like Midjourney and Dall E by Open AI, where whenever users asked to generate an image of a nurse, it came out to be a women but when they were generating for a doctor then it came out to generating a picture of a man, which offended many people.

Mitigating hallucinations and bias in AI systems involves refining model training, using verification techniques, and ensuring the training data is diverse and representative. Finding a balance between maximizing the model's potential and avoiding these issues remains challenging.

These kinds of hallucinations can be useful in creative domains like media and fiction writing and can be helpful for enabling the generation of unique and innovative content

I hope you found some value that got added to you, soon I will be releasing a blog explaining various different kinds of LLM models with their amazing specs and the whole timeline explaining the upgrades that took place in LLMs, the Transformer's working & mechanisms in LLMs & emergent abilities in LLMs & It's going to be LEGEN.....DARY🔥🔥

If you like my Article then please react to it and connect with me on Twitter if you are also a tech enthusiast. I would love to collaborate with people and share the experience of tech😄😄.

My Twitter Profile:

Aryan_2407

Did you find this article valuable?

Support Aryan Parashar by becoming a sponsor. Any amount is appreciated!