An Introduction to Large Language Models

Author: Theodore Odeluga

Introduction

Large language model’s (LLM’s) are currently proving to be the most promising area for the future of artificial intelligence as seen through examples such as Open AI’s flagship product Chat GPT.

Out of the company’s portfolio (including DALL E, OpenAI Five and OpenAI Codex) GPT is the project which has gained the most attention and provides another fascinating illustration of how artificial intelligence has captured the public’s imagination.

Other notable milestones in recent AI history include Deep Blue, the computer built to play chess that beat Gary Kasparov, the then world champion back in 1997, and the software developed by DeepMind, which in 2016 beat the then world champion of Go, a complex game of strategy dating back to 4th century China.

In this article I’ll try to cut through all the sensational headlines and noise we’ve had about AI recently and present what I think are the most interesting aspects of the main subject, as gleaned from others’ research and my own exploration.

The aim here is to filter out the more hyperbolic claims made on behalf of LLM systems and focus on the key characteristics of these important tools.

Before we get into specifics, it would be useful to begin with a helpful definition to better understand GPT and other similar applications.

A large language model (like GPT) is essentially a data and instruction set used for forming statements based on the probability determining which word will follow another in any given context of communication (such as a statement or question) according to the conventions of human language.

Within this method, each word is assessed on the basis of its next-in-line probability (as defined by the context*). This in turn creates a framework – or model – of common structures in human communication.

Armed with this statistical information artificially intelligent software can correctly predict the next word in a sentence or phrase when constructing the response to a query or comment.

Applications for this technique include everything from chatbots to smart assistants and of course programs such as Chat GPT.

*A point of clarification: the term context in large language models is derived from the term context-length.

The context length of a large language model defines the number of text-based tokens the LLM can “remember” when generating responses (more about tokens later in the article).

This will be the only occasion where the term "context" will be associated with its more specific meaning for Large Language Models.

Most usage of the term in this article will be based on the general (and more familiar) meaning. I only make the differentiation here to avoid confusion.

Since its release in the spring of 2023, Chat GPT 4.0 has attracted considerable attention from both the mainstream and technical media in speculation of its potential with much discussion about the improvements in its abilities since version 3.5.

Chat GPT or Generative Pretrained Transformer to give it its full name, is a generative artificial intelligence application capable of working with text, images, audio and video.

The term generative is so called because GPT can ‘generate’ data in response to other data.

It does this through first being ‘trained’ via a large amount of information (hence the ‘pretrained’ aspect).

One can think of ‘training’ in AI terms as simply the way an artificial intelligence ‘learns’ (more details below).

An LLM like GPT will study these vast amounts of data to understand what is expected in different forms of communication.

In large language models, the “large” aspect is very loosely defined.

At present, datasets which LLM’s learn from can encompass everything ever published on the internet or just “larger” numbers of parameters.

There is no standard size for the amount of training data required by an LLM.

Neural Networks

To facilitate it’s learning GPT 4 uses a construct known as a ‘neural-network’, a computerized imitation of the human brain.

This network is a collection of computers or set of interconnected processors.

The ‘architecture’ or design of this imitative system mimics the brain’s neuron structure (those interconnected nodes which transmit information from one part of the brain to the other via electrochemical signals) and through ‘reinforcement learning’ – that is, providing a ‘reward’ for useful results and exacting a penalty for failures - performance is improved via a repetitive procedure around solving problems.

Reinforcement learning in a computer copies the way humans learn; as the computer does, we also learn by gathering information, separating the useful data from the non-useful as we go and improving how we perform through trial and error.

Not unlike a computer running a neural network, when we learn, we experiment with data; mentally labelling results as useful if they lead to solutions or non-useful if they don’t.

Neural networks follow the exact same process, augmenting their training data (their ‘knowledge’) as they go over time with an aggregation of ever-improving results leading to optimum capability.

As mentioned, the hardware equivalent of the brain’s neurons is essentially a collection of interconnected computers or several processors working together.

The point of this elaborate arrangement is to enable reproduction of relevant information with output that is responsive, realistic and believable in the context of a previous query or statement (for this, an LLM draws on its huge (and as implied above, constantly growing) database of pre-training information) through regularly scraping millions of pages online.

The ‘Transformer’ aspect of GPT refers to how it automatically ‘transforms’ an input to a contextually relevant output.

Patterns

Machine learning algorithms also learn by using “patterns”.

In AI terms, a pattern is just a way to categorize information so other information of the same type is more recognizable.

In this way, patterns are useful for predicting what data to expect when working in a particular context (different contexts may for example include – types of numbers (such as prime numbers, even numbers or decimals), specific types of images in photographs (i.e. - breeds of cat or car designs) or just certain types of word (adjectives, verbs, nouns etc.).

Going back to the description near the start, an AI would use this kind of approach (along with a set of statistics to analyze the context – or occasion – in which certain words are more likely to be used) to predict the correct word following in a sentence.

Machine Learning

Chat GPT was initially and specifically designed to process text.

It does this through a Machine learning (ML) technique called Natural Language Processing (more about NLP below).

Machine learning is a branch of artificial intelligence where the focus is on the computer guiding itself with pre-supplied data instead of direct human instruction.

Machine learning comes in 4 main forms – supervised, semi-supervised, unsupervised and reinforcement.

In supervised ML an algorithm is supplied with data that is “labelled” – that is, it comes with additional information.

This additional detail can include identification of the data by category, a description of the format (i.e. – video, images or text) and the provision of contextual information.

Using the algorithm as its method, the computer will process the data along with these additional aids to achieve successful completion of its task.

In unsupervised learning, data is provided without any labelling.

Instead, the computer develops a system of categorization for different types of information by looking for patterns.

Identification of patterns here relies on the results of data analysis correlating with each other.

The clearer the relationships between each of the different datapoints, the easier it is for the computer to recognize the pattern.

A datapoint is a vector or “datum” (singular piece of data).

In data science (and other areas of mathematics), vectors are numeric variables which represent characteristics, observations and conditions which are of course measurable.

In ML, datapoints correspond with information in all its forms.

In semi-supervised learning, the computer extrapolates from a small amount of data that is labelled and applies it to a dataset that is unlabeled.

It then continues processing as described for the above 3 types of ML using the small labelled data sample as a kind of template or framework to understand the rest of the overall information.

Reinforcement learning is based on an algorithm with a goal and a set of rules. The intention is that the rules will guide the computer toward the goal.

Through a system of “punishments” and “rewards”, the algorithm will correct errors the computer makes and reward achievements.

Through this “carrot-stick” approach, the machine will continuously improve its capacity to achieve the goal, by recording efficiencies (ways to achieve better results) and discarding the methods or routes which don’t lead to good results.

Natural Language Processing

Natural Language Processing is a system enabling a computer to work with human language that exploits all of the above capabilities.

NLP facilitates a machine’s “understanding” of human language by defining its key components (i.e. syntax, semantics, grammar etc.).

This begins with unstructured data that is refined through preprocessing (preparing the data by simplifying words to their root form and removing stop words such as “the”, “for” and “with”), training (feeding software with sample data) and finally, deployment (integrating a model into a production environment).

Text Memory

While a large language model’s capability with human language is considerable, it must contend with how much text it can work with. This is determined by its text memory.

An LLM’s text memory is based on the maximum number of “tokens” it can handle. A token in this context is the smallest unit of text in human language (as processed by an LLM).

This is often based on single words, but tokens can also be based on segments such as single letters, special or punctuation characters or even parts of words.

Steerability

Steerability refers to the way a user can influence the ‘personality’ and behavior of a chat program.

Through use of a simple prompt, the user can direct the application to communicate in the style of a desired personality and even provide knowledgeable responses to specific technical queries.

This doesn’t stop the LLM behind the software from providing false or distorted information however (more about ‘hallucinations’ below) but in Chat GPT, an improvement of this capability occurred between version 3.5 and version 4.0.

Multimodal Capacity

As reflected by GPT’s responsiveness to directed character actions, its versatility comes from a multimodal capacity.

The term multimodal here pertains to the program’s ability to process additional media other than text as it was limited to in previous versions.

As a result, queries can be fed to the model using “mixed media” (i.e. text and images). For example, it can explain or describe the contents of an image in response to a related question.

This represents a major milestone in the evolution of LLMs and opens up a new set of possibilities for different forms of human-AI communication, along with new opportunities in the types of problems mass market AI systems can solve.

Parameters

Technically successful LLMs rely on effective processing of parameters. In statistics, a parameter is a value describing one of the defining characteristics of a population.

During training on data, at the outset, a dataset’s major parameters would be largely unknown.

At this point, a system would take a sample datum from the information to make an inference about each parameter in question.

Each inference made by testing a random sample in this context would be an incremental piece of a “statistical puzzle”.

Through this testing, the system would develop its understanding of information in the overall model.

This in turn would better define the selected parameter being studied.

With better knowledge of critical parameters, more accurate predictions could be made about important aspects of the data.

To clarify one important concept; a data model is simply an abstract conceptual representation of different elements of related information comprising a dataset.

The data model effectively describes the relationships between these associated informational elements.

In real-world terms then, a more concrete form of data models could be anything from the fields in a spreadsheet to the records of an enterprise level database.

Each of these systems are the embodiment of the definitive representation (or model) of a dataset.

Going back to the concept of a population, the “population” in this sense is the collection of unique individuals each represented by a record in the dataset.

To recap, while at the start of analysis, definitive parameters are unknown, through the process of trial-and-error testing of samples (not unlike the process of testing a series of theoretical hypotheses against a developing body of knowledge on a previously unresearched subject), an understanding of the overall data would itself develop, enabling more accurate predictions about that data in future.

Safeguards and guardrails

Jailbreaking is the term given to removing the restrictions built in to a large language model to block its ability to provide illegal or unethical information.

Similarly to the legitimate use of an LLM, this is accomplished through the use of carefully crafted prompts, sometimes done in the context of a fictional scenario to “fool” the system into thinking it’s only discussing something illicit in a hypothetical scenario.

The advantage of removing these restrictions or “guardrails” in an LLM is to make the model more flexible, enabling functionality not possible with the previous limitations in place.

However, the disadvantages (apart from the risk of generating illegal or unethical content) would be voiding any agreement on the part of the model’s developer to provide support, failures in the products performance and incompatibility with other software the model might need to work with.

As ever, benefits and risks in technology are also reflected in AI systems through implementation and engineering payoffs and tradeoffs.

'Slowness'

Immediately following the release of GPT 4.0, initial excitement led to disappointment among some early adopters who noticed the decline in speed between 3.5 and 4.0.

The ‘slowness’ of an LLM is down to a number of factors.

Greater context-size, additional rules in safeguards and greater popularity adding to demands on OpenAI servers have meant version 4.0 of GPT can do more but at the expense of speed due to a bigger workload.

It’s expected that performance will improve over time with newer iterations.

Hallucinations

Hallucinations are the major limitation of LLM applications.

In AI terms, hallucinations are the generation of incorrect or nonsensical data in response to user queries.

GPT 4.0 has improved in this respect but isn’t completely free of the problem.

Distortions in LLM output come about through a range of issues – mainly connected with the input which has gone into them (i.e. – via training data – much of which has come from the internet and hasn’t been quality assured) and even human error – such as through flawed prompt engineering.

Limitations are also based on the fact that LLM systems don’t always have all the information required readily available.

All of the above simply underlines how while LLMs are extremely powerful, human intervention can’t be taken out of the loop entirely when using them.

Conclusion

2023 will go down in history as a pivotal landmark in the development of AI and for the foreseeable future (the “foreseeable” being quite limited given the speed and ingenuity of the advancement) LLMs in particular look set to be a major vehicle for the most common and practical implementations of the technology. If the race is just getting started, it’s safe to say we’ve only seen the tip of the iceberg and I'm sure in years to come, LLMs will still be full of surprises.