Generative AI Neural Networks: Definition, Architecture, Types & Model

The definition of neural networks in the context of Generative AI is that they are a type of machine learning algorithm used to emulate human reasoning processes by executing tasks such as understanding language, recognizing images, and generating new content. The Multilayer Perception, introduced in 1987 by David Rumelhart and Geoffrey Hinton, was considered the first true neural network. It featured hidden layers capable of transforming and combining raw data inputs into new features of data. A key computational principle enabling this and all other deep learning variants of neural networks is called backpropagation, which estimates how to change the weights of many connections in the neural net in order to reduce errors between predicted and actual results.

A generic neural network consists of three kinds of layers. The first layer is the Input Layer in which numerical training data is fed into the system. This layer is followed by one or more Hidden Layers made up of units (nodes) that transform the data through complex mathematical computations. The last layer is called the Output Layer, where the neural network’s final predictions or classifications are produced for the data originally input.

Flat diagram of a neural network showing input, hidden, and output layers
This clean and modern diagram visually represents how data flows through a neural network. It highlights the roles of the input layer (receiving raw data), hidden layers (processing via activation and weight calculations), and output layer (generating final results). Each component is marked with icons and labels for accessibility and clarity.

Neural networks can be divided into three categories known as feedforward, recurrent, and convolutional networks. Feedforward networks are the most basic type of neural network which applies the input and hidden layers of an architecture one time to produce an output. As the name implies, in a feedforward network information flows only one direction, from the input toward the output. Convolutional Neural Networks or CNNs are very complex feedforward networks used to process visual inputs and are also common in understanding audio inputs. Recurrent Neural Networks or RNNs process sequential data like language while leveraging memory to inform earlier predictions based on data learned from later data.

Generative Adversarial Networks (GANs) take the convolutional network architecture and pair it with a feedforward structure to create hyperrealistic datasets. In GANs, one neural network generates new data and another evaluates it, providing feedback that causes the generator to create increasingly realistic data until the evaluator cannot discern generated data from real data. Such datasets are not sequences of data like language or audio, but visual datasets like images or videos.

The most commonly used architecture for Generative AI is called the transformer. Transformers leverage parallel processing and attention mechanisms that allow the model to intelligently focus on the most relevant parts of the input data, overcoming previous limitations of sequential models like RNNs. As a result, transformer models utilize more of the underlying generalizable learning of neural networks and are the basis for large language models (LLMs) such as GPT-4, Stable Diffusion, and DALL-E.

Generative AI uses neural networks that have sizeable numbers of layers (deep learning) in order to automate processes through something called representation learning. For example, a three-layer feedforward neural network might classify images between cats and dogs by first recognizing whether the object is fuzzy or smooth, then distinguishing whether it has whiskers or a long nose, and finally classifying it as either a cat or a dog based on the features identified in the previous layers. Humans can see all the intermediate data transformations at work in this kind of system, but in deep learning models these intermediate learning patterns are so high-dimensional and abstract that they cannot be easily interpreted, even by the developers of the model.

Neural network processes try to minimize the difference between the actual target and predicted target output. They go about this by utilizing feedback and error-correcting tasks that are based on three instructional strategies: supervised learning which uses labeled datasets, unsupervised learning which utilizes untagged datasets, and reinforcement learning which is based on incentivized behavioral modification.

The optimal weightings of the data transformation processes in each layer of the neural network are carried out by something called an activation function. These activation functions are a crucial processing element because they determine if a particular neuron should be activated. In other words, they decide whether or not the neural network has learned something or patterned the input data as expected. Activation functions introduce non-linear properties to the neural network which ensure that the neural network can learn complex patterns, relationships, and features in data.

Neural networks take a tremendous amount of training to work properly. They therefore are frequently used in generative AI models that perform complex outputs like deepfake images (artwork, design, etc.), photorealistic animated images (such as those used in insurance cost analysis or real estate listings), audio clips that imitate the tone and pitch of specific voices (such as those used for voice synthesis for handicapped individuals), and of course text such as news stories or essays.

Neural networks, especially when combined with deep learning models, take intensive computing power to build and maintain. Graphics processing units (GPUs) are commonly employed, as they are less expensive and more robust than central processing units (CPUs). However, smaller versions of neural networks can be run on laptops or even cellphones.

While GPUs are most commonly associated with building and running neural networks, application-specific integrated circuits (ASICs) and field-programmable gate arrays (FPGAs) are also frequently used. FPGAs are the preferred technology in the AI sector because they are adaptable for multiple different computational uses. ASICs are unable to adapt their function once produced but are the most efficient in terms of cost and energy use, so they are becoming increasingly adopted for AI models once they reach enough scale. This diagram describes in more detail the three most common hardware models that run AI algorithms.

Neural networks can take factors from the real world and leverage those factors in order to create entirely new outputs such as realistic imagery of non-existent people, convincing deepfake videos, and natural-sounding text that mimics a human author. The results are often astonishing, such as this image generated by Stability AI’s Stable Diffusion model. As the competitive marketplace for generative AI expands, companies have noticed the utility of applying the technology through using neural networks in familiar, but very expensive, sectors such as education and healthcare. McKinsey & Company estimates that generative AI in these two sectors alone has the potential to create value ranging from $200 billion to over $600 billion annually, through efficiencies that represent as much as two-thirds of tasks performed by professionals.

Generative AI neural networks operate within the broader ecosystem of generative technologies, which also includes transformers and other advanced architectures. Explore the complete landscape of generative AI models and architectures here.

What Are Neural Networks in Generative AI?

Neural networks are algorithms that facilitate machine learning by imitating the processes of the human brain. They consist of rows of interconnected nodes, referred to as neurons, which are capable of conducting computations on data.

A neural network’s three main components are an input layer, intermediate (hidden) layers, and the output layer.

  • Input Layer: The input layer obtains the data that the neural network needs to analyze. If the input variable is a photograph, it will contain the raw pixel values of the image.
  • Hidden Layers: The intermediate layers conduct the majority of the work. Each neuron processes the data fed into it and sends the results to the next layer. A neuron produces two outputs: a product of the inputs’ weights (what the network has learned) and an activation function’s output.
  • Output Layer: The input and hidden layers’ computations lead to the output layer’s generation of results. In image applications, the output might represent the likelihood of an object’s existence or the final generation of a new image.

Neural networks’ learning patterns are driven by the difference between predicted and actual outcomes, expressed through loss functions that quantify the errors of neural networks in estimation. The model adjusts the weights and biases of its internal nodes to minimize these losses according to a prefixed optimization process.

This self-adjusting ability allows neural networks to adapt their model outputs based on their learning patterns, making them extremely valuable for generative AI tasks involving image, text, audio, or video creation.

For example, in image creation, if the model generates an image containing a cat with five legs when instructed to paint a cat, the model will compare its output to the parameters or tags for standard cats it has been trained on. Because the image it created does not conform to the learning model’s expectations, the neural network will gradually adjust its inner workings to reduce such erroneous outputs in the future.

How Do Neural Nets Mimic the Human Brain?

Neural networks mimic the human brain in the way they are structured and how they process information.

Biological neurons in the human brain communicate with each other using connected receptors, similar to how artificial neurons in a neural network communicate with one another over “weights.” While biological neurons signal the arrival of a new message (input data) through electrical pulses, artificial neurons rely on mathematical functions. Once a signal reaches an artificial neuron, it processes the message and sends it to the next connected art

The brain itself is made up of billions of neurons that work in parallel, processing information simultaneously to recognize and understand patterns. Neural networks follow this model by having numerous artificial neurons grouped into different layers to detect and comprehend complex patterns in the data.

Comparison diagram of human brain and neural network layers
This infographic compares how the human brain and artificial neural networks process input. On the left, the brain recalls an imagined dog barking using sensory memory and associations. On the right, a layered neural network receives input, processes it through hidden layers, and generates output. The diagram highlights structural and functional parallels in a simplified, educational format.

As a simplified analogy of how neural networks model the human brain, consider how our brains process sensory data. If we hear the word “dog,” our brain identifies the environment to understand what a “dog” looks like based on past experiences and info. Maybe it recalls an image of our dog barking in the garden. It recognizes the sound of the barking, retrieves the meaning of a dog, imagines a dog, and recognizes the barking sound as coming from a dog.

Neural networks do something similar but in a more mechanical way. Relevant data inputs are sent to the first set of artificial neurons (the input layer). These neurons pass the information to the next sets of neurons (hidden layers) which then together comb through various possibilities and arrive at a solution that is passed on to the output layer.

Which type of neural network do we use for which type of data? Input data’s format decides which type of neural network is used. Simple sequential data is easy to handle with RNNs. For multidimensional data like images, CNNs are essential. RNNs are popularly used for voice-command applications (for instance, say “Ok Google” followed by a question) while CNNs work for facial recognition.

Neural networks also contain enabling factors that mimic the human brain in an imperfect mechanical form. The “weights” that modify input signals in an artificial neuron are similar to how experiences impact the strength of connections between biological neurons. The “activation function” in neural networks that determines whether output is generated or not is similar to the “Threshold potential” in real neurons – that once exceeded sends off the electrical signal.

This overall process with different layers and the manipulation of signals with weights is like the following. First, the data flows through a series of artificial neurons in different layers or “paths” as in the example of recognizing a dog from a bark

Second, some “paths” between different layers get more attention based on importance. These receive and modify input signals based on the weights assigned to each artificial neuron. The higher the weight, the more important the data is in delivering a successful output. The second step is analogous to the human brain using prior experiences to adjust the strength of pathways between neurons to recognize a dog bark better. This process of nebula alterations in the human brain is described as “learning”. Neural networks mimic this by reviewing the final decisions in models and adjusting weights to better simulate future outcomes. The more data a model is trained on, the better its outputs – facilitating improved and more accurate identifications and assumptions.

What’s the Role of Neural Networks in GenAI Model Outputs?

Neural networks in generative AI (GenAI) are responsible for identifying intricate patterns among data used for training and generating outputs that display similar patterns. The outputs are sometimes so indistinguishable from human creations that they have been called “hallucinations “ or fakes. A neural network’s role in ensuring GenAI communicates seamlessly and logically is found in its capacity to remember and organize contextual information so that outputs are consistent with prior statements or actions.

Neural networks play a central role in GenAI by approximating complex functions that translate inputs into desired outputs. They learn from vast data collections to model relations between input and output, including sophisticated correlations. Further, they will “remember” information from earlier portions of the input to ensure that outputs are internally consistent. The two key technical aspects of neural networks that are crucial for GenAI outputs relate to the multilayer structure of the networks that allows them to learn complex patterns via representation learning, and to recurrent layers that feed forward informative output that can be pooled for later use.

Representation learning is commonly achieved by deep neural networks with many layers. The more complex the output a neural network is to generate, the more layers it often needs. Each layer learns some structure in the data relevant to the task at hand, with early layers learning e.g. straightforward patterns, and later layers learning more complex structures. This diagram shows a simple three-layer set up, with one hidden layer, with inputs going to three neurons, followed by an output node.

Neural networks that are required for generative AI outputs of text or images use Recurrent Layer mechanisms that allow the neural network to “remember” past patterns that influenced earlier outputs so that they can be adjusted to follow the flow of the final output. This is particularly important for GenAI outputs including but not limited to narratives or arguments that must remain logically consistent throughout.

As an illustration, OpenAI’s ChatGPT utilizes 96 self-attention layers with more than 177 billion parameters to accurately interpret requests. The image below shows the architecture of the transformer neural network.

GANs use two neural networks, a Generative Adversarial Network that creates a new data set similar to the trained one, and a Discriminator that assesses how similar the two datasets are. According to GANs patent holder Ian Goodfellow, they can be thought of as “building an image one pixel at a time”.

The GANs process illustrated below sets the creation and refinement of a new image into five stages of operation between the two neural networks.

Infographic showing the five-stage GAN training process with generator and discriminator networks
This flat-style infographic presents a five-stage visual of how Generative Adversarial Networks (GANs) function. The process starts with noise input into the generator, which produces initial low-fidelity images. These are evaluated by the discriminator against real images, feeding feedback into the generator for refinement. The result is a progression toward a realistic image. The clean layout emphasizes the adversarial loop between the generator and discriminator, reflecting how GANs learn.

Input Layer to Output Layer: How Data Flows

In a neural network, data flows through the architecture from first an input layer, next to any hidden layers, and finally to an output layer. The structure of a neural network can usually be visualized as the diagram below. The nodes of the network are artificial neurons, which are connected by hyperplanes in multidimensional space called weights. These weights are important because they determine the strength of the influence one neuron has on another. As an example, in a three-layer MLP neural network, the activation of one neuron is a weighted sum of the activation values of the neurons from the previous layer, plus a bias term.

The data flow can be simply visualized as a series of arrows pointing from the input at one end of the diagram to the output at the other, as follows.

  • Input Layer: In this first stage, data enters a neural network’s input layer, which is comprised of neurons that receive various features of the data as external inputs. In computer vision, these features are pixel values. For text generation, they are the mutated, indexed, or encoded numerical representations of a word or an entire sequence of words.
  • Hidden Layers: The next stage is the system’s hidden layers, which are between the input and output layers. These take weighted sums of inputs from the previous layer, apply an activation function, and send the outputs to the next layer. Networks can have many hidden layers and many neurons per layer. Increasing the number of hidden layers and nodes is how deep learning achieved its name and why it has become more successful, as increasingly complex and non-linear calculations can be performed on incoming data. However, as this is technically non-unique parameter (the same output can be achieved by a different architecture), simplistically adding more layers does not guarantee success.
  • Output Layer: The last stage is the output layer, the neurons of which produce the completed answer output of the neural network. For a natural language generation task, it is a predicted response made of decoded word tokens. Or, for visual art tasks, the completed image.

It is common to talk about different neuron functions performing “calculations” on input data. This is a shorthand way of speaking because, in reality, activation functions only perform mathematical operations.

Additional layers can be added to neural networks to create more complex architectures for additional functionality. These include input pipelines for preprocessing raw data, skip or residual connections which allow information to skip certain layers and avoid vanishing gradients, output layers for directing activation values to loss functions, and more.

The basic flowchart of the model architecture needs just to add feedback loops that allow outputs to be rerouted as additional inputs, which is useful for RNNs, reinforcement learning, and variational models.

Why Activation Functions Matter in Neural Networks

Activation functions in neural networks introduce non-linearities that enable the networks to solve complex problems and decipher intricate data patterns. Without activation functions, a multi-layer neural network would behave similar to a single-layer linear model, as multiple layers without activation can be collapsed into one layer.

The choice of activation function has a profound impact on a model’s output. For example, choosing the correct activation functions allows GANs to generate hyper-realistic images rather than blobby or cartoonish creations. The three most common activation functions are ReLU, sigmoid, and softmax.

  • ReLU (Rectified Linear Unit) defined as f(x) = max(0, x). ReLU outputs zero for any negative input and a linear output for any positive input. Activations occur only on the positive side of the input axis. It is the most widely used activation function because it limits the vanishing gradient issue of sigmoid and tanh functions which hampers deep learning. In other words, it allows for gradients to flow further upstream, enabling quicker learning. One downside is that very large weights can happen, either causing saturation or resulting in “dying ReLU” neurons that are unusable. This video from the StatQuest YouTube channel explains ReLU and its downside.
  • Sigmoid defined as f(x) = 1 / (1 + e-x). The sigmoid function applies a squashing function to any input to ensure the output is between 1 and 0. The positive outputs can be interpreted as probabilities. But a downside is that it limits the gradient it passes upstream. Remember that during the backpropagation training process, gradients are passed back to the previous layers. Gradients of a value between 0 and 1 will always be smaller than the value itself, and so learning will become increasingly small in early layers, essentially “dying” in a way. This episode from the 3Blue1Brown YouTube channel explains the intuition behind the sigmoid function.
  • Softmax defined as f(xi) = exi / Σj=1Kexj in which i = 1 to K. The softmax activation function ensures that outputs are between 0 and 1 and that they sum up to 1. This makes softmax interpretable for classification-type problems, where you desire probabilities that must sum up to 1. The downside is that the “dying” problem associated with sigmoid is also associated with softmax.
Comparison of ReLU, Sigmoid, and Softmax activation functions with graphs and formulas
This infographic compares three widely used activation functions in neural networks: ReLU, Sigmoid, and Softmax. It presents each function’s formula and a corresponding graph to visually explain how they process input data. ReLU activates only positive values, Sigmoid maps inputs to a 0–1 range, and Softmax converts output values into probabilities across multiple classes. The layout is optimized for web readability and educational use.

Historical Development and Milestones of Neural Networks

Neural networks are designed to process information in a way that mimics the human brain. They seek to mimic how information flows through neurons, with “synthetic neurons” passing data down different layers, modifying it based on pre-defined parameters, and ultimately delivering output based on its programming. By the mid-1950s, researchers in computer science began developing mathematical formulas and algorithms that mimicked this structure. The structure of neural networks has evolved multiple times since the 1950s, with dramatic shifts in functionalities, complexity, and applications.

This is how the historical development of neural networks that led to generative AI progressed.

  • 1958: Perceptron. Frank Rosenblatt of the US Naval Research Laboratory created the first perceptron neural network that used electrocuted neurons and a feedback loop. The perceptron was capable of learning to solve specific problems by establishing new connections, mimicking biological neurons. This perceptron was designed using a giant crane with a physical output. The concept of the perceptron was formalized in 1958. The architecture was comprised of a single layer of inputs and one output, and it was limited to performing tasks that were linearly separable, meaning the output could be derived mathematically by the input parameters and some weighting factors. And, similarly to biological neurons, the perceptron was designed to ‘fire’ only when the input supplied it with enough information such that it could successfully complete the assigned task. This challenged researchers, as the overall learning function suddenly became less of a simple mathematical equation and more like something that researchers needed to train the computer to resolve.
  • 1960s: MLPs and problems. In the 1960s, the limitations of the perceptron were resolved by Frank Rosenblatt and others through the development of multi-layered perceptrons (MLPs) that had the concept of multiple computational nodes as inputs and outputs. However, this meant that each node became a new decision-making layer that required honing, which quickly overwhelmed existing computational power. By 1970, researchers realized that it would be impossible to hone neural networks with several layers or more complex nodes. Even in 1986 when Geoffrey Hinton and his team released a landmark paper describing backpropagation which enabled neural networks with multiple layers to learn, computers simply lacked the processing speed to make it practical.
  • 2012: CNNs. Noah Vail and his team at Google were able to leverage new computing capabilities at larger and lower costs to generate better images than earlier models. They used enhanced GPUs now commonly found in PCs and gaming systems to build their Convolutional Neural Networks (CNNs) whose multiple layers could learn to capture hierarchies in images. In the years leading up to 2012, a host of new algorithms and architectures had been introduced to the sector, including Sparse Autoencoders, Representation Driven Learning, Restricted Boltzmann Machines, Deep Belief Systems, and so on, each with their own pros and cons. A series of prizes and research funding at Stanford University, MIT, and other institutions soon resulted in new funds and focus streams that allowed researchers to explore variants of these emerging architectures.
  • 2016: GANs. Ian Goodfellow produced Generative Adversarial Networks (GANs) in 2016 while a PhD student at the University of Montreal. GANs disrupted all previous models because they disrupted the training structure. Rather than a single model being trained on a specific set of inputs and outputs, a GAN pits two competing neural networks against each other. One generates and the other discards, thereby catching flaws and honing the process. GANs have become some of the most powerful architectures for building generative AI capabilities.
  • 2018 And Beyond: Transformers And Natural Language Models. The popularity of Google’s search engine provided billions of queries worth of training data that researchers could leverage. The 2017 paper Attention is All You Need by researchers at Google Brain, Amazon, and the University of Toronto introduced the Transformer architecture, which allowed models to learn relationships between different words in unstructured text data without involving an extra layer of neural networks.

Generative AI made possible by neural networks is frequently the outcome of stacks or hybrid variations of several different architectures, such as the design of CLIP, OpenAI’s Contrastive Language-Image Pre-Training model.

In the 1960s and 1980s, two architectures were sometimes used in parallel. Today, researchers are using stacks that add 10 or more layers of neural networks. For example, Capgemini’s note on the AI art generation process followed by VQGAN and CLIP states that VQGAN has 46 layers of multiple building blocks with 192 channels, while CLIP has up to 224 channels.

Understanding how neural networks work is essential for grasping the foundations of generative AI but seeing these concepts in action brings deeper clarity. From GPT-based text generators to diffusion models for image creation, today’s leading platforms bring neural architectures to life through practical tools. To explore which tools and platforms are leading the way and how they build on the concepts you’ve just learned check out our guide to Generative AI Tools and Platforms.

From Perceptrons to Deep Learning: A Timeline

This timeline presents a chronological view of the evolution of neural networks and highlights key moments in research, technology, and industry that have driven progress over the years.

Neural Networks (NNs) began in 1943 at the hands of Warren McCulloch and Walter Pitts at the Massachusetts Institute of Technology. They produced a paper entitled ‘A Logical Calculus of Ideas Imminent in Nervous Activity’ which detailed how neurons communicate and collectively respond to stimuli in real life.

The institute’s own Frank Rosenblatt conducted research in the 1950s based on the aforementioned work. In 1958, he presented his “Perceptron”, which was the first of its kind to be fully functional and machine-based. A functional Perceptron utilized simplified versions of concepts such as dendrites, axons, and synapses, in correlation with pseudorandom signals and electric currents, to mimic how real neurons collectively responded to stimuli.

However, it was not until Thomas Kuhn’s “Revolutionary Ideas” that perceptrons generated real interest in the scientific community. Kuhn explores how new ways of thinking in scientific fields often reward creativity over correctness and tie their reception to broader cultural dynamics in American society at the time. Kuhn dedicated an entire chapter to Rosenblatt’s Perceptron, and this helped attract numerous scientists who were eager to develop artificial intelligence (AI) machines at a rapid pace.

The first major set of scientific breakthroughs in NNs came in the early 1980s. David Rumelhart, Geoffrey Hinton, and Ronald Williams at the University of California, San Diego solved the main problems with perceptrons: the inability to perform tasks such as pattern recognition and object rotation.

The invention of Backpropagation, which allows NNs to pass turned-out instruction to nodes earlier in the network rather than just the last layer for error-checking, tackles the problem of complex tasks. This zooms out to apply NNs as a whole to perform a task instead of making adjustments for individual nodes. Backpropagation also eases the process of adjusting artificial synapses between nodes, which is how NNs “learn”. This video explains Backpropagation in an easily digestible way.

The increasing importance of computers and the Internet in everyday life, as well as examples such as Flappy Bird AI made in the early 2010s, rose interest and investment in NNs. Significant progress throughout the decade laid the groundwork for the NNs that we have today.

Noteworthy advancements in the 2010s include the “ImageNet Moment” of 2012. Peter Norvig and Andrew Moore, Directors of Research & Development and AI at Google respectively, noted how many researchers were now beginning to publish academic papers on AI and ML based on similar investment forecasts and incentives. Norvig and Moore went on to further NM research and development, creating self-driving cars, deep learning NNs, and speech recognition. This flurry of activity culminating in this infamous ‘Origin of Species’ moment for NNs.

Looking toward the future, other directions of work already begun on ethical AI, Explainable AI systems, sustainable AI, AI for social good, Quantum AI, Brain-Computer Interfaces (BCI), and Artificial General Intelligence (AGI) are targeting NN progress. It is difficult to disambiguate “future” timelines for these other specialties as they are still very much in their infancy, but researchers generally consider their success and use in NNs quite essential. Future NNs will likely be at the synergistic intersection of BCIs, Quantum, and ethical AI.

Key Breakthroughs That Enabled Generative AI

The following landmark developments in neural network research have been key breakthroughs that enabled generative AI.

  • Multi-layer Networks: In 1986, David Rumelhart and others published their work on backpropagation training algorithms. These democratized the design of multi-layer neural networks, particularly feedforward types. These networks contain one or two hidden layers, enabling the processing of more complex data. Multi-layer networks enabled generative AI by adding stochastic elements to data generation, allowing new outputs to emerge that better resembled true input patterns.
  • CNNs: Yann LeCun and colleagues developed Convoluted Neural Networks (CNNs) in the late 1990s, which mimicked biological processes where cells respond to inputs (stimuli) in their vicinity. CNNs enabled complex, automated feature identification in images, making tasks like automatic image classification easier. This CNN Video explains them well.
  • GANs: Ian Goodfellow of Google invented Generative Adversarial Networks (GANs) in 2014. GANs consist of two competing neural networks that generate and evaluate new patterns. GANs revolutionized all forms of generative AI, as they provided a coherent, scalable way to produce hyper-realistic fake images, video, and even data for algorithm training.
  • Variational Autoencoders (VAEs): VAEs, a more controllable and interpretable alternative to GANs, were first proposed by D.P Kingma and M. Welling in 2014. They are ‘multivariable’ probabilistic models that can generate realistic outputs similar to the original training input. VAEs have also been used in med-tech to identify links between known diseases and genetic structures.
  • Transformers: The final major breakthrough in neural network research that undergirded advancements in generative AI was the invention of the transformer architecture in 2017 by Ashish Vaswani and colleagues at Google Brain. Transformers innovated with self-attention mechanisms and multi-head attention. Self-attention allowed neural networks to focus on the most relevant tokens in input data elements such as words, while multi-head attention allowed parallel processing streams to automatically link similar sequences. Transformers are much more compute-efficient than older recurrent neural networks. And they can scale into massive architectures and effectively learn from both small and large datasets. Transformers enable new optimized outputs and further refinement in the generation of text, images, and even code.

Breakthrough milestones in core neural network technology that enabled generative AI include the following.

Breakthrough

Developer

Year

Description

Multi-layer networks

David Rumelhart, Geoffrey Hinton, and Ronald Williams

1986

Development of backpropagation algorithms that allowed design of neural networks containing multiple hidden layers, enabling representation of more complex input patterns

Convolutional Neural Networks

Yann LeCun, Yoshua Bengio, Geoffrey Hinton

1989

A hierarchical approach where early layers work on images and later layers work on image segments, allowing faster identification of features and tasks such as automatic image classification

Generative Adversarial Networks

Ian Goodfellow

2014

A method of generating data that relies on a generator and a discriminator that compete – allowing invasive data to be generated with clarity and precision through a process similar to biological Darwinism

Variational Autoencoders

D.P Kingma and M. Welling

2014

A probabilistic model with a neural network structure that identifies statistical regularities in data, allowing for faster identification of differences and potential problems in images such as cancerous growths

Transformer architecture

Ashish Vaswani, Noam Shazeer, Nikhil Nangia, et al

2017

A neural network architecture based on the self-attention and the feed-forward networks, allowing for state-of-the-art precision, recall and F1 scores

Backpropagation and Gradient Descent Explained

Backpropagation and gradient descent are two key algorithms that allow neural networks to learn by reducing prediction error, or “loss”. Here’s a simple explanation of each.

  • Backpropagation: Backpropagation is an algorithm that calculates the contribution of each weight in a neural network to the overall error. It does this by analyzing how changes to each weight impact the final output when an input is fed into the model. Based on these analyses, backpropagation assigns a “blame score” to each weight for the model’s error, which is technically referred to as the gradient of the loss function with respect to that weight.
  • Gradient descent: Gradient descent is used to update the weights of a neural network based on the blame scores, to reduce loss in future predictions. It does this by moving weights in the direction of their blame score (the gradient), but by only a small amount, or “learning rate”. When a learning rate is too high, it risks skipping over the optimal weight setting, while a very low learning rate increases training time.

Even though they are distinct operations, backpropagation and gradient descent are often bundled into one, because backpropagation itself does not change weights but simply calculates how much they should change based on their effect on model output error. These graphs conceptually explain the core ideas behind backpropagation and gradient descent.

As a simple example, consider a sports player attempting to make a shot. If they throw a ball and miss, they need to know which direction they should aim and how much force they should apply so they can successfully hit the target the next time. Backpropagation functions like a coach that observes where the ball landed in relation to its target and gives the player feedback on the amount and direction they should adjust their aim. Gradient descent then functions like the player who takes that advice and adjusts the next shot.

Emergence of Deep Neural Networks and Beyond

Deep learning is an extension of traditional neural networks with many more hidden layers, sometimes numbering in the hundreds or even thousands. Deep neural networks (DNNs) take advantage of this depth to automatically discover hierarchical patterns within the training data.

For most of the history of neural networks, people used limited numbers of shallow layers due to limited computational power and high training times. However, this changed in the 2010s with a combination of rapid improvements in computational power and breakthroughs in AI research.

This chart shows how the rise of deep learning tracks with breakthroughs in hardware, algorithms, and architectures.

Deep neural networks more closely mimic human brains by sometimes having structures requiring hundreds or even thousands of layered “artificial neurons” to process an image in a way that ensures that there are less complex neuronal connections at the lowest level and more complex ones at the higher ones.

This video from UPenn gives a great overview of how DNNs are designed and the reasons they work.

Once researchers worked out the mathematical foundations for DNNs, GPUs made the implementation possible. GPUs are ideal for the matrix operations that drive deep learning algorithms. This ensures smooth, parallel computation in feeding data inputs and carrying out multiple feature evaluations across layers of neurons. Simultaneously, better AI architectures like convolutional and recurrent neural networks make varying degrees of spread-out data access practical, thereby reducing latency.

DNNs have different properties than shallower networks. For example, outputs from deep networks are less sensitive to errors found further back in the layering of neuron nodes. This is beneficial because a model’s parameters need to adjust based on training data to improve the model’s predictive ability, and deep networks are proportionately less vulnerable to slight mistakes made in the adjustment process.

Shallower networks generally must build more shortcut connections between input nodes leading to outputs, while deeper networks can rely on more hierarchical information processing which generally creates smoother outputs with less noise.

Poorly functioning neurons, known as “dying ReLUs”, having zero or negative outputs, and poor weight initialization are two examples of challenges introduced when moving from shallow to deep neural networks. Researchers at the Stanford AI Lab and NYU’s Courant Institute, among others, are striving to improve methods to solve these problems.

DNNs can be thought of as allowing generative AI systems to be designed so that they learn simpler patterns, before moving on to applying those simpler patterns to even more complicated tasks. This is a key reason that DNNs form the basis for the majority of current neural network architectures, with the remaining special cases generally working in conjunction with DNNs.

Types of Neural Networks Used in Generative AI

The main types of neural networks used in generative AI are Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks, Generative Adversarial Networks (GANs), and Transformers. Each of these architectures has been designed to uniquely process specific data types such as images, sequences, and language.

CNNs take images as input, RNNs and LSTMs are used for sequence generation and analysis, GANs create synthetic data, and Transformers serve as the primary architecture for language and multimodal tasks.

These models have unique roles as described in the subsections below.

  • CNNs for Image Generation and Understanding: CNNs effectively identify visual patterns in two or three dimensions. Their role in generative AI is to understand and create images, frequently used in GANs.
  • RNNs and LSTMs for Text and Audio Sequences: RNNs have a feedback loop structure for processing sequences of data. LSTMs are an RNN variant that mitigate learning issues over long sequences. Both are used in generative AI for analysis and creation of text as well as audio.
  • GANs for Hyper-Realistic Data Generation: GANs effectively create hyper-realistic synthetic data with wider applications. Their output can replicate actual images and videos indistinguishably from real ones.
  • Transformers for Language and Multimodal Tasks: Transformers are used to teach machines the structure and meaning of words. Their use is mostly focused on transforming words into easily recognizable numbers.

The diagrams below show the basic structures of these typical neural networks, with the unique elements of each architecture highlighted.

Visual comparison of CNNs, RNNs, GANs, and Transformers in neural networks
This flat-style infographic illustrates the structural layout of four major neural network types. CNNs are shown for image recognition, RNNs and LSTMs for sequential text or audio, GANs for generating synthetic data through generator-discriminator training, and Transformers for handling complex language and multimodal processing. The design highlights functional differences using simplified icons and directional flow to support clear understanding for professional tech articles.

CNNs for Image Generation and Understanding

CNNs have unique mathematical properties that allow them to recognize patterns with visual geometry. Their model operates like a map, with layers of landmarks performing different functions in guiding how the neural network understands data. So, for examples, when given an image as input, the various structures of a CNN recognize and label different parts of the image before piecing the whole picture back together.

CNNs are often used in GANs for producing and understanding images, because they excel at pattern recognition in visual data. With their multi-layer structures, CNNs in GANs help the generator determine what to create, and the discriminator determine whether what was created is usable or needs to be created again.

Following are examples of generative AI tools that use CNNs.

  • Google DeepDream: This computer vision tool generates a dreamlike psychedelic effect in pictures including patterns of animal faces. “GoogleTwiST” is a variation of the model that adds the user’s Twitter handle, which this illustration shows applied to a famous portrait of Napoleon.
  • Convolutional Neural Network Image Analysis Using TensorFlow: A basic TensorFlow framework uses a CNN to analyze and recreate images.
    • Auto-painting tools. CNNs in generative AI are transforming photography and videos into works of art that look painted by renowned artists. These tools use a combination of CNNs and GANs. The GANs create hyper-realistic images using style transfer technology, while the CNNs identify patterns that help the tools imitate the brush strokes of famous painters.

    RNNs and LSTMs for Text and Audio Sequences

    If information is lost when comparing an early item in a sequence to a later item, RNNs provide an insight> into the likelihood of the sequence continuing. In the example below, the first picture of a cow in a pasture is “remembered” even after the third picture. This memory is indicative of a series of items across time, making RNNs perfect for analyzing time series data.

    Researchers used regular neural networks to recognize sequential patterns, albeit unsuccessfully. When they replaced the conventional feedforward structure in 1986 with a feedback loop that allowed them to retain information from previous data points and use it to process the current data point, RNNs were born.

    However, RNNs began to struggle when they had to retain the information of long sequences, often misplacing the importance of non-consecutive data points. This issue was resolved with the development in 1997 of Long Short-Term Memory (LSTM) networks, the RNN variant that has become synonymous with generic naming in the industry.

    RNNs are used to generate text, audio, and video in generative AI. While RNNs are quickly being phased out for having less favorable performance, researchers note that they are important in understanding the development of modern AI.

    Following are examples of generative AI tools that use RNNs or LSTMs.

    • Google Translate LSTM: The original Google Translate utilized LSTMs before the advent of Google’s Transformer architecture, which offers superior performance. A heat map generated from Google Translate data shows how many LSTMs were used to translate a particular word into German.
  • Text-Based LSTM Poetry Generators: Generative AI poetry tools that create text based on specific inputs can use LSTMs to analyze and produce poetry. The basic framework below was created in Python and Tensorflow to utilize RNNs and LSTMs for learning and producing poetry.
  • Audio Generators: To produce varieties of sounds at different frequencies, researchers at South Korea’s Sungkyunkwan University propose a modified RNN structure that applies to neurofeedback audio generators and other such tools. This diagram shows basic neural oscillator RNN frameworks.
    • Story Generator Projects: AI-based automatic storytelling film and game production is in the nascent stages. This is a research proposal example in its early stages by the MIT Neurobiology Department’s Alyssa McCarthy, that will use a BIM representing the story to build the natural language processing (NLP) capabilities of a Transformer and plot maps to distinct emotional temperatures.

      GANs for Hyper-Realistic Data Generation

      The three-part structure of GANs is used to produce hyper-realistic data in diverse applications. The typical components of a GAN model are shown below.

      GANs use a generator and discriminator to create synthetic data that looks indistinguishable from true data. The generator creates fake data that gets fed into the discriminator alongside real data. The discriminator then has to correctly identify the true data from the fake data.

      The cycle continues, with the generator receiving feedback on how to improve its accuracy each time the discriminator is wrong. This information helps the generator to improve its output quality until the data produced is comparable to the real data.

      GANs are extremely valuable in generative AI because hyper-realistic synthetic data can be used when training real models. This is especially useful when there is not enough real data, which is frequently the case for medical images such as X-rays and MRI scans.

      These are examples of tools that utilize GANs in generative AI.

      • Deep Fake Technology: GANs create hyper-realistic false images or videos that imitate the voice or actions of a real person. 
      • Neural Networks that Generate Art:GANs that create art using generative artificial intelligence can adapt the output for various media that are indistinguishable from traditional artworks. The DreamUp and Artify generative art tools shown below are capable of producing art that mimics the brushstrokes of renowned physical painters.
      • Synthetic Data Generation for Research Purposes: GANs are able to produce synthetic data that retains the characteristics of the original dataset, without compromising data privacy. This diagram shows how a GAN generates new examples based on a hidden distribution of the real dataset. The generated synthetic examples can then be used in various domains and industries.

      Transformers for Language and Multimodal Tasks

      Transformers, introduced in the paper Attention Is All You Need by researchers from Google in late 2017, are a versatile architecture used in generative AI for text, image, and audio tasks.

      Transformers use multi-headed attention techniques to analyze particular sections of input text and discern which words carry the most meaning. This information is used to determine the relationships among the words, much as a collection of words and their surrounding text are analyzed to understand the main point of news articles, research papers, and poetry.

      The strength of this technique grows with parallelization, or handling multiple contexts simultaneously. This competence has helped Transformers attain state-of-the-art performance in a range of natural language processing (NLP) tasks, including text generation, information extraction, translation, summarization, and question-answering.

      This graphic illustrates the pipeline of how input data passes through a transformer neural network.

      In addition to NLP tasks, Transformers are utilized in generative AI for Computer Vision and Audio Processing. Researchers at Microsoft and Google, as well as other organizations, have created video processors based on the transformations that deploy groundbreaking predictive tracking models.

      Inside Imagination, an emerging augmented reality (XAR) artwork project that seeks to merge the physical and digital spaces through contextual storytelling, plans to utilize a video-based transformer trained on a vast amount of video data.

      These are examples of tools that utilize Trainable Transformers.

      • Google Smart Compose: Google’s Gmail email service uses a LSTM and GRU-powered smart compose feature that suggests message completions to accelerate writing. This analysis shows how Google’s LSTM architecture functions to convert real-world data predictive models into easy-to-use formats.

    Generative AI studies typically refer to the fundamental works by Raffel et al (2020) who introduced the T5 (Text-to-Text Transfer Transformer) model and OpenAI’s 2019 GPT-2 paper.

    Overview: Different Architectures for Different Tasks

    The major types of neural nets used in generative AI are Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, Generative Adversarial Networks (GANs), and Transformers. Each architecture serves specific input/output needs and is suited for particular types of generative tasks.

    • CNNs are designed to process data with a grid-like topology, such as visual imagery, where relationships between pixels can be leveraged to identify patterns.
    • RNNs and LSTMs retain information from previous data inputs and are essential for tasks involving sequences such as natural language processing, image and video generation, and audio stream generation.
    • GANs take advantage of discriminative neural nets to produce hyper-realistic content by pitting two neural networks against one another to improve accuracy and plausibility.
    • Transformers are designed for easy parallelization of language modeling tasks but have been adapted for various other tasks.

    CNNs, GANs, and Transformers can be classified as deep learning models, while RNNs and LSTMs are considered historically as they are rarely used now. It should be noted that deep learning models are simply neural networks with more hidden layers between input and output, and often more complex architectures.

    Each of these AI models works by creating a statistical representation of the training material, allowing the model to emulate the core characteristics of that material.

    CNNs for Image Generation and Understanding

    Convolutional Neural Networks (CNNs) are a specialized type of feedforward neural networks designed with the primary goal of processing and generating visual data. They are highly flexible and capable of handling visual content in different modalities, such as videos, photos, and even 3D images.

    Positional arrays of numbers, such as pixel values in 2D images or volumetric data in 3D images, make up the input data for CNNs. They work by examining local parts of the input, recognizing patterns, and pooling those patterns together to learn the bigger picture. These visual patterns include very basic features like edges, or parts, or more advanced patterns such as entire shapes or descriptors of entire objects.

    Although CNNs can be both sequential feedforward networks and recursive in structure, they primarily employ a deeper, more complex design that allows them to analyze data in greater detail. Multiple layers of convolutional analysis are stacked on top of one another to resemble an armoire’s rack of flat, stacked shelves, with each shelf representing a layer of analysis and archived visual images of specified features that have been identified, pooled, and archived by the preceding layer.

    In generative AI, CNNs can both create images as standalone generators or be embedded in larger image-generation architectures such as Stable Diffusion and GAN variants oriented around images. CNNs can enhance image quality through upscaling techniques.

    The three main types of layers in CNNs are convolutional, pooling, and fully connected layers, serving the following functions.

    1. Convolutional layers. The core component of CNNs are the convolutional layers. They apply mathematical operations called convolution to detect edge patterns. Each convolutional layer is made up of feature detectors, known as filters or kernels, which each contribute to analyzing a specific edge pattern. These filters are applied to small regions of the input image, and the results are collated to make an edge map of the observed pattern. In the images below, the pixels in light blue represent the filters, which are moved across the image to detect various edge patterns.
    2. Pooling layers. CNNs are designed to detect spatial parameters in images. Pooling layers are then required to reduce the number of parameters and decrease computation times for deeper convolutional layers without losing relevant spatial information. By conceptually collapsing the output of the immediately previous layer into smaller sections, pooling layers produce smaller, manageable edge pattern maps.
    3. Fully connected layers. The last layer of a CNN is a fully connected layer which links inputs and outputs by returning the nodes back into a standard neural network where they can complete basic tasks to produce the final desired outputs.

    This diagram traces how an image moves through the various layers of a CNN and what operations are performed at each stage.

    In addition to being designed with new tasks in mind, CNNs can be simply fine-tuned modifications of feedforward neural networks. Doing so allow them to retain a majority of the computational efficiency that CNNs are prized for, while delivering somewhat greater flexibility in analyzing visual inputs.

    According to McKinsey, CNNs are especially effective in completing image translation tasks (changing an image from one format to another), coloring specific items in an image, or restoring damaged image sections. They have numerous applications in self-driving cars, robotics, healthcare (to name just a few) which will often involve generative AI tasks or applications.

    The specific sub-type of CNNs known as Stacked CNNs utilize various supervised, unsupervised, and semi-supervised learning methods to create data without producing particular images. This type of network embeds layers of normal CNNs both before and after additional layers of processing to identify and understand the features of original input images before producing entirely new images representing learned features through standard neural networks.

    Sampling methods to create generative images, which exploit the latent space of various generative neural networks, are slower and have a 2-3 times greater chance of producing unrecognizable outputs. Stacked CNNs more often produce higher-quality images in an acceptable period of time, making them popular base architectures for image-generating AI.

    RNNs and LSTMs for Text and Audio Sequences

    Recurrent Neural Networks (RNNs) and Long Short-Term Memory cells (LSTMs) are specific types of neural network architectures commonly used for text and audio generation in generative AI.

    RNNs are a type of artificial neural network designed for working with time series data or sequential data. They have the ability to loop back on themselves, taking some of the previous information and feeding it into their programming for use. This allows them to understand context and relevant relationships within a set of data variables.

    However, RNNs tend to have difficulty remembering the context of information if it is not immediately sequential—for example, when trying to determine the subject or meaning of a later word when working from an earlier prompt. While there are additional methods that can be added to RNNs to assist in this area, researchers say that they are often unable to learn complicated tasks, which makes them less useful than other architectures.

    While LSTMs include the same basic looping structure of RNNs, they have an additional structure of input, output, and forget gates. These gates determine what information to add, what to output, and what to discard, respectively. This allows them to keep track of the context of information much more easily. As a result, such architectures perform better than RNNs on complex language tasks.

    RNNs have been the backbone of Natural Language Processing [i.e., Language Understanding (LU), Language Generation (LG), Language Translation (LT), Natural Language Processing (NLP)] tasks since their advent in the 1980s. Researchers began to garner mixed results on their efficacy in the 2000s, as the limitations of RNNs became increasingly apparent and other architectures such as CNNs and GANs began to be used for specific applications in generative AI. However, when combined with additional mechanisms, RNNs and LSTMs remain highly popular for a wide range of natural language processing tasks and are even used for speech and video, thereby enabling numerous applications in different industries.

    In the video below, Kelsey Houston-Edwards of the PBS channel Element of Entropy provides a detailed explanation of how RNNs work.

    Unlike CNNs with fixed-size inputs and outputs, RNNs have variable input and output sizes, which makes them better for generating sequences. An early application was character-based text generation in which a model was trained on text, receiving one character at a time. Once the model learned a certain amount of text, it began generating sentences.

    This diagram illustrates how RNNs process one input sequence at a time. Each input is combined with both the most recent hidden state and the previous hidden state in order to yield an updated current hidden state. The current hidden state is then used in the output layer to produce the expected output.

    Unlike CNNs, which typically use a fixed-size input and output, RNNs have variable input and output sizes, making them better for data generation tasks that deal with sequences. One early application was character-based text generation in which an RNN model was trained on text, receiving one character at a time. Once the model had learned a certain amount of text, it began generating sentences. This is a simple example of how RNNs process sequential data.

    This diagram illustrates how RNNs process one input sequence at a time. At each training step, the most recent input, the most recent hidden state, and the previous hidden state are combined in some way in order to yield an updated current hidden state. The current hidden state is then used in the output layer to produce the expected output.

    With a feedback loop in their architecture, RNNs remember the previous network state. This enables them to maintain contextual relationships among variables, resulting in emergent behavior based on prior input. This makes RNNs far better than CNNs at handling sequential time-based inputs such as text or audio.

    RNNs are not without their drawbacks, notably vanishing gradients, as well as overfitting. They require larger datasets for training than CNNs. Long Short-Term Memory (LSTM) networks were developed in 1997 by a team of researchers at the University of Toronto led by Jürgen Schmidhuber and were met with immediate success for addressing RNN drawbacks. Studies first demonstrated how they dramatically reduce the chance of vanishing gradients. They are also better at selectively memorizing or forgetting inputs than RNNs.

    The diagram below illustrates the basic operating principle of LSTM cells. Three gates in the architecture control information flow to and from the LSTM cell. The picture also depicts the memory cell that will keep memory for long and short periods of time. Though LSTMs are more complicated than RNNs, this additional complexity makes them better for many time-series tasks.

    The ease with which LSTMs can remember and forget previous inputs makes them better for identifying relationships among variables further back in the control flow. They become advantageous for Natural Language Processing tasks.

    This video from Andrew Ng’s Deep Learning Specialization course at Coursera explains how LSTMs work.

    In the graphic below, we see that LSTMs were not only an advance on the backpropagation techniques of RNNs, but that they had their own further advances. This graphic shows that LSTMs began solving the problem of vanishing gradients in RNNs soon after their development.

    Generative Pre-trained Transformer (GPT) language models, of which OpenAI’s ChatGPT is one, actually utilize RNNs and LSTMs internally.

    Common uses of RNNs and LSTMs for text and audio communication include speech-to-text applications, music generation, and language translations.

    GANs for Hyper-Realistic Data Generation

    GANs (Generative Adversarial Networks) are a type of neural network architecture introduced in 2014 by Ian Goodfellow and colleagues, which generates realistic data. GANs utilize a unique architecture in which two different types of networks – called the generator and discriminator – are trained simultaneously and in competition with one another. The generator creates realistic data, while the discriminator determines whether the data is real or fake. The models effectively volatility pitch to best one another, with basic adversarialism. In practice, this means that as the generator improves its realism, the discriminator has to improve its distinguishing capability to stay one step ahead.

    According to Goodfellow, a key insight that enabled GANs as a generative model was the observation that “When training a model with a given data distribution, the model won’t create any data point worse than the worst one in the training data it is imitating, because any point in data space that it generates which has some observable feature worse than that of the worst training data will be easily identifiable and thus will be classified as a fake by the discriminator.”

    In the same study, Goodfellow proposed ways to mitigate this issue, and apply Goodfellow

    GANs works as follows:

    1. The generator produces synthetic data samples.
    2. This output goes to the discriminator along with a batch of real data samples from the training dataset. The discriminator then decides whether it thinks each input is real or fake.
    3. The discriminator then provides feedback to the generator – its inputs had properties close to its parameters, their probability decided according to the true situation of whether they were inputs it had simulated or real ones from the training set.
    4. The generator use this feedback to improve its outputs.

    This feedback loop between the generator and discriminator continues until either a laboratory of pranks resembling their training data or until either network stops learning.

    Diagram of how Generative Adversarial Networks work with generator and discriminator flow
    This infographic visually explains how Generative Adversarial Networks (GANs) operate. It depicts two core components — the generator and discriminator — working together to improve data generation. The generator produces fake outputs based on noise, while the discriminator evaluates whether the data is real or generated. A feedback loop between both networks helps improve the quality of the generated images over time. This is a foundational structure in generative AI, especially for image synthesis tasks.

    The breakthrough power of GANs is now used in many Generative AI applications.

    • Deepfakes: GANs can produce hyper-realistic images of non-existent human faces. These were initially used, and still often are, to create pornographic or otherwise objectionable deepfake videos featuring non-consenting individuals’ faces. In recent years, more positive applications have emerged such as featuring missing persons in reconstructed videos.
    • Automobile Design: The US car manufacturer Ford has begun utilizing GANs to assist in design. They input artificial lighting and projective transformations for simulations, which lead to new design features.

    GANs require a comparatively small amount of input data to work well. They have fewer parameters than the transformer models that have become the standard for most generative text AI. Analysts project that GANs will be used for Generative AI APIs and products.

    Transformers for Language and Multimodal Tasks

    Transformers are a neural network architecture introduced in 2017 by researchers at Google in a paper titled “Attention is All You Need”. This architecture was developed for natural language tasks but has since been adapted for image and audio applications to create multimodal models like OpenAI’s DALL-E 2 which generates digital images from user prompts given in natural language.

    This diagram shows a simplified Transformer neural network architecture consisting of an encoder and decoder component each with multiple identical stacked layers.

    Transformers are built on self-attention mechanisms that allow them to dynamically weigh relationships across sections of input data. In contrast to the recurrence in RNNs where input sequences are run one after the other, self-attention in transformers enables input sequences to be run together, with the model determining their importance.

    Transformers are usually built with an encoder and decoder stack structure. As explained in a related article on transformers, the encoder takes input, while the decoder produces output. The number of encoder and decoder blocks varies according to the application. The original transformer architecture utilized a 6-layer encoder and 6-layer decoder. BERT uses encoders only, while GPT-3 and other autoregressive language models use decoders only. Multimodal LLMs such as Flamingo utilize various encoder-decoder combinations.

    Given their flexibility, scalability, and high parallelizability, transformers dominate the generative AI field today. Their self-attention mechanisms allow them to account for the significance of different components in an input, enabling them to generate coherent output sequences.

    Another reason for their prevalence is their suitability for training on large datasets, which is a requirement of generative AI systems. This means their architecture can be easily adapted for specific applications or optimization.

    Transformers mate well with large datasets because they can compute using matrix rather than sequential operations. This is evident in the training of autoregressive text generation models on massive datasets like Common Crawl (over 60 terabytes of data) and the LAION datasets used for image generation.

    Two key features of transformers that contribute to their effectiveness are the ability to model long sequences of input data and attention mechanisms, which allow the model to focus its “attention” on necessary parts of the input. While RNN struggles to maintain context in long sequences of input, transformers can process thousands of words or even images at one time, allowing them to generate more precise and informed output.

    Attention mechanisms are built on the “query-key-value” system that lies at a transformer’s core. The “query” is used to retrieve information from the input batches. Information (key) and result (value) for all input components are generated and passed into the transformer. The transformer then makes a weighted sum of value vectors based on the amount of information on the key vector that matches the query to determine the output associated with that input.

    While attention is not exclusive to transformers (it is used in CNNs and RNNs), the type of attention within transformers is more sophisticated than in other architectures. This is because all outputs can be linked to one another, allowing them to consider the relationship of each piece of information in the batch with every other item.

    This image shows how the ‘Query-Key-Value’ attention mechanism works in a transformer network.

    Multimodal systems are the other major category of applications for transformers. Multimodal generative AI is the production of outputs based on two or more different types of input—sound, sight, text, etc—capable of “understanding” and creating in the same way humans do, by correlating information across different sensory inputs.

    DALL-E 2 and Claude AI from Anthropic are examples of multimodal systems. Multimodal systems have the advantage of making machines much more efficient. They can, for example, produce better results in tasks such as locating objects in images using descriptive words, or creating descriptive articles based on audio/video input.

    Transformers enable multimodal capabilities because they learn patterns in how users express ideas while improving accuracy in generating responses. For example, a user may describe an object using shapes, colors, or attributes. Through training, the model can learn how to process these inputs simultaneously while correlating them with relevant outputs. A more complex example is a word that could mean an object, action, and an adjective. The model can learn, through the context of usage in input data, how to generate outputs that correspond to each of these meanings.

    Most research into transformers to date has been within the realm of natural language. This is likely due to the enormous success of AI in text generation, chatbots, search engines, and other applications. However, as funding increases for non-text generative AI projects, multimodal transformers are likely to become much more popular.

    Models such as OpenAI’s CLIP, a multimodal image and text model, are proving the concept of using transformers for images is valid. CLIP associates pictures with relevant text by predicting which words a human would assign to them. Previous attempts to create similar models were inefficient. In the case of CLIP, the use of transformers facilitates the simultaneous training of two modalities. This allows them to better align the different features of the two data types compared to their predecessors.

    Attention Mechanisms: The Game Changer

    Neural network attention mechanisms are systems that dynamically highlight information within a data set. They have been termed ‘attention mechanisms’ because they allow models to prioritise and focus on parts of input data that are relevant to a particular task.

    This diagram shows how attention weights are calculated in the transformed architecture to identify focus areas in the training data.

    The first use of attention in neural networks is credited to Dzmitry Bahdanau in 2015. In his research, he describes an alignment model that can focus on certain words when predicting the next one. Researchers quickly realised how useful this innovation was for a range of tasks in AI.

    Attention has allowed for breakthrough improvements in text generation models such as Transformers, enabling the systems to focus on entire phrases instead of using a single direction to analyze sequences one part at a time.

    Attention architectures still use traditional feed-forward layers and recurrent networks. But many experts agree they represent a second golden age of AI because of their vastly improved processing efficiency and output quality. Attention mechanisms utilize parallelization instead of sequential processing, allowing significantly faster training times.

    Attention mechanisms also greatly expand the context windows that generative AI models can handle. In March 2021, OpenAI researchers wrote that the context window in their language model GPT-3 for analyzing user prompts had been expanded up to 2048 token inputs. In early 2023, the company found a way to increase the context window to 32,768 tokens in their next generation GPT-4 turbo model.

    Encoder-Decoder Architectures: The Foundation of LLMs

    Encoder-decoder architectures are types of neural networks used for sequence-to-sequence tasks, where inputs and outputs can have different lengths. They were originally developed for machine translation but have been adapted to many other applications, including generative AI text models like ChatGPT and Claude. These models break down the creation of output sequences into two steps: first encoding the input into a fixed-length latent vector and then decoding it into the output sequence, as illustrated below.

    In traditional encoder-decoder models, the encoder is a collection of stacked bi-directional RNN (BRNN) layers. Instead of processing input data in the original sequential order, the BRNN enables the encoder to learn both the previous and subsequent context of the input in each layer. The BRNN takes variable length sequential input and encodes it into a fixed-size latent vector (energy vector) that sums up the relevant information from the input sequence.

    In traditional encoder-decoder models, the decoder is made up of stacked RNN layers. The first input to the sequential layers of the decoder is a special start of sequence symbol. The latent vector created by the encoder is supplied to the first decoder layer as well. Each layer of the decoder takes input from the previously generated output as well as the latent vector. At the end of the decoder, an output layer generates the final output sequence.

    Architecture for traditional encoder-decoder models. Input tokens are passed sequentially into the encoder layers, which generate the latent vector z that is the input to the first decoder layer. The next token to the decoder is fed back from the previous output. This process continues until a stop sequence token is generated.

    This traditional RNN encoder-decoder model showed significant weaknesses leading to the development of improved architectures. This included difficulties in learning long-range dependencies in input sequences and a limit on the size of the latent vector.

    Then in 2014, Google Brain researchers introduced a transformer-based encoder-decoder model. In this new architecture, self-attention was applied at both the encoder and decoder stages. This allowed for a computation of an output token’s relation to all input tokens. In the decoder stage, the output token can relate to all previously generated output tokens. This architecture solves many of the traditional RNN architectures’ long-range relation learning and latent space size limitations.

    Language models such as GPT and Claude use the decoder portion of transformer models as the core of their architecture. The input to these models is a sequence of tokens that represent the request given by the user. Their multi-layered architecture enables the models to understand nuances in the user request and produce relevant output sequences via the usage of self-attention mechanisms.

    Because the transformer decoder architecture is more advanced than previous RNN encoders and decoders, most modern large language models are a combination of the two. They possess an initial RNN encoder to break down the input into relevant subsections for further decoding, which is then followed by the transformer architecture’s main decision-making work.

    LLM models such as ChatGPT and Claude operate under a two-step process when generating output.

    1. Initial tokenization
    2. The subsequent generation of one token at a time based on a sampling method

    This can be viewed as a sequential generation of each word in the sentence being constructed, where privileged information is used as the basis for predicting the next component.

    Tokenization is done by splitting an input text into its constituent components based on the rules of the encoding algorithm being used. In the ChatGPT model, the Byte Pair Encoding standard is used to change text into a roughly equal number of words and other sequences, which are then represented in numerical form for modeling purposes.

    This tokenization process, turning natural language into machine language, is crucial to all neural network architectures that interact with users in natural languages. It is particularly vital to the workings of RNNs, which treat their inputs as one-dimensional sequences comprising multiple components.

    The subsequent token generation process can be described as taking the input tokens (user prompt) and added to the end portion of the sequence for processing by the neural network. While the tokenization methods used by RNNs and transformers differ somewhat, they continue to operate in similar ways in how language and other formats that generative AI models work in are reduced to numerical gray codes.

    As an example of how RNN, transformer, and hybrid models operate, MidJourney uses a processing architecture that inverts the others by utilizing a transformer architecture at the front end where natural language prompts are processed. When leading text through the convolutional neural network that performs the actual image generation, they convert it to the numeric form typical of prompts that machine learning models can follow.

    How Neural Networks Power Modern GenAI Models

    Neural networks power modern generative AI models using the underlying mathematical and logical functions that mimic human cognitive behavior. These functions that enable neural networks to recognize patterns and learn from sets of data are based on statistical models from an area of research known as artificial intelligence (AI).

    The structure of a neural network has been designed with specific requirements to enable it to process information more like a human brain. The key components that determine how neural networks power GenAI models are the layers, weights, activation functions, training, loss functions, and optimizers.

    • Layers: Neural networks consist of multiple layers. The input layer accepts initial data, the hidden layer contains the majority of the neurons, and the output layer shows the results. Some networks add an additional deep layer.
    • Weights: Each connection between neurons in different layers has an adjustable weight determined by the learning and training of the neural network. If the neural net identifies a characteristic of the input data (whether an image, text, etc.), the weight is increased, otherwise decreased.
    • Activation Functions: Neural networks have various activation functions that determine how the weights of the connections are adjusted. The outputs from one layer are fed into the next layer via these activation functions. They essentially define the learning capabilities of the network.
    • Training: After defining the components above, the neural network is “trained” so that it can emerge with better and better predictions. An initial guess is made about the output based on the existing weights of the connections. If the guess is wrong, the weights are readjusted. This process iterates until the errors are minimized.
    • Loss Functions: The training process depends on a “loss function” that quantifies how far the neural network’s prediction or output is from the actual outcome. The loss function is an objective function that serves as a target that the optimizer seeks to minimize.
    • Optimizers: These are the inputs to adjusting the weights in neural networks in order to minimize the loss functions. Minimizing a loss function is mathematically an optimization problem, and solving the problem means having the appropriate values in place to reduce the loss as much as possible.

    The training behavior of a neural network is complex, and so are the training processes that are applied to them. In the background, over hundreds of thousands or millions of iterations, the neural network incrementally “learns” by fine-tuning its parameters (the weights in the connections between neurons) based on the examples it is fed. The optimizer carries out the approaches for reducing the loss. And their complex interrelationships ultimately result in producing novel, high-fidelity content.

    In this way, neural networks power modern generative AI, eventually resulting in convergence or nearly correct unchanging values for each parameter, with which the network effectively possesses a good understanding of the characteristics of the input it was trained on. At inference time, they can validate and synthesize new content, with outputs that often seem human-like in their sophistication and creativity.

    Deep Layers and Representation Learning

    Neural networks learn concepts and features of the input data at different levels of abstraction depending on their architecture. In a feedforward neural network, data is fed in at the input layer, and the neural net looks for patterns as this information makes its way through the different layers until it is finally output. As the data passes from one layer to the next, the neural network looks for and builds upon increasingly abstracted features.

    This diagram illustrates how neural networks look for features at different levels of abstraction.

    Infographic showing how neural networks learn low-, mid-, and high-level visual features
    This clean, flat-style infographic visualizes how neural networks learn patterns across layers. It begins with the input layer that accepts raw data, such as a pixelated cat image. As the data passes through hidden layers, it undergoes feature extraction: low-level edges, mid-level shapes, and high-level object parts are identified. These features are combined to produce a final classification in the output layer. This visual is especially relevant in image recognition tasks, explaining how artificial neurons mimic brain-like feature processing hierarchies.

    While deep learning neural networks can have more than 100 layers, expert consensus is that the optimal number of layers in a neural network is often between 5 and 50. This is because a more realistic and useful hierarchy of features can be built with these numbers. For example, in a CNN for image generation, the first few layers may recognize patterns such as edges and curves. The next few layers may then identify abstracted shapes such as circles and squares. Then shapes may be combined into parts of objects, and so forth until a full object is recognized. If these intermediate layers are skipped, it may be impossible to reconstruct information needed for generating output.

    The depth of a neural network creates the potential for representation learning, where the network identifies and learns features of the data without the need for human intervention. For example, natural language processing (NLP) deep learning models such as GPT-4 can autonomously identify the definitions of words from the positions and contexts in which they appear throughout an entire corpus of text. The ability for generative AI models to automatically learn representations of data that are more useful and refined helps improve their overall performance.

    Loss Functions and Optimization Techniques

    Neural networks learn based on the data they generate, and measure their learning via loss functions. When a neural network generates content, it does so by predicting values until it converges on an acceptable output. After generating an output based on input data, the meaning of this prediction is derived using a mathematical function called a loss function.

    The loss function is then used to calculate the degree to which the output produced by the neural network differs from the expected or correct output, referred to as the ‘ground truth’ in machine learning. This discrepancy is called the ‘loss’, with a greater loss indicating that the generated output is less similar to the correct output.

    Every generative AI model with neural networks has a specific loss function that corresponds to its application in order to assess and guide the learning process. Thus, the learning process in generative AI neural networks is guided by loss functions within a model’s training and fine-tuning phases.

    Adapting base neural architectures using supervised or reinforcement fine-tuning enables domain precision. See how it’s implemented in our model fine-tuning strategy breakdown.

    The generated output is then used to update the variables (also called parameters) in the model so as to reduce the loss, aligning the output generated by the model closer to that of the ground truth. The methods for adjusting the model’s weights based on the calculated loss are known as ‘optimization techniques’.

    The most common optimization techniques used in generative AI have two components. First is a loss function, often Mean Squared Error (MSE) or Binary Cross-Entropy for generative tasks. This loss function then interacts with gradients, a smooth function that acts through the ã-values on a function that returns whatever value we are supposed to minimize – basically the loss. These two components together help to guide the model’s learning.

    Common Loss Functions in Generative Tasks

    Loss functions quantify the difference between the generated data and the real data in neural networks. The generative model is updated to minimize this loss, increasing accuracy in replicating the form of the input data.

    The type of loss function used varies by generative AI model architecture and output for tasks like natural language processing (NLP), image recognition, and audio generation. The models’ optimization algorithms then minimize the loss during training.

    Some common loss functions used in generative models include the following.

    • Mean Squared Error (MSE) in GANs and VAEs for image generation
    • BCE or Binary Cross Entropy in GANs for text-generation and image generation
    • Categorical Cross Entropy in GANs for style transfer tasks
    • Contrastive Loss in GANs in Protein folding tasks and VQGANs for image generation
    • Contextual Loss in GANs for image denoising tasks

    Additional details regarding each common loss function are provided below.

    • Mean Squared Error (MSE): The MSE function takes the square of the difference between the real label and the predicted label. The mean is then taken of all the squared differences. MSE is most commonly used in Regression problems where the output is a real value. Naturally, it is used in very few Generative models, but is often used in VAE encoders to quantify how much the original input data differs from the encoded latent representations of that input data. VAEs use MSE to provide a way to force the encoded latent representations to generally resemble normal distributions by trying to fit the encoded latent representations in this multidimensional space.
    • Binary Cross Entropy (BCE): Also known as log loss or logistic loss, BCE is defined as “the probability of a class variable whose value is 1 is predicted”, for example, “cat” in “cat vs dog”. In technical terms, BCE is defined as the negative average of the logarithm of predicted probabilities assigned to the “true” class. Variants of BC Entropy are often used in GANs for text generation. A modified type of binary cross-entropy, known as Kullback-Leibler divergence (KL loss), is used in Wasserstein GANs to reduce the chance of model failure by eliminating the need for noise detection to check whether the data distribution has stabilized. As a frequentist objective, KL loss is used in VAE decoders to calculate loss by comparing how the overall volume of the output features/distribution data differs from the expected output, identical to how MSE operates in the VAE encoder.
    • Categorical Cross Entropy (CCE): CCE is the generalization of binary cross-entropy loss that is applicable when there are multiple possible classes (but only one “true” class). CCE is defined as the probability of a class variable among multiple possibilities, and contrary to BCE, it does not assume a priori categorical variables are mutually exclusive. CCE is often used in GANs for image reconstruction and style transfer tasks. Also, in multi-class, one-hot encoded datasets of audio waveforms, CCE is likely the loss function going to be used.
    • Contrastive Loss: Contrastive loss is a relatively new loss function that utilizes the concept of siamese learning networks. In general terms, contrastive loss networks learn similarities and differences among data points, such that inputs that have the same label are projected closer together in the encoded space while inputs with different labels are projected farther apart. This can be thought of as the lifting of some compressive force. In context, it is the learning of a force that pulls unlike coding representations farther apart while pushing like coding representations together. VQGANs and GANs use the Contrastive loss function.
    • Contextual Loss: Contextual loss focuses on the large structures within the image and preserves differences in contrast, brightness, and color during image processing. Contextual loss is implemented in GANs for image denoising.

    Backpropagation Simplified: How Models Learn

    Backpropagation is an algorithm used in training neural networks. It is a supervised learning approach. Backpropagation takes outputs from the neural network, compares them to desired output values (the labels or answers), and uses the differences to compute gradients for each weight in the model. It does this for all examples in a batch and uses the average gradient to update the model weights. This video by 3Blue1Brown gives an excellent visual description of how gradients are used to update weights during the training of a neural network.

    Simply put, during training:

    1. The neural network makes a prediction based on the current weights.
    2. The prediction is compared to the actual result (the correct answer) to determine error.
    3. The part of the neural network’s code that adjusts the weights is invoked to minimize the error.

    To adjust the weights, the algorithms use two parameters: gradients and learning rates.

    • Gradient: The gradient is the derivative of the Loss Function with respect to the weights. Understanding and computing the gradient is fundamental to helping neural networks learn. Gradients tell neural networks how far off their predictions are from the correct answer; they give the direction and steepness of the slope of the loss function. This image shows how the 3Blue1Brown team estimates the direction to adjust the weights in the training set by imagining that they’re blindfolded and using a stick to feel around for the topography.
    • Learning Rate: The scaling factor for weight adjustment. If the learning rate is too high, the model can over-adjust and miss the optimal values. If it’s too low, the model will take too long to converge, or never converge at all. Models often apply an exponentially decay learning rate to ensure the model makes big jumps in the beginning, and smaller and smaller jumps until the end of training.

    Total function weight updates, expressed mathematically, look like this:

    Where η is the learning rate, and L is the loss function. In backpropagation, the partial derivatives of the loss function with respect to weights and biases are calculated using the chain rule.

    This simple illustration showing a linear regression model demonstrates the concept of backpropagation in a straightforward manner. For a dataset with two inputs and an output, you can visualize how the algorithm continuously attempts to decrease the error by adjusting weights at each input node, thereby moving the model’s prediction closer to the actual outcome.

    Real-World Use Cases of Neural Networks in GenAI

    Neural networks are able to generate human-like text, images, videos, and voices thanks to the generative AI power they derive from machine learning. The real-world use cases of generative AI neural networks are found across all industries, including healthcare, entertainment, customer service, finance, and education, where they assist humans in tasks such as automation, analysis, and diagnosis. Industry-specific tools such as Wavemaker in advertising or Synthesia in human resources use generative AI neural networks in a similar manner to be more efficient and effective.

    Generative AI uses neural networks to produce original media. Some of the typical outputs include the following.

    • Text: ChatGPT, Jasper, PanelsAI
    • Image: Midjourney, DALL-E, Stable Diffusion
    • Voice: Jasper, ElevenLabs, Resemble
    • Video: Synthesia, Pictory
    • 3D Objects: Open3D, Kaedim
    • Music: Aiva, Amper
    • AR/VR Environments: GenVA
    Infographic listing six major types of generative AI with icons and tool examples
    This structured infographic visually presents six main categories of generative AI: text (e.g., ChatGPT, PanelsAI, Jasper), image (Midjourney, DALL-E, Stable Diffusion), voice (ElevenLabs, Resemble), video (Synthesia, Pictory), 3D objects (Open3D, Kaedim), and AR/VR environments (GenVA). Each category is paired with a simple, flat-style icon and examples of tools used within that segment, providing a concise summary of the generative AI landscape. The layout is clean, responsive, and ideal for web publishing.

    Neural networks have already found application across multiple industries, and have become core tools in generative AI systems for text generation, image creation, video production, and more. Some major real-world use cases are the following.

    • Text Creation: ChatGPT is a conversational text assistant for natural dialog and generating written content. Jasper is a text generation tool with a similar focus on marketing and sales use cases.
    • Image Generation: Midjourney and DALL-E allow users to create art or images of concepts via prompts. Stable Diffusion is focused on open-source development of similar capabilities.
    • Voice Generation: ElevenLabs creates voice content that imitates human conversation with accurate emotional pacing. Customers can ‘clone’ voices from content that they provide.
    • Video Creation: Pictory automatically generates videos from articles that are input by users.
    • 3D Object Creation: Kaedim can automatically turn 2D drawings into 3D models for gaming.
    • Music Generation: Aiva composes symphonies, usually for the purpose of production score.
    • AR/VR Environments: GenVA creates immersive environments for training and entertainment purposes.

    Generative AI is being actively integrated into vertical industry-specific tasks for efficiency and efficacy. For instance, Precision AI’s use of generative AI neural networks in the agriculture sector is said to save farmers up to $90,000 per year per machine. In healthcare, Google DeepMind has developed a tool to spot eye diseases that is up to 99% accurate.

    While understanding the structure and function of neural networks helps clarify how generative AI models operate, it’s just as important to evaluate where these models can deliver real-world value. Technical insight is only the first step — strategic use case mapping ensures that AI is applied where it actually solves business problems. To explore how to identify and prioritize the right AI opportunities, visit our guide on Generative AI Use Case Mapping.

    Technologies at particular firms utilize generative AI for different purposes and in different ways. More tools have recently been created for particular purposes, but the big players are beginning to work on platforms that can efficiently and effectively provide a range of use cases.

    For example, Microsoft has integrated OpenAI’s capabilities into many of its products like Azure, Copilot, and the new Bing, and has developed a new product called Semantic Kernel to help embed semantics into older products. Others are taking it a step further by creating an ‘ecosystem’ to better address productized generative AI needs.

    Google Cloud’s new Dataplex tool is intended to bring data science and artificial intelligence together. The mission of the team behind Dataplex is to use their past experiences building self-driving cars, virtual assistants, and robots to create a single self-service environment. This would simplify the transitions between data science and artificial intelligence posed by the generative AI pipeline.

    Adobe is using its Sensei platform to help democratize video editing through automating certain lower-value tasks via generative AI and assist creators in building higher-value assets via intelligent new editing capabilities. Adobe is implementing generative AI across its product line, including through its Photoshop and audio tools.

    Neural Networks in Text Generation Tools

    Neural Networks are at the core of text generation tools, enabling them to accurately produce coherent human-like text from a wide variety of prompts. While each tool’s internal workings may differ, they all use the foundational principles of prediction, training on datasets of examples, and user prompt interpretation to produce the desired text output.

    Practical examples of how text tools use neural networks include auto-completion, summarization, and creative writing assistance. Auto-completion is a core feature found in document editing tools such as Microsoft Word and Google Docs.

    When users begin typing a word or phrase, the software uses generative algorithms to make suggestions regarding what is likely to follow. Depending on the sophistication of the software, the generated text can be selected phrases, sentences, or entire paragraphs.

    Another example is the use of generative AI to summarize large data sets for analysis and reporting purposes. Applications such as ChatGPT, Knackmaps, and Abstract are able to take long reams of text and distill that information into a concise report.

    Similarly, generative AI in the domain of creative writing – such as Sudowrite and Jasper.ai – can assist authors in generating coherent story plots and arcs by providing contextualized suggestions on likely words and phrases that should follow.

    Two of the most popular general-purpose text generation tools – Jasper AI and Google Bard, utilize different outputs based on different underlying models. Jasper uses a model using transformer architecture, while Google Bard uses a LaMDA model that incorporates neural network architecture.

    Transformers have been the dominant architecture for language generation since their inception. At a high level, the Google Bard LM model is similar to a transformer architecture. According to Deep Mind’s LaMDA research paper, the enhanced interactivity and contextualization capabilities are enabled via two layers in the LaMDA architecture.

    The first layer is designed for dialog generation, while the second layer is designed for search-based activities. In an illustration, the LaMDA model and transformer architecture are functionally the same with regard to input processing and output generation. However, in models such as LaMDA and GPT-3 where the intent is for more ‘natural dialogue’, the interactivity, contextualization, and predictability capabilities are built out further with additional neural layers.

    Image, Video, and Audio Creation with Neural Models

    Neural networks enable AI to create images, videos, and audio through an interaction with the known probabilities of different pixels, frames, voices, and sounds, allowing for synthesis and the generation of new outputs based on learned examples.

    Neural networks allow for generative AI content such as images, audio, and video to be created across different platforms using different underlying networks. Exhibit A lists the primary generative AI platforms and the underlying architectures they use to create audio, video, and image outputs.

    • Images: Neural networks generate images using large training datasets of existing images. The most widely used neural networks to create images are CNN-based GANs, along with diffusion models, such as Stable Diffusion.
    • Audio: Several different types of neural networks can generate audio. GANs are often used to create music, with examples such as Google’s Tone Transfer, which moves melodies to different instruments, and Alysia, which generates whole songs including lyrics. Other platforms, such as ElevenLabs for voice cloning, and jazz improvisation tools use different types of neural networks for audio synthesis.
    • Video: GANs and increasingly transformer-based models are used to generate video. GANs create intermediate steps between current and desired images to create video. Transformer-based models such as CogVideo and Google’s Imagen Video, delay frame generation until they have larger contextual information.

    Essentially, neural networks work by understanding and learning the probabilities of pixels, frames, voices, or sounds that correlate with the examples on which they were trained. They use these learned probabilities to generate new outputs that are similar to but never identical to any of the training inputs and that resemble the structure of the training inputs.

    While their outputs differing, the general steps of using neural networks to generate images, audio, and video are similar. At a high level, these three steps are involved.

    1. Input/Training: Generative AI models receive sample images, audio, or video as training input. In images, these may be labeled datasets such as human faces, animals, or buildings. In the audio, they may be labeled datasets of human voices in particular tones or of music in specific genres. For video, labeled datasets may be human movements, gestures, or actions being carried out in particular environments.
    2. Neural Network Process: The generative AI model’s underlying neural network draws upon various computational processes to internally extract patterns and rules from the training inputs that govern how these different datasets correlate with each other and their different features. The most basic way of visualizing how a model may learn from images is through a neural network structure that analyzes and extracts relevant features, drawing correlations between these different features (using hidden layers), and at the output layer, uses all the previously learned relationships to generate new images, audio, or video.
    3. Output: Output generators in neural networks create entirely new images based on examples in their training sets by recognizing patterns and rules that underly their example input.

    In this way, neural networks implicitly capture the latent variables that govern a dataset to produce new data. This is visualized below which has simplified the processes in an overly linear manner but does show the core concept of generative AI.

    The following are specific examples of how each media type uses neural networks.

    • Image Creation: Tools like Midjourney generate entire images based on text prompts using diffusion models based on CNNs.
    • Audio Creation: Tools like ElevenLabs use auto-encoding neural networks to identify internal and external vocal characteristics.
    • Video Creation: Tools like RunwayML use GANs to create what is titled a video cube in which different frames are generated, interpolated, and sequenced between generative input and desired output.

    We now look at how Midjourney, ElevenLabs, and RunwayML use generative AI neural networks to create their respective outputs.

    • Midjourney: Midjourney uses a diffusion model built on a type of generative adversarial network (GAN) to produce images from text prompts. This involves the user scanning their prompt and a ‘noise’ image (random pixels with no clear meaning), with a straightforward diffusion model moving concepts digested through text encoders through the image in fractions, until the desired image is produced by inversely mapping noise back down into clearer structure.
    • RunwayML: RunwayML uses a type of multilayered neural network called a sketch-RNN. This simplifies the video creation process by allowing it to generate sketches by interpreting video frames and allowing full generative output from a simple, rapid sketch input.
      • Demo of Sketch-RNN trained on human movement inputs.
    • Eleven Labs: Eleven Labs uses the neural network architecture called auto-encoders as the base for their vocal cloning and generation tool. Their neural networks break down vocal sounds into spectrograms, which are a way of visually representing how sound changes over time. The network then uses their learned inputs to turn their generated new audio into spectrograms, which are then decoded into the final audio output.
      • Vocal Cloning Example Using Elements of an Auto-Encoding Neural Network.

    Despite differing outputs and functional mechanisms, all of these tools and platforms depend on the same fundamental underlying principles of artificial neural networks and generative artificial intelligence.

    Neural networks form the backbone of most content-generating AI models—from blog post synthesis to social caption creation. To see real-world implementations of these models in action, check our deep dive into generative AI’s impact on content creation workflows.

    Healthcare, Finance, and Education Applications

    Neural networks are used in generative AI solutions across industries such as healthcare, finance, and education. In these sectors, generative AI applied through advanced neural networks assists domain experts in interpreting large quantities of information and can automate processes to improve productivity and speed.

    • Healthcare. Neural networks aid in diagnosing diseases via image recognition and predictive modeling for outcomes and drug development.
    • Finance. Neural networks predict trends in markets and stocks, analyze credit ratings, and assess financial risks.
    • Education. Neural networks adapt educational content according to learners’ needs, assist in grading mechanisms, and monitor dropout rates through various analytical models.

    Five potential healthcare, finance, and education applications of generative AI based on neural networks include the following.

    • AI Diagnosing Diseases via Image Recognition. Generative AI applied through deep neural networks is being used to automate diagnosis of diseases and improve treatment outcomes, particularly in radiology and dermatology. The Geisinger healthcare system has developed an AI model that predicts cardiovascular conditions and mortality risk in patients based on a patient’s existing echocardiogram images.
    • Generative AI to Predict Stock Trends. The uncertainty of predicting stock markets and trends is supported by AI/ML models that continuously improve via neural networks. Charles River Development, a BCG company, adopts AI algorithms that use risk premia to predict market volatility.
    • Generative AI Adapting Educational Content. A 2022 study from Universitat Jaume I in Spain entitled “Metrics to measure diversity in educational tests” examined Content-Based Recommender Systems for personalized educational content that integrate textual, audiovisual, contextual, motivational, and personal characteristics. Their proposed system could use generative AI adapted from existing image creation models to create proper, on-the-fly educational content based on various student inputs.

    AI Diagnosing Diseases via Image Recognition

    AI is diagnosing diseases via image recognition using neural networks to analyze medical images for abnormalities. This is done via the same process as traditional radiology, but results are delivered quickly thanks to neural networks trained to perform image recognition tasks.

    AI tools designed to interpret medical images including X-rays, CT scans, MRIs, ultrasounds, pathology slides, and nuclear medicine scans to swiftly detect disease. For example, Radiology Assistant is an AI tool that interprets lung CT scans to search for potential early signs of COVID-19 infection.

    AI assists healthcare providers with targeted image analysis that quickly highlights areas of concern, which is then reviewed by a human analyst for diagnosis and subsequent treatment planning. Algorithms for automatic assessment of cardiovascular diseases, diabetic retinopathy, and skin cancer are already in the works. Research into the application of AI in genomics, patient data, and drug design is also burgeoning in order to better understand and treat diseases.

    Traditionally, human interpretation of medical images was time-consuming and prone to errors given the high series volumes of scans they were required to review. Even with the second human review now required in many healthcare systems, there have been documented cases of radiological errors and misreading of medical images, as highlighted by this research by J C Paul et al.

    Neural networks, especially Convolutional Neural Networks (CNNs), are able to perform image classification tasks by detecting and classifying features within sub-regions of images. These algorithms can be trained using historical datasets of aligned medical images and radiology reports to learn similar associations and flag those same areas for concern in future scans. This is facilitated with two-dimensional matrixes, or filters, that process multiple layers in order to extract an ever-increasing level of detail.

    GenAI Automating Financial Predictions

    Investors and managers need to understand where money is to determine how much they should spend and how they should spend it. Forecasting trends is key to this understanding, whether to determine how fast sales are expected to grow or understand how consumer preferences are changing, and GenAI has the capability to automate this.

    Financial models forecast future patterns in order to help decision-makers figure out how to respond based on current trends. These models run on large-scale data that comes from multiple internal and third-party historical sources. Leading financial centers manage vast amounts of computation on behalf of their clients to provide varied and advanced models which they then expose to their clients through user interfaces and data feeds. Having reliable forecasts give firms a competitive advantage.

    Neural networks are used in automated financial forecasting, not just to increase the accuracy of estimates, but also to reduce the time and effort needed to build and recalibrate models by leveraging the ability of deep learning methods to automatically generate predictors that in traditional econometric models had to be specified a priori.

    In a typical workflow, multiple data sources are aggregated, and financial prediction features are generated. These features are then fed into neural networks that can model the data’s non-linear structure. The outputs are estimates of future patterns, which are subsequently used in financial analyses including budgeting, cash flow forecasting, and reporting. This diagram outlines the basic flow between data sources and neural networks for financial predictions, as used by various financial services providers.

    Infographic of a neural network using financial data sources to generate forecasts
    This infographic illustrates the role of neural networks in financial forecasting. It shows the end-to-end flow: starting with economic, market, and company data, which is processed into features used by a neural network. The model then generates financial predictions. The clean flat-style visual is structured for professional content, showing labeled stages and arrows in a web-optimized format ideal for articles about finance and AI integration.

    Prediction methodologies in financial forecasting can be separated into quantitative and qualitative models. Quantitative methods utilize numerical data and statistical methods for business modeling predictions, while qualitative methods rely on the judgment and experience of individuals. Quantitative forecasting numbers are derived from complex models that employ numerical data input from a variety of sources, some of which consist of feedback data from previous forecasts.

    Neural networks are particularly suited for quantitative forecasting, with several variations of neural networks being capable of building forecasting models based upon economic variables with which they were originally not trained. These methods include feed-forward, recurrent, long short-term memory (LSTM), and convolutional neural networks (CNN).

    LSTMs are commonly used in sequence data. such as financial forecasting because it not only produces current patterns but also past and future patterns needed for accurate predictions. In financial forecasting, this pattern produced by LSTMs corresponds to various intervals of time that affect the original data input. In addition, there are existing datasets with LSTM structures which can be leveraged for training new predictive models.

    Multimedia data can improve forecasts, and as such convolutional neural networks (CNNs) are increasingly finding applications in financial forecasting. CNNs read time series data such as price movements, and employ data such as tweets and Google trends through their text analysis, to derive simple and insightful signals for predicting future price movements.

    Attention mechanisms are another method of extracting patterns used in CNNs and are especially helpful for financial prediction in assessing the importance of debt ratios and economic indicators like GDP growth. While these attention mechanisms have yet to be practically used in financial forecasting, many such applications are likely to find their way into GenAI financial tools in the near future.

    As such, GenAI tools are either already being used for financial predictions, or some version of them is likely to be used by financial institutions within a few years.

    Common Platforms & Tools That Use Neural Networks

    Common platforms and tools that use neural networks for generative AI include OpenAI’s ChatGPT, GitHub Copilot, Google’s Gemini, Midjourney, Stability AI’s Stable Diffusion, DeepMind’s DreamerV3 , ElevenLabs, DALL-E, Jasper, Synthesia, HourOne, and Runway Gen-2 video editor. Each of these platforms makes use of different neural network architectures depending on the tasks they are designed for, from text and code generation to hyper-realistic image, audio, and video generation.

    Many of these platforms provide APIs for third-party developers that allow them to access the generative AI models built on neural networks. While not all of these products are open-source or available for public use, there are many open-source generative AI neural network models in the research community that developers can build and iterate on.

    Some of the top generative AI platforms and tools, their neural architectures, and their intended uses are as follows:

    • OpenAI ChatGPT: A large language model (LLM) based on transformers that uses natural learning processing (NLP) techniques to help comprehensively mimic human conversation. Initially trained on internet data and continuously updated with user data, it is designed to hold conversations and answer questions.
    • OpenAI Codex: An LLM based on the same transformer architecture as ChatGPT, but focused on understanding and generating code for hundreds of programming languages. The Codex model forms the backend of Github Copilot, which provides assistance to software developers. Codex can even translate plain English to code.
    • Google Gemini: Google’s next-generation cross-modal AI system that handles images, 3D, video, audio, and text. Transforms data inputs into a shared ‘thought’ space for more human-like generative responses.
    • Stability AI’s Stable Diffusion: A latent diffusion model that generates high-quality images from text prompts. Stable Diffusion uses different neural networks to improve image generation over the diffusion process in latent space.
    • Icon AI: A text-to-video tool that generates short videos and gifs by interpreting natural language prompts. Icon’s neural image models work on single or multi-frame sections of a video.
    • Eleven Labs: A neural text-to-speech (TTS) product that allows users to generate high-quality voiceovers. Their technology uses machine learning models on audio and linguistic data to generate speech that conveys meaning, emotion, and even emphasis. Users can create “Ultras” which are directly clone the voice of any historical, political, or fictional character.
    • Runway Go: A suite of tools for editing images and videos. Their Generative Video product uses a diffusion model to generate new frames that match the context of existing footage, creating an interpolation-like effect.
    • Synthesia: A platform for creating videos with AI avatars based on text input.
    • Soundraw: AI music generator. Users can customize the length of the track, instruments, genres, and moods. AI-generated tracks can then be downloaded.
    • CanvaAnimator: An application for creating animated GIFs that allows users to generate GIFs using neural networks.

    While ChatGPT is the generative AI platform with the broadest and most transparent user base, there are many equally interesting generative tools that use neural networks in different ways. Most of the products listed above are designed for end users and thus may not offer the greatest technical clarity. Google Scholar and ArXiv, which contains preprints of papers that have been submitted (or rejected) to academic journals, often provide some of the most transparent technical details on leading edge generative AI neural network architectures.

    Top GenAI Tools and Their Underlying Architectures

    This is a list of popular generative AI tools and their core neural network architectures. Representative platforms include large language models (LLMs) for text generation, image and video generation tools, text-to-speech and voice synthesis systems, and more general multi-modal platforms. Developer-first tools that require no coding and deep knowledge of AI have been included.

    Tool

    Type

    Core Neural Architecture

    OpenAI API

    Large Language Model (LLM)

    Transformer (with various sparse components for efficiency)

    Claude (Anthropic)

    LLM

    Transformer

    ChatGPT

    LLM

    Transformer (with RLHF training)

    Bard (Google)

    LLM

    Transformer (modifications including Reinforcement Transformers)

    Notion AI

    LLM

    Transformer (with various additional features)

    WriteSonic

    NLP-focused LLM

    Transformer (with custom additions)

    Jasper

    NLP-focused LLM

    Transformer (with additional NLP features)

    DeepAI Text Generation

    General NLP LLM

    Transformer (and similar auto-regressive architectures)

    Georgie (DeepAI)

    Conversational LLM

    Transformer (and similar architectures)

    Runway

    Image and Video Creation

    Various CNNs and GANs

    Stability AI tools (Stable Diffusion, DreamStudio)

    Image Generation

    Transformer and other sparse attention mechanisms

    Craiyon

    Image Generation

    Stable Diffusion + add-ons

    Scikit-Speech

    Audio generation

    RNNs, CNNs, and GANs

    Cleanvoice

    AV suppression

    Deep CNNs

    Speechify

    Text to Speech

    Variants of WaveNet, Tacotron, Clonation, and others

    ElevenLabs

    Voice Synthesis

    Multi-faceted with clones of Tacotron and others

    X(S)AI

    Multi-modal

    Not publicly known, likely hybrid with Reinforcement Transformers

    OpenAI Codex

    Programming Tool

    LLM with additional components

    Canva

    Art Generation

    Hybrid with various options for back-end

    StyleGAN

    High-Fidelity Image Generation

    GAN

    Designify

    Background removal

    Stable DIffusion + other options

    Runway ML / Runway Gen2

    Video generation

    Various GANs and CNNs

    Aiblins

    Text and Image generation

    Multi-modal architecture (not publicly known)

    Mimo

    Language Learning

    Multi-modal

    For more developer-dedicated tools that use neural networks, MLJAR and MindsDB offer autoML platforms with many tools including various neural network implementations. For the no-coding crowd, platforms such as Let’s Enhance, ParagraphAI, and HypotenuseAI offer easy-to-use interfaces for generative AI functionalities.

    GPT-4o, Claude, Midjourney, ElevenLabs: Who Uses What?

    These platforms use the following model bases, which map onto the types of neural network discussed earlier.

    • GPT-4o (OpenAI): transformer neural network
    • Claude (Anthropic): transformer neural network
    • Midjourney (self-described): custom diffusion model based on latent diffusion
    • ElevenLabs (self-described): uses a novel clipping mechanism in a fine-tuned transformer
    • Runway Gen-2 (self-described): latent diffusion model
    • Magician AI (self-described): GAN based model

    Substantial differences exist across these platforms beyond their model basis, with different feature sets, operational focuses, user experiences, and pricing. A few highlights are as follows.

    • OpenAI’s GPT-4o tool. The latest iteration of the foundational text and multimodal chat tool like ChatGPT. It is a decoder-only transformer, which means it receives prompts and generates outputs in a sequential manner one token at a time rather than holistically. It has multimodal functionality, supporting text, audio, image, and video content. It supports plugins that allow it to connect to various external data sources like the web.
    • Anthropic’s Claude platform. Claude is a transformer similar in functionality to OpenAI’s GPT products that focuses on more conscious and transparent AI interactions. It has text and (limited) image generators. Its major differentiators are responsive self-critique and user adjustability of behavioral parameters. It can process lengthy prompts (input plus output) of up to 75,000 words in a single request. A pro user claims to have successfully uploaded Claude 2 with an entire book for a summary.
    • Midjourney generative AI for images and videos. Midjourney is a paid subscription image creation tool that provides a community on Discord for users to collaborate on generating AI artwork. The platform is known for highly artistic rendering of images with an emphasis on color and painting effect. Though it does have the ability to render video clips and variations of images in rapid sequence. While often described as a diffusion-based model, it is a custom multimodal latent diffusion with added functionalities for action and collaboration that makes it quite different from other similar platforms.
    • Eleven Labs. Eleven Labs is unique among the major generative AI platforms in being focused specifically on generative conversation via synthetic speech. It is highly advanced, boasting the ability to mimic speech patterns of identifiable individuals including auto-generating a simulated text preview of their voice. Users can “upload a sample” which looks to be the first few seconds of their chosen speaker. It is unclear whether this is the voice of a voice actor or whether they use a public voice the person is known to have used in a clearly identifiable context as their source.
    • Runway’s Gen-2 tool. Runway is an artistic image and video creation tool like Midjourney, but with a greater emphasis on animation. Users can submit video clips that “magically” expand from their original frame in time or space. It is a diffusion-based model, meaning it starts from noise and generates recognizable imagery – in this case for still images or sequences from video clips. While they assert it is “the world’s first video generator,” the technology to reach that goal is unclear.
    • Magician AI. Magician is one of the leaders in the growing category of image generation tools designed for use with augmented reality (AR) and virtual reality (VR).

     

    Which Platforms Offer Transparency vs Black Box?

    A key trade-off for different AI architectures in varying applications is transparency versus performance. Users of various generative AI tools receive different levels of insight into how they work, depending on the choices of the underlying organizations.

    Offering transparency into how internal generative AI works provides companies with a means to garner trust from users. The more the inner workings of the platforms are made clear, the less paranoia exists regarding inherent biases, potential for data leaks, and the reliability or safety of their outputs. This is especially true in industries like healthcare that deal with lives and data security. Transparency can also be a means to commercial success, as it can lead to collaborations between research and market-oriented entities like Microsoft.

    On the other hand, transparency often leads to performance trade-offs. This has especially been the case with older generations of AI. Many early AI alternatives had long mathematical equations that accurately described their processes but made it difficult to achieve high performance. Even today, the methods for improving performance are not always compatible with increased transparency. This is especially true for deep learning models which develop layers of complexity that produce a highly accurate, in many ways mysterious, ‘black box’ output.

    This section defines four levels of transparency, and then discusses which generative AI platforms fall into which categories, along with their transparency trade-offs.

    More formally, these four levels of transparency can be defined as follows.

    • Architecture Definitions. Basic disclosure of the types of algorithms, neural network architectures like CNNs or RNNs, and layers of the computational models. This is the least detailed type of transparency, as it might not correspond clearly to how the platform is actually functioning in practice.
    • Parameter Disclosures. Beyond architecture definitions, this includes the dimensions and other characteristics of the various components that comprise AI systems, including neural layers, transformer units, hidden states, and so forth. For example, while we may not understand what an AI has learned at a given stage, it is helpful to know how much it has learned. Parameters may only provide helpful insight into the generative AI’s competencies at high levels, given that the sheer number of parameters models function with can be enormous.
    • Research Paper and Technical Detail Disclosures. Full-length research paper disclosures may be the only avenue for transparency at a given time, particularly in fast-evolving fields like deep learning and generative AI where companies are still determining how much they want to share externally. Usually referring to disclosures included with the generative AI tool’s launch, these papers contain the dimensions and characteristics of algorithms, architecture, and parameters, as well as extensive detail on how they work together. At times they may also include extensive details on how the platform performs and compares to other platform alternatives, the strengths and weaknesses of the technologies used, and how they might be improved in the future.
    • Interactivity. Indications of what generative AI platforms are focusing on, being improved upon or limiting, as well as insights into how it can work and interfaces within its surrounding software architecture. Such elements provide transparency in a way that goes beyond words, equations, or numbers.

    While these four levels of transparency are helpful in thinking about how generative AI systems can be understood, determining the term’s layout in actual practice can be quite nuanced.

    The neural network in this ai system diagram shows that even a comparatively simple learning architecture can involve many arcane elements that are difficult to grasp outside of technical jargon in research papers. One of the simplest neural networks consists of an input layer that receives data, one or more hidden layers where processes occur, and an output layer that produces meaning or categories based on the data received.

    The key generative AI developers, tools, and models today straddle all levels of transparency and performance in varying combinations. Business and analyst surveys from McKinsey and Deloitte illustrate the degree to which many companies across industries run a range of generative AI tools.

    Platforms are highly differentiated in their transparency levels. Those using highly researched architectures such as DeepMind’s Gato, or advanced transformer architectures most plainly represented by Google’s Bard AI have opted to publicly share research papers and technical details. While the field is moving toward secrecy due to the competitive advantages that transparency provides, there is also a middle-ground in having some measure of transparency without forsaking performance.

    At least in the near-term, this is balanced out by many cheaper generative AI tools having poor transparency and need for user trust but not bringing the same performance that higher-end brands have, whether transparency hurts or helps the users themselves are rarely going to their AI tools for more than specific outputs. For very practical reasons, these brands have not disclosed far beyond basic architectural definitions.

    The issue of developing trust in generative AI processes through transparency has a stronger manifestation on the user side. Generative AI is not broadly understood, and due to its black-box nature many people have suspicion that it is untrustworthy. Even for companies like Google or OpenAI that have shown extensive willingness to share their works through research papers, many users interpret even the rudimentary mathematical models they work with as inscrutable.

    But the field has been shifting toward more transparent and trustworthy norms, not only to dispel user doubt, but because finding new means of enhancing performance often involve improving transparency. Just think of many early businesses that were focused solely on whether or not the product worked for them and avoided researching the ins and outs of “what smart home integration” meant. Transparent input processes that displayed how the system worked in real-time or showed specifics on meeting customer needs, were vital to ensuring customer retention and brand loyalty.

    This, while photos have been rarer, has been happening in generative AI, particularly when it is receiving outputs that students and professionals can trust. This means signals early in the learning process that helped users iteratively check-in on the output to ensure it was correct. Basic factor transparency in helping to verify that internal generative outputs were not biased was necessary. The end products would eventually be used specifically and visibly enough to start customer retention.

    This trend has been positive for both the customers and the models, as most generative AI tools have achieved what might have been a decades’-long rollout pace by doing what many other industries have moved towards: ensuring processes and products were clear enough to develop customer loyalty.

    APIs and SDKs Built on Neural Networks

    Application programming interfaces (APIs) and software development kits (SDKs) provide a simplified way for others to access the capabilities of neural networks without needing to build their own models. These platforms have pre-trained generative AI models that developers can call or integrate into their software products.

    OpenAI, Hugging Face, Replicate, Google Cloud, Amazon Web Services, and Stability AI are notable platforms offering APIs and SDKs built on neural architectures.

    • OpenAI: APIs for ChatGPT and DALL-E allow integration of advanced text and image generation capabilities into third-party user Products. Natural language understanding is made available by Whisper (speech-to-text) and Codex (code generation and understanding).
    • Hugging Face: Developers can access thousands of pre-trained models ranging from text, audio, and image generation to translation models built on top of various neural architectures, simply by registering on their platform.
    • Google’s Vertex AI: Offers access to large language models, including ChatGPT and PaLM. Current public offerings use Transformer architecture. Vertex AI uses a multi-modal approach that combines neural networks with symbolic AI by allowing users to include hard-coded logic in model outputs.
    • Stability AI: Provides its models via a unified API that allows generative AI developers to choose diverse generative functions, including image, text, and video generation.

    Your choice of API or SDK will depend on the generative AI tasks you are interested in. OpenAI and Hugging Face remain good starting points for exploration. Though development is more complicated, Google’s Vertex AI is a powerful choice for enterprises looking to embed large language or multi-modal models in their operations.

    Challenges of Neural Networks in GenAI Systems

    Deploying neural networks for generative AI comes with several structural challenges, including the following.

    • Bias in results generated: results that add to rather than reduce societal prejudices.
    • Hallucinations: false information that seem plausible.
    • Overfitting: models too tailored to training data.
    • Trustworthiness: lack of explainability, transparency and data privacy.

    We define each of these challenges in generative AI below and explore the impact they have on machine learning systems.

    • Bias in Results: Neural networks are too intelligent to produce biased opinions themselves. Rather, according to a 2021 Stanford paper entitled Social & Structural Biases in Natural Language Processing (NLP) Datasets, biases from society’s prejudices manifest in training data.

    The end result is output that is either prejudiced or stereotyped. An example that circulates in the press is that a neural network image generator was improbably much more likely to associate Black individuals with negative words such as war, riots, or violence than it was to associate Black individuals with positive words.

    There is debate over whether bias in large language models (LLMs) is a big problem. A 2022 study from the AI researchers working group Model Cards for Model Reporting found that the majority of their respondents (62%) thought it was, while others believed mitigative efforts would work.

    The threat of bias in results raises the ethical challenge of ensuring that results of generative AI platforms do not perpetuate prejudices.

    Mitigation efforts include retraining models on controlled data sets that have been stripped of problematic content or adding filters that aim to disguise or remove unwanted characteristics from outputs.

    • Hallucinations: Hallucinations refer to the generation of information or results that are untrue or inaccurate but made to appear plausible. These are frequent challenges for neural networks, especially in language-generation tools such as ChatGPT.

    The reason hallucinations happen is intertwined with the reason trust in LLMs is low. Outputs are based on patterns detected in the training data and not dependent on real-time awareness or information fact-checks. The solution is to expand the training data so that the patterns it contains ensure more accurate generative output. This is expensive and time-consuming, but it—along with better prompting techniques—is the best solution.

    • Overfitting: A third challenge is overfitting, which refers to a model that is too well-adjusted to training data. The predictions and outputs of such a model may be accurate for its training data, but less useful when asked to analyze and assess a new data set.

    The main reason overfitting tends to occur is when researchers embed excessive parameters into neural networks.

    The primary means of addressing overfitting is through testing. If users share their data and test results regularly with the GenAI provider, then it is easy for the provider to pick up patterns of results that are too tailored to their training dataset.

    In fact, if the provider is disciplined about establishing the testing regime early on in a model’s life cycle, the problem of overfitting can often be detected before it begins. Researchers can compare model evaluation scores on their test set against their training set. If they observe scores improving on the training set but decreasing on the test set, they know their neural network is becoming overfitted.

    • Lack of Trustworthiness: The issue of bias in results generates problems of trustworthiness for users, who want generative AI results to be reliable. Increasingly distributed bias leads to declining reliability in generative AI results.

    Researchers have focused on improving the reliability of generative AI outputs. In fact, improving reliability and trustworthiness have become priority areas for researchers and tech firms working in the field of ethics. To this end, the most promising avenue appears to be having more rigorous testing procedures and protocols during the model training and outputs verification phases.

    Tech companies are starting to create protocols to give consumers the option of getting clearer explanations for the functionality of generative AI models. For example, users can select a form of the desired output, whether that be imaginative or practical, purposeful or whimsical. These selections tell the prompt what biases to avoid or emphasize, thereby enhancing the trustworthiness of the output’s intent.

    This Thankfully leads us back to bias, where answers cannot merely be free of prejudices but emphasize fairness and ethical reliability. As such, bias itself cannot be eliminated from generative AI completely, as who decides on the fairness still may have biases or preferences themselves. This should ideally be based on a composite of societal values, which is subjective and changing on a regular basis.

    Data privacy was always an issue, especially as different generative AI systems are desired to generate different protected attributes, whether it be documents relevant to the financial industry or art heavily inspired by religious themes. But as Data Bridge research noted, the pandemic forced an emergence of a more data-driven culture. This raised the risks of leakage of sensitive and confidential data.

    OpenAI, in a July 2023 collaboration with System 1, allowed researchers and consumers to play their models to detect sensitive personally identifiable information (PII) and improper content generation. This is another area where bias reduction and privacy can work in tandem: examining patterns in generative AI outputs can produce better training data and verification protocols that would improve not just bias-awareness but also data privacy protocols.

    Lack of transparency in modeling algorithms is a big issue for improving the ethical and usable aspects of generative AI. The MAGIC (Model-Agnostic Global Importance Criteria) framework helps researchers quantitatively gauge various model-building choices’ contributions to the final results, thereby improving the transparency of generative AI outputs.

    The structure of neural networks impacts ethical risk—such as bias propagation or data leakage. Learn how design decisions intersect with responsibility in our analysis on ethical challenges in generative AI.

    Bias, Hallucination, and Overfitting

    Neural networks in generative AI face several major challenges.

    • Data bias refers to neural networks producing outputs that reflect biases present in their training data, leading to distorted, stereotypical, or improper results.
    • Hallucinations refer to software outputs that contain false or invented information which is stated as fact. Hallucinations are inaccurate data generated by neural networks that are presented with high confidence; these are particularly common in text generation.
    • Overfitting occurs when the neural network becomes too focused on the training data, and learns intricacies and noise to the point that it negatively impacts the performance in new, unseen situations.

    Overfitting can cause subtle inaccuracies, but hallucinations and bias can pose more serious problems.

    Bias is inherently difficult to manage because it requires subjective determinations of what is or is not appropriate, and generative AI models built on a neural network make decisions based on training data that may contain biases. One tool that the Generative AI community is using to identify bias is the AI Bias Inventory tool developed by researchers at University College London and Ben Gurion University in Israel. This data collection tool enables researchers to identify and potentially reduce bias in responses in six areas, as illustrated below.

    Hallucinations, which lead to the production of inaccurate information stated as fact with confidence, pose a different but equally serious problem for generative AI and its users. Users and even some researchers have had difficulty distinguishing between high-level outputs with low accuracy and the same outputs stated with a confident tone.

    According to a survey on the accuracy of AI-generated text conducted in 2022 by researchers Susan Michniewicz and Eric S. McCoy at the University of Pennsylvania and Columbia University, some 94% of respondents thought AI-generated text articles contained entirely accurate information, when in fact the AI often fabricated or misrepresented sources.

    Bar chart showing survey responses on AI-generated text accuracy
    This infographic visually summarizes survey data on how users perceive the accuracy of AI-generated text. The horizontal bar chart reveals that a significant 94% of respondents found AI-generated text “entirely accurate,” while only a small percentage found it mostly or somewhat accurate. None of the participants rated it as “not accurate.” This concise visual effectively supports discussions on the reliability of generative AI outputs and public trust in AI content, and is well-suited for tech articles or research-based blogs.

    The outputs of generative AI tools in art or image creation can also be biased, overfit, or hallucinatory, as Microsoft image generation tool DALL-E 2 demonstrated when it produced “distorted pizza slices” while attempting to generate an accurate response to user input. As discussed further in the “Challenges” section of this article, biased or distorted outputs can be dangerous, and generative AI experts (and even governmental authorities) are increasingly focused on implementing safeguards against such things.

    Explainability and Trust Issues in Deep Models

    Deep learning models, including neural networks, are often considered “black boxes” whose exact functioning cannot be discerned by human understanding. This lack of clarity and interpretability is a major concern for researchers, regulators, and everyday users in sectors like finance, healthcare, and self-driving vehicles that require certainty and trust in AI decision-making.

    The difficulty in interpreting deep learning models is due to three main reasons.

    • Opacity: The vast number of computations and parameters (up to hundreds of billions in the case of foundational models) used by these systems are simply too complex for humans to clearly follow and relate to original input data.
    • Decision Traceability: Even if one were to know the precise methodology that a neural network was using to arrive at a decision, these techniques would not be able to provide a clear audit trail on how the specific inputs got transformed into the final output in the model’s many layers.
    • Trust Concerns: Without certainty on how a model is deciding things, there are fears it could be vulnerable to exploitation or hacking, malfunction, unintended biases, or a myriad of other potential issues that could create danger or diminish reliability.

    A number of techniques have emerged to try and combat the opacity problem in interpreting how neural networks arrive at specific predictions or outputs. These largely revolve around “mapping the black box” or reducing a complex model’s behavior to simpler analogs that are easier for humans to understand. Common methods include feature visualization, saliency maps, and usage of surrogate models.

    Model Interpretability: Why It Matters

    Model interpretability refers to the degree to which a human can understand the cause of a decision made by a machine learning algorithm.

    Model interpretability is important in human-centric fields like healthcare, law, or finance where AI systems cannot make arbitrary and unknown decisions with real-world consequences. Researchers in such fields would like to know the reasoning behind a model’s output to act in accordance or correct the model’s logic should the reasoning be flawed. Furthermore, regulations (e.g. the EU AI Act) are beginning to require this kind of transparency for decision-making AI systems.

    For generative AI, understanding model logic is useful during development to debug models that generate poor or undesirable outputs. Because outputs are often highly complex, it is easier to grasp the model’s understanding of data by analyzing the example outputs it generates. The following are common methods to understand and analyze generative AI systems.

    • Attention maps: Especially for transformer-based models, attention maps illustrate the parts of the input data the model placed most importance on when producing specific outputs. Modifying the inputs to analyze their effect on outputs is another way to look into neural networks’ decision-making. For instance, researchers at EleutherAI noticed that images created by the DALL-E image generator changed depending on how they capitalized the first word. They hypothesized this capitalized first word serves as a ‘context delimiter’ indicating a switch between more informational phrases to words in the prompt intended for instruction. They tested their hypothesis by adjusting the prompt, which produced different output styles.
    • Neuron visualization: Another simpler method of analyzing models in a qualitative way is to visualize the activation of certain neurons in the network. A neuron or unit in artificial neural networks is a basic component (e.g. a mathematical function) that works to analyze input features in different ways. Activations show the intensity of the signal flowing through that neuron. Plotting these flows and outputs can help identify whether a specific neuron is being used to measure specific traits of input data. For example, researchers at Google found a specific neuron in the Inception v1 neural network trained for image classification downloaded and activated only when it received images of the National Cathedral in Washington, D.C. This property was unexpected because it behaves as a ‘class detector’ neuron.

    This diagram shows attention maps for two different phrases fed to Latent Diffusion Models. It shows the parts of the input text that received the highest attention from the AI model to determine the final visual output. The relative weight is indicated by the height of the colored bar. The specific words “invisible red cloak” receive more attention than the word “demon” for the first phrase, whereas the word “of” receives the most attention for the second.

    Fighting Data Leakage and Privacy Breaches

    The core strength of neural networks in generative AI could be a double-edged sword when it comes to user privacy. Information about the data that was used to train a model can sometimes be extracted or “leaked”. This could manifest as a generative answer that contains personal information, or as a third party being able to determine something about the individual whose data was included in training the model. This might happen with “model inversion” which seeks to reverse engineer an input by finding the trained parameters that generate a desired output.

    An example of such a leak occurred on OpenAI’s DALL-E image generation platform where users were able to see pictures created by others. While multifactor authentication added later reduced the risk, there was also the potential for a malicious insider to cause serious problems. In a 2019 LinkedIn data leak, 187,000 PII were made accessible by a disgruntled employee.

    To mitigate these risks, leading developers are implementing stringent security protocols, both technical and personnel-related. Some top-tier strategies include models like GPT-4 being programmed to “refuse” requests for sensitive information if there is any chance of it being available, personally identifiable information (PII) being stored separately from auxiliary model training data, and continued monitoring of top and unusual requests alongside model outputs.

    This diagram by Wynford offers a simplified view of how chatbots like ChatGPT, which employ neural networks, can keep user data private.

    Future of Neural Networks in Generative AI

    The future of neural networks in generative AI promises varied and exciting developments as research is focused on improvements to overcome current challenges and to harness the incredible potential of these models.

    Four major trends are emerging in the future of neural networks in generative AI, which are discussed below.

    • Sparse models
    • Multi-modal learning
    • Agentic AI systems
    • Neural-symbolic integration

    Next-Gen Architectures: Sparse Models and Beyond

    Current model architectures often take the form of dense networks with numerous connections involving billion or more parameters. But researchers are exploring ways to design sparse models that contain significantly fewer and/or selected connections, while still achieving similarly high levels of performance.

    • Adaptive Sparse Transfomers have been shown to reinforce important attention patterns and remove unimportant ones to optimize performance. They have been suggested as a follow-up to large transformer models like ChatGPT that could be less costly to run while still achieving high performance.
    • Sparse mixture of experts (SMoEs) activate only a fraction of their parameters at each input processing time versus regular neural networks that use their full complement. SMoEs can become very similar to current dense architectures when their inputs are not diverse. Nevertheless, SMoEs conditionally scale up to much larger parameter sizes that other models cannot if they maintain diverse inputs. SMoEs have been shown to achieve impressive state-of-the-art performance on various NLP benchmarks.
    • GrokNLG is a mixture of local and global expert networks to efficiently handle natural language processing tasks including relation extraction, coreference resolution, and entity linking.

    These sparse modeling techniques offer better efficiency without sacrificing accuracy. They can generate much larger and more powerful models that can handle multi-modal tasks, contain fewer parameters so that the models can operate with less data, and be more efficient so that they utilize less energy.

    Multi-Modal Learning and Brain-like AI Systems

    In response to fast-growing user demand for capabilities beyond text, researchers are now developing tools to handle multi-modal data in a more integrated way. This includes processing and generating images, video, 3D scenes, sounds, music, or other time-based audio signals alongside text.

    There is also growing interest in designing AI systems that mimic the brain, to better process data and improve reasoning ability. A recent paper by researchers at DeepMind and the University of California Berkeley, for example, lays out a roadmap for building AI systems that contain sensory, cognitive, and motor systems similar to those in the brain. With multi-modal learning emerging as a core theme, these neuroscientific principles can be used to improve learning efficiency and perceptual creativity. This could allow future models to better understand the interplay between sensations and concepts, make better inferences, and generate a wider variety of output types.

    Agentic AI: Proactively Learning Like a Human

    Neural network-based generative AI systems have gotten very good at responding quickly and providing smart outputs based on user prompts. However, these systems are still fundamentally reactive. A research paper from the Stanford Center for Research on Foundation Models has defined three levels of agency or proactivity in how these models build and refine knowledge.

    • Level 0 involves AI providing outputs without any significant learning from contextual data.
    • Level 1 involves new capabilities being possibly added based on the history of previous interactions.
    • Level 2 represents the highest capabilities where the output is based on a continually evolving set of knowledge.

    ChatGPT, for instance, does not yet have the ability to remember specific user characteristics or preferences across sessions. Though future generative AI like Google’s Gemini is rumored to incorporate this feature, it is guaranteed to raise an entirely new set of ethical issues with user data protection.

    While big strides have been made in allowing generative AI systems to learn at very high levels, achieving human-like learning potential is still very much a work in progress. Factors such as applying prior knowledge to new scenarios, working around conceptual roadblocks, or having the capability to generate and employ meaningful abstractions are still limited. Future developments will help generative AI achieve this higher-level learning.

    One of the key challenges hindering this growth is the need to ensure that AI does not learn or repeat harmful concepts or ideas.

    Neural-Symbolic Integration: Codifying Knowledge

    Many researchers are exploring neural-symbolic integration, which focuses on linking high-level reasoning by graphically outlining relationships within datasets along with rules, with low-level tasks of pattern recognition using neural networks.

    Current generative AI models are highly effective at the pattern recognition aspect of neural-symbolic integration. There is much room for improvement, however, at performing higher-level reasoning. This is part of the issue with the current models’ tendency to experience hallucinations, or generate grossly incorrect or illogical results.

    Fusing reasoning systems employed in knowledge graphs with neural networks is advantageous because it helps neo-symcsolticnuot network models follow multi-step instructions and generate more useful outputs of greater accuracy.

    Techniques such as Reinforcement Learning with Human Feedback (RLHF) that incorporate moral and ethical aspects into the design of neural network models to govern the kind of outputs human users want and need, further help improve research into this area.

    Powerful neural architectures are only as useful as their ability to integrate into existing systems. Whether you’re connecting to pipelines, dashboards, or real-time analytics, our guide on generative AI system integration covers critical technical bridges.

    Next-Gen Architectures: Sparse Models and Beyond

    While multi-modal learning, brain-in-a-box systems, and models combining symbolic AI with neural approaches are all fascinating possibilities for future neural networks in generative AI, researchers are pursuing even more advanced neural network architectures. Some of the top areas of focus include enhancing efficiency with sparse models and continued development of memory-augmented architectures.

    Sparse models utilize only a fraction of their parameters at any given time, while all parameters of traditional deep learning models are in use at all times. Researchers are exploring both Mixture of Experts models (MoE), which activate different subsets of parameters for different tasks, as well as sparse attention mechanisms within transformers. Sparse transformers have recently been used in speech systems and other generative AI used for audio, but there is optimism in the AI community that they will be explored more widely for text and visual data.

    There are significant computational advantages to sparse models. Deep learning architectures assess every parameter every time, which is computationally expensive. By activating only a subset of parameters, researchers are able to achieve significant improvements in both speed and efficiency as highlighted in this graphic from Google Research presentation on Mixture of Experts.

    Mixture of Experts Models such as a Sparse Activation Encoder-Decoder architecture introduce significantly different numbers of parameters – with models initially built with 100 billion parameters enhanced with an extra 600 million to make them sparse and able to tackle very different tasks. These additional 600 million parameters represent the “experts” in the model – the specific parameter sets that are activated for specific tasks.

    Google DeepMind has revealed their MoE model that produces text from images called Chinchilla-70B that has both eye-popping performance and efficiency numbers. During benchmarks run by people not associated with Google DeepMind, Chinchilla-70B achieved the best results with stronger efficiency numbers than all but one other visual generative model as illustrated in the chart below.

    Memory-augmented architectures utilize outside memory to provide the model with additional context it cannot keep track of internally. GPT-3 is modestly memory-augmented, but researchers speculate that models with more enhanced memory capabilities will be built. Researchers at Facebook AI Research (FAIR) have developed a system called “Memory Augmented Neural Networks” (MANNs) that have the ability to store and utilize information across episodes of interaction.

    MANNs could have a wide range of applications: for example, in customer service bots that need to remember prior interactions, as well as in healthcare bots that need reminders of treatments previously prescribed. They show potential in generative AI for memory-augmented architectures by applying schema for episodic simulation of decision-making.

    Multi-modal Learning and Brain-like AI Systems

    Multi-modal learning in AI refers to the ability of a system to process and integrate data from different modalities, such as text, images, audio, and video. In Generative AI, this enables models to generate richer content that combines various data types. For example, in 2023, Google’s Gemini 1 and Anthropic’s Claude 3 are both multi-modal systems that can respond to text prompts with rich imagery.

    Brain-like AI systems are designed around the concept of artificial general intelligence (AGI), where a machine takes on human-like aspects of cognition. Researchers have made progress in simulating the brain’s architecture and data processing abilities. For example, protocols for neurosymbolic systems that combine logic-based reasoning (the “symbolic” part) with neural networks (the “neuro” part) are being explored as a way to incorporate the brain’s enormous storage and processing power into AI systems.

    Memory layers used in neural networks are designed to replicate some of the brain’s data storage and processing abilities, specifically short- and long-term memory. In the 2014 paper “Neural Turing Machines”, researchers from DeepMind proposed models with explicit memory structures that could store information for use in future processing. Though this application is still a work in progress, it draws upon typical human behavior that prefers to forget unimportant memories and only keep the most relevant to future decision-making.

    Researchers from the University of Southern California have developed Deep-Q-Learning-based AI models whose training simulates the manner in which humans and animals learn and remember. Similarly, Google Research has a team that is trying to make reinforcement learning more brain-like by shortening the time the AI has to wait to receive feedback on its actions.

    Overall neural networks are becoming more efficient and capable as generative AI systems use insights from neuroscience that speed the learning process and improve model accuracy.

    Fusion of Symbolic and Neural Approaches

    Symbolic reasoning in AI is a method that involves manipulating symbols and the relationships between them to derive new conclusions. In simple terms, it allows computers to reason and make decisions based on logical rules. For example, if a computer can store data such as “All humans are mortal” and “Socrates is a human”, it can derive new information such as “Socrates is mortal”. Learning through this method is not flexible or intuitive, but it allows for clear reasoning.

    A recent wave of generative AI applications in reasoning and decision-making has explored blending the symbolic logic of traditional AI with the pattern recognition capabilities of neural networks. BAIR researchers note that “transformers learn to represent world states in a way that allows them to perform complex reasoning tasks by simulating aspects of classical algorithms like breadth-first search”.

    This approach means that neural generative AI can perform a wider range of human-like reasoning and decision-making tasks. It is relatively new, so whether it improves GenAI is still unclear, though reports have begun to emerge with some success stories. The recent Cooperative AI project from Google’s DeepMind, for example, produced agents that could self-organize into structured teams and complete complex tasks affecting agent collaboration.

    From Transformers to Agents: Evolution Path

    Multimodal neural networks that move between the realms of text, image, audio, and video data, such as OpenAI’s GPT-4, are currently at the forefront of generative AI. Experts predict that the next generation of generative AI will be increasingly focused on new architectures such as sparse models and “agents” which can rapidly adapt to users’ needs.

    Sparse models allow for increased model scalability while maintaining efficiency by concentrating a model’s focus on only the most useful pathways for solving specific tasks, instead of activating all pathways. Although they are still under research, some experts anticipate that meta-learning models which are taught to interactively change their own learning approaches may be a form of sparse learning.

    Generative AI agents represent a more substantial leap beyond current architectures. Agents are not simply more advanced versions of transformers or CNNs; they represent a fundamentally different way of creating AI models that offer broader capabilities. As analytics firm ABI Research notes, “agents automate a significant number of tasks that previously required human intervention”.

    Agents beyond the current generative AI models in three key ways.

    1. Planning: Generative AI agents would be able to plan out the steps needed for a goal and execute multi-step tasks, whereas current models are limited to single-step actions.
    2. Goal Pursuit: Agents will be able to intelligently pursue complex goals and modify their behavior based on whether their original intentions are satisfied. For example, if one goal was to purchase air travel tickets for an individual, the tool could modify the goal if the original goal fails, such as pursuing hotel reservations, which may otherwise have been classified as a separate tool-use task.
    3. Tool Use: Generative AI agents will link with external tools, leveraging capabilities beyond their own usually narrow domain. For example, one could imagine agents with the ability to operate finance or Excel-like software, perhaps based on models developed for interaction with automated customer service, enabling automated accounting without human intervention.

    Economist Jeffrey Dyer suggests that Google and Amazon will be the first to bring generative AI agents to market, as they already have “robot helpers” in the form of Google Assistant and Alexa, social data footprints as well as engagement with tools like Google’s API and Alexa Voice Services to create language models that interact with modern software.

    The existing generative AI model types, including fine-tuned versions of transformers, will continue to exist and thrive as the underlying neural network architectures which power them evolve. But the development of agents that are safer, explainable, and can leverage existing tools, may lead to an explosion in generative AI capabilities.

    While this article focused on neural networks, generative AI is a vast field that spans multiple architectures and use cases. Read the full generative AI guide here to understand how these models compare and complement each other.

    Are neural networks essential for generative AI?

    Yes, neural networks are essential for generative AI. Generative AI refers to systems leveraging machine learning to produce text, imagery, or other output forms based on user prompts. Neural networks are one of the two core underlying computational approaches to machine learning. Probabilistic graphical models are the other approach, but have a more limited usage in GenAI.

    Neural networks consist of many layers of interconnected nodes that perform simple mathematical functions to help find patterns in data. They learn complex relationships within data and can produce outputs that reflect such learned relationships, which is the fundamental requirement for any generative AI. This graphic illustrates the connection between machine learning, AI, neural networks, and generative AI.

    Concentric circle diagram showing AI, Machine Learning, Neural Networks, and Generative AI
    This concentric circle infographic illustrates the layered relationship among key AI concepts. The outermost ring represents the broad domain of Artificial Intelligence (AI), followed by a smaller circle for Machine Learning, then Neural Networks, and at the core, Generative AI. This visualization helps clarify how each subfield nests within the larger one, providing a clean conceptual breakdown for readers trying to understand how generative models relate to foundational AI technologies.

    That being said, there are generative AI tools built on probabilistic graphical models. For example, Google’s Dreambooth used for text-to-image generation relies on such models. These have yet to achieve the impressive levels of performance that neural networks have, and continue to be primarily used in research settings. Thus, in building generative AI products and systems, neural networks have proven the most useful and are critical and essential for the foreseeable future.

    Is every generative AI model built on the same neural net type?

    No, every generative AI model is not built on the same neural net type. This is because generative AI models designed for different output types (text vs image vs audio vs video) and even those designed for similar outputs often require different types of neural networks to achieve the best results based on their architectures.

    Among models designed for similar outputs, OpenAI’s various versions of DALL-E and Midjourney contain subtle differences making them uniquely effective image generators, while Google’s Imagen, Synthesis AI, and Jasper Art use different underlying architectures. In text generation, while GPT-4 and Claude 3 are both large language models (LLMs), they achieve their results through different internal structures and training methodologies.

    Do neural networks always require huge training data?

    Neural networks mostly require huge training datasets, but there are some recent examples of successful applications using much less data. The general trend since the 2010s has been to increase the amount of training data used in generative AI to improve performance, with ample data now compiled for numerous industries and applications.

    That said, researchers at Stanford have recently used a wide mix of neural architectures and generative modeling techniques to successfully train new models capable of cancer detection in GI (gastrointestinal) pathology with just 500 cancer cases and 100 controls (normal tissue samples). The models were able to understand and learn the underlying causes of the cancer and normal tissue as assessed by human pathologists, a breakthrough for which they are still researching and confirming the details.

    For models involving protein folding and crashing, “AlphaFold 2” from DeepMind can predict the structure of a protein using only one sequence whereas earlier versions of the AI as well as other protein structure predicting AIs around the world required datasets of thousands or millions of sequences to fold proteins. There is still a lot to be learned about researchers are attempting to replicate these successes across other areas of medical and biological research.

    In general, training data models with smaller data are much easier to overfit and memorize the training data instead of generalizing and learning how to properly predict outputs from a wider swath of inputs. Researchers at Google have found that as the model size (the number of parameters) increases, even having small training datasets can lead to surprising and high-quality outputs. They found this in their models for Google’s YouTube recommendations, the BERT natural language processing AI, and Twiletnet music generation AI.

    Can a neural network generate both text and images?

    Yes, a neural network can generate both text and images, but only if they have been designed for multimodal outputs. The original Transformer architecture used by Google’s BERT did not have this capability. But current large language models (LLMs) like OpenAI’s GPT-4, Google’s Gemini and Anthropic’s Claude can generate images as outputs when given the right multimodal input.

    The reason multimodal models can create text and images is that all outputs in generative AI are ultimately digital representations of data. Multimodal models use neural networks and encoders/decoders to find correlations between the different modalities.

    Other current multimodal models are meta’s LLaMA family, and most recently Google Deepmind’s Gato model which claimed the broadest (1700+ capabilities) multimodal ability at launch. Pictured is one of Google Deepmind’s Gato outputs showing a text able idea for a skateboard ramp based floor plan.

    Are all neural networks deep learning models?

    Not necessarily. While all deep learning models are neural networks, the opposite is not true: neural networks vs deep learning models can be superficial. The difference between the two is how many layers the architecture has.

    In general, a machine learning model is any software algorithm that helps systems learn to make decisions based on data. Neural networks are algorithms structured with layers of nodes allowing artificial brain-like problem-solving abilities. And deep learning uses neural networks with multiple layers.

    As an example, this neural network designed by Frank Rosenblatt in 1958 is not a deep learning algorithm because it only has two layers. Compared to modern generative AI neural networks, Rosenblatt’s design was very simplistic. But it established one of the key building blocks of the technology today.

    Do platforms like ChatGPT run on neural nets?

    Yes, platforms like ChatGPT run on neural nets. ChatGPT is a foundation model based on the Generative Pre-Trained Transformer (GPT) architecture developed by OpenAI.

    ChatGPT contains layers of transformer neural networks and is used for conversational AI, natural language processing (NLP) tasks, and even generating computer code from user prompts. ChatGPT mimics human-like conversations by predicting sequences of text through training on vast stores of internet data.

    Can neural networks be trained without GPUs?

    No, at scale. Neural networks can technically be trained without GPUs, including CPUs and TPUs (Tensor Processing Units). But this is only feasible at the smallest and most basic scale, such as with shallow models for research or academic purposes. Training modern deep neural networks for generative AI on large-scale and complex datasets without the massively parallel processing capabilities of GPUs is prohibitively time-consuming, even for the most powerful CPUs.

    The general formula for the time it takes to train a neural network is computed by multiplying the time it takes to complete the processing steps for a single data sample by the number of samples and the number of iterations/epochs. This is simplified to the following formula.

    Line chart showing training time decreasing as batch size increases for different epoch counts
    This line chart illustrates the relationship between batch size and training time across different numbers of training epochs: 5, 10, 20, and 50. Each curve shows a decreasing trend, demonstrating that larger batch sizes significantly reduce training time in machine learning model training. However, the reduction becomes less significant after a certain threshold. This graph is particularly useful for understanding how computational efficiency scales with training configuration in neural networks.

    The speedup factor is in the 20-100x range when comparing TPU vs CPU and GPU vs CPU. For example, in a Google research paper from 2017 that tested the training of speech recognition systems on Tensor Processing Units vs CPUs, TPU use was able to deliver the same level of accuracy in just 1/1000th of the time. In fact, one of the major reasons for the development by companies like Google and NVIDIA of generic TPUs and GPUs for wide availability, including other hardware acceleration, was the necessity of additional processing power for the development of high-tech consumer products.

    With the growing interest in generative AI in the past few years, Google made its TPU technology available for rent via cloud structures such as Google Cloud Vertex AI in 2022. Similarly, in March 2023 AWS announced the ability to use its own Trainium chips for fine-tuning generative AI models at a lower cost.