Skip to content

News & Insights

Graphic of a circular ring

B is for BERT


B is for BERT

From A to I to Z: Jaid’s Guide to Artificial Intelligence

BERT is a machine learning framework for natural language processing. It trains AI to interpret the meaning of everyday human speech and text, and to formulate a natural-sounding and appropriate response.

BERT trains AI by exposing it to a huge amount of language from a wide range of texts — books, newspaper columns, journal articles… the list goes on. The AI practices natural language by analyzing the text and reacting to it in a human-like way.

Before BERT came along, language processing models could only process human language in one direction at a time — from right to left or from left to right, which made it challenging for AI to understand ambiguity.

For example, if somebody said they’d seen the man with the telescope, AI would struggle to decide whether this meant that the person had seen the man through a telescope or carrying a telescope.

By contrast, BERT can process language bidirectionally — from right to left and left to right simultaneously. This makes it possible to train AI to glean the meaning of unclear or ambiguous words from the surrounding context, understand whether two sentences are connected or unrelated, and therefore, assess the speaker or writer’s emotional state and accurately determine their intentions, even when a word, phrase or sentence has several possible meanings.

Some facts:

BERT stands for Bidirectional Encoder Representations from Transformers. Transformers is a deep learning model created by Google in 2017. It uses a process — called attention — that connects every input and output and works out the relationship between them.

Transformers was a huge breakthrough in natural language processing, because it could process data in any order. This made it possible to pre-train machine learning models, and to do so on much larger volumes of data.

BERT, for instance, is pre-trained on the entire English version of Wikipedia and the Brown Corpus, a vast electronic collection of American texts that includes books, journals, literature, newspaper articles, and even government documents and industry reports.

Google open-sourced BERT in 2018 with the aim of advancing research and development of natural language processing. Anyone — business or individual — can use BERT to train AI.


This article on Google’s blog offers a detailed analysis of how BERT works, what makes it different, and how you can use it to quickly train AI to do a variety of tasks.

In this video, applied AI engineer Dale Markowitz takes a deep dive into the inner workings of Transformers and how it has shaped BERT and other machine learning models.

Jaid’s perspective

BERT represents a quantum leap in natural language processing, because it enables AI to enrich its understanding of human communication with semantic and emotional cues. From a business standpoint, this means it’s possible to take customer service to a whole new level: empathetic, human-like, personal… but also far more efficient.

Want to know more?

BERT is a confluence of linguistic theory, technology and an ecosystem that combine to give performant, polyvalent and cost-effective language models. Let’s break this down to understand the essence of BERT.

The Theory

Masked Language Models

The concepts behind a masked language model predates the technology. Key ideas are derived from studies in linguistics, Distributional Semantics. This is the study of the way language terms relate to one another, often relayed in the key quotation: “You shall know a word by the company it keeps.” (John Firth, 1957)

For the phrase, “The cat sat on the mat,” we can change one word in the phrase whilst keeping its sense by swapping “cat” for “dog.” It is by these kinds of mechanisms that a masked language model can learn how words relate to one another.

The Technology

Word Embeddings

Computers handle text by processing numbers. Deep Learning Models do not take raw text as input! However, simply hot-swapping words with numbers (one-hot encoding) does not produce the rich input that is required to capture the complexity of language for most use cases. A performant language model is usually achieved by using Word Embeddings.

Word Embeddings are high dimensional vectors that take the numerical representation of text a step further. It allows a representation of text, which captures the relationships between the words in the sentence.

Attention and Transformers

Whilst the first word embedding technology (word2vec) worked well for short sentences, they have limitations. On longer sentences, they lose track of the relationships.

By weighting of the input data by importance, Attention acts as a filter for the relevant context. Attention addresses the problem of forgetfulness in word embeddings, the vector that has access to all parts of the input sequence, not just the last.

The application of the Attention came to the fore in the 2017 Google paper “Attention is All You Need.”

Transformers rely solely on a self-attention mechanism thus removing the need for the recurrence of the earlier language models.


In 2018, Google trained a model based on Transformers; Bidirectional Encoder Representations from Transformers. The training of BERT was performed on 64 TPU chips with each pre-training taking 4 days to complete at a cost of $7k per run.

The Ecosystem


This post of Google summarizes the nature of the release:

This week, we open sourced a new technique for NLP pre-training called Bidirectional Encoder Representations from Transformers, or BERT. With this release, anyone in the world can train their own state-of-the-art question answering system (or a variety of other models) in about 30 minutes on a single Cloud TPU, or in a few hours using a single GPU. The release includes source code built on top of TensorFlow and a number of pre-trained language representation models.

Transfer Learning

The release of BERT was the beginning of a modeling framework that could be utilized and built upon for various applications. Since the first BERT model was released, thousands more transformer-based models have been made available. These models can be “fine-tuned” for specific applications, such as text categorization or question-answering, meaning that the underlying language model trained on a large general corpus of text can be used as a basis to train custom models for domain-specific problems.

Beyond BERT

ChatGPT employs Transformers in their Large Language Model.

Want to understand how BERT is used to train Jaid’s AI platform? Contact us today to learn more, and we’ll show you how Jaid utilizes AI to empower companies and teams to focus on what matters most.