Natural language processing, or NLP, is a field that combines computer science and linguistics. Its main focus is giving computers the ability to understand, interpret, and generate human language in a meaningful and useful way. This includes recognizing speech, translating text, and retrieving information from large data sets (such as a search engine).
Today, NLP is closely associated with artificial intelligence (AI), which often leverages machine learning (ML) techniques to process and analyze human language. Many people see the ultimate goal of NLP as bridging the gap between human language and machine understanding.
A brief history of NLP and how it relates to AI
The field of natural language processing has been largely revolutionized by modern advances in AI, like machine learning (specifically deep learning). Before these advances, NLP was mainly carried out via symbolic (i.e. rule-based) and statistical approaches to processing and analyzing language.
Symbolic rules were often handcrafted by linguistic experts (an expensive and time-consuming process); these rules conveyed info about things like how to analyze sentence structure, or how to use a dictionary for language translation.
Statistical approaches, on the other hand, involve analyzing vast amounts of text in order to detect statistical patterns and probabilities related to the order of words in a sequence (e.g. how likely is the word “am” to follow the word “I”) and other linguistic elements. These statistical approaches are more closely related to the most popular and cutting-edge NLP approaches used today—they make use of AI, and can learn patterns and representations directly from raw (i.e. unlabeled) training data through a process called unsupervised learning. The key difference is that state-of-the-art models are much more powerful than traditional statistical approaches.
The advent of deep learning techniques and neural network architectures made it possible to surpass many limitations of previous NLP approaches. Deep learning models can also make use of their novel, complex architectures to achieve a much richer understanding of language (and much higher proficiency with processing and using language, too). The result is much better performance in all kinds of language-related tasks compared to earlier methods.
Today, most NLP is carried out via deep learning models, which make use of intricate neural network architecture to excel at language-related tasks.
How is NLP used today?
Natural language processing is used in all kinds of language-related computing tasks, ranging from the obvious (e.g. AI-powered chatbots like ChatGPT) to the not-so-obvious (like amending user queries in search engines). Some of the most common NLP applications include:
- Virtual assistants/chatbots/smart devices: NLP powers Siri, Alexa, Google Assistant, and more, enabling them to respond to both written/spoken language, answer questions, and perform tasks.
- Summarization: Many AI tools (especially browser-based ones like Brave’s Leo) are capable of extracting content from webpages/videos and generating succinct summaries—processes powered by NLP.
- Search engine functionality: NLP enables search engines to detect misspelled words or odd phrasing, and to return better search results by analyzing language patterns to understand context and user intent.
- Autocorrect/spellcheck: NLP makes it possible for messaging apps, word processors, and other similar tools to analyze grammar/spelling to improve text accuracy and readability.
- Language translation: Language translation apps and tools rely on NLP to understand and translate text.
- Email filtering: Many email providers use NLP tools to analyze email content to filter out spam, detect potential threats, and highlight important messages.
- Sentiment analysis: Companies use NLP to detect sentiment from text (e.g. reviews, comments, surveys, etc.) to better understand users’ opinions, satisfaction, and other trends.
- Content categorization: NLP makes it possible to predict the category of an article or webpage based on its text using a predefined taxonomy, which is useful in advertising and news.
While not a complete list of NLP use cases, these are some of the more common ones.
How does natural language processing work?
As a simplified overview, these are some of the common steps in the modern NLP sequence, which we’ll explore one-by-one:
- Data collection and preprocessing
- Tokenization and word embedding
- Model selection and architecture design
- Model training
- Model deployment
Data collection and preprocessing
The first step to building any NLP-capable ML model is to gather data—in this case text. The text needs to be preprocessed, which includes various tasks like dividing up the data as needed, formatting the text, handling special characters, removing “noise” (i.e. junk data) from the set, and more.
Tokenization and word embedding
In this context, “tokenization” refers to the process of breaking down large bodies of text into smaller units like phrases, words, or even characters. These smaller units are called tokens, and they serve as an improved way to represent raw data.
Once the text has been tokenized, the tokens must be converted into numerical representations (known as word embeddings or word vectors) which are represented in a high-dimensional space. There are several different techniques to convert words/tokens into embeddings (which we won’t cover here), but the goal is to capture semantic relationships between words based on their context in the text—and to represent that relationship with a numerical value based on how near or distant two words are in the embedding space. Language models are better able to understand and reason about words when they’re encoded as word embeddings.
Encoded word embeddings
Word embeddings should capture both:
- Semantic similarity: Words that have similar meanings or are often used in similar contexts should have similar embeddings (i.e. be closer to each other in the embedding space). For example, “cat” and “dog” should be more closely related than “cat” and “car.”
- Contextual information: Embeddings should be able to capture the nuance of words that have multiple meanings that depend on the context of their usage (e.g. “cool shirt,” which could either mean that a shirt looks stylish, or is made of a lightweight material).
Model selection and architecture design
Engineers must choose an appropriate ML model based on the problem at hand, the set goal or outcome, and available resources. Transformers—a type of neural network architecture capable of processing data in parallel rather than in sequence—represent the state-of-the-art approach to NLP. Popular language models like BERT and GPT are transformers, for example. However, other particular tasks or constraints (e.g. computational footprint or memory requirements) may lead engineers to choose different models. For example, sequence-to-sequence (seq2seq) models may be chosen for their ability to excel at language translation (though state-of-the-art translation may also employ transformers).
Learn more about different types of neural network architectures.
Training the model
Once a model has been chosen, it needs to be trained—using the preprocessed data set—to learn to reliably predict the next token in a sequence. And, as by-products of this learning, to detect patterns and structures in text data, including recognizing context, syntax, semantics, and even nuances of language usage.
During training, model parameters are optimized (i.e. weights are adjusted) to minimize the chosen loss function (basically to help the model better accomplish its designated goal).
Note that the training stage entails a much heavier workload (in terms of time, financial costs, and computing resources) than the final stage when the model is deployed.
Deploying the model
After the training stage is complete, and the model weights are no longer updated, it’s time for deployment. Once deployed, these models can interact with users through natural language interfaces (like chatbots or other types of software), understand and respond to user inquiries, analyze large amounts of text data for insights, accomplish language translation tasks, and more.
The future of NLP, and the importance of quality data
This article covers some of the basics of the field of natural language processing, and how its development has accelerated thanks to deep learning networks and transformer architecture. Like many contemporary AI applications, keep in mind that things are quickly changing and evolving. With any AI model, but especially with user-facing NLP applications, the quality of training data can greatly impact—for better or worse—the quality and performance of the model. If you’re interested in building AI applications, or NLP-capable models in particular, check out the Brave Search API to learn more about Brave’s high-quality data feeds for AI.