Insight

Natural language processing, NLP - definition, analysis, and use cases

Technologies to bridge the understanding gap between humans and computers

2023
.
06
.
09
by
Sangsun Moon
Natural language processing, NLP - definition, analysis, and use cases

What is natural language processing (NLP)?

NLP, which stands for Natural Language Processing, is the correlation of data with natural language text or speech, sometimes referred to as language in.

즉, NLP는 A machine learning technique that gives computers the ability to interpret, manipulate, and understand human language. NLP is the intersection of the fields of computer science, artificial intelligence, and linguistics.

Simply put, the Interaction is the study of how computers interact with people. It involves computers finding intent in conversations with users and making Aims to provide accurate information. This has allowed chatbots in particular to perform remarkably well.

Principles of natural language processing, NLP

NLP algorithms allow computers to take human language as input, understand and analyze it, and iterate to find the best results. For example, this is the basic way a chatbot processes input from voice or text data that comes in the form of unstructured data.

In layman's terms, it's about analyzing the input of voice or text data in the form of unstructured data and outputting results based on it. The process of NLP algorithms analyzing the conversation is called "natural language understanding". Next, the output of the results after the analysis process can be described as 'natural language generation' (NLG).

Preprocessing text

In order for a computer to recognize unstructured text, some preprocessing is required. This is very important to improve the accuracy of NLP algorithms.

Machine Learning in Natural Language Processing
출처: Machine Learning in Natural Language Processing (2012)

1. Tokenization

  • Split a sentence or corpus into tokens, the smallest unit of meaning.
  • How to get your computer to recognize it
  • English is generally divided by spacing, and Korean is mainly divided by morphemes in the word.

2. Part of speech tagging (POS Tagging, Part of speech tagging)

  • Tagging the part of speech for morphemes

3. Extract Headings

  • Techniques for finding headings (dictionary words) from words

4. Stemming

  • Remove prefixes and suffixes from words to obtain stems

5. Handle Stop Words

  • Handle words that contribute little to the actual semantic analysis, such as investigations, suffixes, etc.

Natural Language Processing, NLP and Deep Learning

NLP is a field that utilizes both linguistics and artificial intelligence, and research into NLP began shortly after the invention of the digital computer in the 1950s. It was shortly after the development of machine learning that NLP research made great strides.

Convolutional Neural Network-based text classification network
A schematic of a convolutional neural network, which utilizes deep learning techniques to process natural language.

순환 신경망(RNN), Convolutional Neural Network (CNN), and others utilize deep learning techniques to process and understand human language. With deep learning, NLP models learn complex patterns, hierarchies, and representations of textual data. This enables them to perform tasks such as language understanding, generation, translation, and sentiment analysis.

NLP models leverage the scalability and expressive power of deep learning architectures to capture complex linguistic structures and semantic relationships. By combining deep learning and NLP techniques, natural language processing has been able to make significant advances, opening the door to a wide range of applications, which we'll discuss below.

NLP Use Cases & Korean Chatbot Cases

Everyday natural language processing

NLP techniques are also being used in everyday life.

Natural language processing applied to email filters
Natural language processing applied to email filters

Email filters

  • Spam filters that look for specific words or phrases that suggest a message is spam.
  • Gmail categorizes emails into three categories based on their content: basic, social, and promotional.

Smart assistants

  • Apple's Siri, Amazon의 Alexa, Samsung의 Bixby
  • Understand the question and generate a response relevant to the situation.
  • Learning more about your users allows for more personalized interactions

Search engine optimization

  • Drive search results based on user intent
  • Recognize user search intent rather than exact queries
  • Analyze ambiguous queries with NLP to connect them to relevant objects

Autocomplete text

  • Predict user intent to complete words or suggest related words
  • Adapts to your unique language habits the more you use it

Translation

  • Online translators using NLP translate languages more accurately and minimize grammatical errors.
  • When translating from another language to your native language, the translator can recognize which language the text you enter is in

SaaS and NLP

AI and machine learning can be leveraged to enhance the performance, functionality, and security of SaaS solutions. SaaS provides the infrastructure and models for cloud-based solutions, while AI provides data analysis, prediction, and task automation. Chatbots, in particular, are a business model that brings many benefits to SaaS customer service because they can provide fast and consistent responses to inbound queries.

Some of the most popular NLP tools include NLTK, Gensim, Stanford NLP, SpaCy, and OpenNLP. These models allow you to integrate NLP capabilities within your SaaS platform. For example, SpaCy is a model that can be pre-trained and used to analyze text data. SpaCy tokenizes text to create objects, which are then processed through several steps in a processing pipeline. After data analysis, the Visual Builder component allows you to build a SaaS platform to display the results of NLP analysis.

NLP model examples

Beyond that, NLP plays a role in simplifying and automating a wide range of business processes, many of which involve large amounts of unstructured text, such as emails and surveys. With NLP, businesses can better analyze and leverage large amounts of data to make the right decisions.

  • Process, analyze, and archive large documents
  • Analyze customer feedback or call center recordings (Speech recognition)
  • Run a chatbot for automated customer service
  • Who, what, when, and where - answering structured questions
  • Text classification and extraction (Text classification)

Verizon's Business Service Assurance (BSA) group is using natural language processing and deep learning to automate customer request comments. They receive more than 100,000 requests per month, and until they developed an automation tool, they had to read and act on every request individually. But when GTS, Verizon's IT group, built the AI-Enabled Digital Worker for Service Assurance (AIEDWSA), it could not only read and decode repair tickets, but also report on the status of existing tickets and track their progress.

natural language processing based on meeting footage
Generate engagement metrics using natural language processing based on meeting footage

Spiky is an American startup that develops artificial intelligence-based analytics tools to improve sales calls, training, and coaching sessions. Spiky's automated coaching platform uses video recordings of meetings to generate engagement metrics. It creates contextual and behavior-based analytics within the video content, and provides a range of communication and content-related metrics from both verbal and non-verbal sources. In this way, the platform aims to improve sales teams' performance customer engagement skills. This kind of sentiment analysis model is similar to how NLP models can When we can read people's 'intentions' better, we'll be able to perform at a higher level than we do now.

On the other hand, healthcare is a matter of human health, so accuracy and efficiency are critical. Traditionally, patients communicate their conditions to their doctors via electronic health records (EHRs), but this process can lead to omissions and miscommunication of the severity of symptoms. In response, Well Span Health in Pennsylvania built a voice-based tool that directs the interaction between patients and doctors. Patients get a healthcare portal that they can interact with in their native language. This ensures that doctors get an accurate picture of the patient's condition, reducing the chance of human error, and patients have an easier time understanding their health status.

Large Language Model (LLM)

A large language model (LLM) is a natural language processing model with parameters in the billions. GPT-3.5 architecture based on Chatbots like ChatGPT are a product of advances in NLP technology. ChatGPT generates coherent, contextually relevant text to deliver compelling, creative storytelling and personalized responses.

This was preceded by extensive training on vast amounts of data to capture the complex patterns, syntactic structures, and semantic relationships of the language. As NLP technology advances, It is notable that 'intent' can be found even for a wide range of training data. This was also an important checkpoint in the success of the Korean chatbot, which I'll discuss below.

Korean chatbots

Natural language processing in NAVER
NAVER Hypercloud creation pipeline

NAVER's super-sized AI HyperClova, unveiled in 2021, has 204 billion parameters. The model is said to have learned more than 6,500 times more Korean data than GPT-3, and by learning data with 97% Korean content, it has built the world's largest Korean language giant.

In addition, the KoGPT API, developed by Kakao, lexically and contextually understands the Korean language entered by the user and generates sentences suitable for the user's intent. Thanks to its ability to generate sentences based on context and intent, it can be used for product introductions and sentiment analysis.

However, not all Korean NLP examples have been successful: Iruda, which started with the concept of "my first AI friend," eventually shut down in January 2021 after a controversial announcement, and its corpus database was scrapped. Iruda worked by picking sentences from the corpus that were relevant to the keywords in the conversation and posting them in the chat box. As a result, there were quite a few conversations that didn't have any context, and It was hard to see this as an NLP chatbot with the ability to understand intent.

Furthermore, it was largely unprepared to deal with malicious abusive behavior from users. Irida had trained on 100 million pieces of corpus data, but it hadn't processed all of the private or sensitive conversations in it.

Natural language processing (NLP) limitations and how to overcome

Limitations of NLP

When it fails, it's often due to a lack of refinement of the trained corpus data, as opposed to issues with model performance.

Why is Korean corpus data refinement so challenging? In order to perform NLP algorithms, text preprocessing is essential. Among them, tokenization is the process of dividing sentences and corpora into tokens, which are the smallest semantic units. Hangul is mainly divided into morphemes within words. This process consumes a lot of tokens compared to other languages. So, Costs and Resources, which puts it in a somewhat unfavorable position compared to English.

Tokens have long been a barrier to developing a Korean-style giant language model, one that many Iridas have tried but failed to overcome. In despite, Bard, which launched on Google, is changing things up with its initial launch in Korean.

How to overcome the limitations of NLP

DataHunt also participated in a Korean NLP business. The O2O housekeeping service app Daerijubu used We pushed for ways to leverage natural language data that expressed user intent.

DataHunt's Chatbot Data NER/NLP Processing
DataHunt's Chatbot Data NER Processing

DataHunt started by cleaning up customer review data and text data such as booking requests, chat history, and introductions and greetings from housekeepers. However, due to the large amount of natural language data, there was a chance that it would contain a lot of unnecessary content or typos to understand intent.

So, after preprocessing the spelling API and processing the data, we proceeded to revise the labels for the user's requests. Analyzing the labels to derive user propensities allowed us to match providers and users. Finding the user's intent in the vast amount of natural language data was a critical step in our success.

NLP is only as good as the quality of the data it's trained on. At a time when chatbots for customer service are becoming a major business, a well-crafted Korean chatbot could be the breakthrough that gives you an edge in the market.

Datahunt's know-how

There is a saying that "Korean should be listened to to the end". This is because, in general, the real intent of Korean is at the end of the sentence. In order to utilize and understand this context well, the accuracy of NER and text analysis is important when processing text data. To improve the chatbot and matching performance of a surrogate housewife, DataHunt applied NER(Named Entity Recognition) for the same reason.

That said, we believe that NLP is all about analyzing intent and context, which is why we take a multimodal approach, where AI doesn't complete all of the work, but human hands do once more. We use AI to uncover intent and context within a data set or corpus, and then Tagging by human experts to ensure completeness and improve accuracy. The reason we insist on this human-in-the-loop approach is because we are solely focused on improving accuracy.

DataHunt has diverse experience in improving the performance of NLP models. We are also constantly conducting research based on Case study from real people. Our beliefs and tenacity are most evident in our data processing work, where accuracy is everything.

Summary

  • NLP is the study of the interaction between computers and human language.
  • In order for a computer to recognize unstructured text, it goes through a process of tokenization, part-of-speech tagging, heading extraction, stemming, and stopword processing. However, Korean is at a disadvantage compared to other languages because it consumes a lot of tokens.
  • For NLP models to better understand natural language and generate accurate responses, the quality of the data they are trained on is critical. To achieve this, techniques such as NER and text analysis can be utilized when processing text data.
Talk to Expert