What is reinforcement learning?

Artificial intelligence that learns from its own experience and chooses the best course of action

Sangsun Moon
What is reinforcement learning?

What is reinforcement learning?

Reinforcement learning definition

Humans learn and master by interacting with their environment. without explicit teaching, we can recognize our surroundings and control our bodies. The idea underlying all learning, intelligence, and other theories is that derived from learning through interaction. Reinforcement learning refers to learning methods that focus on goal-directed learning, among other things.

Reinforcement learning, a type of machine learning, is a learning method in which a learner performing an action is given direction on how to act to maximize the reward for the action, without knowing what the action should be.


Unlike supervised and unsupervised learning, the most popular machine learning methodologies, reinforcement learning has a unique characteristic.

  1. Trial and Error
  2. Reinforcement Signal
  3. Delayed Reward
  4. Exploration and Exploitation

Reinforcement learning finds the optimal behavior through lots of trial and error. As the agent interacts with the environment, it experiences rewards and penalties, and can improve its behavior based on these results. Through this iterative process, the agent learns the optimal strategy to maximize its rewards. The consequences of its actions are also expressed as temporally delayed rewards. This encourages the agent to anticipate and consider the future consequences of its current actions. In other words, the agent must consider short-term rewards for long-term goals, allowing it to maximize its reward by continually choosing the optimal action it already knows.

Supervised vs. Unsupervised vs. Reinforcement learning

what is reinforcement learning?
What is reinforcement learning?

There are three types of machine learning: supervised, unsupervised, and reinforcement learning. While all three are commonly used methodologies, they have distinct differences. Here's a look at the pros and cons of each.

  • Supervised learning, supervised learning is the most accurate type of learning, but Requires labeled data.
  • Unsupervised learning, unsupervised learning is the ability to find Useful for finding data patterns, but can be less accurate
  • Reinforcement learning, which is the most difficult type of learning to implement, can help solve problems that are difficult or complex to solve with other types of learning.

Based on this, supervised learning can be interpreted as image classification, Natural Language Processing, fraud detection, and more. Unsupervised learning is often used for tasks like clustering, anomaly detection, and dimensionality reduction. Reinforcement learning, on the other hand, is often used for tasks like gameplay, robotics, and financial trading.

Why reinforcement learning is important

Reinforcement Learning Algorithms and Applications - TechVidvan
Reinforcement Learning Algorithms and Applications - TechVidvan

Many AI leaders are turning their attention to reinforcement learning. With the growing popularity of machine learning, why is reinforcement learning gaining traction?

Overcoming the limitations of unsupervised and supervised learning

Unsupervised learning is a method of extracting features from unlabeled data, but it lacks strict guidelines for its objectives, leading to may yield limited results. Supervised learning creates predictive models from labeled data, but it has the limitation of requiring prior knowledge and a large amount of training data. 반면에 Reinforcement learning, reinforcement learning can provide a more scalable learning method, as the agent teaches itself optimal behavior through trial and error.

Solve complex problems and adapt to real-time learning

Reinforcement learning is useful for solving complex and challenging problems. For example, in real-world systems like autonomous vehicles or robots, reinforcement learning can be applied to help agents learn optimal behavior as they interact with their environment. This enables the system to adapt to changing environments and perform at its best.

Reinforcement learning also has the ability to learn and adapt in real time. Agents experience the consequences of their actions and earn rewards through their interactions with the environment. This ability to learn and adapt in real time allows them to flexibly respond to changing environments.

Wide range of applications

Reinforcement learning can be utilized in a variety of applications. For example, fields as diverse as game theory, neuroscience, economics, and robotics have applied reinforcement learning to solve problems and build the optimal strategy. This diversity makes reinforcement learning even more useful and applicable.

For these reasons, reinforcement learning has gained prominence in machine learning and continues to be developed and utilized. Reinforcement learning is a powerful learning method that can solve complex problems, learn and adapt in real time, and has a wide range of applications.

How reinforcement learning works


To understand how reinforcement learning works, you need to first understand its four components.


In reinforcement learning, an agent is a decision-making entity. Agents interact with the environment to determine actions and receive rewards. An agent has a policy to understand its current state and choose the optimal behavior.

Policies are at the core of everything the agent does, and they determine what happens next with functions.


The environment is the stage on which the agent learns. The environment determines the state in which the agent acts and the rewards it receives. This allows the agent to learn by predicting the reward it will receive following its behavior.


It represents the state the agent is in when interacting with its environment. The state can contain a variety of information depending on the environment, and the agent's decisions are based on its current state.

The value of a state is the long-term attractiveness of a set of states, determined based on the expected future state and the benefits that accrue from that state. For example, a current state may have less immediate reward, but is still valuable if it leads to an additional state with a greater reward.


In reinforcement learning, the agent selects possible behaviors. An action is a change the agent chooses to make to the environment through decision making. In each state, the agent decides which of the possible actions it should choose.


A reward is a signal that an agent receives when it takes a certain action. Rewards play an important role in motivating an agent to achieve a desired goal. Positive rewards can be given for good behavior, and negative rewards can be given for bad behavior.

The ultimate goal of the agent is to optimize the overall reward earned, so the reward will be able to distinguish between good and bad behavioral outcomes for the agent.

Reinforcement learning algorithm classification

What are the components of an Agent?

Before we get into the taxonomy of reinforcement learning, we need to understand the components of a reinforcement learning agent, which is the basis of the taxonomy. There are three main components of a reinforcement learning agent

  1. Policy
    A behavior pattern of a gent. It tells us what action to take in a given state, i.e., it is a function that connects the state to the action.
    Policy is broadly divided into deterministic and stochastic policies.
    Deterministic policies give a single action for a given state, while stochastic policies give a probability distribution of actions for a given state.
  2. Value function
    A prediction function for how much reward the state and action will return later.
    That is, a weighted sum of all the rewards you will receive in the future for taking that state and action.
    We use a discounting factor λ to indicate a preference for the reward that comes first over the reward that comes later.
  3. Model
    The agent's prediction of what the next environment (state) and reward (reward) will be. It can be divided into State model and Reward model.

Model-Free vs. Model-based

Classify reinforcement learning algorithms
Classify reinforcement learning algorithms

The first criterion for distinguishing between reinforcement learning algorithms is the existence of a model for the environment. Having a model has its own advantages and disadvantages.

  • Pros : Allows for planning
    If you know how the environment will change in response to your actions, you can anticipate the changes and execute the optimal behavior before you actually act. With this kind of planning, your agents will be able to act much more efficiently.
  • Cons : Difficult or impossible to implement
    Frequent errors where the model does not properly reflect the environment, which can lead to will lead to errors in the agent as it is. Creating an accurate model can be just as hard, if not harder, than creating a good agent.

Because there are advantages and disadvantages to each, we call agents that use models model-based and agents that don't model-free.

Value-Based vs. Policy-Based

The second distinction between reinforcement learning algorithms is the use of value functions and policies.

If the value function is perfect, the optimal policy follows naturally. In each state, the most Select only high-value behaviors. This is called an implicit policy. Algorithms that only learn the value function, like DQN, and have an implicit policy are called value-based agents.

In the opposite case, we have a perfect policy, and we don't need to have a value function - after all, the value function is just an intermediate calculation used to create the policy. Agents that only learn policies without a value function are called policy-based. Examples include Policy Gradient and others.

The latter has the advantage of being able to utilize data more efficiently. Policy-based agents, on the other hand, can learn more reliably because they optimize directly for what they want. For cases where you have both options, we call them actor-critic agents.

Reinforcement learning use cases

Reinforcement learning specializes in solving more complex problems than traditional methodologies in machine learning. Let's see what reinforcement learning is being used for by analyzing a case study.


Since robots operate in dynamic and ever-changing environments, it was impossible to predict what would happen next. Reinforcement learning makes robots robust in scenarios like industrial sites, and helped us adaptively acquire complex behaviors. In particular, it has helped to eliminate repetitive inspection processes and ensure quality control on the production assembly line, among other Computer vision with the goal of replacing it.

Product assembly and defect inspection

Multi-task data collection, where different robots collect data on different tasks
Multi-task data collection, where different robots collect data on different tasks (source: Google Research blog)

Several manufacturers have used reinforcement learning to improve their product assembly process, fully automating it so that managers can Remove manual intervention. In addition to the object detection and object tracking models, deep reinforcement learning models are being trained using multimodal data to easily identify missing pieces, dents, cracks, scratches, and overall damage in images spanning millions of data points.

Manage inventory

Inventory control performed using computer vision
Inventory control performed using computer vision (Source: 9 Reinforcement Learning Real-Life Applications)

An invention in the field of computer vision for tracking inventory in real time has enabled automated inventory management in warehouses with large areas. Deep reinforcement learning agents help find empty containers so that restocking is fully optimized.


Agents in reinforcement learning naturally learn to distinguish between rewarding and punishing actions, and to behave in ways that maximize their rewards. In addition, the underlying model is a A type of machine learning model that represents and generates complex data and is utilized in a variety of fields, including natural language processing and computer vision.

Adaptive Agent Team
Adaptive Agent Team

DeepMind ADA is a new algorithm that uses these baseline models to improve the performance of reinforcement learning. These algorithms first train a baseline model on a large dataset, and then use the model to generate tasks for the agent to perform. The results of the actions are then used to update the underlying model.

The Google DeepMind team once described how they trained an agent to play Go. The agent was able to play Go at a superhuman level, despite never having been explicitly taught how to play, and became famous around the world for showing the potential for reinforcement learning to be used in a variety of tasks.

Recently, DeepMinds announced that based on AlphaZero, a reinforcement learning model trained for application to games like Go and chess, 알파데브를 개발했습니다. Recently, DeepMind developed AlphaDev based on AlphaZero, a reinforcement learning model trained for applications in games like Go and chess. The main change was that we trained the AI to treat finding a faster algorithm as a kind of game to win. We used the game of selecting computer instructions, placing them in order, and then running the result as an algorithm to reduce computation time.

Autonomous driving

Driving a vehicle in an open context environment requires a machine learning model trained with all the scenes and scenarios that can happen in the real world. However, since it is nearly impossible to obtain all the data for each driving scenario, reinforcement learning can be used to train the model. Reinforcement learning models follow the navigate-and-use principle of minimizing traffic disruption, trained in a dynamic environment by learning policies from its own experience. Based on this, self-driving cars can identify driving zones, handle traffic, and make decisions such as staying within speed limits and avoiding collisions.

Here's how reinforcement learning works in autonomous driving.


Inside the lab where Waymo is building the brains for its driverless cars - The Verge
Inside the lab where Waymo is building the brains for its driverless cars - The Verge

Weimo's reinforcement learning simulates real-world road conditions to detect various accumulate experience with driving situations and use it to develop a driving strategy that best fits the identified situation. WeiMo's self-driving cars learn based on a large number of simulations and are able to perform stable and predictable driving in real-world road conditions. Trained through reinforcement learning, Weimo's self-driving vehicles have demonstrated a high level of safety and efficiency compared to human drivers.


Imitation Learning for Autonomous Driving in TORCS | PPT
Imitation Learning for Autonomous Driving in TORCS | PPT

Tesla, on the other hand, isn't sure if it's using reinforcement learning for autonomous driving, but so far it has said it's utilizing imitation learning. Imitation learning is a methodology that collects data from experts and rewards the AI for doing things the way the experts do. Unlike reinforcement learning described above, A family of supervised learning methods that allow computers to discover algorithms by injecting data with results.

Experts say that because Tesla has over 8 billion miles of real-world driving data, it is able to develop Imitation learning would have an advantage over reinforcement learning. However, given Tesla's recent hiring of reinforcement learning experts, the company may be looking to combine the two approaches for future autonomous driving. If Tesla were to apply reinforcement learning methods to its self-driving system, it would collect and analyze real-time data to learn the best behavior in different driving situations. This would allow Tesla vehicles to make more sophisticated judgments and reactions when maneuvering the car, and You'll be able to better respond to changes in your driving environment or traffic conditions.


Reinforcement learning is being used in a variety of NLP domains, including text summarization, question answering, translation, and dialog generation. Here, an agent can understand the state of a sentence and construct a set of actions that maximize the value it adds.

Reinforcement learning to NLP

Survey on reinforcement learning for language processing
Survey on reinforcement learning for language processing

To summarize, here's how reinforcement learning can be applied to NLP tasks.

Text generation

Dialogue systems

  • Predict the most appropriate response and let your chatbot or virtual assistant system Learn how to react to user input.
  • The agent's response is evaluated based on a reward function that can take into account the quality of the response and the user's satisfaction.

Sentiment analysis

  • Predict the sentiment of text to learn how to classify text as positive, negative, or neutral.
  • Reward function used to judge agent's predictions based on how well it classifies

Text summarization

RLHF (Reinforcement Learning from Human Feedback)

One important application of reinforcement learning in natural language processing is learning from human feedback. ‘Reinforcement Learning from Human Feedback (RLHF) means that a human annotator provides feedback in the form of a preference comparison between multiple generated responses. This feedback is used to create a reward model that guides the reinforcement learning process, encouraging the model to generate responses that better match human preferences.

For example, RLHF applied to ChatGPT can be summarized in three steps.

RLHF applied to ChatGPT
RLHF applied to ChatGPT

  1. Supervised Fine-Tuning (SFT)
    Fine-tuning a pre-trained LM with a small sample dataset curated by human labelers to learn human-intended policies
  2. Reward Model (Mimic Human Preferences)
    Collect a dataset of human labelers' ranked scores of which answer is the best among the many answer candidates generated by the model fine-tuned in the SFT stepUse this dataset to train a new reward model.
  3. Reinforcement Learning for SFT Models with Proximal Policy Optimization (PPO)
    SFeed the FT model with input from multiple users, interact with the Reward model, and iterate on reinforcement learning.

Through this process, OpenAI sought to mitigate LLM's inherent alignment issues, such as not following explicit instructions from the user, hallucinations that create non-existent or incorrect facts, difficulty for humans to understand how the model arrived at a particular decision or prediction, and language models trained with biased/toxic data outputting answers based on that data. You can read more about the fine tuning of the RLHF method used in ChatGPT in this paper.

Limitations of Reinforcement Learning

Reinforcement learning is a promising machine learning methodology because it can solve more complex problems with less data than supervised learning and more data than unsupervised learning, but it also has some limitations.

  • Huge data needs
    The quality and quantity of data is crucial because flipped learning relies on data. However, building data that can interact with the environment is complex and can be costly and time consuming.
  • Defining complex reward functions
    Defining a reward function properly can lead to unintended consequences due to unexpected behavior or incorrect reward signals. Also, because reinforcement learning models focus on rewards to select optimal behavior, they can be biased toward immediate rewards without considering long-term goals.
  • Adaptability to environmental changes
    Hard-trained models should be able to act appropriately in situations they see for the first time, but with proper exploration along the way, New Experience should be obtained. Agents perform actions based on their environment and current state, and if the environment is constantly changing, it can be difficult to make good decisions.
  • Expense/ Resourses
    The computational complexity of the learning algorithm is enormous. As the size and depth of the learning model and the amount of computing resources required increase, the time and cost of training can also increase.

Conclusion: Since the reinforcement learning methodology is all about finding the best outcome, the data leading up to the conclusion is important.

As seen in the limitations of Reinforcement Learning from Human Feedback (RLHF), deviations in the quality of the dataset can be an obstacle that delays reinforcement learning from achieving optimal learning results. To improve the accuracy of reinforcement learning, You need consistent, high-quality datasets. The datasets used for training should reflect a variety of environments and situations, and continuous efforts are needed to improve the quality of the data.

In addition to data consistency, efforts should continue to be made to improve algorithms and models through research and development. Reinforcement learning still faces many challenges and requires new methods and approaches to solve problems. Therefore, these Research and innovative development efforts will further enhance the expertise and utility of reinforcement learning.

Policy development to create more rewards, which is the core of reinforcement learning, and consistency in data quality are major challenges that require continuous research and multifaceted development. By actively responding to these challenges and continuing to innovate, DataHunt is striving to become an important driver for the development of machine learning and the utilization of artificial intelligence.

Talk to Expert