Reinforcement Learning
Reinforcement Learning (RL) is a type of Machine Learning where an agent learns by trial and error through interacting with an environment. It isn't given labelled examples or a dataset to study. Instead, it takes actions, sees what happens, and receives rewards for good choices and penalties for bad ones — gradually learning a strategy that earns the most reward over time.
Think of how you train a dog with treats, or how a child learns to ride a bike: there's no answer key, just feedback from trying. The agent learns what works by experiencing the consequences of its own actions.
💡 In one line: Reinforcement Learning is learning by doing — an agent improves by earning rewards and avoiding penalties through trial and error.
How Reinforcement Learning Works
RL is built around a continuous feedback loop between an agent and its environment:
- The agent observes the current state of the environment.
- It chooses an action based on its current strategy.
- The environment responds with a reward (or penalty) and a new state.
- The agent updates its strategy to favour actions that lead to higher rewards.
- This loop repeats thousands of times until the agent learns the best overall strategy (its policy).
![]()
.
Key Terms
| Term | Meaning |
|---|---|
| Agent | The learner or decision-maker (e.g. a robot, a game player) |
| Environment | The world the agent interacts with |
| State | The current situation the agent is in |
| Action | A choice the agent can make |
| Reward | Feedback signal — positive for good actions, negative for bad |
| Policy | The agent's strategy for choosing actions in each state |
A Simple Example
Imagine teaching an AI to play a maze game. At first it knows nothing:
- It tries moving in random directions (actions).
- Hitting a wall gives a small penalty; moving closer to the exit gives a small reward; reaching the exit gives a big reward.
- Over many attempts, it learns which paths earn the most reward and which to avoid.
Nobody told the agent the correct route — it discovered the best strategy purely through trial, error, and reward.
Exploration vs. Exploitation
A core challenge in RL is balancing two needs:
- Exploration — trying new, untested actions to discover better rewards.
- Exploitation — sticking with actions already known to give good rewards.
Too much exploration wastes time on bad moves; too much exploitation may miss a better strategy. Good RL systems carefully balance the two.
Common Reinforcement Learning Algorithms
- Q-Learning — learns the value of taking each action in each state.
- SARSA — similar to Q-Learning but updates based on the action actually taken.
- Deep Q-Networks (DQN) — combine Q-Learning with neural networks for complex environments.
- Policy Gradient methods — directly learn the best policy rather than action values.
- Actor-Critic — combine value-based and policy-based approaches for stability.
Pros and Cons of Reinforcement Learning
| ✅ Pros (Advantages) | ⚠️ Cons (Challenges) |
|---|---|
| Learns without labelled data | Needs a huge number of trials to learn |
| Handles complex, sequential decision-making | Training can be slow and computationally expensive |
| Adapts to changing environments | Designing the right reward signal is tricky |
| Can discover strategies humans never thought of | Poorly designed rewards lead to unwanted behaviour |
| Excellent for control, games, and robotics | Hard to apply safely in the real world during learning |
⚠️ Reward design matters: if the reward is set up carelessly, the agent may "cheat" — finding a high-reward shortcut that wasn't the intended goal.
Applications of Reinforcement Learning
| Domain | Use |
|---|---|
| Games | Mastering board games and video games at superhuman level |
| Robotics | Teaching robots to walk, grasp, and balance |
| Self-driving | Learning driving decisions and control |
| Finance | Automated trading strategies |
| Operations | Optimising logistics, energy use, and scheduling |
| Recommendations | Adapting suggestions based on user feedback over time |
Summary
- Reinforcement Learning trains an agent to make decisions by trial and error, using rewards and penalties as feedback.
- It works through a continuous loop: observe state → take action → receive reward + new state → update strategy.
- Key concepts include the agent, environment, state, action, reward, and policy, plus the exploration vs. exploitation trade-off.
- Common algorithms include Q-Learning, SARSA, Deep Q-Networks, and Policy Gradient methods.
- RL excels at sequential decision-making in games, robotics, and control, but it needs many trials and careful reward design to work well.