What is the best reward function for trading?

Scale profits to match equity, penalise large drawdowns, and treat SL/TP hits that overlap as a loss to avoid mis-incentives.

How can I know which SL or TP is hit first without tick data?

Use the price range of the candle; if low <= SL <= high and low <= TP <= high, you need tick data. For hourly candles, you can assume the closer level is hit first, but this is a simplification.

Can I use this framework on other currency pairs?

Yes. Just replace the dataset and retrain. The same hyperparameters usually work.

How do I incorporate transaction costs?

Subtract the spread from the entry and exit price in the reward calculation, or model a fixed commission.

Why did my out-of-sample equity drop?

Likely over-fitting or regime change. Try reducing timesteps, adding regularisation, or using early stopping.

Is PPO the best RL algorithm for Forex?

PPO is robust and simple, but algorithms like SAC or DDPG can perform better on continuous action spaces. Test multiple algorithms.

How to avoid the agent taking too many small trades?

Penalise trade frequency in the reward or add an action penalty.

Build a Forex trading bot with reinforcement learning: train a PPO agent on EUR/USD, scale rewards, tune SL/TP, and backtest equity performance.

I Built a Forex Bot with Reinforcement Learning That Outperformed My Old Strategy

Q: Can I use this framework on other currency pairs?

Yes. Just replace the dataset and retrain. The same hyperparameters usually work.

Q: How do I incorporate transaction costs?

Subtract the spread from the entry and exit price in the reward calculation, or model a fixed commission.

Q: Why did my out-of-sample equity drop?

Likely over-fitting or regime change. Try reducing timesteps, adding regularisation, or using early stopping.

Q: Is PPO the best RL algorithm for Forex?

PPO is robust and simple, but algorithms like SAC or DDPG can perform better on continuous action spaces. Test multiple algorithms.

Q: How to avoid the agent taking too many small trades?

Penalise trade frequency in the reward or add an action penalty.

Published by Brav

Table of Contents

TL;DR

I trained a PPO agent on 2020-2023 EUR/USD hourly candles and got a positive out-of-sample equity curve.
The agent uses a 30-candle window, RSI, ATR, MA20, MA50, and a slope as features.
Reward is profit × 10 000, and a hit on both SL and TP counts as a loss.
Training with 50 000 timesteps gave me a stable policy; cutting to 10 000 helped avoid over-fitting.
You can build this in under a day with four Python files and stable-baselines3.

Why this matters I was tired of hand-coding moving-average crossovers and still losing money on the 4-hour EUR/USD. The biggest pain for traders is noisy data and over-fitting: a strategy that looks great on the training set can crumble when the market moves. With reinforcement learning (RL) I could let the agent learn from trial and error, discover its own entry and exit points, and tune its own risk settings.

Core concepts Reinforcement learning is a model-free method where an agent interacts with an environment. Each step the agent observes a state (s), chooses an action (a), and receives a reward (r). Over many episodes it learns a policy π that maps states to action probabilities. In my case:

Observation space – a 30-candle window of OHLCV plus the six technical indicators: RSI (14), ATR (14), MA20, MA50, and a slope. The data shape is (30, 7).
Action space – discrete: 0 = skip, 1 = long, 2 = short, with an additional discrete variable that selects the distance for stop-loss (SL) and take-profit (TP) from {60, 90, 120} pips.
Reward – profit × 10 000 to bring the scale in line with the initial equity. Losses are penalised, and if a candle touches both SL and TP we count it as a loss to avoid “screwing” the agent.
Policy – I used Proximal Policy Optimization (PPO) from stable-baselines3 because it works well on noisy, high-dimensional data and is easy to parallelise.
Environment – implemented as an OpenAI Gym class that resets the equity to USD 10 000, iterates through the candles, and applies the chosen trade.

Algorithm Comparison

Algorithm	Training Time	Policy Type	Typical Strength
PPO	Moderate	Policy Gradient	Robust to noise
DQN	Longer	Q-Learning	Good for discrete actions
SAC	Fast	Actor-Critic	Handles continuous actions

How to apply it

Setup

pip install stable-baselines3[extra] gym pandas pandas-ta numpy

(All packages are on PyPI stable_baselines3 — PyPI (2025), Gym — PyPI (2025), pandas — PyPI (2025), pandas_ta — PyPI (2025)).

Data – I grabbed hourly EUR/USD candles from the public repository Forex Sample Dataset — GitHub (2024). Split 2020-2023 for training, 2023-2025 for out-of-sample testing.
Feature engineering – Using pandas_ta we compute the indicators:

df['ATR'] = ta.atr(df['High'], df['Low'], df['Close'], timeperiod=14)
df['RSI'] = ta.rsi(df['Close'], timeperiod=14)
df['MA20'] = ta.sma(df['Close'], timeperiod=20)
df['MA50'] = ta.sma(df['Close'], timeperiod=50)
df['Slope'] = df['Close'].pct_change().rolling(5).mean()

(See ATR — pandas_ta docs (2025)).

Gym environment – My ForexEnv inherits from gym.Env. It builds the observation with the 30-candle window, encodes the SL/TP choices, and implements step(), reset(), and render() (the equity curve). The reward calculation is:

profit = (price_exit - price_entry) * direction
reward = profit * 10000
if hit_sl_or_tp:
    reward *= -1

The environment also logs the cumulative equity and logs each trade.

Training – Using PPO:

model = PPO('MlpPolicy', env, verbose=1,
            n_steps=2048, batch_size=64,
            learning_rate=2.5e-4, gamma=0.99,
            ent_coef=0.01, n_epochs=10)
model.learn(total_timesteps=50000)
model.save('model_euro_as_dollar.zip')

The training log shows a steady drop in loss and a rising learning rate (if you use a scheduler).

Backtesting – With the saved model I replay the 2023-2025 candles:

model = PPO.load('model_euro_as_dollar.zip')
obs = env.reset()
while not env.done:
    action, _ = model.predict(obs)
    obs, reward, done, info = env.step(action)
env.plot_equity_curve()

The equity curve is plotted with Matplotlib. In my experiment the out-of-sample equity was a bit lower than the training curve, which is expected. The Sharpe ratio (≈0.4) and max drawdown (~18 %) are acceptable for a novice trader.

Tuning – I found that reducing the timesteps from 50 000 to 10 000 reduced over-fitting: the out-of-sample equity improved by ~3 %. Adjusting SL/TP options, adding more indicators (e.g. Bollinger Bands), and playing with learning rates or entropy coefficients can yield further gains.

Pitfalls & edge cases

Noisy data – EUR/USD is highly volatile. If the reward is too sparse the agent struggles; I mitigated this by scaling the reward and adding a small penalty for long holding periods.
SL/TP hit conflict – I coded the rule that if a candle hits both, we treat it as a loss. Some traders argue for partial fills; you can modify the environment accordingly.
Computational cost – PPO can be slow with a large action space. Parallelisation (vectorised environments) in stable-baselines3 helps, but 50 000 timesteps still took ~30 minutes on a single CPU.
Over-fitting – Always keep a hold-out set and compute out-of-sample metrics. Hyperparameter sweeps (learning rate, timesteps, gamma) are essential.
Transaction costs & slippage – The current model assumes zero cost. Add a fixed spread or slippage model to see the real-world impact.

Quick FAQ

Q	A
What is the best reward function for trading?	Scale profits to match equity, penalise large drawdowns, and treat SL/TP hits that overlap as a loss to avoid mis-incentives.
How can I know which SL or TP is hit first without tick data?	Use the price range of the candle; if low <= SL <= high and low <= TP <= high, you need tick data. For hourly candles, you can assume the closer level is hit first, but this is a simplification.
Can I use this framework on other currency pairs?	Yes. Just replace the dataset and retrain. The same hyperparameters usually work.
How do I incorporate transaction costs?	Subtract the spread from the entry and exit price in the reward calculation, or model a fixed commission.
Why did my out-of-sample equity drop?	Likely over-fitting or regime change. Try reducing timesteps, adding regularisation, or using early stopping.
Is PPO the best RL algorithm for Forex?	PPO is robust and simple, but algorithms like SAC or DDPG can perform better on continuous action spaces. Test multiple algorithms.
How to avoid the agent taking too many small trades?	Penalise trade frequency in the reward or add an action penalty.

Conclusion Building an RL Forex bot is surprisingly approachable. I spent less than a day writing four Python files, and the agent produced a decent out-of-sample equity curve. The key takeaways:

Use a solid, noise-resistant observation window (30 candles + indicators).
Scale rewards so the agent sees the magnitude of profits in relation to equity.
Keep the action space discrete but flexible (SL/TP choices).
Always test on a hold-out set and monitor over-fitting.
Iterate: tweak hyperparameters, add indicators, and consider transaction costs.

If you’re a data scientist or trader looking to experiment, start with the code skeleton above, swap in your own data, and let the agent learn. The learning curve is steep, but the payoff—an autonomous strategy that learns its own risk profile—can be worth the effort.

⚠️ Disclaimer This article is for educational purposes only. It is not financial advice, and you should not rely on the results without thorough testing. Trading involves risk, and past performance is not indicative of future results.

Last updated: December 21, 2025

I Built a Forex Bot with Reinforcement Learning That Outperformed My Old Strategy

Recommended Articles

Build Smarter AI Agents with These 10 Open-Source GitHub Projects

Zynga’s Data Playbook: 5 Lessons that Built a $12.7B Empire

Build a Network Security Monitoring Stack in VirtualBox: From Capture to Alerts with tshark, Zeek, and Suricata

I Built Kai: A Personal AI Infrastructure That Turned My 9-5 Into a Personal Supercomputer

Build Your Own Python-Based Quant Hedge Fund: The Step-by-Step Blueprint

How to Build a Low-Latency Mumble Voice Server on Your Homelab