Flappy Bird Reinforcement Learning Agent

This project implements a reinforcement learning agent for the Flappy Bird game using the REINFORCE (Policy Gradient) algorithm. The agent learns to play Flappy Bird by maximizing the cumulative reward through trial and error.

Features

Policy Gradient (REINFORCE) algorithm implementation
Neural network policy with multiple hidden layers
Checkpoint saving and resuming for training
Evaluation with statistical analysis
Video recording of gameplay
Dynamic plotting of training progress

Dependencies

Python 3.13
PyTorch
NumPy
Matplotlib
Gymnasium
Flappy Bird Gymnasium
ImageIO
tqdm
PIL (Pillow)
Hugging Face Hub (for model uploading)

Installation

Create a virtual environment:

python -m venv env
source env/bin/activate  # On Windows: env\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Or manually:

pip install torch numpy matplotlib gymnasium flappy-bird-gymnasium imageio tqdm pillow huggingface-hub

Usage

Training

Run the training script with default settings (trains and evaluates):

python agent.py

The agent will train for 100,000 episodes by default. Checkpoints are saved every 100 episodes to checkpoints/checkpoint.pth, and the final model is saved to checkpoints/final_model.pth.

Configuration

Modify the hyperparams dictionary in agent.py to adjust:

no_of_episodes: Number of training episodes
gamma: Discount factor
lr: Learning rate
fps: Frames per second for video output
save_videos: Whether to save training/evaluation videos
train: Whether to run training (default: True)
evaluate: Whether to run evaluation (default: True)
push_hf: Whether to push model to Hugging Face Hub (default: False)
hf_repo: Hugging Face repository name (required if push_hf is True)
eval_episodes: Number of evaluation episodes (default: 5)

Examples

Train only (modify hyperparams):

hyperparams["evaluate"] = False

Evaluate only (modify hyperparams):

hyperparams["train"] = False
hyperparams["eval_episodes"] = 10

Train and push to HF (modify hyperparams):

hyperparams["push_hf"] = True
hyperparams["hf_repo"] = "your-username/flappy-bird-rl-agent"

Resuming Training

If checkpoints exist, the training will automatically resume from the last saved checkpoint.

Evaluation

The script automatically evaluates the trained model after training. To evaluate a saved model separately, modify the evaluate_policy call in agent.py.

Uploading to Hugging Face

To upload your trained model and project to Hugging Face Hub:

Get your Hugging Face token:
- Go to Hugging Face Settings
- Create a new token with "Write" permissions
- Copy the token
Set the token as an environment variable:
```
export HF_TOKEN=your_token_here
```

Modify hyperparams in agent.py:

hyperparams["push_hf"] = True
hyperparams["hf_repo"] = "your-username/flappy-bird-rl-agent"

The entire project (excluding unnecessary files) will be uploaded to your repository.

Configuration

Modify the hyperparams dictionary in agent.py to adjust:

no_of_episodes: Number of training episodes
gamma: Discount factor
lr: Learning rate
fps: Frames per second for video output
save_videos: Whether to save training/evaluation videos

Output

Training plots: output/rewards_vs_episodes.png, output/loss_vs_episodes.png
Training video: output/flappy_bird_training.mp4 (if save_videos=True)
Evaluation video: output/flappy_bird_evaluation.mp4
Checkpoints: checkpoints/

Results

Training Progress

Figure 1: Total reward per episode during training

Figure 2: Loss per episode during training

Evaluation

The evaluation video shows the trained agent playing Flappy Bird. To create a 3-second GIF from the last 3 seconds of the video:

ffmpeg -sseof -3 -i output/flappy_bird_evaluation.mp4 -t 3 -vf "fps=10,scale=320:-1:flags=lanczos" output/flappy_bird_evaluation.gif

Figure 3: GIF of the evaluation gameplay

Final Evaluation Statistics

Episode 1: Total Reward: 597.100000000053 Episode 2: Total Reward: 2084.89999999937 Episode 3: Total Reward: 428.4000000000229 Episode 4: Total Reward: 22.700000000000063 Episode 5: Total Reward: 152.89999999999645

Evaluation Statistics: Mean Reward: 657.20 Standard Deviation: 741.78 Max Reward: 2084.90 Mean Score: 446.00 Max Score: 446.00

Architecture

The policy network consists of:

Input layer: Matches observation space dimension
Hidden layers: 256, 128, 64 neurons with ReLU activation
Output layer: Softmax over action space

Algorithm

REINFORCE uses Monte Carlo policy gradients to update the policy parameters by maximizing the expected cumulative reward. The agent samples actions from the current policy, collects trajectories, and updates the policy using the discounted returns.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Reinforcement Learning