Flappy Bird Reinforcement Learning Agent
This project implements a reinforcement learning agent for the Flappy Bird game using the REINFORCE (Policy Gradient) algorithm. The agent learns to play Flappy Bird by maximizing the cumulative reward through trial and error.
Features
- Policy Gradient (REINFORCE) algorithm implementation
- Neural network policy with multiple hidden layers
- Checkpoint saving and resuming for training
- Evaluation with statistical analysis
- Video recording of gameplay
- Dynamic plotting of training progress
Dependencies
- Python 3.13
- PyTorch
- NumPy
- Matplotlib
- Gymnasium
- Flappy Bird Gymnasium
- ImageIO
- tqdm
- PIL (Pillow)
- Hugging Face Hub (for model uploading)
Installation
Create a virtual environment:
python -m venv env source env/bin/activate # On Windows: env\Scripts\activateInstall dependencies:
pip install -r requirements.txtOr manually:
pip install torch numpy matplotlib gymnasium flappy-bird-gymnasium imageio tqdm pillow huggingface-hub
Usage
Training
Run the training script with default settings (trains and evaluates):
python agent.py
The agent will train for 100,000 episodes by default. Checkpoints are saved every 100 episodes to checkpoints/checkpoint.pth, and the final model is saved to checkpoints/final_model.pth.
Configuration
Modify the hyperparams dictionary in agent.py to adjust:
no_of_episodes: Number of training episodesgamma: Discount factorlr: Learning ratefps: Frames per second for video outputsave_videos: Whether to save training/evaluation videostrain: Whether to run training (default: True)evaluate: Whether to run evaluation (default: True)push_hf: Whether to push model to Hugging Face Hub (default: False)hf_repo: Hugging Face repository name (required if push_hf is True)eval_episodes: Number of evaluation episodes (default: 5)
Examples
Train only (modify hyperparams):
hyperparams["evaluate"] = False
Evaluate only (modify hyperparams):
hyperparams["train"] = False
hyperparams["eval_episodes"] = 10
Train and push to HF (modify hyperparams):
hyperparams["push_hf"] = True
hyperparams["hf_repo"] = "your-username/flappy-bird-rl-agent"
Resuming Training
If checkpoints exist, the training will automatically resume from the last saved checkpoint.
Evaluation
The script automatically evaluates the trained model after training. To evaluate a saved model separately, modify the evaluate_policy call in agent.py.
Uploading to Hugging Face
To upload your trained model and project to Hugging Face Hub:
Get your Hugging Face token:
- Go to Hugging Face Settings
- Create a new token with "Write" permissions
- Copy the token
Set the token as an environment variable:
export HF_TOKEN=your_token_hereModify hyperparams in
agent.py:hyperparams["push_hf"] = True hyperparams["hf_repo"] = "your-username/flappy-bird-rl-agent"
The entire project (excluding unnecessary files) will be uploaded to your repository.
Configuration
Modify the hyperparams dictionary in agent.py to adjust:
no_of_episodes: Number of training episodesgamma: Discount factorlr: Learning ratefps: Frames per second for video outputsave_videos: Whether to save training/evaluation videos
Output
- Training plots:
output/rewards_vs_episodes.png,output/loss_vs_episodes.png - Training video:
output/flappy_bird_training.mp4(ifsave_videos=True) - Evaluation video:
output/flappy_bird_evaluation.mp4 - Checkpoints:
checkpoints/
Results
Training Progress
Figure 1: Total reward per episode during training
Figure 2: Loss per episode during training
Evaluation
The evaluation video shows the trained agent playing Flappy Bird. To create a 3-second GIF from the last 3 seconds of the video:
ffmpeg -sseof -3 -i output/flappy_bird_evaluation.mp4 -t 3 -vf "fps=10,scale=320:-1:flags=lanczos" output/flappy_bird_evaluation.gif
Figure 3: GIF of the evaluation gameplay
Final Evaluation Statistics
Episode 1: Total Reward: 597.100000000053 Episode 2: Total Reward: 2084.89999999937 Episode 3: Total Reward: 428.4000000000229 Episode 4: Total Reward: 22.700000000000063 Episode 5: Total Reward: 152.89999999999645
Evaluation Statistics: Mean Reward: 657.20 Standard Deviation: 741.78 Max Reward: 2084.90 Mean Score: 446.00 Max Score: 446.00
Architecture
The policy network consists of:
- Input layer: Matches observation space dimension
- Hidden layers: 256, 128, 64 neurons with ReLU activation
- Output layer: Softmax over action space
Algorithm
REINFORCE uses Monte Carlo policy gradients to update the policy parameters by maximizing the expected cumulative reward. The agent samples actions from the current policy, collects trajectories, and updates the policy using the discounted returns.