Flappy Bird Reinforcement Learning Agent

This project implements a reinforcement learning agent for the Flappy Bird game using the REINFORCE (Policy Gradient) algorithm. The agent learns to play Flappy Bird by maximizing the cumulative reward through trial and error.

Video Preview

Features

  • Policy Gradient (REINFORCE) algorithm implementation
  • Neural network policy with multiple hidden layers
  • Checkpoint saving and resuming for training
  • Evaluation with statistical analysis
  • Video recording of gameplay
  • Dynamic plotting of training progress

Dependencies

  • Python 3.13
  • PyTorch
  • NumPy
  • Matplotlib
  • Gymnasium
  • Flappy Bird Gymnasium
  • ImageIO
  • tqdm
  • PIL (Pillow)
  • Hugging Face Hub (for model uploading)

Installation

  1. Create a virtual environment:

    python -m venv env
    source env/bin/activate  # On Windows: env\Scripts\activate
    
  2. Install dependencies:

    pip install -r requirements.txt
    

    Or manually:

    pip install torch numpy matplotlib gymnasium flappy-bird-gymnasium imageio tqdm pillow huggingface-hub
    

Usage

Training

Run the training script with default settings (trains and evaluates):

python agent.py

The agent will train for 100,000 episodes by default. Checkpoints are saved every 100 episodes to checkpoints/checkpoint.pth, and the final model is saved to checkpoints/final_model.pth.

Configuration

Modify the hyperparams dictionary in agent.py to adjust:

  • no_of_episodes: Number of training episodes
  • gamma: Discount factor
  • lr: Learning rate
  • fps: Frames per second for video output
  • save_videos: Whether to save training/evaluation videos
  • train: Whether to run training (default: True)
  • evaluate: Whether to run evaluation (default: True)
  • push_hf: Whether to push model to Hugging Face Hub (default: False)
  • hf_repo: Hugging Face repository name (required if push_hf is True)
  • eval_episodes: Number of evaluation episodes (default: 5)

Examples

Train only (modify hyperparams):

hyperparams["evaluate"] = False

Evaluate only (modify hyperparams):

hyperparams["train"] = False
hyperparams["eval_episodes"] = 10

Train and push to HF (modify hyperparams):

hyperparams["push_hf"] = True
hyperparams["hf_repo"] = "your-username/flappy-bird-rl-agent"

Resuming Training

If checkpoints exist, the training will automatically resume from the last saved checkpoint.

Evaluation

The script automatically evaluates the trained model after training. To evaluate a saved model separately, modify the evaluate_policy call in agent.py.

Uploading to Hugging Face

To upload your trained model and project to Hugging Face Hub:

  1. Get your Hugging Face token:

  2. Set the token as an environment variable:

    export HF_TOKEN=your_token_here
    
  3. Modify hyperparams in agent.py:

    hyperparams["push_hf"] = True
    hyperparams["hf_repo"] = "your-username/flappy-bird-rl-agent"
    

The entire project (excluding unnecessary files) will be uploaded to your repository.

Configuration

Modify the hyperparams dictionary in agent.py to adjust:

  • no_of_episodes: Number of training episodes
  • gamma: Discount factor
  • lr: Learning rate
  • fps: Frames per second for video output
  • save_videos: Whether to save training/evaluation videos

Output

  • Training plots: output/rewards_vs_episodes.png, output/loss_vs_episodes.png
  • Training video: output/flappy_bird_training.mp4 (if save_videos=True)
  • Evaluation video: output/flappy_bird_evaluation.mp4
  • Checkpoints: checkpoints/

Results

Training Progress

Training Rewards Figure 1: Total reward per episode during training

Training Loss Figure 2: Loss per episode during training

Evaluation

The evaluation video shows the trained agent playing Flappy Bird. To create a 3-second GIF from the last 3 seconds of the video:

ffmpeg -sseof -3 -i output/flappy_bird_evaluation.mp4 -t 3 -vf "fps=10,scale=320:-1:flags=lanczos" output/flappy_bird_evaluation.gif

Evaluation GIF Figure 3: GIF of the evaluation gameplay

Final Evaluation Statistics

Episode 1: Total Reward: 597.100000000053 Episode 2: Total Reward: 2084.89999999937 Episode 3: Total Reward: 428.4000000000229 Episode 4: Total Reward: 22.700000000000063 Episode 5: Total Reward: 152.89999999999645

Evaluation Statistics: Mean Reward: 657.20 Standard Deviation: 741.78 Max Reward: 2084.90 Mean Score: 446.00 Max Score: 446.00

Architecture

The policy network consists of:

  • Input layer: Matches observation space dimension
  • Hidden layers: 256, 128, 64 neurons with ReLU activation
  • Output layer: Softmax over action space

Algorithm

REINFORCE uses Monte Carlo policy gradients to update the policy parameters by maximizing the expected cumulative reward. The agent samples actions from the current policy, collects trajectories, and updates the policy using the discounted returns.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading