---
license: mit
library_name: pytorch
tags:
  - pytorch
  - lip-reading
  - computer-vision
  - video-classification
  - reproduction
  - 3dcvt
---

# 3DCvT on LRW-1000

This repository provides the released checkpoint and evaluation artifacts for an unofficial PyTorch reproduction of:

**A Lip Reading Method Based on 3D Convolutional Vision Transformer**

Code repository:

- https://github.com/DPInnovationWorks/3DCvT_LipReading

## Model Summary

- Task: Chinese word-level lip reading
- Dataset: LRW-1000
- Number of classes: 1184 in this processed split
- Framework: PyTorch
- Architecture: 3D CNN + CvT + BiGRU

## Released Files

- `best_model.pth`: released checkpoint
- `sha256.txt`: checksum for the checkpoint
- `logs/train.log`: selected training log
- `results/per_class_acc_lrw1000_val.csv`: per-class validation summary
- `plots/learning_curve.png`: learning curve exported from training

## Training Setup

Training settings from the released run:

- GPUs: 1 GPU
- Per-step batch size: 128
- Gradient accumulation: 2
- Effective batch size: 256
- Epochs: 120
- Optimizer: Adam
- Weight decay: 1e-4
- Learning rate: 6e-4
- Warmup epochs: 5
- Mixed precision: AMP enabled
- `torch.compile`: disabled

## Evaluation Result

| Dataset | Split | Metric | Value |
| --- | --- | --- | --- |
| LRW-1000 | Validation | Top-1 Accuracy | 55.29% |

## Intended Use

This checkpoint is intended for:

- research reproduction
- benchmark comparison
- qualitative inference demos

It is not intended as a production-ready commercial lip-reading system.

## Limitations

- Performance depends on using the matching preprocessing pipeline
- This release does not include the raw LRW-1000 dataset
- Users must obtain the dataset according to its own terms
- This processed split uses 1184 classes in the generated vocabulary

## Usage

Example inference command:

```bash
python inference.py \
  --dataset lrw1000 \
  --pkl_path /path/to/sample.pkl \
  --checkpoint /path/to/best_model.pth \
  --gpu 0
```

## Notes

- The checkpoint is released for reproducibility
- Please use the matching code version when possible
- Local source artifact names were `best_model_for_lrw1000.pth` and `train_lrw1000.log`

## Citation

If you use this release, please cite the original paper:

```bibtex
@article{wu2022lip,
  title={A Lip Reading Method Based on 3D Convolutional Vision Transformer},
  author={Wu, Jiafeng and others},
  journal={IEEE Access},
  year={2022}
}
```