| --- |
| license: mit |
| library_name: pytorch |
| tags: |
| - pytorch |
| - lip-reading |
| - computer-vision |
| - video-classification |
| - reproduction |
| - 3dcvt |
| --- |
| |
| # 3DCvT on LRW-1000 |
|
|
| This repository provides the released checkpoint and evaluation artifacts for an unofficial PyTorch reproduction of: |
|
|
| **A Lip Reading Method Based on 3D Convolutional Vision Transformer** |
|
|
| Code repository: |
|
|
| - https://github.com/DPInnovationWorks/3DCvT_LipReading |
| |
| ## Model Summary |
| |
| - Task: Chinese word-level lip reading |
| - Dataset: LRW-1000 |
| - Number of classes: 1184 in this processed split |
| - Framework: PyTorch |
| - Architecture: 3D CNN + CvT + BiGRU |
| |
| ## Released Files |
| |
| - `best_model.pth`: released checkpoint |
| - `sha256.txt`: checksum for the checkpoint |
| - `logs/train.log`: selected training log |
| - `results/per_class_acc_lrw1000_val.csv`: per-class validation summary |
| - `plots/learning_curve.png`: learning curve exported from training |
|
|
| ## Training Setup |
|
|
| Training settings from the released run: |
|
|
| - GPUs: 1 GPU |
| - Per-step batch size: 128 |
| - Gradient accumulation: 2 |
| - Effective batch size: 256 |
| - Epochs: 120 |
| - Optimizer: Adam |
| - Weight decay: 1e-4 |
| - Learning rate: 6e-4 |
| - Warmup epochs: 5 |
| - Mixed precision: AMP enabled |
| - `torch.compile`: disabled |
|
|
| ## Evaluation Result |
|
|
| | Dataset | Split | Metric | Value | |
| | --- | --- | --- | --- | |
| | LRW-1000 | Validation | Top-1 Accuracy | 55.29% | |
|
|
| ## Intended Use |
|
|
| This checkpoint is intended for: |
|
|
| - research reproduction |
| - benchmark comparison |
| - qualitative inference demos |
|
|
| It is not intended as a production-ready commercial lip-reading system. |
|
|
| ## Limitations |
|
|
| - Performance depends on using the matching preprocessing pipeline |
| - This release does not include the raw LRW-1000 dataset |
| - Users must obtain the dataset according to its own terms |
| - This processed split uses 1184 classes in the generated vocabulary |
|
|
| ## Usage |
|
|
| Example inference command: |
|
|
| ```bash |
| python inference.py \ |
| --dataset lrw1000 \ |
| --pkl_path /path/to/sample.pkl \ |
| --checkpoint /path/to/best_model.pth \ |
| --gpu 0 |
| ``` |
|
|
| ## Notes |
|
|
| - The checkpoint is released for reproducibility |
| - Please use the matching code version when possible |
| - Local source artifact names were `best_model_for_lrw1000.pth` and `train_lrw1000.log` |
|
|
| ## Citation |
|
|
| If you use this release, please cite the original paper: |
|
|
| ```bibtex |
| @article{wu2022lip, |
| title={A Lip Reading Method Based on 3D Convolutional Vision Transformer}, |
| author={Wu, Jiafeng and others}, |
| journal={IEEE Access}, |
| year={2022} |
| } |
| ``` |
|
|