--- license: mit library_name: pytorch tags: - pytorch - lip-reading - computer-vision - video-classification - reproduction - 3dcvt --- # 3DCvT on LRW-1000 This repository provides the released checkpoint and evaluation artifacts for an unofficial PyTorch reproduction of: **A Lip Reading Method Based on 3D Convolutional Vision Transformer** Code repository: - https://github.com/DPInnovationWorks/3DCvT_LipReading ## Model Summary - Task: Chinese word-level lip reading - Dataset: LRW-1000 - Number of classes: 1184 in this processed split - Framework: PyTorch - Architecture: 3D CNN + CvT + BiGRU ## Released Files - `best_model.pth`: released checkpoint - `sha256.txt`: checksum for the checkpoint - `logs/train.log`: selected training log - `results/per_class_acc_lrw1000_val.csv`: per-class validation summary - `plots/learning_curve.png`: learning curve exported from training ## Training Setup Training settings from the released run: - GPUs: 1 GPU - Per-step batch size: 128 - Gradient accumulation: 2 - Effective batch size: 256 - Epochs: 120 - Optimizer: Adam - Weight decay: 1e-4 - Learning rate: 6e-4 - Warmup epochs: 5 - Mixed precision: AMP enabled - `torch.compile`: disabled ## Evaluation Result | Dataset | Split | Metric | Value | | --- | --- | --- | --- | | LRW-1000 | Validation | Top-1 Accuracy | 55.29% | ## Intended Use This checkpoint is intended for: - research reproduction - benchmark comparison - qualitative inference demos It is not intended as a production-ready commercial lip-reading system. ## Limitations - Performance depends on using the matching preprocessing pipeline - This release does not include the raw LRW-1000 dataset - Users must obtain the dataset according to its own terms - This processed split uses 1184 classes in the generated vocabulary ## Usage Example inference command: ```bash python inference.py \ --dataset lrw1000 \ --pkl_path /path/to/sample.pkl \ --checkpoint /path/to/best_model.pth \ --gpu 0 ``` ## Notes - The checkpoint is released for reproducibility - Please use the matching code version when possible - Local source artifact names were `best_model_for_lrw1000.pth` and `train_lrw1000.log` ## Citation If you use this release, please cite the original paper: ```bibtex @article{wu2022lip, title={A Lip Reading Method Based on 3D Convolutional Vision Transformer}, author={Wu, Jiafeng and others}, journal={IEEE Access}, year={2022} } ```