RaikkonenWu
/

3dcvt-lrw1000

Video Classification

computer-vision

Model card Files Files and versions

3dcvt-lrw1000 / README.md

RaikkonenWu's picture

Add files using upload-large-folder tool

464090a verified 3 months ago

|

history blame contribute delete

2.47 kB

	---
	license: mit
	library_name: pytorch
	tags:
	- pytorch
	- lip-reading
	- computer-vision
	- video-classification
	- reproduction
	- 3dcvt
	---

	# 3DCvT on LRW-1000

	This repository provides the released checkpoint and evaluation artifacts for an unofficial PyTorch reproduction of:

	A Lip Reading Method Based on 3D Convolutional Vision Transformer

	Code repository:

	- https://github.com/DPInnovationWorks/3DCvT_LipReading

	## Model Summary

	- Task: Chinese word-level lip reading
	- Dataset: LRW-1000
	- Number of classes: 1184 in this processed split
	- Framework: PyTorch
	- Architecture: 3D CNN + CvT + BiGRU

	## Released Files

	- `best_model.pth`: released checkpoint
	- `sha256.txt`: checksum for the checkpoint
	- `logs/train.log`: selected training log
	- `results/per_class_acc_lrw1000_val.csv`: per-class validation summary
	- `plots/learning_curve.png`: learning curve exported from training

	## Training Setup

	Training settings from the released run:

	- GPUs: 1 GPU
	- Per-step batch size: 128
	- Gradient accumulation: 2
	- Effective batch size: 256
	- Epochs: 120
	- Optimizer: Adam
	- Weight decay: 1e-4
	- Learning rate: 6e-4
	- Warmup epochs: 5
	- Mixed precision: AMP enabled
	- `torch.compile`: disabled

	## Evaluation Result

	\| Dataset \| Split \| Metric \| Value \|
	\| --- \| --- \| --- \| --- \|
	\| LRW-1000 \| Validation \| Top-1 Accuracy \| 55.29% \|

	## Intended Use

	This checkpoint is intended for:

	- research reproduction
	- benchmark comparison
	- qualitative inference demos

	It is not intended as a production-ready commercial lip-reading system.

	## Limitations

	- Performance depends on using the matching preprocessing pipeline
	- This release does not include the raw LRW-1000 dataset
	- Users must obtain the dataset according to its own terms
	- This processed split uses 1184 classes in the generated vocabulary

	## Usage

	Example inference command:

	```bash
	python inference.py \
	--dataset lrw1000 \
	--pkl_path /path/to/sample.pkl \
	--checkpoint /path/to/best_model.pth \
	--gpu 0
	```

	## Notes

	- The checkpoint is released for reproducibility
	- Please use the matching code version when possible
	- Local source artifact names were `best_model_for_lrw1000.pth` and `train_lrw1000.log`

	## Citation

	If you use this release, please cite the original paper:

	```bibtex
	@article{wu2022lip,
	title={A Lip Reading Method Based on 3D Convolutional Vision Transformer},
	author={Wu, Jiafeng and others},
	journal={IEEE Access},
	year={2022}
	}
	```