USTCPhonetics
/

FlexAligner

speech-alignment

forced-alignment

Model card Files Files and versions

FlexAligner / README.md

USTCPhonetics's picture

Update README.md

1612d4f verified 5 months ago

|

History Blame Contribute Delete

3.19 kB

	---
	license: mit
	language:
	- zh
	- en
	tags:
	- audio
	- speech-alignment
	- wav2vec2
	- ctc
	- forced-alignment
	---

	# 🌊 FlexAligner: Robust Speech-Text Alignment Framework

	FlexAligner is a robust two-stage speech-text alignment framework designed specifically for "non-ideal" real-world acoustic data.
	FlexAligner 是一种强鲁棒性的两阶段语音-文本对齐框架，专门为“非理想”的真实世界语音数据设计。

	---

	## 🌟 Key Features / 核心功能

	### English
	- Robustness to Mismatched Data: Unlike traditional MFA (Montreal Forced Aligner), FlexAligner automatically identifies and skips mismatched segments (e.g., laughter, background noise, or un-transcribed words), preventing cumulative errors.
	- Two-Stage Architecture:
	1. Stage 1 (CTC Chunking): Macro-segmentation to locate reliable "speech islands."
	2. Stage 2 (CE Alignment): Micro-alignment using dynamic hop calibration for sub-millisecond boundary regression.
	- Eliminating Temporal Drift: Includes a self-calibrating decoding algorithm for long recordings, ensuring phoneme boundaries at the end of the file remain strictly aligned with samples.

	### 中文
	- 处理不匹配数据：不同于传统的 MFA (Montreal Forced Aligner)，FlexAligner 能够自动识别并跳过音频与文本不匹配的部分（如笑声、长时间噪音或漏记的单词），而不会产生累积误差。
	- 两阶段对齐架构：
	1. Stage 1 (CTC Chunking): 宏观切分，定位可靠的语音“岛屿”。
	2. Stage 2 (CE Alignment): 微观对齐，利用动态步长校准实现亚毫秒级的边界回归。
	- 消除时间漂移：针对长音频设计了自校准解码算法，确保音频末尾的音素边界依然能够与采样点严格对齐。



	---

	## 📦 Model Components / 模型组成

	This repository contains the weights for the two core components of FlexAligner:
	本仓库包含 FlexAligner 运行所需的两套核心权重：

	- `hf_phs/`: Wav2Vec 2.0 based CTC chunking model. / 基于 Wav2Vec 2.0 训练的 CTC 切分模型。
	- `ce2/`: High-precision frame-level Cross-Entropy alignment model. / 高精度帧级别交叉熵对齐模型。

	---

	## 🚀 Quick Start / 快速上手

	### CLI Usage / 命令行使用
	After installation, you can call the cloud models directly via the CLI:
	安装后，你可以通过 CLI 命令行直接调用云端模型：

	```bash
	flex-align input.wav transcript.txt --dynamic -o output.TextGrid
	```

	Python API
	Or integrate it into your Python pipeline: / 或者在 Python 代码中集成：

	```Python
	from flexaligner import FlexAligner

	# Use the Hugging Face Repo ID to automatically download and load weights
	# 填入本仓库 ID，程序会自动处理模型下载与加载
	aligner = FlexAligner(config={
	"chunk_model_path": "USTCPhonetics/FlexAligner",
	"use_dynamic_hop": True
	})

	aligner.align("test.wav", "test.txt", "result.TextGrid")
	```

	---

	📜 License / 协议
	This project is licensed under the MIT License. Feel free to use it in academic research or commercial projects. 本项目遵循 MIT License。你可以自由地在学术研究或商业项目中使用。

	---