Moeeldouma commited on
Commit
894c1aa
·
verified ·
1 Parent(s): 1f7e1a3

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +2 -51
README.md CHANGED
@@ -95,57 +95,8 @@ This project spans all four domains. Here is where each phase of our work falls:
95
 
96
  ### XTTS-v2 Model Components
97
 
98
- ```
99
- XTTS-v2 Architecture
100
- ================================================================
101
-
102
- Input Text Reference Audio (Speaker)
103
- | |
104
- v v
105
- +------------------+ +--------------------+
106
- | VoiceBPE | | Mel Spectrogram |
107
- | Tokenizer | | Extraction |
108
- | (6,681 tokens) | | (80-channel, 22kHz)|
109
- +------------------+ +--------------------+
110
- | | |
111
- v v v
112
- +--------+ +--------------+ +----------+
113
- | [ar] + | | Conditioning | | ResNet50 |
114
- | Token | | Encoder | | Speaker |
115
- | IDs | | (6 attn blks)| | Encoder |
116
- +--------+ +--------------+ +----------+
117
- | | |
118
- | v |
119
- | +--------------+ |
120
- | | Perceiver | |
121
- | | Resampler | |
122
- | | (2 layers, | |
123
- | | 32 latents) | |
124
- | +--------------+ |
125
- | | |
126
- | gpt_cond_latent speaker_embedding
127
- | (1024-dim) (512-dim)
128
- v | |
129
- +---------------------------------------------+
130
- | GPT-2 Transformer |
131
- | 30 layers | 1024 hidden | 16 heads | ~350M |
132
- | |
133
- | Text tokens + conditioning --> audio codes |
134
- +---------------------------------------------+
135
- |
136
- Audio Codes
137
- (1026 codebook)
138
- |
139
- v
140
- +---------------------------------------------+
141
- | HiFiGAN Vocoder (~10M) |
142
- | Upsampling: [8, 8, 2, 2] = 256x |
143
- | + Speaker conditioning at each layer |
144
- +---------------------------------------------+
145
- |
146
- v
147
- 24kHz Waveform
148
- ```
149
 
150
  ### Key Specifications
151
 
 
95
 
96
  ### XTTS-v2 Model Components
97
 
98
+ ![XTTS-v2 Architecture](docs/images/architecture_diagram.png)
99
+ *Figure 2: XTTS-v2 architecture — text and reference audio are processed through separate pathways, combined in the GPT-2 transformer, and decoded to a 24kHz waveform.*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
100
 
101
  ### Key Specifications
102