Meta AI seamlessM4T_v2_large 實驗

我實驗環境是部署在 GCP 上,Compute Engine 選用的 GPU 是 NVIDIA T4 1 x GPU,登入主機後第一件事當是是要先確定你啟用的 Instance 真的有 GPU

$ lspci | grep -i nvidia

沒出意外的話,CLI 上會顯示

00:04.0 3D controller: NVIDIA Corporation TU104GL [Tesla T4] (rev a1)

再來就是要確保你的主機有安裝好 CUDA,這邊基本上就是按照 Nvidia 官網上的教學一步一步的完成安裝

CUDA Toolkit 12.3 Update 1 Downloads

$ wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2204/x86_64/cuda-keyring_1.1-1_all.deb
$ sudo dpkg -i cuda-keyring_1.1-1_all.deb
$ sudo apt-get update
$ sudo apt-get -y install cuda-toolkit-12-3

當你透過上述方法安裝完成 CUDA Toolkit 後,它已經包括安裝與該版本 CUDA Toolkit 相容的必要 NVIDIA 驅動程式。


Check for NVIDIA GPU and Driver Status

根據 NVIDIA 官方的教學完成 CUDA 安裝之後,可以使用 nvidia-smi 這個指令去查看 GPU 使用情況的資訊,包括驅動程式版本、GPU 利用率、記憶體使用情況等等。

$ nvidia-smi

如果一切順利,你的 CLI 將會顯示:

Tue Dec  5 07:22:39 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.23.08              Driver Version: 545.23.08    CUDA Version: 12.3     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla T4                       Off | 00000000:00:04.0 Off |                    0 |
| N/A   76C    P0              34W /  70W |      2MiB / 15360MiB |      7%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

List All Graphics Cards

額外你也可以使用以下 CLI 去檢查主機上是否有安裝顯示卡以及顯示卡的型號。

$ lspci | grep VGA

lspci" 是一個在 Linux 系統中常用的命令,用於列出電腦上所有 PCI 匯流排上的裝置,PCI 匯流排是一種電腦內部數據傳輸匯流排,用於連接和控制電腦內部組件。

00:03.0 Non-VGA unclassified device: Red Hat, Inc. Virtio SCSI

Using Python Libraries for Specific Checks

除了上述 CLI 的指令之外,我們也可以直接撰寫 Python 程式透過 PyTorch 的 API 去確認 CUDA 是否處於可用狀態。

import torch
print(torch.cuda.is_available())

如果有成功讓 PyTorch 使用 GPU 的話,CLI 上會顯示 True 這個值。


Hello Worlds

接下來我們就可以用 m4t_predict 這個 CLI 工具去進行各項實驗了。


S2ST(Speech-to-Speech Translation)

以下指令可以將內容為中文的語音直接轉譯成日文的語音

$ m4t_predict --src_lang cmn ./sample-001.mp3 --task s2st --tgt_lang jpn --output_path ./sample-001-jp.mp3

如果有成功的安裝好 Nvidia driver 以及 CUDA Toolkit 的話,執行指令時會顯示如下。

2023-12-05 09:25:19,764 INFO -- seamless_communication.cli.m4t.predict.predict: Running inference on device=device(type='cuda', index=0) with dtype=torch.float16.
Using the cached checkpoint of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached checkpoint of vocoder_v2. Set `force` to `True` to download again.
/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
2023-12-05 09:25:33,054 INFO -- seamless_communication.cli.m4t.predict.predict: text_generation_opts=SequenceGeneratorOptions(beam_size=5, soft_max_seq_len=(1, 200), hard_max_seq_len=1024, step_processor=None, unk_penalty=0.0, len_penalty=1.0)
2023-12-05 09:25:33,054 INFO -- seamless_communication.cli.m4t.predict.predict: unit_generation_opts=SequenceGeneratorOptions(beam_size=5, soft_max_seq_len=(25, 50), hard_max_seq_len=1024, step_processor=None, unk_penalty=0.0, len_penalty=1.0)
2023-12-05 09:25:33,054 INFO -- seamless_communication.cli.m4t.predict.predict: unit_generation_ngram_filtering=False
2023-12-05 09:25:38,684 INFO -- seamless_communication.cli.m4t.predict.predict: Saving translated audio in jpn
2023-12-05 09:25:39,403 INFO -- seamless_communication.cli.m4t.predict.predict: Translated text in jpn: 絶対に元気だ ⁇ ちょっと一回 ⁇ 絶対に元気だ ⁇


S2TT(Speech-To-Text Translation)

將語音檔直接轉譯成文字的輸出

m4t_predict ./sample-002.wav --task s2tt --tgt_lang jpn

執行成功的話會顯示如下的回應

2023-12-05 12:28:54,018 INFO -- seamless_communication.cli.m4t.predict.predict: Running inference on device=device(type='cuda', index=0) with dtype=torch.float16.
Using the cached checkpoint of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached checkpoint of vocoder_v2. Set `force` to `True` to download again.
/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
2023-12-05 12:29:41,769 INFO -- seamless_communication.cli.m4t.predict.predict: text_generation_opts=SequenceGeneratorOptions(beam_size=5, soft_max_seq_len=(1, 200), hard_max_seq_len=1024, step_processor=None, unk_penalty=0.0, len_penalty=1.0)
2023-12-05 12:29:41,770 INFO -- seamless_communication.cli.m4t.predict.predict: unit_generation_opts=SequenceGeneratorOptions(beam_size=5, soft_max_seq_len=(25, 50), hard_max_seq_len=1024, step_processor=None, unk_penalty=0.0, len_penalty=1.0)
2023-12-05 12:29:41,770 INFO -- seamless_communication.cli.m4t.predict.predict: unit_generation_ngram_filtering=False
2023-12-05 12:29:55,754 INFO -- seamless_communication.cli.m4t.predict.predict: Translated text in jpn: 彼はあなたを助けて ⁇ あなたのビジョンでそれを見ることができます ⁇ 

T2TT(Text-to-Text Translation)

seamlessM4T_v2_large 也能直接做不同語文間的直接翻譯

$ m4t_predict 'Hi! How are you?' --task t2tt --tgt_lang jpn --src_lang eng

如下所示 t2tt 可以直接將 “Hi! How are you?” 轉譯成日文的 “こんにちは!元気ですか?”

2023-12-05 08:43:03,156 INFO -- seamless_communication.cli.m4t.predict.predict: Running inference on device=device(type='cuda', index=0) with dtype=torch.float16.
Using the cached checkpoint of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached checkpoint of vocoder_v2. Set `force` to `True` to download again.
/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
2023-12-05 08:43:49,793 INFO -- seamless_communication.cli.m4t.predict.predict: text_generation_opts=SequenceGeneratorOptions(beam_size=5, soft_max_seq_len=(1, 200), hard_max_seq_len=1024, step_processor=None, unk_penalty=0.0, len_penalty=1.0)
2023-12-05 08:43:49,794 INFO -- seamless_communication.cli.m4t.predict.predict: unit_generation_opts=SequenceGeneratorOptions(beam_size=5, soft_max_seq_len=(25, 50), hard_max_seq_len=1024, step_processor=None, unk_penalty=0.0, len_penalty=1.0)
2023-12-05 08:43:49,794 INFO -- seamless_communication.cli.m4t.predict.predict: unit_generation_ngram_filtering=False
2023-12-05 08:43:50,864 INFO -- seamless_communication.cli.m4t.predict.predict: Translated text in jpn: こんにちは!元気ですか?


T2ST(Text-to-Speech Translation)

t2st 這項任務則能夠將任何語言的文字形式轉譯成指定語言的音頻格式(ex. 日文)

$ m4t_predict '這就是人生' --task t2st --tgt_lang jpn --src_lang cmn_Hant --output_path t2st.wav

上述指令的目的就是將來源中文字 “這就是人生” 直接轉譯成一個日文輸出的聲音檔。

2023-12-05 08:50:48,246 INFO -- seamless_communication.cli.m4t.predict.predict: Running inference on device=device(type='cuda', index=0) with dtype=torch.float16.
Using the cached checkpoint of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached checkpoint of vocoder_v2. Set `force` to `True` to download again.
/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
2023-12-05 08:50:59,107 INFO -- seamless_communication.cli.m4t.predict.predict: text_generation_opts=SequenceGeneratorOptions(beam_size=5, soft_max_seq_len=(1, 200), hard_max_seq_len=1024, step_processor=None, unk_penalty=0.0, len_penalty=1.0)
2023-12-05 08:50:59,107 INFO -- seamless_communication.cli.m4t.predict.predict: unit_generation_opts=SequenceGeneratorOptions(beam_size=5, soft_max_seq_len=(25, 50), hard_max_seq_len=1024, step_processor=None, unk_penalty=0.0, len_penalty=1.0)
2023-12-05 08:50:59,108 INFO -- seamless_communication.cli.m4t.predict.predict: unit_generation_ngram_filtering=False
2023-12-05 08:51:00,745 INFO -- seamless_communication.cli.m4t.predict.predict: Saving translated audio in jpn
2023-12-05 08:51:00,747 INFO -- seamless_communication.cli.m4t.predict.predict: Translated text in jpn: これが人生

實驗結果生成的聲音檔品質相當驚人,只要是 seamlessM4T_v2_large 支援的語言幾乎都能順利地從任意語言的文字轉譯成任意有支援語言的聲音檔輸出。


ASR(Automatic Speech Recognition)

#自動語音識別 是利用機器學習和深度學習技術將語音數據轉換為文字的過程,ASR系統能夠識別和翻譯人類的語音,將之轉寫成可讀的文本。

$ m4t_predict ./sample-002.wav --task asr --tgt_lang jpn

輸入的聲音檔內容為 “Hi! how are you? I am looking for a local cuisine recommendation.”

2023-12-05 09:04:50,753 INFO -- seamless_communication.cli.m4t.predict.predict: Running inference on device=device(type='cuda', index=0) with dtype=torch.float16.
Using the cached checkpoint of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached tokenizer of seamlessM4T_v2_large. Set `force` to `True` to download again.
Using the cached checkpoint of vocoder_v2. Set `force` to `True` to download again.
/usr/local/lib/python3.10/dist-packages/torch/nn/utils/weight_norm.py:30: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.
  warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.")
2023-12-05 09:05:02,815 INFO -- seamless_communication.cli.m4t.predict.predict: text_generation_opts=SequenceGeneratorOptions(beam_size=5, soft_max_seq_len=(1, 200), hard_max_seq_len=1024, step_processor=None, unk_penalty=0.0, len_penalty=1.0)
2023-12-05 09:05:02,815 INFO -- seamless_communication.cli.m4t.predict.predict: unit_generation_opts=SequenceGeneratorOptions(beam_size=5, soft_max_seq_len=(25, 50), hard_max_seq_len=1024, step_processor=None, unk_penalty=0.0, len_penalty=1.0)
2023-12-05 09:05:02,815 INFO -- seamless_communication.cli.m4t.predict.predict: unit_generation_ngram_filtering=False
2023-12-05 09:05:04,688 INFO -- seamless_communication.cli.m4t.predict.predict: Translated text in jpn: 彼はあなたを助けて ⁇ あなたのビジョンでそれを見ることができます ⁇

我覺得是因為我英文口音非常不標準,他生成的結果其實偏誤非常多,他將我的聲音檔內容辨識為 “彼はあなたを助けて ⁇ あなたのビジョンでそれを見ることができます”


實驗小結

  • 在文字跟聲音的轉換上,由於不存在口音與發音不標準的問題,在 seamlessM4T_v2_large 支援的語言範圍內,任意來源語言文字與任意目標語言的聲音間的轉換可說是相當完美。
  • 在聲音轉換成文字上,我拿 YouTube 上那些 Native speaker 的聲音去做 Speech-to-Speech Translation 以及 Speech-to-Text Translation 等任務上,來源語言與目標語言間的轉換也是相當的令人驚艷。
  • 在 Speech-to-Speech 以及 Speech-to-Text 任務上,發音不標準的話轉譯出的結果都相當糟糕。
  • 很殘念的,來源語言為中文的話,轉譯成其他目標語言聲音檔的效果我目前測試在結果上始終還是有非常大的偏誤。

常見問題

ValueError: The input audio cannot be decoded. See nested exception for details.

簡單說就是輸入的音源檔格式 seamlessM4T_v2_large 不支援(ex. mp3)所以要先將 mp3 格式的音擋轉換成 wav 的格式。

$ ffmpeg -i input.mp3 -acodec pcm_s16le -ar 44100 -ac 2 output.wav

Reference

Leave a Comment

Your email address will not be published. Required fields are marked *