728x90

There is a README.md information on https://github.com/m-bain/whisperX/blob/main/README.md
On the Github, there are also  source codes and recipe for installation and how-to-run method.

No information on the result. So I have some information.

The first motivation was 'Is this give me a information on utterance level, not by word-level.'. becase many article only describe the accurracy of time-stamp based on  word-level via phoneme networks.  The answer is YES on utterance-level and word-level also.

The following figure represents  the input wav data. It consists of 4 utterances with  3 ambiguous  short pause or silences. 

 

And the results was

root@c6991d090164:/whiperX# whisperx ./test/en_cosmetic.wav
INFO:speechbrain.utils.quirks:Applied quirks (see `speechbrain.utils.quirks`): [disable_jit_profiling, allow_tf32]
INFO:speechbrain.utils.quirks:Excluded quirks specified by the `SB_DISABLE_QUIRKS` environment (comma-separated list): []
config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.37k/2.37k [00:00<00:00, 13.8MB/s]
vocabulary.txt: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 460k/460k [00:00<00:00, 1.15MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.20M/2.20M [00:00<00:00, 2.64MB/s]
model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 484M/484M [01:01<00:00, 7.92MB/s]
No language specified, language will be first be detected for each audio file (increases inference time).
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.5.0.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../usr/local/lib/python3.10/dist-packages/whisperx/assets/pytorch_model.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.6.0+cu124. Bad things might happen unless you revert torch to 1.x.
>>Performing transcription...
/usr/local/lib/python3.10/dist-packages/pyannote/audio/utils/reproducibility.py:74: ReproducibilityWarning: TensorFloat-32 (TF32) has been disabled as it might lead to reproducibility issues and lower accuracy.
It can be re-enabled by calling
   >>> import torch
   >>> torch.backends.cuda.matmul.allow_tf32 = True
   >>> torch.backends.cudnn.allow_tf32 = True
See https://github.com/pyannote/pyannote-audio/issues/1370 for more details.

  warnings.warn(
Detected language: en (1.00) in first 30s of audio...
Transcript: [2.208 --> 31.267]  This powerful formula works to minimize fine lines and wrinkles while enhancing elasticity and hydration. Infused with collagen, hyaluronic acid, adenosine, and natural extracts, it supports your skin's health and promotes a youthful glow. The lightweight, non-sticky texture absorbs effortlessly, leaving you feeling refreshed and soft. Reson Collagen 100 Ampoule is here to provide your skin with the care it deserves.
Downloading: "https://download.pytorch.org/torchaudio/models/wav2vec2_fairseq_base_ls960_asr_ls960.pth" to /root/.cache/torch/hub/checkpoints/wav2vec2_fairseq_base_ls960_asr_ls960.pth
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 360M/360M [00:57<00:00, 6.60MB/s]
>>Performing alignment...
root@c6991d090164:/whiperX#

 

It downloads some resources such as vocabulary and binary model. And detect language, perform transcription with whole data length informatin and its contents.  After that it download again the wav2vec2 ( X ) model  and perform the alignment (of word-level and adjust the time-stamp information)

It makes 5 textual information file such as json, srt, tsv, txt, and vtt files.

I use the json file and re-align resultant information to view the silences between utterances. It look like the following: 

 

It contains utterance level with its time-stamp e.g. start time and end time of utterance. Also following the word-level infortion with additional confidence score.

It detect well the silence between utterances. It looks so good.

 

728x90
반응형

+ Recent posts