Mac M1 convert video to text using whisper and ffmpeg

Original link: https://blog.kelu.org/tech/2023/05/29/mac-video-to-text.html

In our work and life, we always encounter requests to watch a certain video and write down our experience.

At present, I often watch station b, and station b already has related plug-ins that can summarize video content for you. For example, the plug-in Glarity Summary , which I recorded in “Some Data Summary of ChatGPT” a few days ago, can use ChatGPT to generate summaries for Google searches, YouTube videos, and various web content.

After all, it involves work and life. There are some videos that we can’t upload to SNS to generate summaries, so we have to generate them locally. My current approach is to convert the video to audio and then to text. Through a specific prompt, after desensitization, hand it over to gpt to help me generate the data I want. Although it is not a perfect solution, it is good to solve the problem temporarily. This article briefly records the process.

There are still many areas that need to be optimized in this solution. For example, whisper uses the cpu solution. I have been unable to solve the error of GPU operation. Research it when you have time.

I guess it may be my Python version problem. The Python version I used in ” Mac M1 running conda and jupyter notebook memo ” is 3.8, it may not work.

1. ffmpeg

install ffmpeg

 brew install ffmpeg

Convert video to audio:

 ffmpeg -i "input.mp4" -vn -acodec libmp3lame output.mp3 ffmpeg -i "input.mov" -vn -acodec libmp3lame output.mp3
  • -i input.mov specifies the input file path and file name.
  • -vn tells FFmpeg not to include the video stream, only the audio stream.
  • -acodec libmp3lame specifies the audio codec as libmp3lame, which is used to encode the audio stream into MP3 format.
  • output.mp3 specifies the output MP3 file path and file name.

image-20230530 PM 85304231

The audio is intercepted as a 30s audio.

 ffmpeg -i output.mp3 -f segment -segment_time 30 -c copy output_%03d.mp3 

image-20230530 AM91813357

2. whisper

https://github.com/openai/whisper

whisper is a multi-task speech recognition model that can perform multilingual speech recognition, speech translation and language recognition. It uses the Transformer sequence-to-sequence model trained on a large and diverse audio dataset, and is compatible with Python 3.8-3.11 and the latest PyTorch version. It offers five different model sizes, including an English-only version whose performance is language-dependent. It can be used via the command line or Python, and its code and model weights are released under the MIT license.

Whisper has several models. I use the small model on mac m1 (CPU mode) very fast, and the medium model is very slow.

Activate the virtual environment and install whisper:

 conda activate ~/Workspace/pytorch-test/env pip install --upgrade git+https://github.com/openai/whisper.git

Write hello world and try the effect:

 import whisper # model = whisper.load_model("small") model = whisper . load_model ( "medium" ) audio = whisper . load_audio ( "/Users/kelu/Desktop/output_000.mp3" ) audio = whisper . pad_or_trim ( audio ) # make log-Mel spectrogram and move to the same device as the model mel = whisper . log_mel_spectrogram ( audio ) . to ( "cpu" ) # detect the spoken language _ , probs = model . detect_language ( mel ) print ( f "Detected language: {max(probs, key=probs.get)}" ) # decode the audio # options = whisper.DecodingOptions(fp16 = False, prompt="以下是普通话的句子") # 简体中文增加prompt options = whisper . DecodingOptions ( fp16 = False ) result = whisper . decode ( model , mel , options ) # print the recognized text print ( result . text )

Downloading the model will take some time:

image-20230530 PM 01835016

My two audio files are also placed as a backup: output_1.mp3 , output_2.mp3 ,

image-20230530 PM91228422

Write the logic of a loop:

 import whisper options = whisper . DecodingOptions ( fp16 = False , prompt = "以下是普通话的句子" ) model = whisper . load_model ( "medium" ) for i in range ( 361 ): file_name = f "output_{i:03d}.mp3" audio = whisper . load_audio ( "/Users/kelu/Desktop/" + file_name ) audio = whisper . pad_or_trim ( audio ) mel = whisper . log_mel_spectrogram ( audio ) . to ( "cpu" ) result = whisper . decode ( model , mel , options ) print ( result . text )

3. Unresolved issues

I tried to run whisper using mps like in the article in reference 1, but it didn’t work. There are also many discussions on the Internet. Follow up when you have the energy.

References

This article is transferred from: https://blog.kelu.org/tech/2023/05/29/mac-video-to-text.html
This site is only for collection, and the copyright belongs to the original author.