Automating Video Transcription

3 min readJul 12, 2024

In a world where time is precious, watching lengthy videos can be a challenge. Be it educational content, meetings, or webinars, the need to quickly grasp the main points without viewing the entire video is crucial. Enter VideoToText, a project that automates the text extraction from video files, allowing for rapid content reading.

The Issue

Many of us face situations where watching a full video isn’t feasible, yet understanding its content is necessary. Manual transcription is both time-consuming and impractical. Automating this process with VideoToText can save substantial time and effort by quickly and accurately converting video to text.

Technology Stack

VideoToText leverages a powerful technology stack:

• yt-dlp: A command-line tool for downloading videos from YouTube and other sites.

• FFmpeg: A versatile tool for processing audio and video files, used here to extract audio from videos.

• OpenAI Whisper: A state-of-the-art speech recognition model for transcribing audio to text.

Docker: Ensures easy setup and consistent environment deployment.

Dockerfile Breakdown

The Dockerfile ensures all dependencies are correctly installed and the Whisper model is pre-downloaded for efficiency:

# Use the official Python 3.12 image based on Slim
FROM python:3.12-slim

# Set environment variables to non-interactive mode for apt
ENV DEBIAN_FRONTEND=noninteractive

# Update GPG keys and install dependencies
RUN apt clean && apt update && apt install -y ffmpeg git

# Install Python libraries
COPY requirements.txt /app/requirements.txt
RUN pip install --no-cache-dir -r /app/requirements.txt

RUN pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git

# Define build-model arguments
ARG WHISPER_MODEL=small

# Set environment variables
ENV WHISPER_MODEL=$WHISPER_MODEL

# Download the Whisper model and cache it
RUN python -c "import whisper; whisper.load_model('$WHISPER_MODEL')"

# Copy the application files
COPY ./ /app

# Set the working directory
WORKDIR /app

CMD ["python", "main.py"]

Getting Started

To start using VideoToText:

Clone the Repository:

git clone https://github.com/vshloda/VideoToText
cd VideoToText

2. Build the Docker Image:

docker compose build

3. Run the Docker Container:

For YouTube videos:

docker compose run --rm app python main.py --url "https://www.youtube.com/watch?v=example"

For local video files:

docker compose run --rm app python main.py --file "files/video.mp4"

Segmented Audio Processing

The transcription process splits audio into smaller segments, enhancing efficiency and accuracy. Here’s a code snippet for audio segmentation:

def split_audio(audio, segment_length=30, sample_rate=16000):
    segments = []
    audio_length = len(audio) // sample_rate
    for start in range(0, audio_length, segment_length):
        end = min(start + segment_length, audio_length)
        segment = audio[start * sample_rate:end * sample_rate]
        segments.append((segment, start, end))
    return segments

def convert_audio_to_text_whisper(model, audio_path, output_format):
    print("Transcribing audio...")

    # load audio
    audio = whisper.load_audio(audio_path)
    audio_length = len(audio) / whisper.audio.SAMPLE_RATE

    # split audio into 30-second segments
    segments = split_audio(audio, segment_length=30)

    # initialize the text result
    result_list = []
    result_txt = ''

    for segment, start, end in tqdm(segments, desc="Transcribing segments"):
        # check if segment shape is correct
        if len(segment) < 16000 * 30:
            segment = whisper.pad_or_trim(segment)

        # make log-Mel spectrogram and move to the same device as the model
        mel = whisper.log_mel_spectrogram(segment).to(model.device)

        # detect the spoken language
        _, probs = model.detect_language(mel)
        detected_language = max(probs, key=probs.get)
        print(f"Detected language: {detected_language}")

        # decode the audio
        options = whisper.DecodingOptions()
        result = whisper.decode(model, mel, options)

        if output_format == 'txt':
            result_txt += ' ' + result.text.strip()
        else:
            result_list.append({
                "start": format_timestamp(start),
                "end": format_timestamp(end),
                "text": result.text.strip()
            })

    print("Transcription completed.")
    if output_format == 'txt':
        return result_txt
    else:
        return result_list

The Whisper Model

Whisper offers various model sizes to balance performance and resource usage. Choose a model that fits your needs and set it using the WHISPER_MODEL environment variable:

model = whisper.load_model(os.getenv('WHISPER_MODEL', 'small'))

Conclusion

VideoToText automates video content transcription, enabling quick comprehension of main points without watching the entire video. Future enhancements could include a summarization feature to extract only the essential information.

Check out the VideoToText project on GitHub to get started.

Automating Video Transcription

Written by Vitalii Shloda

No responses yet