Automating Video Transcription

Vitalii Shloda
3 min readJul 12, 2024

--

In a world where time is precious, watching lengthy videos can be a challenge. Be it educational content, meetings, or webinars, the need to quickly grasp the main points without viewing the entire video is crucial. Enter VideoToText, a project that automates the text extraction from video files, allowing for rapid content reading.

The Issue

Many of us face situations where watching a full video isn’t feasible, yet understanding its content is necessary. Manual transcription is both time-consuming and impractical. Automating this process with VideoToText can save substantial time and effort by quickly and accurately converting video to text.

Technology Stack

VideoToText leverages a powerful technology stack:

yt-dlp: A command-line tool for downloading videos from YouTube and other sites.

FFmpeg: A versatile tool for processing audio and video files, used here to extract audio from videos.

OpenAI Whisper: A state-of-the-art speech recognition model for transcribing audio to text.

  • Docker: Ensures easy setup and consistent environment deployment.

Dockerfile Breakdown

The Dockerfile ensures all dependencies are correctly installed and the Whisper model is pre-downloaded for efficiency:

# Use the official Python 3.12 image based on Slim
FROM python:3.12-slim

# Set environment variables to non-interactive mode for apt
ENV DEBIAN_FRONTEND=noninteractive

# Update GPG keys and install dependencies
RUN apt clean && apt update && apt install -y ffmpeg git

# Install Python libraries
COPY requirements.txt /app/requirements.txt
RUN pip install --no-cache-dir -r /app/requirements.txt

RUN pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git

# Define build-model arguments
ARG WHISPER_MODEL=small

# Set environment variables
ENV WHISPER_MODEL=$WHISPER_MODEL

# Download the Whisper model and cache it
RUN python -c "import whisper; whisper.load_model('$WHISPER_MODEL')"

# Copy the application files
COPY ./ /app

# Set the working directory
WORKDIR /app

CMD ["python", "main.py"]

Getting Started

To start using VideoToText:

  1. Clone the Repository:
git clone https://github.com/vshloda/VideoToText
cd VideoToText

2. Build the Docker Image:

docker compose build

3. Run the Docker Container:

For YouTube videos:

docker compose run --rm app python main.py --url "https://www.youtube.com/watch?v=example"

For local video files:

docker compose run --rm app python main.py --file "files/video.mp4"

Segmented Audio Processing

The transcription process splits audio into smaller segments, enhancing efficiency and accuracy. Here’s a code snippet for audio segmentation:

def split_audio(audio, segment_length=30, sample_rate=16000):
segments = []
audio_length = len(audio) // sample_rate
for start in range(0, audio_length, segment_length):
end = min(start + segment_length, audio_length)
segment = audio[start * sample_rate:end * sample_rate]
segments.append((segment, start, end))
return segments

def convert_audio_to_text_whisper(model, audio_path, output_format):
print("Transcribing audio...")

# load audio
audio = whisper.load_audio(audio_path)
audio_length = len(audio) / whisper.audio.SAMPLE_RATE

# split audio into 30-second segments
segments = split_audio(audio, segment_length=30)

# initialize the text result
result_list = []
result_txt = ''

for segment, start, end in tqdm(segments, desc="Transcribing segments"):
# check if segment shape is correct
if len(segment) < 16000 * 30:
segment = whisper.pad_or_trim(segment)

# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(segment).to(model.device)

# detect the spoken language
_, probs = model.detect_language(mel)
detected_language = max(probs, key=probs.get)
print(f"Detected language: {detected_language}")

# decode the audio
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)

if output_format == 'txt':
result_txt += ' ' + result.text.strip()
else:
result_list.append({
"start": format_timestamp(start),
"end": format_timestamp(end),
"text": result.text.strip()
})

print("Transcription completed.")
if output_format == 'txt':
return result_txt
else:
return result_list

The Whisper Model

Whisper offers various model sizes to balance performance and resource usage. Choose a model that fits your needs and set it using the WHISPER_MODEL environment variable:

model = whisper.load_model(os.getenv('WHISPER_MODEL', 'small'))

Conclusion

VideoToText automates video content transcription, enabling quick comprehension of main points without watching the entire video. Future enhancements could include a summarization feature to extract only the essential information.

Check out the VideoToText project on GitHub to get started.

--

--

Vitalii Shloda
Vitalii Shloda

Written by Vitalii Shloda

Software Engineer. I write about backend, data and other amazing stuff