Automating Video Transcription
In a world where time is precious, watching lengthy videos can be a challenge. Be it educational content, meetings, or webinars, the need to quickly grasp the main points without viewing the entire video is crucial. Enter VideoToText, a project that automates the text extraction from video files, allowing for rapid content reading.
The Issue
Many of us face situations where watching a full video isn’t feasible, yet understanding its content is necessary. Manual transcription is both time-consuming and impractical. Automating this process with VideoToText can save substantial time and effort by quickly and accurately converting video to text.
Technology Stack
VideoToText leverages a powerful technology stack:
• yt-dlp: A command-line tool for downloading videos from YouTube and other sites.
• FFmpeg: A versatile tool for processing audio and video files, used here to extract audio from videos.
• OpenAI Whisper: A state-of-the-art speech recognition model for transcribing audio to text.
- Docker: Ensures easy setup and consistent environment deployment.
Dockerfile Breakdown
The Dockerfile ensures all dependencies are correctly installed and the Whisper model is pre-downloaded for efficiency:
# Use the official Python 3.12 image based on Slim
FROM python:3.12-slim
# Set environment variables to non-interactive mode for apt
ENV DEBIAN_FRONTEND=noninteractive
# Update GPG keys and install dependencies
RUN apt clean && apt update && apt install -y ffmpeg git
# Install Python libraries
COPY requirements.txt /app/requirements.txt
RUN pip install --no-cache-dir -r /app/requirements.txt
RUN pip install --upgrade --no-deps --force-reinstall git+https://github.com/openai/whisper.git
# Define build-model arguments
ARG WHISPER_MODEL=small
# Set environment variables
ENV WHISPER_MODEL=$WHISPER_MODEL
# Download the Whisper model and cache it
RUN python -c "import whisper; whisper.load_model('$WHISPER_MODEL')"
# Copy the application files
COPY ./ /app
# Set the working directory
WORKDIR /app
CMD ["python", "main.py"]
Getting Started
To start using VideoToText:
- Clone the Repository:
git clone https://github.com/vshloda/VideoToText
cd VideoToText
2. Build the Docker Image:
docker compose build
3. Run the Docker Container:
For YouTube videos:
docker compose run --rm app python main.py --url "https://www.youtube.com/watch?v=example"
For local video files:
docker compose run --rm app python main.py --file "files/video.mp4"
Segmented Audio Processing
The transcription process splits audio into smaller segments, enhancing efficiency and accuracy. Here’s a code snippet for audio segmentation:
def split_audio(audio, segment_length=30, sample_rate=16000):
segments = []
audio_length = len(audio) // sample_rate
for start in range(0, audio_length, segment_length):
end = min(start + segment_length, audio_length)
segment = audio[start * sample_rate:end * sample_rate]
segments.append((segment, start, end))
return segments
def convert_audio_to_text_whisper(model, audio_path, output_format):
print("Transcribing audio...")
# load audio
audio = whisper.load_audio(audio_path)
audio_length = len(audio) / whisper.audio.SAMPLE_RATE
# split audio into 30-second segments
segments = split_audio(audio, segment_length=30)
# initialize the text result
result_list = []
result_txt = ''
for segment, start, end in tqdm(segments, desc="Transcribing segments"):
# check if segment shape is correct
if len(segment) < 16000 * 30:
segment = whisper.pad_or_trim(segment)
# make log-Mel spectrogram and move to the same device as the model
mel = whisper.log_mel_spectrogram(segment).to(model.device)
# detect the spoken language
_, probs = model.detect_language(mel)
detected_language = max(probs, key=probs.get)
print(f"Detected language: {detected_language}")
# decode the audio
options = whisper.DecodingOptions()
result = whisper.decode(model, mel, options)
if output_format == 'txt':
result_txt += ' ' + result.text.strip()
else:
result_list.append({
"start": format_timestamp(start),
"end": format_timestamp(end),
"text": result.text.strip()
})
print("Transcription completed.")
if output_format == 'txt':
return result_txt
else:
return result_list
The Whisper Model
Whisper offers various model sizes to balance performance and resource usage. Choose a model that fits your needs and set it using the WHISPER_MODEL environment variable:
model = whisper.load_model(os.getenv('WHISPER_MODEL', 'small'))
Conclusion
VideoToText automates video content transcription, enabling quick comprehension of main points without watching the entire video. Future enhancements could include a summarization feature to extract only the essential information.
Check out the VideoToText project on GitHub to get started.