Collection video transcriptions at scale with Whisper

Published in

ACMI LABS

5 min readDec 9, 2022

Our ACMI Labs team has spent the past few weeks prototyping, building, and integrating automated video transcriptions into XOS, our museum operating system, using OpenAI Whisper.

We did this to help uncover the details of what’s in our video archive — a lot of which contains amateur films and home movies with limited documentation and cataloguing information.

We’ve been dreaming about this for a long time but have been previously limited by not having the digital infrastructure, only having our collection in physical analogue formats, and the unattainable cost of third party transcription services at such scale.

So our team was very excited when we posted about the initial release of Whisper on Slack one morning:

A screenshot of Seb’s first post to Slack where he noticed OpenAI had released Whisper for audio and video transcribing. — A screenshot from ACMI’s Slack backchannel

Evaluation

To quickly test Whisper we trimmed a collection video to the first 5 minutes using ffmpeg and uploaded it to a publicly accessible S3 media storage bucket.

We then opened up a Google Colab Notebook, installed Whisper from the source, and pointed it at the video. The Notebook looked like this:

# whisper.ipynb

# Install dependencies
! pip install git+https://github.com/openai/whisper.git

# 5 minute test
import whisper

model = whisper.load_model("base")
result = model.transcribe("https://public.s3.amazonaws.com/012345_first_5_minutes.mp4")
print(result["text"])

Infrastructure

The results from the “base” model were good enough to lead us to continue to evaluate longer videos, and start preparing XOS and its Videos API to accept transcriptions.

Because the memory requirements of this model were only ~1GB we were also able to integrate Whisper directly into our XOS infrastructure. This enabled us to run background tasks both in our Azure Cloud and on local computers, which was very desirable when we have more than 4,000 digitised items to process.

Our node pool of cloud computers are CPU only, so we couldn’t take advantage of the 16x transcription speed of Whisper on a GPU, but even on our CPUs we saw transcriptions happening at ~2.5x video playback speed.

In reality our cloud infrastructure only managed to transcribe videos less than 2,700 seconds (45 minutes) in length before our pods hit their limits and were destroyed.

So for a few weeks we also ran transcriptions on MacBookPro M1 laptops, making use of our XOS Videos API to GET the Video data, run the transcriptions, and POST back the transcribed JSON data to XOS. The final results were:

3,406 videos transcribed
1,093 hours of footage
4,924,635 words

The transcribed data has the full text in a JSON field, and also segments of it with start and end timecodes in seconds:

{
  "text": " When Prime Minister Menzi announced a new form of conscription in November 1964, it was accepted by most of the population without question.",
  "language": "en",
  "segments": [
    {
      "id": 0,
      "start": 0.0,
      "end": 23.32,
      "seek": 0,
      "text": " When Prime Minister Menzi announced a new form of conscription in November 1964, it was",
      "tokens": [
        50364, 1133, 9655, 6506, 6685, 3992, 7548, 257, 777, 1254, 295, 1014,
        12432, 294, 7674, 34314, 11, 309, 390, 51530, 51530, 9035, 538, 881,
        295, 264, 4415, 1553, 1168, 13, 51718, 51718
      ],
      "avg_logprob": -0.22478310267130533,
      "temperature": 0.0,
      "no_speech_prob": 0.2515774071216583,
      "compression_ratio": 1.1864406779661016
    },
    {
      "id": 1,
      "start": 23.32,
      "end": 27.080000000000002,
      "seek": 0,
      "text": " accepted by most of the population without question.",
      "tokens": [
        50364, 1133, 9655, 6506, 6685, 3992, 7548, 257, 777, 1254, 295, 1014,
        12432, 294, 7674, 34314, 11, 309, 390, 51530, 51530, 9035, 538, 881,
        295, 264, 4415, 1553, 1168, 13, 51718, 51718
      ],
      "avg_logprob": -0.22478310267130533,
      "temperature": 0.0,
      "no_speech_prob": 0.2515774071216583,
      "compression_ratio": 1.1864406779661016
    }
  ]
}

Prototypes

We built a first prototype as a Django template view in XOS, which allowed ACMI staff to tap directly from our video collection admin to a graphical representation of the transcribed data.

Staff testing led us to explore the Whisper utils for exporting .srt and .vtt subtitles files for media players and HTML video elements. We exposed those on the XOS Videos API as well so we can now deploy auto-generated closed-captions to in-gallery devices as well as HTML video players in XOS and on our website.

Production

Whilst still very much in an experimental state, we’ve now deployed video transcriptions to the ACMI website.

See an example of it on the work page for Patterns by Margaret Haselgrove: https://www.acmi.net.au/works/78690--patterns/

**Left**: The ACMI website work page for Patterns. **Right**: the pop-up video view showing auto-generated transcriptions and closed captions.

Results

Overall we have been really happy with the accuracy of the “base” model transcriptions. Things Whisper struggles with in our videos are:

Names of people
Names of places
Music without lyrics in between sections of dialog
Our name — it transcribes ACMI as ACME!

But failed transcriptions do usually have a repeatable format:

00:00:30 Oo...
00:00:31 Oo...
00:00:32 Oo...
00:00:35 Oo...

00:05:06 I don't know.
00:05:10 I don't know.
00:05:14 I don't know.
00:05:18 I don't know.
00:05:22 I don't know.
00:05:32 I don't know.

So that’ll make it reasonably easy to run some queries on the transcription data to select videos that require a larger Whisper “model” for better accuracy.

Organisational benefits

Our Collections team are excited at the prospect of using the auto-generated transcriptions to help catalogue videos, as we will always have limited cataloguing staff time.

Our video creation team can also envisage using transcriptions to generate semi-automated super-cuts from our video collection.

Next steps

We’d like to develop a versioning strategy to keep track of which models we used for video transcriptions
We’d like to explore transcriptions on GPUs — both in new cloud node pools, and also locally on external GPUs connected to laptops
We’d like to build a video search for transcriptions

Love to hear what you think of it all, what you’d like us to build next, and what you’re building with Whisper or alternative transcribing services.