Captions and alternative text: to automate or not to automate?

Internet accessibility button in the shape of the Enter key on a computer keyboard

Last month I was fortunate to have the chance to travel to Montreal, Canada for the 13th annual Web for All (W4A) conference. The quality of papers was exceptional and there were many great presentations providing lots of food for thought around a number of accessibility topics. I really enjoy attending conferences where cutting-edge research challenges your thinking and this year was no exception.

While many different topics and viewpoints were discussed both formally and informally, the one I’ve kept thinking about since returning home relates to the benefits and issues of automated accessibility processes. This was highlighted by a great paper on automated captions and the timeliness of Facebook’s recently announced automated alternative text (AAT) feature.

Facebooks AAT: promise and peril at the same time

To start with the latter, the feature, available to most English users of the Facebook mobile app, can use image detection software to automate alternative text to support people who are blind or vision impaired. As Matt King, a Facebook engineer who is blind, wrote in a TechCrunch article, the benefits if it works are highly significant.  “You just think about how much of your news feed is visual — and most of it probably is — and so often people will make a comment about a photo or they’ll say something about it when they post it, but they won’t really tell you what is in the photo.” In essence, the aim of the automated alternative text is to address this issue.

There are many demonstrations around the web of its success, and for the most part it seems to get things right with startling accuracy.  With a fairly detailed description such as “One person frowning in front of a blue car”, there are examples where the alternative text is both informative and profoundly beneficial to people who are blind or vision impaired.

However, before we declare victory and pledge never again to dabble with what goes inside ALT=””, mark-up, an important question to ask is “What does the blind and vision impaired community have to say?” Judging by the responses on the Applies website, the answer appears to be quite a lot – and not entirely positive. Initial comments from users indicate some initial frustration in using the feature, particularly because it appears to be limited to specific English language settings, and its accuracy, while remarkable at times, currently lacks consistency across a broad range of photo types. Facebook has indicated that additional language support will be provided in the future, and the accuracy of the AAT will improve over time, but the reality is that such tools only work if they can be relied on for accuracy, and as long as the AAT continues to confuse an open refrigerator with a group of people smiling, it will remain in the ‘not quite there yet’ category.

This story seems familiar somehow…YouTube, 2009 perhaps?

If you’re thinking to yourself that this story sounds strangely familiar, you’re not alone. The use of automation to assist with accessibility is not new, not even for social media. In 2009 YouTube introduced its automated captions feature with a great deal of excitement from the Deaf and hearing impaired community and developers. The automated captions had the potential to make millions of hours of inaccessible video available to people with hearing-related disabilities within a day or two by letting Google’s automated voice technology caption the video for you, meaning developers didn’t have to worry about what at that time was considered a tedious and difficult process, and people with disabilities could get access to a wealth of online content.

However, the reality was that, as with Facebook’s AAT, the technology couldn’t be relied upon. If the video had good quality audio, was basically a talking head, and the person speaking had an American accent, the automated captions would do a reasonable job in providing captions. Wander off the path of this criteria even slightly though and the captions became a hilarious combination of random words, a fact highlighted by videos on YouTube itself. At the time it was also limited by which languages it would work with, and Google pledged, as Facebook is doing now, that additional languages would be supported and the quality would improve over time.

Fast forward to 2016 and to some degree this is true. The automated caption feature in YouTube has certainly improved since its launch, and while it’s still a dangerous proposition to just assume that the automated captions should be used without checking the output, it is much better than it was and has produced some unexpected benefits to developers.

The middle ground

I often hear arguments at accessibility conferences from people arguing about whether or not automation should be used, with the camps generally being either ‘automation is not accurate, it does more harm than good and it’s a useless gimmick’ through to the other point of view of ‘automated is a developer’s salvation as it saves time, money and ultimately something is better than nothing’. In the Professional Certificate in Web Accessibility course I teach, students have to caption a two-minute video and we give them the choice as to whether they want to caption from scratch, use automated captions or a combination of the two, i.e. use the automated captions and then edit the errors.

It’s this ‘middle ground’ approach that turned my attention to a great paper at W4A entitled ‘The Effects of Automatic Speech Recognition Quality on Human Transcription’ by Ashish Gaur, Carnegie Mellon University; Walter Jasicki, University of Michigan; Jeffrey Bingham, Carnegie Mellon University; and Florian Metz, Carnegie Mellon University. The research focused on trying to find a ‘sweet spot’ in which automated and manual captions could both be beneficial for real-time captioning. The presentation made the point that even under the best of circumstances, automated captions are still likely to have a 15% error rate, meaning that it’s not good enough to use in its own right. The second demonstration showed what happens if a person tries to caption manually in real-time, and the result was far worse as humans can’t keep up with the dialogue of the video even if the dialogue is at a fairly slow speed.

The third option is to use the automated captions, but instead of just relying on them, attempt to edit them on-the-fly as the video is playing. The result was impressive and for the most part the errors could be identified and corrected while the video was playing. The research suggests that as long as the error rate is around 15%-35% then it’s possible, but after that the automated captions are of no real benefit.

While I’ve heard some debate over the percentage of accuracy required to make such a process work, and even debate within my own organisation about how the familiarity of the video and other optimal conditions would be needed for this to work, it does raise the point that the use of automation doesn’t have to be an either/or situation. Indeed, since YouTube’s 2009 launch of automated captions, the availability of free caption editors including YouTube’s own feature and great tools like Amara mean that tweaking automated captions is not the onerous process it was years ago, and because the automated captions get the timing right with pinpoint accuracy, it does make the work of developers much easier in many scenarios. Naturally this won’t help much if the video features a teenager screaming something at a camera as they fly past on a skateboard, but for professional videos the automation has reached a point where it’s likely to be helpful.

Likewise, for Facebook’s AAT, if the feature can be used by developers as a first-run of alternative text, leaving developers to focus more on editing the text rather than creating it from scratch, automation may again prove useful, especially if the content is image-rich, such as a photo gallery, presumably the very scenario that Facebook had in mind.

Use it, then fix it

Regardless of which side of the fence you sit when it comes to automation, It’s clear that these processes are here to stay and I suspect we’ll see even more uses of automation for accessibility in the future. The key message, in my view, isn’t about avoiding automated solutions due to their accuracy issues. It’s about embracing the benefits such accurate timing of captions or somewhat-accurate alternative text, then making absolutely sure that the text is corrected before it gets published so that people with disabilities can rely on it. In essence, it’s fine to use it, but it’s absolutely imperative to fix it, making sure that inaccurate automation is not seen as yet another accessibility issue associated with technology that can ultimately help. With more innovation in this space I’m looking forward to seeing what accessibility benefits automation brings us next.