WCAG Audio Description Requirements for Government Videos

[00:00:05] Speaker A: Three, two, red one. [00:00:11] Speaker B: Today, we're diving into a topic that, on paper, sounds pretty straightforward until you actually have to implement it. New federal accessibility requirements tied to WCAG 2.1 will expand expectations beyond traditional closed captioning to include audio description for government communications. Now, many of you watching have probably experienced audio description before, whether you realized it or not, possibly on YouTube or Netflix. It's the additional narration track that describes what's happening on screen during natural pauses in dialogue, which helps visually impaired audiences understand scenes, actions, or context that isn't spoken. Which sounds great, and it is. But for a lot of organizations, it also raises the question, wait, how exactly are we supposed to do that? To help us break this down, we're joined by Nathan, founder of Castus, and Matthew, director of engineering at Castus as well. Now, Castis has been a key player in the government and community media space, particularly around cloud and captioning solutions used by municipalities, public access channels, and state agencies. Nathan, Matthew, thanks so much for joining us today. [00:01:23] Speaker C: And on our inaugural podcast, we are, yeah, super exciting. [00:01:28] Speaker B: Ton of questions for both of you. [00:01:30] Speaker C: And I then understand we're going to get to a little bit of show and tell, Is that right? [00:01:33] Speaker A: Absolutely. [00:01:34] Speaker C: Excellent. So, Nathan, I know you've been involved in this for a very long time. Can you kind of tell us about the state of things and where they are right now? [00:01:41] Speaker D: So this is a very exciting topic for us. Especially over the last couple of years, talks about audio descriptions has emerged. So we spent a lot of time researching and discussing with customers what potential workflows are going to be. And of course, seeing as though we're already in broadcast with cable playout scheduling, streaming captioning, it seems like we want to absolutely position ourselves as a leader for the audio description services. So there's a big bill to fill. There's a lot of questions, there's a lot of hype, there's a lot of concern as to how we're going to create audio descriptions for all of our content. [00:02:14] Speaker C: For everyone out there listening and watching. Perhaps we should set some kind of technical guidelines. Right, because there's going to be some acronyms. Wcag. I mean, it doesn't really roll off the tongue. What does that stand for? [00:02:24] Speaker D: WCAG. This is 2.1 AA level guideline specific, specifically. And this is really focusing on, as we all know, years ago, captioning became a requirement. And in this new 2.1 double day, this is where audio descriptions become a requirement. And when you think about audio descriptions, a lot of folks may or may not have seen it. There's been a kind of misconceptions. Is that where in closed captions it says. APPLAUSE no, no, no. This is a narration track that describes what's being unseen during gaps of silence. So if you think about this, it's, it's. That's going to be difficult for, for a human to do and maintain that workflow and consistency through two, three, one to five times a week. That's, that's a pretty big burden. So this is definitely where AI comes in to save the day. [00:03:08] Speaker C: So you're saying AI is being used for good? [00:03:10] Speaker D: It's being used for good, yeah. It is still scary how good it really is. [00:03:14] Speaker C: A lot of the municipalities and city governments and folks we've been talking to are kind of concerned about what the law does spell out and what doesn't spell out. For example, does your back catalog of content need to. Need to have this introduced? [00:03:28] Speaker A: Absolutely. [00:03:28] Speaker D: So the Good news is April 24 this year for the towns above 50,000 DOL, they only have to start creating audio descriptions on new content app roll after April 24th. So that means you do not have to go past all of your archives from years or decades ago and create audio descriptions for all that. It's just everything from the 24th moving forward. [00:03:46] Speaker C: So no back catalogs have to be done? [00:03:48] Speaker D: No back catalogs. You're good to go as long as you're just doing everything moving Forward past that April 24 deadline. Now, the way the law is written is it speaks about having everything as audio described for describing key visual elements. So that's falling everywhere from between who's saying what people on screen, items, like if there's a special presentation, maybe there's an award ceremony and of course, PowerPoint presentations. We all know there's a lot of presentations during council meetings, finance committee meetings and all of that sort of stuff. So really, the general, the rule of thumb we've said is like, if you close your eyes and listen to the, to the meeting, can you tell what's going on without having an audio description in order to facilitate that? I understand what's being seen on screen and if you can't, you're, you're not meeting, you're not meeting the guidelines. [00:04:33] Speaker C: So it sounds like you've just inadvertently kind of defined the difference between closed captioning and a lot of times closed captioning or open captions can sometimes have screen direction or what, what the talent is doing. But you're saying that if you can [00:04:46] Speaker B: close your eyes, what you see in [00:04:48] Speaker C: your Kind of mind's eye is something that you should have as an audio descriptor. That closed caption just doesn't handle. [00:04:52] Speaker D: Absolutely. So another thing to really point out too, that's, that's important really falls around that when you're watching this video with audio descriptions, there is no minimum word count. So in the caption world, when the, when caption became a regular mandatory thing, it was, it must be 95% or greater accuracy. With audio descriptions, there's, there's no definition around that doesn't have a minimum word count. But where it's very clear is that everything is identified, the key visual elements are identified so that you can listen to the video and have a good understanding of what's going on in the video. [00:05:26] Speaker C: Well, from a technical perspective, one of the things we've always had to deal with with introducing automated captioning is that some names may get spelled wrong, places may be misspelled. And how do you see that affecting how the captioning, the audio description tech that you're deploying. [00:05:44] Speaker A: So this is something that we've actually pioneered a lot in without giving away too much information about how our super cool technology works. There are several steps in this process. Some are AI related, some are not to course correct and try to make as accurate predictions as possible. So there's special things in there for building profiles on people at seas. So let's say somebody sees a placard or the software sees a placard and someone's speaking towards the beginning of a city council meeting. The software tries it's. It starts creating profiles on people so that it can remember and reflect that information maybe at the end of the meeting where a placard isn't visible. So we, this is so much more than just let's give it to the AI and, and see what it can figure out and do this. We, we have specific steps as a part of the process to confirm, I guess is probably the best way to describe it, that what it's seeing is accurate. [00:06:48] Speaker C: That's outstanding. I think we've kept everyone in suspense enough. [00:06:51] Speaker B: I say we get plugged in and [00:06:53] Speaker C: then come back and show everyone exactly what the software does. [00:06:57] Speaker E: Title card Huntsville City Schools logo on light background Huntsville City Schools Board of Education meeting [00:07:09] Speaker D: so these are important because this is the beginning of the video and if you just have just this title, just the graphic, you have no context as to what, where, date, that sort of information. [00:07:18] Speaker E: So board members Matthews and colleagues observe moment of silence at base. Huntsville City Schools board members and audience go Heads after pledge of allegiance. [00:07:28] Speaker F: All right, this time I'll turn it over to Senate Coordinator Dr. Rachel Evans for sighted sighting, Shed 30. [00:07:34] Speaker E: Dr. Evans approaches podium smiling. [00:07:36] Speaker F: Holy bait. [00:07:37] Speaker A: So here's an example. There wasn't a placard and it's still doing speaker recognition identification. [00:07:45] Speaker C: It's pretty impressive. [00:07:46] Speaker E: Poses with TVA $5,000 STEM grant check. [00:07:49] Speaker D: Now, looking at the scene, it's very difficult to understand what's going on in this presentation. And he was even able to pick out the fact there was a $5,000 check. [00:07:56] Speaker E: So after this next check, Dr. Sutton hands certificate to school counselor. Both smiling. [00:08:03] Speaker D: Now it's knowing that that's now a certificate, not a check. What's also really important to note is we uploaded this video to our new cloud AI audio description platform. We wanted to see what we were going to get back without touching anything whatsoever. The results came back like this. Of course, when we show off the user interface, there is a full blown editor. So once it creates and scans and analyzes the video, it creates text version of what the do what what our tech is going to do the text to voice audio track. But before that actually happens, it gives you a text version of everything to approve. But this is untouched. You know, this is just our, our AI built this the way it is. We did zero human intervention on it. [00:08:41] Speaker C: Is the data secure? Is are your models being trained on this AI content after it's been pushed to the cloud? [00:08:47] Speaker A: So it is extremely secure. CASAS takes data security very seriously. We are actually not retraining on this data. We would definitely explicitly let our customers know if we decided to do that, but we feel like that might. We don't want there to be any accidental invasion of privacy. So we're not retraining with the data. This is all stuff that we built before with data that was independent of what our customers provide. Our customers can definitely feel safe and secure using our product. We're not going to use your content to improve our product. We're not going to be lax with security either. [00:09:31] Speaker C: And what does the software give you as an output? What's that process? [00:09:34] Speaker A: So when you actually use this in practice, there's two steps. You're first going to submit it to do the pre processing and analysis and you're actually going to get a VTT back. Now you can manually inspect this VTT if you want, but we've conveniently given you an editor. You have your timeline and you'll see where all your descriptions are on that timeline. [00:09:54] Speaker D: You go to the video in your library and you click the action dropdown that file and then you choose Generate audio Descriptions. It's that simple. Once you select Generate Audio Descriptions you're going to be presented with the screen here where it's going to bring up your bank of minutes and hours that you've pre purchased. You're going to get notifications as jobs are complete and ready to review. And this is this wonderful editor that we built. There is a full video audio timeline so you can preview. Because we knew that remembering what the audio description is describing, it's very important that you have video and audio to reference. [00:10:24] Speaker A: Sure. [00:10:24] Speaker D: So you can tell if it's, if it's quality, if it's accurate. And on the right hand side you have all of the text windows represented with the time codes of what, what it's going to say. So this is just text. Now once we click Publish is when it's going to. Our, our software is going to take that text and convert it to a synthesized voice. That's the audio description. [00:10:43] Speaker C: You can change those voices. [00:10:45] Speaker D: You can change the voices? Yep. In your account you will select what type of voice type you want. The male, female, accent, whatever. However, the more we've been doing this for customers, we realize it is important to have a voice that does sound somewhat AI because you don't want someone who's visually impaired misinterpreting that. This is an audio description. It's not someone in the video. So you do want it to be distinctly different. [00:11:08] Speaker C: And what are we looking at in terms of turnaround time? [00:11:11] Speaker A: You would be shocked how fast this is. It takes longer to upload the video than it does to process them. I don't know, like 15 minutes. I think it took the most recent two hour samples I was running to do the actual analysis. [00:11:24] Speaker D: And there's one more thing that we want to define when it's they're saying get during gaps of silence. Now the reality is is sometimes when there's no speaking, there's applause, there's white noise, there's background noise. We spend a lot of time in defining like what, what is a gap of silence. So you'll notice in our example it's talking over applause. Sometimes the voice descriptions will describe when you and are having just like banter conversations. It's not important audio. It's going to take that as a moment of silence to provide audio descriptions because it knows it's not interrupting something important, which is pretty cool. So again, real world samples that we got from our customers. I want to play this one because this is Particularly impressive. I'm going to just let it speak for itself. [00:12:08] Speaker G: Welcome to this meeting of the Lakewood City Council. [00:12:13] Speaker E: Scout troop leader guides children toward dais front. Cub Scouts in uniform gather at dais for Plett. Scouts salute as troop leader Troop 21 stands at dais to the fray. [00:12:25] Speaker F: The United States of America. [00:12:28] Speaker G: Great job, Scouts. Can we give them a round of applause? [00:12:31] Speaker E: Scouts and troop leaders gather at Daisy. Mayor Cassandra Chase reviews notes at dates. [00:12:35] Speaker G: Good evening everyone. [00:12:36] Speaker B: And. [00:12:37] Speaker D: Okay, so I think that's enough. Really gets the point across. What I loved about this clip that we're watching is its standard definition. It's four by three for all of you that are not seeing the screen or the video. The entire audience stands up and blocks the wide shot. But in between people you can see what looks like to be kids of Cub Scouts walking on. And it saw that and described that it wasn't Boy Scouts, it knew was Cub Scouts. And then we cut around to a close up like medium shot looking down the podium at the, at the Scouts and it's recognizing as their badge on their shoulders, say Troop 21. And then it proceeds to cut to the mayor, Cassandra Chase, who was not introduced even in this video. It has her name placard in front of her. They've just put their name placards at the bottom of the dais in front of each member. And that's enough for our AI to know Cassandra Chase is the mayor. [00:13:29] Speaker C: Now for folks who are viewing this on another vod platform like YouTube, how does, how does YouTube handle the audio description? [00:13:35] Speaker A: There's a couple ways you can do this. Now I think for some YouTube channels they actually allow a proper secondary audio track. But from my understanding, access is extremely limited and it's an experimental technology. But we allow the ability to have your secondary audio track burned in so you can just upload a second piece of content and you're legally covered doing that. [00:14:01] Speaker D: And the unfortunate, the kind of, I guess the bad news is this WCAG does not fall on us, the vendors, actually it falls on the organizations themselves, the city governments to comply, which is what makes it very challenging. The OTT platforms do support secondary audio programming. CASAS integrates with all of those so those will be accepted on Facebook. Does not even accept SAP secondary audit programming. Good thing that this law is only about vod, not live stream. But you know, captioning at first was for file based at first only too. [00:14:37] Speaker B: So what I do want to hear [00:14:38] Speaker C: a little bit about is we haven't seen any other companies come to market with something like this, or it's prohibitively expensive. So can you kind of tell me where the genesis of this came from and how bleeding edge this technology is? [00:14:52] Speaker D: That's what we're most excited about because we're not borrowing anything from anyone. We're able to bring and introduce a brand new unseen product and service that people have only been able to just think about how it might operate. And with that, we have the luxury of being able to lower and bring down the cost to our customers so they can create more accessible content, not just the descriptions, but afford to add captions, translations, dubbing, meeting, summarization, audio descriptions, extended audio descriptions without breaking the budget, leaving them thinking, how am I going to afford all this? How are we going to be able to afford to have videos compliant? It's going to make us bankrupt. What we've done is we've created these services and solutions that allow them to more effectively and more efficiently from a cost standpoint as well fit the needs of this, this entire WCAG requirement, but also, you know, first and foremost, providing value to their community and their citizens so that they have the information that they need and everyone can stay in tune with the content that they're creating. [00:15:49] Speaker C: So did you have to kind of say, you know what, we're going to invent our best practices for this new emerging standard and kind of preach that, right, Evangelize these best practices so everyone kind of gets started on the right foot. [00:16:01] Speaker A: So because our software provides speaker identification and other stuff like that, we look to, you know, make placards visible and make sure lower thirds don't block placards, make sure your slides are clearly visible so that you know your content is best described for your visually impaired. [00:16:25] Speaker C: That's almost the tail wagging the dog. Because now you're saying all of the setups that city chambers and different locations have, all your PTZ presets, or all the camera people that are doing the work, you have to frame a little bit differently to get those name placards, to make sure that nothing's being obstructed by a lower third, etc. [00:16:43] Speaker A: So, and it's, it's definitely not game over. Like, let's say, you know, you don't follow our best practices, the audio descriptions will still create the best that it possibly can with the information that you see. And it's actually what you're hearing too, that's combined to really make this the perfect audio description possible. [00:17:01] Speaker D: You're framing, thinking about not just relying on a wide shot as frequently, getting close ups of People as they're speaking, addressing people by their names and titles. You know, Deputy Mayor Geiger, PowerPoint presentations. Feed it into the production switcher. Because those are really some very key visual elements that are important when they're talking about budgets or they're comparing last year's budgets to this year's or spending or contracts that are approving, whatever it may be. And then we reach out to some of our customers that are still on analog cameras. They're still four by three, because we wanted to see how and if we could break the system. And I don't want to say to our surprise, but we're looking at the results going, dang, dude, it reads better than I do. This is. I couldn't even tell what was on screen. And it's able to interpret and describe what's being seen on screen. [00:17:44] Speaker A: I mean, I think the first time I watched that example that you're talking about, my brain took longer to catch up with what was happening than it took the AI. [00:17:53] Speaker D: We were definitely celebrating after this. [00:17:55] Speaker A: Yeah, yeah, those were pretty exciting. Better results than, I mean, really early on than I could have hoped for. [00:18:01] Speaker C: Kind of looking, looking down the road, you could see where this could be at some point mandated for live broadcast as well. [00:18:07] Speaker D: I mean, that was one thing we initially thought of when playing with this was like, oh, my gosh, we could actually give this to sports teams and they could have their, their commentator be a voiceover. Our audio description service, as long as their numbers on their jerseys are clearly labeled, narrate the whole game. [00:18:24] Speaker A: I want to also point out, be on the lookout for live stream support. Yeah, that is definitely a big focus of mine. Hours for the future. And I can totally see WCAG requiring this. [00:18:40] Speaker D: Yeah, we know that it's not required today, but we know things change and with change, we want to be prepared. [00:18:47] Speaker A: You know, now that we've pioneered this technology, other checkpoints in the future will be way ahead of their due date. So we plan to support LIVE way before any law requires it, giving our customers some breathing room. [00:19:01] Speaker D: And their point was like, well, it defines in the law, if everything is described by the person speaking, then we're good. And I said, well, technically, yeah, but there's a lot on these slides, charts and graphs and numbers and everything. It's. You're now relying and putting your faith that it's all going to be described in this one presenter. Why would you risk that? [00:19:23] Speaker A: Yeah, I mean, when something isn't explicit, me personally, I get more scared I'm going to insure my myself by using something that was designed to solve this problem. [00:19:32] Speaker D: So are you telling me you're going to go through and watch these presentations after the fact and preview them? Preview the 48 minute presentation about zoning prior to publishing it on YouTube to make sure, and that means, you know, paying key, close attention to it, that every single element on that screen was described audibly. [00:19:50] Speaker B: You know who does that? That's called qc. And you know what, you pay for qc. [00:19:54] Speaker D: Yes, you do. I think you and I, we've probably all done QC and there's a reason why we're not doing it anymore. Yeah, it's so fun. This is the insurance too, that you don't have to sit there. It's doing it much faster than any human ever could and it's truly 100% paying attention to audio and video to ensure that everything is accurately described to the level that WCAG requires and making no mistake doing it. [00:20:21] Speaker C: Nathan, Matthew, thank you so much for joining us. Thank you so much for bringing us this technology, bringing the world this technology and doing it in a way that's not going to cause everyone to pull out their hair. So thank you so much for joining us today. [00:20:34] Speaker A: Yeah, thank you. Absolutely. [00:20:35] Speaker D: It was a pleasure. [00:20:37] Speaker B: Thanks for watching. [00:20:38] Speaker C: Broadcast to post. Don't forget to follow Keycode Media on social and contact us about your [email protected].

Show Notes

Episode Transcript

Other Episodes

Episode

How can we Adapt to Current Trends for Live Production and Broadcast?

Episode

On-Prem vs. Cloud vs. Hybrid | Interview with Troy English (CTO at Ross Video)

Episode

Building A Campus-Wide Video Strategy For Highschools