Robert Bleidt – Articles

Why are more consumers reading captions instead of listening?

Recent surveys, some funded by the captioning industry, report that more consumers are reading captions instead of listening to the dialogue on TV programs, and younger viewers are much more likely to do so than older viewers. Given the latter, it’s likely that hearing loss, the original reason for captions, is not driving this. What is, then? Should the audio industry do anything about it?

I believe the trend to reading captions is not caused by one single problem, but a combination of factors. In the 20th Century, dialogue intelligibility was not really considered a problem. What has changed since then?

For one, we did not have video streaming to mobile devices you could carry anywhere. TV viewing was done where there was a controlled listening environment with acceptance of the TV playing by those present. That’s not always the case today when you are watching on your personal mobile device without earphones. In this situation, captions are a good solution to overcome a loud environment or avoid disturbing others.

Also, today’s younger demographics have different habits and preferences. Many would rather text than talk, for example. So, if the audio is intelligible but captions are still their preference, there’s not an engineering problem to solve.

However, there are still plenty of technical factors to consider. Let’s begin with the content creation process:

Challenges in content creation

Making more with less. At the end of the last century, the days of the big three TV networks in the U.S. were over and we were on our way to John Malone’s “500 channel universe.” Since then, audiences have fragmented into more and more channels, driving the production of more content. According to the FX Network1, the number of scripted TV shows has almost tripled since 2002. Also, this is just “scripted” shows, not including the “reality” shows that have become prevalent due to fragmentation. This means there’s less money and less time to make the sound mix for a show, since there is a smaller audience. And, there may be more mixes to make today.

Today, most content has a 5.1 audio mix, and some content, particularly for theatrical release, has a Atmos mix. Consider the mixes needed for an “A” feature film:

  • Atmos theatrical mix
  • Atmos near-field mix
  • 5.1 near-field mix
  • Stereo near-field mix
  • Audio description mix (for the visually impaired)

These mixes would be adapted (duplicated) for each language the film is released in, often with picture edits for country censorship or regional preferences. Episodic series may not have a theatrical mix, and budget-constrained series may not make an Atmos mix. The volume of work that must be done for less time and money means less review of each mix.

Creative Concerns. My feeling is that sound mixers aspire to the practices of the most prestigious version of their craft – the big-budget “A” feature film played in a top cinema. For a sound mixer, there is no Academy Award for Most Intelligible Dialogue, only for Best Sound. The goal of the sound mixer is to implement the creative vision of the director or showrunner and perhaps gain the recognition of his peers. I’m not saying the mixer will ignore intelligibility concerns, but they may be weighed against creative concerns and judged in the ideal environment of a dubbing stage, or even worse, a near-field room. I have talked to film mixers who strongly object to the possibility of letting consumers adjust the dialogue level in their films. Often they switch from friendly to a steely-eyed stare and say “if we wanted you to hear that dialogue clearly, we would have mixed it that way.” But, hearing it on a dubbing stage does not translate to the consumer’s environment. Why?

Dynamic Range. Before the introduction of digital audio in about 2000, there was a single mix for “TV” – a stereo mix for transmission over an analog TV signal with perhaps 40 dB of usable dynamic range, and up until the last years of the 20th Century, theatrical releases of films had optical sound tracks with similar dynamic range. Sound mixers worked within these constraints while mixing. With digital audio, 80 dB of dynamic range became possible and many mixers took advantage of it. This is far too much dynamic range for home listening except in a true purpose-built home theater. Consumers will have to “ride the gain”, turning the volume down on loud effects, and then up to hear dialogue.

It is possible to reduce this dynamic range during playback by using the Dynamic Range Control setting built into modern codecs such as AC-3, AC-4, MPEG-H, or xHE-AAC. Unfortunately, the user interface of most consumer devices at best has this hidden in a settings menu as “night mode”, or at worst it’s just not implemented. Should you use “night mode” if you are watching during the day? If consumers even know of this setting, they likely do not know what it does.

The dynamic range control or DRC works by sending scaling gains in each frame of audio that can applied in the decoder. In AC-3 and AAC-LC, these gains are derived from a model of an analog compressor (based on one in a Neve console). The compressor settings are controlled by encoder profiles such as “Film Light” or “Film Standard”. In newer codecs such as MPEG-H, or xHE-AAC, the entire program is analyzed and the gains calculated digitally with infinite look-ahead.

DRC does have the advantage that the scaling gains are sent separately, so that the content itself is not modified. For the few consumers that have true home cinemas, they can listen to the full-range content. Consumers in mainstream living rooms can use a medium DRC setting to avoid sound effects being too loud in comparison to the dialogue. A very strong DRC setting can be used for mobile listening in a noisy environment.

Differences in consumer listening conditions

Differences in listening conditions. Feature films intended for theatrical release are usually mixed in a dubbing stage, a cinema with 100-300 seats, with a large mixing console at the rear. The speakers today are similar to that of a top-line cinema – usually a full Dolby Atmos installation. The monitoring level is usually 85 dBA SPL. Speakers in the room are equalized for the SMPTE “X-curve” , which is a 3 dB per octave attenuation above 2 kHz, to account for screen attenuation and room response.

Former Technicolor dubbing stage at Paramount Pictures

With the introduction of DVD releases, mixes began to be released in “near-field rooms”2 of similar size to a control room or mastering room for music – about the size of a large consumer living room. These rooms have a typical listening distance of 2-3 meters and can be very dry. I once asked the engineer of a prominent immersive one what the acoustic treatment was, and he said “eight inches of 703 fiberglass on everything except the floor.” Above the low bass frequencies, that’s anechoic. The monitoring level varies. The Netflix spec3 calls for 79 or 82 dB SPL, while other uses may be at the 85 dB theatrical level.

Historically, near-field mixes were made of theatrical releases so they could be released on Blu-ray or streaming services. “Made for TV” programs such as episodic dramas typically only had a near-field mix. The usual practice in mixing has been to make a theatrical mix – a process that can take weeks for a feature film – and then spend a day or two tweaking a near-field mix.

Let’s examine the differences of a near-field mix and typical consumer living room listening:

Consumers typically play their TV at much lower volumes. Studies have found typical listening levels of 60-65 dB SPL. A study by Benjamin4 found the mean listening level was 58 dBA SPL:

Consumer listening rooms are typically much more reverberant. Studies over decades have shown consumer rooms to have average reverberation times of about 400 ms. A 2010 paper by Holman and Green5, from Audyssey test data, shows this continues, though they comment that regions that construct homes with concrete, brick, or stone walls and floors may have higher RT60 times:

Consumer background noise levels are much higher. Professional rooms are typically built to a NR/NC ambient noise spec of 15 or 20. While a very quiet suburban house can sometimes make NC 20, that is not the case with HVAC noise, pets, appliances, other occupants, or activities in other rooms.

Consumers today have rear-firing TV speakers and $200 sound bars. Back in the 20th Century, TV’s had front-firing speakers. A console TV of the 1970’s typically had a 5 or 6 inch speaker with rather limited bass and treble response, but reasonable response in the voice range. Later decades brought stereo sound and better speakers, but as flat-panel TVs were introduced, marketing considerations led to ever thinner TVs with ever smaller bezels, relegating speakers to the rear of the TV. Today, the solution to this problem for the typical consumer, should he or she feel the need for better sound, is to purchase a $200 sound bar to sit under the $600 TV. This typically provides two or three modest speakers that are at least front-firing. Note that although a sound bar may carry a Dolby or DTS logo, that does not mean it’s capable of reproducing 5.1 channel sound, only that it can decode those signals to stereo or LCR.

Other options available to consumers in the past were to buy an Audio-Video Receiver and set of individual speakers. At the low end of the market, these were packaged together into a “Home Theater In a Box” system. Today, high-end soundbars have eclipsed the AVR and HTIB categories, since they offer much more convenience. A sound bar in the $1500 range can actually do a good job of reproducing a surround or Atmos experience, almost as good as a much more expensive AVR + speaker system. But, this is outside the budget of a typical consumer buying a $600 TV.

Website rtings.com measures the frequency response of each TV they test. One of the best sounding TVs (green below) they tested recently was flat within 6 dB in the speech range. One of the worst (blue below) was flat within 15 dB in this range. Contrast this to a typical near-field monitoring speaker such as a Genelec, that is typically half a dB in this range.

Frequency response of best-sounding 2023 TV (green) and representative poor-sounding TV (blue) from rtings.com

Distribution Issues

Mixdown. One other concern is how a sound mix is converted to other formats. A program may be mixed in an immersive or surround format, and then converted for stereo (or for mobile phones, often mono) playback in three ways: A dedicated stereo mix can be made by the sound mixer, probably starting from the immersive mix. A professional renderer/decoder can automatically render a stereo version that can be checked by the sound mixer. Or, a full immersive or surround encoded bitstream can be sent to a consumer’s device and downmixed to stereo there. This latter case may not be something auditioned during post-production and QC.

The latter case can happen since typically, a TV set advertises its capabilities internally or over HDMI so that a streaming service knows what variant of content to send. The TV may have an immersive decoder, and the user may select a premium version of the program which has HDR video and immersive sound. But, the user only has TV speakers or a two-channel soundbar, requiring the TV’s decoder to downmix the immersive content (and render audio objects if any) to stereo.

Downmixing in most consumer audio decoders is done by simply adding the appropriate channels together using standardized gains. In some formats such as MPEG-H, an energy-preserving downmix is used that avoids the comb filtering produced in many cases with a passive downmix.

Most of the time, automatic downmixing works OK. However, there is the potential for elements of some mixes to add during downmix so they more strongly mask the phantom center stereo dialogue.

Compression in distribution. Legacy distribution channels such as broadcasting and cable TV are subject to loudness control laws (such as the CALM act in 2010 in the U.S.) that limit the loudness of advertisements. Although, in the case of the CALM act, there are “safe harbor” provisions to avoid it, the simplest means of assuring compliance with the law is for each party in the distribution chain to compress the audio to a constant loudness level. This can have the perceived effect of amplifying the background elements of the audio, potentially making dialogue more difficult to understand.

TV broadcasting in the U.S. is typically composed of three tiers of distribution – Network, Local Affiliate, and Cable/Satellite. While networks can ensure compliance of most content by other means, the latter two tiers are usually limited to compression as their way to make sure there is no violation. Since 2015, this distribution chain has been disrupted by the industry move to IP delivery and consumers shifting their budget to streaming services, as shown in this graph6 for U.S. households:

There are proposals to extend the CALM act to streaming services as well. Fortunately, most streaming content is stored for on-demand playback, and thus the loudness of all items can be normalized with static gains instead of real-time compression.

Recently, some operators such as Amazon Prime Video7 and German public broadcaster ARD8 have introduced alternate programming streams with boosted dialog. In Amazon’s case, they say “While Dialogue Boost was built with the needs of customers who are hard of hearing in mind, anyone can use the feature to suit their personal listening preferences.”

TV manufacturers are beginning to “add value” by including correction technology to boost dialogue levels for more intelligibility. Samsung says its TVs offer AI-based sound processing that includes “Active Voice Amplifier, which adjusts and improves dialogue clarity with the speaker and surrounding noise in mind,..”9

Thus, there are many technical factors that could influence dialogue intelligibility. Chances are, it is a combination of factors for each unintelligible item. This matter warrants study and consideration across the industry, and perhaps some research on actual consumer use cases that trigger intelligibility concerns. That said, there are some ideas that might be helpful:

  • Consider QC checks of dialogue in a simulated consumer environment. Listening could be done in an immersive near-field room with 3D reverb to simulate a 400 ms RT60, and perhaps 35-40 dB A SPL of masking noise. If an average response curve for TV sets could be established, it could be applied to the playback audio, similar to an “Auratone check” as is common in music mixing.
  • Ensure that playback devices default to DRC profiles that are appropriate for their device class.

How can dialogue intelligibility be objectively measured? That is perhaps a subject for another article.

Also, I’m part of an industry group that might establish some recommendations to improve this situation, stay tuned…


References:

  1. “Peak TV Tally: According to FX Research, A Record 559 Original Scripted Series Aired in 2021”, Variety, https://variety.com/2022/tv/news/original-tv-series-tally-2021-1235154979/ ↩︎
  2. The critical distance boundary between near-field and far-field listening depends on the Q (directivity) of the speaker and the reverberation time of the room. Except at the very lowest frequencies, cinema listening is far-field. Listening in a living room is near field at middle frequencies in most cases. Listening in a dry near field mix room is always near field. ↩︎
  3. “Netflix Sound Mix Specifications & Best Practices v1.4”, https://partnerhelp.netflixstudios.com/hc/en-us/articles/360001794307-Netflix-Sound-Mix-Specifications-Best-Practices-v1-4 ↩︎
  4. Preferred Listening Levels and Acceptance Windows for Dialog Reproduction in the Domestic Environment, Eric Benjamin, Audio Engineering Society Convention Paper 6233, October 2004. ↩︎
  5. First Results from a Large-Scale Measurement Program for Home Theaters, Tomlinson Holman and Ryan Green, Audio Engineering Society Convention Paper 8310, November 2010. ↩︎
  6. “Traditional pay TV US home penetration to fall below 50% in 2023”, nScreenMedia, https://nscreenmedia.com/traditional-pay-tv-us-home-penetration-2023/ ↩︎
  7. https://www.aboutamazon.com/news/entertainment/prime-video-dialogue-boost ↩︎
  8. https://www-ard-de.translate.goog/die-ard/08-30-Fernsehen-besser-verstehen-Mehr-ARD-Programme-mit-Tonspur-Klare-Sprache-100/?_x_tr_sl=auto&_x_tr_tl=en&_x_tr_hl=en-US&_x_tr_pto=wapp&_x_tr_hist=true ↩︎
  9. https://news.samsung.com/global/interview-ai-driven-sound-innovation-redefining-the-tv-audio-experience ↩︎

Copyright © 2023 Robert L. Bleidt. All Rights Reserved. The information presented here is just the author’s opinion and not legal or business advice or necessarily the view of his employer. You agree to hold the author and his employer harmless from your actions if you use this information. Copying of this document is not permitted but linking to it is.