TalkBack can read images even if your phone is offline – thanks to the on-device Gemini Nano

0comments
A screenshot from the Android Developers Blog for the TalkBack functionality.
TalkBack, the indispensable Android feature for people who have blindness or low vision, gets a lot more useful – and powerful – thanks to the Gemini Nano with multimodality model.

There's an extensive blog piece on the Android Developers Blog, where the team opens up about the latest enhancement of the screen reader feature from the Android Accessibility Suite.


– Android Developers Blog, September 2024

TalkBack includes a feature that provides image descriptions when developers haven’t added descriptive alt text. Previously, this feature relied on a small machine learning model called Garcon, which generated brief and generic responses, often lacking specific details like landmarks or products.

The introduction of Gemini Nano with multimodal capabilities presented an ideal opportunity to enhance TalkBack’s accessibility features. Now, when users opt in on eligible devices, TalkBack leverages Gemini Nano’s advanced multimodal technology to automatically deliver clear and detailed image descriptions in apps like Google Photos and Chrome, even when the device is offline or experiencing an unstable network connection.

Google's team provides an example that illustrates how Gemini Nano improves image descriptions. First, Garcon is presented with a panorama of the Sydney, Australia shoreline at night – and it might read: "Full moon over the ocean". Gemini Nano with multimodality, however, can paint a richer picture, with a description like: "A panoramic view of Sydney Opera House and the Sydney Harbour Bridge from the north shore of Sydney, New South Wales, Australia". Sounds far better, right?

Recommended Stories
Utilizing an on-device model like Gemini Nano was the only practical solution for TalkBack to automatically generate detailed image descriptions, even when the device is offline.

.
– Lisie Lillianfeld, product manager at Google

When implementing Gemini Nano with multimodality, the Android accessibility team had to choose between inference verbosity and speed, a decision partly influenced by image resolution. Gemini Nano currently supports images at either 512 pixels or 768 pixels.

While the 512-pixel resolution generates the first token almost two seconds faster than the 768-pixel option, the resulting descriptions are less detailed. The team ultimately prioritized providing longer, more detailed descriptions, even at the cost of increased latency. To reduce the impact of this delay on the user experience, the tokens are streamed directly to the text-to-speech system, allowing users to begin hearing the response before the entire text is generated.

While I'm not yet boarding the AI hype train fully, AI-powered features like this are stunning – just think about the potential! And then, there are stories like this one that makes you want to tone down this "wonderful" progress of ours:

Recommended Stories

Loading Comments...
FCC OKs Cingular\'s purchase of AT&T Wireless