OpenAI has incorporated voice and image capabilities in ChatGPT, allowing ChatGPT Plus and Enterprise users to either ask the interface questions directly or provide pictures as a reference for queries.
The new voice capability, announced on Monday, September 25, is powered by an advanced text-to-speech model, capable of generating remarkably human-like audio from text input. Additionally, OpenAI employs its open-source speech recognition system, Whisper, to transcribe spoken words into text, ensuring smooth communication with ChatGPT.
To get started with voice, Plus users can navigate to Settings, then New Features on the mobile app and opt into voice conversations. Users can also select their preferred voice from a choice of five different voices, each meticulously crafted in collaboration with professional voice actors.
The new voice technology, while incredibly versatile, also poses new risks, such as the potential for impersonation or fraud. To mitigate these risks, OpenAI is initially focusing on voice chat applications, collaborating with voice actors and partners like Spotify to ensure responsible and beneficial use.
The new feature to understand and discuss images opens up numerous possibilities, such as troubleshooting technical issues, planning meals based on the contents of your fridge, or analysing complex data graphs for work-related tasks. Users can capture or select images directly from the mobile app, and on iOS or Android, they can even discuss multiple images or use a drawing tool to guide their assistant.
Vision-based models present unique challenges, and OpenAI has collaborated with organisations like Be My Eyes to understand usage and limitations as part of its testing process. OpenAI has also taken technical measures to limit ChatGPT’s ability to analyse and make direct statements about people, respecting individuals’ privacy.
This image-understanding capability is powered by multimodal models GPT-3.5 and GPT-4, which apply advanced language reasoning skills to a wide range of images, including photographs, screenshots, and documents containing both text and images.
The new voice and image capabilities will first be rolled out to Plus and Enterprise users over the next two weeks, starting with voice on iOS and Android.