Multi-modal AI and the UX Revolution in Travel

Excerpt from PhocusWire

For those of you not keeping up, ChatGPT at the end of 2022 gave us a computer that can understand and generate English. Not just English, but human language in general. On the face of it, that can seem like a small step. It’s not.

Yes, it struggles with facts and up-to-date data, but that part is solvable and often not as critical as you’d think.

If that wasn’t revolutionary enough for you, the next shift is generative artificial intelligence (GenAI) going multi-modal. ChatGPT started out as text-to-text. Multi-modal is the other stuff:

Text-to-Audio. Generating voice from text. Companies like Elevenlabs, Google and Amazon can now create both pre-trained voices and custom cloned voices that are in many cases indistinguishable from human voice. Everybody has access to these tools - it’s really easy.
Audio-to-Text. Apple, Amazon and Google have been doing this for years. Good luck if you have an accent. Today, via GenAI it's vastly improved - even with that accent. Otter.ai and OpenAI’s Whisper are two of my favorites. ChatGPT now has Audio-In > Audio-Out as an option so you can bypass the text altogether.
Text-to-Image. This area has been improving rapidly in the last year. OpenAI’s Dall-E was fun to play with a year ago. Today, the free Bing.com/create is really close to creating usable photo-realistic images. MidJourney is also right there. SHOULD you be generating images for your travel marketing campaigns? I’m staying out of that argument.
Image-to-Text. This just happened in the last couple of weeks as part of ChatGPT. Take a photo, upload a sketch, whatever you like, and ask the GenAI to describe, deduce, tell a story … anything you want.
Text-to-Video and Video-to-Text. These aren’t really any different to the technology for images, except there’s a lot more data (more compute, more $) to process. These are not ready for prime time yet, but they’re coming. It’s fun to play with. Just like images were a year ago.

So if you took any of these modes in isolation - they’re all significant. But when you connect them all together seamlessly, it starts to get really interesting.

Consider the new Ray-Ban Meta. They’re basically a multi-modal large language model (LLM - think ChatGPT) you wear on your face. You can do the basics like listen to music or take photos and videos. But really, they’re a frictionless way to connect what you see, hear and say to the world’s most powerful (in a practical sense) computer, an LLM.

When you take all of the human senses, it’s a fair argument that vision and hearing are the ways we consume most data. Voice and text are the ways we communicate most data. These are all now functioning and connected.

Today, we’re already using many of these tools, but they often don’t function very cohesively. It’s still a bit patched together. This will change.

Click here to read complete article at PhocusWire.