Associate Director
San Francisco - Bay Area
I spend a lot of time exploring the future of human interaction with Generative AI-enabled products. Recently, I’ve had fun experimenting with two innovative music generation platforms: Udio and Suno.
Generative AI models are increasingly multi-modal, expanding beyond text and images to modalities like video and music. For designers, these music models are worth exploring for their interaction design choices—and what these might mean for other Generative AI products in the future.
The “magical” aspect of Generative AI continues to fascinate me. Even after years of using these models, their ability to generate compelling content from just a few words still strikes me as remarkable. After seeing users post creative and quirky songs made on Udio and Suno, I decided to hop on and create some myself. These platforms produced the songs instantly—and the results were impressive. Within seconds, an idea transforms into a song. It’s truly amazing.
Using Udio or Suno for fun or personal creative expression—without specific output requirements—works exceptionally well. But how well do these tools perform in professional contexts where precise outputs are needed?
For the designers building these platforms, there are many questions around how to give users more control over the output—without creating a tug-of-war with the underlying foundation model.
Generative AI and the Evolution of UX
Before music generation models existed, some of the first foundation models were designed for image generation. The two most well-known of these—Midjourney and OpenAI’s DALL-E—took different approaches to the user experience.
The most widely available version of DALL-E, embedded in ChatGPT, relies entirely on a text interface to generate images. This allows for a faster, more natural way of interacting with the tool than Midjourney, which built its interface into the Discord messaging app. Discord requires users to type simple text commands (such as /imagine to generate an image) into a smaller window, which can feel clunky.
But Midjourney has its advantages, too. By integrating more traditional UI buttons, users can easily zoom out and expand an image in various directions—or upscale and vary images in different ways. These buttons often open an additional text window, so the refinement is a mix of conversational prompting and buttons.
Weighing the two approaches, DALL-E is often easier to prompt and can better represent all the elements of complicated text inputs. As such, DALL-E is the tool I use when I need something quick or where several text elements must work well together. Midjourney, however, allows for more refinement and is the tool I use when I have more time to use its robust controls for precise iteration.
In contrast to these standalone models, Adobe has trained and integrated its own Generative AI model into its Creative Suite. While tools such as Photoshop offer far greater editing capabilities than DALL-E and Midjourney, Adobe’s underlying Firefly model—while improving—still tends to lag behind these other models in performance.
The power of the underlying foundation model is a critical factor in determining the balance between what can be done through conversational inputs as opposed to UI controls. A robust model can generate impressive results from just a short prompt. It may even understand the user’s intent well enough to enable iterative back-and-forth as the model and user work together toward a better result. However, these models typically disappoint when you use text prompts alone to seek a specific result.
Balancing Complexity and Control
From these early image generation models, it’s interesting to examine the evolution of user experience and design as we expand into new modalities like music. One way to collaborate with Suno and Udio is by extending your song beyond your initial creation. Both music platforms can write lyrics for you, and both allow you to insert your own lyrics. You can write the first set of lyrics and then let the AI build upon your creation for a new extension of the song. This kind of back-and-forth collaboration is emerging as a useful interaction pattern with these models.
However, getting exactly what you want can be challenging if you're after something very specific—like a particular style or structure. When it comes to finessing finer details—like what part of your song is the chorus or the bridge, or fine-tuning the style—you are forced to trust in the magic of the model. And this can be hit or miss.
As you iterate, some improvements lead to other regressions, and unexpected changes you never wanted can insert themselves into the product. Power users of ChatGPT will be familiar with this pattern, where making one desired change often brings another unintended one. Models can be frustrating to guide toward an exact result. We are still in the early days of GenAI, and there’s plenty of room for innovation—especially in designing interfaces that let users dial in their exact preferences.
The Future of Human Interaction with AI
Designing for Generative AI requires striking the right balance between letting the foundation model shine and presenting options for finer user control. While the initial wow factor is great, the true magic happens when users become more adept at steering the technology to meet their needs.
As these tools evolve, they promise to become even more integral to both casual creativity and professional processes. Innovative interaction models will continue to emerge as new tools are launched. Since these tools are developing so quickly, it’s crucial for businesses to continue to experiment with them to understand their readiness and application—and equally important for designers to create new interaction patterns that make the tools useful for both casual creativity and precise professional workflows.