On August 28, music fans will tune in to the MTV Video Music Awards and on September 18, TV fans will tune into the Emmy Awards. Millions of viewers tune in and log on to watch their favorite stars arrive and win awards. More often than not, they’re interested in what celebrities are wearing or what their latest hit song is. What if viewers could easily identify the designer of Lady Gaga’s outfit as she walked the red carpet – or even found similar colors, styles and fabrics to purchase in real time, all at the click of a button on the remote?
As TV product placement investments flow from automotive, fashion, electronics and consumer packaged goods brands, specialty and mainstream products of all types saturate our televisions. Real time multimedia search could deliver superior returns for brands and content providers alike. Viewers could instantly request more information or connect with local businesses selling thousands of different wares.
More immediately, as online news publishers move away from the written word toward more video-based content, each video could serve as a curator of content for the page that surrounds it. Videos could deliver ads, coverage and content to each user that is entirely relevant to the user experience without any human intervention. These futuristic scenarios may be closer than you think. A new and innovative approach to search is underway at Santa Clara, CA-based startup Haileo (http://www.haileo.com/). The company has developed what its executives call the Haileo Brain.
I recently spoke with co-founders Vwani Roychowdhury and Nima Sarshar to learn more about the technology’s immediate potential for product marketers and its future potential as a media game-changer.
Defining Context through Entities, Objects and Concepts
The Haileo Brain is a machine representation of the real world, and is modeled after the real brain. It comprises images, audio signals and text phrases that capture different attributes of objects and experiences; in addition such attributes have links among them that capture similar intent or functional and contextual relationships. Then, just as humans experience or understand a video or a movie scene by recognizing familiar objects and reliving the external signals inside their brains, the Haileo Brain understands multimedia by first recognizing familiar signals in the video, and then stitches together a wide range of elements (already stored inside it) to generate a contextual, intent-driven and sequential summary of the content. This contextual summary, in conjunction with other data such as user’s profile, can then be used to target ads or related content at the right time and right location as the video is watched.
How does one create such a versatile brain? Currently, search engines such as Google and Bing get help from the thousands of marketers who bid for search terms and perform SEO, and also from the 100’s of millions of users who perform queries. All this data help them to create an associative database of what phrases and images people choose and what are some of the related products and services. With the visual signal playing a dominant role in a video-centric world, the current paradigm may not be that useful and alternate means of processing multiple signals in a common framework are considered to be too futuristic. This is where the Haileo Brain aims to take a giant leap. Like Sarshar says, “Most objects and experiences are documented on the web on multiple sites, complete with images, text and audio signals. We crawl the web and automatically aggregate the relevant visual signals and link them to related visual, audio and textual signals.” Roychowdhury adds, “This is a new kind of science – distilling entities from the web– that allowed us to make a breakthrough.”
To fully understand this technology, let’s address Haileo’s definitions of three critically important terms:
- Entity: the underlying tangible thing that exists offline (a PlayStation Portable 3000, for example)
- Object: any atomic multimedia signal (for example, a caption or other text source, an image, a video frame, an audio file, etc.) that is an attribute of an entity (for example, an object relating to a PSP-3000 could be an image of a game, a video file of an advertisement or customer review.)
- Intent Categories: A group of entities that have similar intent or represent the same category of experience (for example, entities such as a scene of a fashion runway, luxury apparel, and jewelry would fall under the intent of fashion; similarly, entities such a stadium, tennis court, and rackets would fall under sports).
Like Sarshar explains it, “The challenge is to understand what an image or a video segment is on many different levels.” Sarshar further explains, “Object level understanding may not be enough in many occasions. An image about a basketball court even when correctly identified at the object level does not tell the complete story: one also needs to know that one can show ads about tickets to local games, jerseys, sports shoes etc. This is where the Haileo Brain kicks in and provides a contextual map of the intent space. ”
The post continues tomorrow with details of the performance and speed of Haileo and current applications for the technology.