By Steven Brykman, Propelics
The trend with Apple products has always been to make things smaller and faster with greater storage capacity. The only Apple product which has gotten consistently and significantly larger over time (other than desktop displays) is the iPhone. It’s the one time Apple got it wrong. iPhones started out the smallest they would ever be, and have gotten bigger since.
I’m not saying Apple erred by going bigger. Everyone I know who bought a new iPhone says the same thing: "I love it. Now every time I look at an iPhone 5 I think, 'Look at how small that is!' I can barely type on that thing. But now I don’t even need my iPad anymore."
Regardless, screen size is not what really matters. In the long run, what will make all the difference is something that doesn’t require a screen at all, nor even a user interface. I’m talking about vocal interaction with a "virtual assistant" like Siri or Cortana.
A key concern of technology has always been finding ways to improve how we interact with it: from keyboard to mouse/trackpad to Google goggles to Leap devices to voice interaction. Of these, vocal interaction is the most obvious, intuitive and natural -- the most similar to how we interact with people in the real-world.
It may also be the most represented interaction modality in science fiction -- everything from Douglas Adams' "share and enjoy" to Kubrick/Clarke's "I know that you and Frank were planning to disconnect me, and I’m afraid that is something I cannot allow to happen." As more daily objects are becoming Internet enabled, it’s inevitable that one day it will seem normal to hold a conversation with your dishwasher: "What do you mean, I’m not using enough detergent?"
Making Devices More Human
There are a number of ways to achieve more "human" interactions with our devices. Questions that require culling Internet data will need three things:
- a means by which all our randomly stored/displayed data (aka the Internet) can be somehow organized and repackaged into a usable standard format (beyond Wikipedia entries and Google search results)
- a more intelligent means of gathering this information
- the ability to present this collected data back to the user in natural language
We are already witnessing a trend toward improving information gathering and repackaging. Companies like Crimson Hexagon treat the Web like a "mass focus group," providing actionable information to corporations based upon opinions expressed on social media. (Though admittedly, even social media offers a certain level of formatting.)
The ultimate goal is to be able to ask our devices any question about anything and receive an actual answer -- not just a Web search -- phrased in perfect English (or the language the user speaks).
Some of this technology is already here. Ask Siri for a picture of President Obama and Siri delivers. Ask Siri how old President Obama is and again, Siri scores, answering, "Barack Obama is 53." Impressive.
Needs More Context
But follow the image search with the more ambiguous, "How old is he?" And Siri fumbles, searching the Web for "how old is he." There’s no inclusion of "Obama" in the follow-up because Siri is not smart enough to consider context in this instance.
With the exception of specific lines of questioning, the conversation with Siri ends with one question. Siri somehow does well with directions, deriving appropriate context when asked, "Where is the nearest Salvation Army?" followed by, "Give me directions." Siri and Google also both do well with weather. Google Search’s “smart, conversational” style allows users to ask follow-up questions to “What’s the weather like?” such as "How about this weekend?" and the system understands the context. Siri does the same.
But ask Siri a seemingly simple question like "What bands are playing in Boston tonight," and it assumes you’re asking it to identify a song: "I’m not familiar with that tune." Ask it again and for whatever reason it seems to grasp the meaning a little better: "OK, I found this on the Web for 'What bands are playing in Boston tonight.'"
The results refer the user to websites rather than listing the names of the bands. Not so smart, after all. Or is it just that the source isn’t there? Perhaps that sort of information has yet to be packaged into a readable format?
Cross-platform, the results are inconsistent. Ask Cortana (Windows' voice assistant) for the tallest building in the world and it responds "the Burj Khalifa." Considerably better than Siri’s answer: "Here’s some information." (Although Siri provides the correct result visually.) Meanwhile, Google Now provides the best response of all, speaking the top three results along with their respective heights.
When it comes to formats in which an answer is delivered, again we find much cross-platform variation. Ask Google Now, "How do you say 'hello' in Spanish?" and it tells you the answer out loud. Say "repeat" and it knows to repeat the word. Add, "What about French?" and it reads the French version aloud.
Siri, on the other hand simply conducts a Web search and displays the result, but with no audio. Adding "What about French?" brings up a Wikipedia entry for French's mustard!
In a Siri-Cortana battle last spring, Pete Pachal found that: "When asked about restaurants, both shot back a list of results quickly. It was only Cortana who understood follow-up questions, though -- telling me how far one of the results was, doing her (sic) best to find the menu and calling the place -- while Siri had no idea what I was talking about. Point Cortana."
The current formula that describes our contextual voice interactions with devices, then, is this: Replace any ambiguous nouns with the more specific noun from the previous question. Simplistic, but better than nothing.
The larger picture involves the use of personal context: incorporating a user’s preferences, activity, locations and past searches to produce an even more rich and rewarding exchange between man and machine. The question "Which movies are playing near me today?" yields a list of movies, but the follow-up question "Which ones will I like?" naturally remains unanswerable. (Though Google Now derives the context correctly, converting the question to "Which movies will I like?")
Eventually our devices will incorporate a wide range of data into their responses. By recalling which restaurants we visited and reviewed favorably in the past, for instance, it will know which to recommend in the future. Same thing with more transient events. "Which bands are playing tonight?" implies "Which bands are playing tonight that I will enjoy based upon my existing musical preferences (pulled from my iTunes library, Spotify playlists, Pandora stations)."
One wonders if the technology will become so prescient that eventually we’ll just let our devices plan our evenings: make the reservations, buy the tickets, even order the food.
As a father of three kids with no time on his hands, frankly I’m looking forward to the day this scenario is possible: "Hey Siri, set me up with dinner and a movie for Friday night: Chinese with a high hot and sour soup rating and an Apatow comedy with an emotional subtext. No farcical Lampoon crap. Also, don’t make me walk more than a half-mile. Daddy’s dogs are barking."
Steven Brykman is a digital strategist and UX architect focusing on mobile products, with a diverse background in writing and literature. Currently with Propelics, a leader in enterprise mobility, he helps a wide range of enterprises determine the direction of their mobile apps. He spent much of the last decade as creative technologist/lead strategist for his own design company, Got Your Nose, improving user experiences for such companies as Scholastic and Bell Canada and leading the mobile strategy direction for companies including Rabbit Bandini, Guinness and Nintendo. He also co-founded Apperian, a Boston-based technology startup focused on mobile application development, and served as lead UI architect and strategist for all iPhone, iPad and Android applications, including Apperian's flagship product, EASE.