When Jay-Z, Madonna, and Google’s head of Webspam Matt Cutts leave comments on your blog, your first impulse may be to question whether they are fakes. I deleted the Jay-Z and Madonna comments immediately, but hesitantly published the comment from Matt Cutts, even though there was something just a little off about it.
In the past, I have received comments from the inventors of some of the search patents I blog about. Gilberto Gil, in his capacity as cultural prime minister of Brazil, commented on a post I published about Remix, a type of musical mashup of old music with new material added to it. I’ve even interacted with Matt Cutts in the comments on other blogs more than once. So a comment from him on my blog wasn’t completely out of the question.
But a tweet from Matt Cutts a little later in the day after I approved his comment verified my worst suspicion — that the comment had been created by an imposter.
In the online world, how do you tell whether or not something was written by the person claiming to be the author? In a time when search engines are looking to tie together authorship with content, how do they investigate? How can they come up with a mathematical approach to uncovering such deception on a large scale?
What’s My Line?
What’s My Line was a very long-running television show, featuring a lineup of celebrities tasked with uncovering the occupations or identities of mystery guests by asking only yes or no questions. The show started in the early 1950s and ran until 1967 and then in syndication until 1975. A live stage version of the show started in Los Angeles in 2004, moved to New York in 2006, and runs to this day.
A real-life What’s My Line with questions about identity and impersonation is a challenge that search engines are faced with more frequently these days as social networks and authorship becomes a larger part of the search ecosystem. Instead of celebrity judges asking yes or no questions, search engines turn to algorithms to investigate. Before we hop into some of the approaches they might use, let’s look quickly at why it’s even an issue.
Google’s Agent Rank And Bing’s Author Authority
In 2007, Google published a patent filing titled Agent Rank (US Patent 7,565,358) which allowed for digital signatures to be associated with pages and blog posts published by authors. It also included signatures for comments, and meta data that could be used to point out where content was syndicated elsewhere on the web. When content was published, Google would track the time and date of its publication. If someone copied the content, Google could look at the time signed content was published as a signal of which version was a copy, and which was original.
Google appears to have been taking advantage of such digital signatures in the form of Authorship markup introduced through Google Plus. You don’t have to use Google Plus as a social network to take advantage of authorship markup, but it’s possible that contributions and interactions on the social network can help authors improve upon a reputation score that might be associated with them. That reputation score may influence user rankings in Google’s social search, and may eventually play a role in user rankings for web search.
Google’s announcement of authorship markup also told us that content from authors may include a badge in web search results that shows the face of the author, under the premise that pictures of real people show an authentication that the content was created by a real person. These authorship badges show up regardless of whether searchers are signed into their Google account or not.
Google isn’t alone in possibly using the authority of authors to rank content in social search, and perhaps even web search. In a pending patent application filed in May, Ranking Authors in Social Media Systems (US Patent application 20120117059), Microsoft gives us a look at user topical signals that might be used to rank content created by specific authors, which might include:
- Raw count of topical posts;
- How often an author is cited by other authors;
- How often an author cites themselves;
- Number of times an author is replied to;
- Total number of posts authored in the system;
- How often they are mentioned by other users;
- Number of links an author has shared;
- How often they use explicitly denoted keywords (e.g., hash tags);
- A similarity index computing the similarity of an author’s recent to previous content;
- A timestamp of an author’s first post on the topic;
- A timestamp of their most recent post on the topic;
- A count of friends / followers who also post on the topic;
- A count of an author’s social media friends/followers who posted on the topic before the author posted on the topic;
- A count of an author’s social media friends / followers who posted on the topic after the author posted on the topic; or
- Other signals.
Google’s approach includes verification of authors using HTML markup, with a way of pointing a Google profile to a page about an author, and a link back from that domain to the Google profile page. Google likely also uses a number of signals like those described in the Microsoft patent, and possibly even an approach that assigns topical scores to different contributions and interactions on the social network itself.
Yahoo also recently weighed in with an approach that adds a slightly different touch to the questions of identity and impersonation, with a patent filing titled Trust Based Moderation (US Patent Application 20120180138). The reputations of contributors to a social network are compared to the reputations of people filing abuse reports about those contributions.
Google On Finding Impersonators
Google’s Eric Schmidt has noted during presentations at a couple of public events that the purpose behind Google Plus is for it to act as an identity service. For example, no direct method exists to verify that you are who you say on Twitter, and that service has had some celebrity impersonators. Not everyone is going to sign up with Google Plus and use Authorship markup. Google does allow you to include links to your other profiles on sites like Twitter, Facebook, LinkedIn, Flickr, and others, and doing so might give Google more faith that the content from those services is from you.
But sometimes people impersonate others on social networks. On July 17, 2012, Google was granted a patent titled Detecting impersonation on a social network (US Patent 8,225,413). The patent was originally filed in June of 2009, before Google Plus launched. Someone might attempt to assume the identity of another person by creating a profile page that substantially copies profile information for that other person. They could do this out of malice or in an attempt to trick other people who might know or know the individual being impersonated.
The patent points out a number of signals that Google might use to try to determine if a profile is that of an impersonator or the victim of impersonation. An analysis of two profiles with substantially similar information might be performed to distinguish which is real and which is a fake. These signals might include:
- Little or no interactions with others since the creation of the profile (impersonator)
- Frequent activity including profile updates, friend additions, and other (victim)
- Membership in deleted or flagged groups (impersonator)
- Membership and activity in active groups (victim)
- Content defaming the person associated with the profile (impersonator)
- Flagging of the profile by others in the social network (impersonator)
- The presence of pornography (impersonator)
The identity of people creating content on the web and participating in social networks is playing an increasing role in how the search engines rank content in social search, and will likely also play a role in the rankings of pages in web search and in deciding whether duplicated content might be filtered out of search results. Chances are that if there are two or more copies of substantially duplicated content on the web, and one of the two copies is connected to a social profile that Google or Bing trusts, that copy will be the one that shows up in search results, with the other filtered out.
Rumors of a commenting system from Google that would allow people to leave comments on sites outside of Google started spreading a month or so ago. Chances are that those comments would be part of Google Plus, and might allow for those comments to be published on the page or blog post that they are responses to as well as on Google Plus. They would also likely be tied to the commenter’s Google Plus account, so that the author’s identity can be verified by bloggers like me. Those comments could also influence the reputation scores of the people leaving the comments, and might even impact the ranking of the page being commented upon, in social search and possibly even web search.
In such a case, if you write a post about search or SEO, and the REAL Matt Cutts leaves a comment on your post, that could cause the ranking of your page to increase.
In all these ways, the search engines are attempting to integrate the signals that social networks can send into making results more accurate, relevant, and meaningful for everyone, as well as ensuring that good content is credited to those who have spent the time and energy creating it.
Image: Identity Unknown — Original Billboard Image from Shutterstock
Editor’s Note: This article first appeared in the Fall 2012 issue of Search Marketing Standard magazine and is now available for all to read.