Andrew Tomkins on Web Search and Online Communities
April 5, 2007
I was very disappointed by the talk. I had expected a lot more from the person who possibly determines the future of Web Search at one of the leading Search Engines in the world. The talk started off with Andy giving a slide show of images from Flickr which rate high on interestingness. That was the good part – the pictures were cool. But thereafter he went into why Flickr was a better social network than other networks — he gave some quantifiable metrics such as the size maximal strongly connected components in the relationship graph and the number of nodes in the graph with a degree of more than a number k, which was parameterized on the X-axis. For Flickr, it seems there are a number of people with 450 friends or more while for another social networking site (a la LinkedIn/Orkut) the number is an order of magnitude smaller. I did not buy his argument that this indicates that Flickr was a more successful social network. Being able to maintain 450 friends is very difficult (ask me! I don’t even interact with many of my 800 odd friends on Orkut). Besides the nature of the two social networks was very different. He also touched upon how social networks are interacting (Upcoming and Flickr).
He then went into how the Internet is growing and the amount of data being generated. Some back of the envelopes (6B people typing away on computers for 4 hours everyday) would generate data with an upper bound of about 150 PetaBytes. However, that data is more and more decentralized. The amount of data which passes through Yahoo! network is only about 11% of the web’s data right now and that is falling fast. Nobody else even comes close (according to Andy). At the same time, one can consume only the data one wants — thanks of course to del.icio.us and RSS feeds and better personalization algorithms. That indicates that both content sources and content consumption is becoming more and more decentralized and democratized. At current rates, the storage of the amount of data being created will cost about $ 25M which should fall. Smaller players can crawl and store the content present on the web. This is great for entrepreneurers because this means that they can match the Big 3 (GOOG, MSFT, YHOO) at least in terms of storage!
Some of the latest developments in search have been special treatment of specialized domains. For instance, if you search for weather, or movies, or flight information, current generation search engines are able to figure out the domain of the search query and provide custom UI for the results based on the domain. For instance, they might give movie timings at theaters near your ZIP code. This is going to become more ubiquitous with special treatment for a lot more domains being added. However, I am not sure how many domains can be supported by a rule-based treatment for each of them. Integration of search results of different types and genres of media (images, video, text) and ranking algorithms for the combined result set remain a challenge. We are going to see the addition of more and more social features as days go by. Crawling and collecting data in the light of new programming models (like Ajax) are going to be a challenge. Andy was not aware of any good solutions to this problem.
The last part of the talk was a real disappointment. He started talking about some of his recent research on estimating properties about a hidden corpus on the basis to the answers to a number of queries. While there is no doubting that his research is worthwhile, it was perhaps a wrong forum to get into mathematical equations and that too suddenly after having talked about general technology. I got the feeling that he wanted to talk about it just to show that he still does some technical work :-).
Overall, the talk didn’t meet my expectations. The last one by Raghu Ramakrishnan was by far better.
[If I have missed out something, please point out in the comments. Thanks!]