Sunday, February 24, 2008

Back from CI Foo

I got back from Collective Intelligence Foo Camp late last night. As promised, here is a summary of some of the highlights, the major thoughts I took away from the conference.

Tim O'Reilly intentionally tends to leave his terms weakly defined -- as he did for the term "Web 2.0" -- so there were several attempts to come up with a compact understanding for what is and is not "collective intelligence".

The simplest definition came from Kim Rachmeler at the end of the second day:
The network knows what the nodes don't.
This nicely captures the idea that the sum has to be more than the parts. There is an emergent property to collective intelligence where problems can be solved that might be difficult to solve any other way.

More detailed definitions came from some of the other sessions. One interesting question was whether massively parallelizing a task across many humans -- something that was referred to as Man-Reduce as opposed to Map-Reduce -- is collective intelligence. In the extreme case, each of the tasks could be independent and the system could merely be collating the results. Is that collective intelligence?

This led down a path where some asserted that there was nothing in collective intelligence that could not be accomplished, given enough time, with one person. However, a compelling counter-example came up in the form of recommendations (e.g. "Customers who bought X also bought..."). To reproduce that system, a single person would have to be able to somehow become many different people with different tastes and preferences. That is hard to imagine.

There was also an attempt to apply organizational theory to collective intelligence, resulting in a detailed taxonomy that listed several methods of aggregating the individuals such as voting, averaging, weighted voting by reputation, mimicry (also referred to as organizational memory), and clustering/hierarchies. This led to a conclusion that there may be nothing new in collective intelligence other than the fact that communication between individuals is now internet-enabled, allowing it to occur at a scale and speed impossible before, yet that scale and speed does create something new and important.

Finally, there was a curious idea of attempting to develop a programming language for human nodes in a network, something that might provide a more rigorous theoretical framework to analyze what is possible and not possible in these systems. One thought here was to start by developing that programming language on what is possible using MTurk rather than the full range of possible human actions, something that might have immediate practical output as well.

Looking back at the discussion, I like the definition "The network knows what the nodes don't." However, if you accept this definition, it seems you have to accept that the the system as a whole can do things that individuals alone cannot. I think that has to lead to the conclusion that systems that are merely massively parallelizing a task among humans (i.e. "Man-Reduce") do not represent collective intelligence.

One interesting theme that came up in many sessions was a concern about manipulation of systems trying to use collective intelligence. We observed that small systems typically were not subject to attack because there was little value in doing so, but that some big systems (such as Amazon reviews) appeared resistant to attacks due to the amount of effort required to overwhelm the large number of legitimate participants. However, in large systems where there is a "winner-takes-all" (e.g. Digg or Google where gaming the system to get top rank will result in a massive amount of lucrative traffic), the benefits of manipulation can justify even quite costly efforts at spamming.

The discussions on manipulation often led into discussions of reputation. Should one participant in these systems get one vote? Or should trustworthy people or experts be given higher weight? And how do we find those trustworthy people or experts? If we nominate experts using equal votes, have we solved the problem or merely transferred it to a meta level? Some suggested a TrustRank-like method of transferring reputation in a network of participants. Others noted that, since community sites tend to start with a loyal, dedicated audience but get diluted over time, a simple seniority-based system of trust and reputation might be able to preserve that core as the community grows.

There were many other discussions in and out of sessions, but these were the major themes and questions that stick in my mind. I much enjoyed the many other conversations I had, including the opportunity to talk at length with Hal Varian and Rodney Brooks, as well as the chance to see old friends and new. A great experience overall.

Update: Roger Ehrenberg posts detailed notes ([1] [2]) on some of the CI Foo sessions.

Update: Matthew Hurst posts some initial thoughts from the conference. He also quotes from the Wikipedia page for collective intelligence which defines the term much more broadly than I did.

Update: Adam Siegel and Andrew Turner also post write-ups.

Thursday, February 21, 2008

Yahoo deploys large scale Hadoop cluster

Yahoo's Eric Baldeschwieler reports that Yahoo is now running a 10k+ core Hadoop cluster that holds over 5 petabytes of data.

Very cool. It appears Yahoo has now achieved most of the Hadoop design requirements they laid out back in July 2006.

Note, however, that 10k cores is quite a bit smaller than Google's cluster appears to be. Google used 11k machine years of computation in their cluster in Sept 2007, suggesting that they have at least 133k machines in their clusters (and likely substantially more).

Please see also my July 2006 post, "Yahoo building Google FS clone?", and my April 2007 post, "Yahoo Pig and Google Sawzall".

The madness of a growing crowd

Nick Carr makes an observation about problems that occur in "self-regulating, super-democratic communities" as they grow. An excerpt:
What we've seen happen with self-regulating communities, both real and virtual, is that they go through a brief initial period during which their performance improves - a kind of honeymoon period, when people are on their best behavior and rascals are quickly exposed and put to rout - but then, at some point, their performance turns downward. They begin, naturally, to decay.

Leave them alone long enough, and they're far more likely to collapse than to reach perfection.
As I said before, there is a repeating pattern with community sites. They start with great buzz and joy from an enthusiastic group of early adopters, then fill with crud and crap as they attract a wider, less idealistic, more mainstream audience.

When there is an incentive to abuse a system, people will abuse it. From the start, community systems need to be designed to filter out crap and spam. Without intervention, without tools to surface the good and bury the bad, a sea of noise will drown out any wisdom that could have been found.

Please see also my older post, "Getting the crap out of user-generated content", where I quoted Xeni Jardin as saying, "Openness has its downside: When you invite the whole world to your party, inevitably someone pees in the beer."

Please see also my other posts, "Buying votes on Digg", "Digg struggles with spam", and "Summing collective ignorance".

Clever exploit of DRAM to attack disk encryption

Security guru Ed Felten posts about "Cold Boot Attacks on Disk Encryption", a sideways attack on BitLocker, FileVault, and other disk encryption programs.

Some excerpts from his post:
The root of the problem lies in an unexpected property of today's DRAM memories ... Virtually everybody, including experts, will tell you that DRAM contents are lost when you turn off the power. But this isn't so.

Our research shows that data in DRAM actually fades out gradually over a period of seconds to minutes, enabling an attacker to read the full contents of memory by cutting power and then rebooting into a malicious operating system ... If you cool the DRAM chips, for example by spraying inverted cans of "canned air" dusting spray on them, the chips will retain their contents for much longer.

This is deadly for disk encryption products because they rely on keeping master decryption keys in DRAM. This was thought to be safe because the operating system would keep any malicious programs from accessing the keys in memory, and there was no way to get rid of the operating system without cutting power to the machine, which "everybody knew" would cause the keys to be erased.

Our results show that an attacker can cut power to the computer, then power it back up and boot a malicious operating system (from, say, a thumb drive) that copies the contents of memory ... search through the captured memory contents, find any crypto keys that might be there, and use them to start decrypting hard disk contents.

There seems to be no easy fix for these problems. Fundamentally, disk encryption programs now have nowhere safe to store their keys.
Awesomely clever. Ed links to more details.

Update: In a second post, Ed discusses how easy it is to attack laptops when they are in sleep mode.

Wednesday, February 20, 2008

Going to CI Foo

I am very much looking forward to going to Collective Intelligence Foo Camp this weekend.

CI Foo is a Tim O'Reilly gathering hosted by Google to discuss how we can build systems that use "networked computers and humans working together to solve interesting problems."

It looks to be a remarkable event. The list of people invited includes Hal Varian, Peter Norvig, Rodney Brooks, and Luis von Ahn. Recommender experts such as John Reidl, Eric Horvitz, and Paul Resnick will be there. Brent Smith and Kim Rachmeler are coming from Amazon Personalization. Gary Flake, Blaise Aguera y Arcas, David Gedye, and Matt Hurst are coming from Microsoft Live Labs. Caterina Fake, Joshua Schachter, and many other startup innovators will be attending.

Should be fun! If you see me there, please say hello!

Update: As I did for the last Foo Camp event I got to attend, I will try to post a summary of some of the conversations afterward.

Update: Caterina Fake did not attend, but Larry Page made a brief appearance.

Update: Four days later, I have posted a summary of some of the discussions I was part of at CI Foo.

Friday, February 15, 2008

Ranking using Indiana University's user traffic

Mark Meiss gave an great talk at WSDM 2008 on a fascinating project where they tapped and anonymized all traffic into and out of Indiana University, then analyzed the behavior of search users. Their paper, "Ranking Web Sites with Real User Traffic" (PDF), covers most of the talk and includes additional details.

The work has several surprising conclusions. First, and most importantly, they authors argue that "PageRank is quite a poor predictor of traffic ranks for the most popular portion of the Web." They say this is because the basic assumptions of PageRank simply are not true.

Specifically, PageRank assumes every link from a page is followed with equal probability, but their data shows that "a few [links] carry a disproportionate amount of traffic while most carry very little traffic." When they attempted to compensate for this with a version of PageRank they called Weighted PageRank (where the links are weighted based on click traffic), they found it helped only a little.

This lead the authors to conclude that the other two assumptions of page rank -- that people have the same chance of jumping from (ending their session) any particular page and equal probability of jumping to (starting a new session) on any particular page -- are false and problematic. From the paper:
People are much more likely to jump to a few very popular sites than to the great majority of other sites.

People follow many more links from a few very popular hubs than from the great majority of less popular sites.

Some sites are much more likely to be the starting or ending points of surfing sessions.
Finally, the authors suggest that relevance rank should be informed by click data, but note that "such steps are likely to amplify the search bias toward already popular sites." In the talk, an audience member also noted that such steps may be susceptible to click spam, which is even easier to do than link spam for those wanting to manipulate search results.

It is worth pointing out however, as I have done before, that naive PageRank has been under assault by spammers for many years and almost certainly is no longer used by any of the search engines in the original form, not without layers upon layers of efforts to eliminate link spam and ferret out any meaning remaining in the chaos that the link graph has become. As compelling as this paper's conclusions are, it could be the case that their version of PageRank naively followed links so manipulated by link spammers as to completely confuse the relevance rank, producing the poor correlation between PageRank and highly trafficked sites that they saw.

In addition to the thoughts on PageRank, the paper had several other very interesting results as well. They noted that only 5% of traffic originated from search hosts, "a surprisingly small fraction." They noted that 54% of traffic "does not have a referrer page, meaning that users type the URL directly, click on a bookmark, or click on a link in e-mail" at a much higher rate than one might expect. Finally, they noted strong recency and 24 hour trends in traffic data, saying that "47% of the clicks at any given time are predicted by the clicks from the previous day at the same time" and that, though the clicks from the previous three hours are a strong predictor of clicks for the current hour, after four hours, "the requests from the previous day yield higher precision and recall."

In all, an excellent paper, probably my favorite of the conference. Do not miss it. It is well worth reading.

For more on the thoughts in the paper on using click data for relevance rank, please see also my earlier post, "Actively learning to rank", that discusses an excellent KDD 2007 paper by Filip Radlinski and Thorsten Joachims.

Update: Mark's talk is now available online.

Oren Etzioni at WSDM 2008

University of Washington Professor Oren Etzioni gave a fun and well done keynote talk in the second day of WSDM 2008 titled "Machine Reading at Web Scale".

Oren described his motivation as looking out at what search would be like in 2020. After quoting Alan Kay as saying, "The best way to predict the future is to invent it," Oren argued that "instead of merely returning pages", computers should "read 'em."

Obviously, we are not going to solve the entire natural language understanding problem today (or, for that matter, in the next few decades), but Oren pointed out that we can make much progress just with "some level of understanding" of the pages.

Specifically, Oren discussed some of the ideas from TextRunner which extracts facts in the form (noun phrase, relation/verb, noun phrase) tuples. For example, TextRunner might learn "(Tesla, invented, coil transformer)" from scouring over web documents. The demo and paper (PDF) have details.

Oren then talked about how they achieved what appear to be very high accuracy rates (nearly 90% for concrete facts) using a combination of a voting scheme (called "urns" and detailed in an IJCAI 2005 paper (PDF)) and, when data is too sparse for urns, what he called a "statistical 'type check'" that looks at whether the relationship appears to make sense when compared to similar data (e.g. "Does 'Pinkerton' behave like a mayor" in terms of having similar relations as other entities labeled as a mayor).

Update: Oren's talk is now available online.

Tuesday, February 12, 2008

Hector Garcia-Molina at WSDM 2008

Stanford Professor and search guru Hector Garcia-Molina gave a keynote talk yesterday at WSDM 2008 titled "Web Information Management: Past, Present and Future".

I would post a summary of the talk, but Erik Selberg already did a good write-up, so let me just point people there.

I particularly enjoyed the part of the talk where Hector listed several topics and ranked ordered them in terms of importance and difficulty. The topics were Beyond Search (meaning better understanding of documents and searcher intent), Information Integration (meaning combining and summarizing information extracted from multiple documents), Monetization, Data Mining, Personalization, Mobile, Privacy, and Scaling.

After explicitly saying he was trying to be controversial and incite debate with this ranking, Hector placed Beyond Search and Information Integration as #1 and #2 respectively for importance and #2 and #1 for difficulty.

A few people took the bait and argued for different orderings, but a strong point made by two people was that several of the topics only exist to serve other goals. Specifically, Data mining, Scaling, and Personalization support Beyond Search and Information Integration by helping us better understand documents and ferret out the subtleties in user intent. As one of the two put it, "Data Mining is Beyond Search."

I also want to note that Hector used an unusually narrow definition of personalization, focusing exclusively on personalized navigation and expressing annoyance at changing navigational links because it makes it learning an interface more difficult. While I agree that personalized navigation can be annoying, personalization has much more useful applications, such as focusing your attention on products or information that are likely to be particularly useful. Personalization and recommendations can be part of Beyond Search by helping people find and discover when user intent is unclear or unknown.

Finally, one minor point. Hector at one point said the "low hanging fruit has been taken" in search and advertising. I am not sure this is true. As our tools change, what is low hanging fruit also changes. As Tjalling Koopmans once said:
Your perception of a thing that is a viable problem to think about is shaped by the tool you can use.

Sometimes the solution to important problems ... [is] just waiting for the tool. Once this tool comes, everyone just flips in their head.
As we can more and more easily process massive data sets, as our tools improve, new low hanging fruit will appear.

Update: Hector's talk is now available online.

Saturday, February 09, 2008

Trouble in the land of social networks?

In "Generation MySpace Is Getting Fed Up", Spencer Ante and Catherine Holahan at BusinessWeek report on slowing traffic and low advertising revenues at major social networking sites.

Some excerpts:
Social networking was supposed to be the Next Big Thing on the Internet. MySpace, Facebook, and other sites have been attracting millions of new users, building sprawling sites that companies are banking on to trigger an online advertising boom.

Trouble is, the boom isn't booming anymore ... Many people are spending less time on social networking sites or signing off altogether ... The average amount of time each user spends on social networking sites has fallen by 14% over the last four months ... MySpace, the largest social network, has slipped from a peak of 72 million users in October to 68.9 million in December.

"What you have with social networks is the most overhyped scenario in online advertising," says Tim Vanderhook, CEO of Specific Media, which places ads for customers on a variety of Web sites.

Besides the slowing user growth and declining time spent on these sites, users appear to be growing less responsive to ads, according to several advertisers and online placement firms. If advertisers can't figure out how to reverse these trends, social networking could end up as a niche market in the online ad world .... Social networks have some of the lowest response rates on the Web.
Please see also Mike Masnick's excellent posts over at TechDirt, "Surprise, Surprise, Social Networking Ads Suck" and "Bill Gates Joins The Growing Social Network Exodus".

Please see also my recent posts, "Targeted advertising and Facebook" and "Facebook Beacon attracts disdain, not dollars", which discuss how much harder it is to make advertising effective on social networking sites because of the lack of commercial intent.

For the opposite point of view, you might be interested in what Josh Bernoff at Forrester Research has to say in his post, "Why Social Applications Will Thrive In A Recession".

Update: One month later, Bob Cringely writes:
I am beginning to think that Internet social networking is another CB radio, destined to crash and burn.

Social networking has a lot of problems as both a business and a cultural phenomenon. To start with there is generally no true business model ... Most are vying simply for eyeballs and hoping for Google ads to pay the bills until Time Warner or News Corp make them an acquisition offer they can't refuse. That might be okay for Facebook or MySpace and maybe Linked-In, but there are more than 350 general-purpose social networks out there and I will guarantee you that no more than 5 percent of those will be still operating two years from today.

It's not that I don't see value to social networks, it's that I generally don't see ENOUGH value.

What will likely happen to social networking is that some applications will survive on a more modest basis than now ... [and] 70 percent or so of most social networking functionality -- the really useful functionality -- will be sucked into the dominant portal/search/e-mail/chat/social networks like MSN and Yahoo.
Update: Six weeks later, The Economist writes, "Social networking will become a ubiquitous feature of online life. That does not mean it is a business."

Ask's BigNews and a PageRank for news

Ask launched a new news site called BigNews. Stories are ranked by importance (aka "BigFactor") which is determined in part by "monitoring mentions in articles, multimedia and blogs" and "the level of [Web] discussions and which ones are making the most noise."

Sounds a bit like PageRank in that it weighs documents by their mentions, but focuses on the leading edge of the Web, on recent documents and new links between documents.

But it is unclear if BigNews builds a link graph and does the full iterative computation or not. For example, when a document that has high BigFactor mentions another article, does it count for more (that is, does some of the BigFactor from the first transfer to the second)? It probably should.

It is also unclear how new this is. Googler Krishna Brahat said that Google News determines the importance of news articles, in part, by the popularity of the article and the PageRank of the source. While he did not say so explicitly, I would find it surprising if their measure of the popularity of an article did not include both clicks on the article and incoming links to the article.

Interesting even so. BigNews is another example of automating a front page of news, allowing the page to change rapidly in response to new articles and important trends. The next step may be to make BigFactor ranks unique to an individual -- reflecting individual tastes, preferences, and interests -- so every reader sees a different front page of news.

[Found via Vanessa Fox and Philipp Lenssen]

Friday, February 08, 2008

Going to WSDM 2008

I will be at WSDM 2008, a conference on web search and data mining, at Stanford University early next week.

Looks like a great conference! Please say hello if you bump into me there!

Please see also my earlier post, "Papers from WSDM 2008 on click position bias and social bookmark data", that discusses a couple of the papers from the conference.

Update: Most of the WSDM 2008 talks are now available online.

Friday, February 01, 2008

The brain as a spam filter

Over at the Scientific American Mind Matters blog, Andrew McCollough and Ed Vogel from U of Oregon posted a thought-provoking article, "Working Memory: They Found Your Brain's Spam Filter".

Some excerpts:
Our mental 'inbox' of working memory is ... constrained ... Several decades of research have indicated that our capacity to hold information "in mind" for immediate use is limited to a mere three or four items.

There are at least two primary explanations for this severe limitation in working memory capacity. First, it could be that working memory capacity is essentially determined by storage space, and that some people have larger "hard drives" than others do.

The alternative explanation is that capacity depends not on the amount of storage space but on how efficiently that space is used. Thus high-capacity individuals might simply be better at keeping irrelevant information out of mind, whereas low capacity individuals may allow more irrelevant information to clutter up the mental inbox. High-capacity individuals may just have better spam filters.

Some of our recent work has provided evidence favoring this mental spam filtering idea. In one experiment... high-capacity people were excellent at controlling what information was represented in working memory: They let in information about relevant objects but completely filtered out irrelevant objects. Low-capacity individuals, by contrast, had much weaker control over what information entered the mental "in box"; they let in information about both relevant and irrelevant objects roughly equally.

Surprisingly, these results mean that we found that low capacity people were actually holding more total information in mind than high capacity individuals were -- but much of the information they held was irrelevant to the task.
More information is not better. To be most productive on a task, we want to maximize our ability to filter for relevant information, not maximize our ability to acquire information.

This result is not surprising, but it does nicely frame the problem for those of us working in information retrieval. Not only is precision more important than recall, not only should we help people filter data and focus their attention, but also we may want to explicitly help people form and retain a working set of knowledge to apply to their task.

For example, search engines are increasing their support for re-finding, helping people remember and return to information found in the past. This usually is done as a web history, not an explicit representation of a past working set of knowledge, but it does help people build a working set, get back to items that may have dropped out of their working set, and swap back in the working set for an old task to which they are returning.

More generally, I think it is useful to think of searchers as skimming and filtering the information on pages as they try to build a small set of relevant information for their task. This may suggest methods we might consider to filter and help focus attention, such has highlighting parts of a page that are particularly likely to be useful, explicitly attempting to determine what may be distracting on a page for specific types of tasks and reducing those distractions, and carrying information and history across pages as people work on their tasks.

[SciAm article found via Mark Thoma]

Marissa Mayer on social search

Doug Sherrets at VentureBeat posts a great interview with Googler Marissa Mayer on social search. An excerpt of my favorite part:
VB: How would you implement social search?

MM: One thing we have tried ... is labeling -- have users annotate the search results they see and have those annotations be shared with people on their social network or with people of like mind and interest ... There have been a few topical areas that have had a lot of traction, but overall the annotation model needs to evolve.

Another classic thing to try is "other users like you," where you build implicit social connections between users who are like each other, much like Amazon does with books. Examples: "Other users like you also searched for" or "other people who did this search also did searches."

I think there is the possibility of taking a social network and combining some element of annotations and searches done. For example, if I have 400 friends on Facebook, and I knew 10 of them all searched for one topic today, that might interest me.

If we took Web History and allowed that data to influence rankings, such that pages that your friends have visited were now bumped up in your search ranking, that that might be a good augmentation to something like personalized search. In essence, it's a fusion of personalized and social search ... We have not done anything like that to date.
I would post my thoughts on this, but Danny Sullivan has already done most of my work for me, providing excellent commentary on Marissa's interview. Some excerpts from Danny's article:
Marissa talked about the existing Google Co-Op service being a success ... in terms of [tagging]. Personally, Co-Op seems to have largely failed to catch on, from what I've seen. We also know that Yahoo's experiment with tagging search results didn't go very far.

Letting people in your network know what you are looking for raises serious privacy issues.

Marissa raised the idea of Amazon-like recommendations [for web search]. To some degree, Google already does this through its personalized search feature ... Yahoo (among others) had thoughts of doing exactly the same thing years ago but never moved forward with it.
For more on tagging and web search, please see also Danny Sullivan's older post, "Tagging Not Likely The Killer Solution For Search" and a post from Sun Researcher Stephen Green, "Tags, keywords, and inconsistency".

For more on personalized web search, please see also my older posts, "Effectiveness of personalized search" and "Potential of web search personalization".