Geeking with Greg: 2007

Monday, December 24, 2007

Papers from WSDM 2008 on click position bias and social bookmark data

The WSDM conference is being held Feb 11-12 at Stanford University. I am not sure I will make it down from Seattle for it, but, if you are in the SF Bay Area and are interested in search and data mining on the Web, it is an easy one to attend.

Most of the papers for the conference do not appear to be publicly available yet, but, of the ones I could find, I wanted to highlight two of them.

Microsoft Researchers Nick Craswell, Onno Zoeter, Michael Taylor and Bill Ramsey wrote "An Experimental Comparison of Click Position-Bias Models" (PDF) for WSDM 2008. The work looks at models for how "the probability of click is influenced by ... [the] position in the results page".

The basic problem here is that just putting a search result high on the page tends to get it more clicks even if that search result is less relevant than ones below it. If you are trying to learn which results are relevant by looking at which ones get the most clicks, you need to model and then attempt to remove the position bias.

The authors conclude that a "cascade model" which assumes "that the user views search results from top to bottom, deciding whether to click each result before moving to the next" most closely fits searcher click behavior when they look at the top of the search results. However, their "baseline model" -- which assumes "users look at all results and consider each on its merits, then decide which results to click" (that is, position does not matter) -- seemed most accurate for items lower in the search results.

The authors say this suggests there may be "two modes of results viewing", one where searchers click the first thing that looks relevant in the top results, but, if they fail to find anything good, they then shift to scanning all the results before clicking anything.

By the way, if you like this paper, don't miss Radlinski & Joachims' work on learning relevance rank from clickstream data. It not only discusses positional bias in click behavior in search results, but also attempts the next and much more ambitious step of optimizing relevance rank by learning from click behavior. The Craswell et al. WSDM 2008 paper does cite some older work by Joachims and Radlinski, but not this fun and more recent KDD 2007 paper.

The second paper I wanted to point out is by Paul Heymann, Georgia Koutrika and Hector Garcia-Molina at Stanford, "Can Social Bookmarks Improve Web Search?" I was not able to find that paper, but a slightly older tech report (PDF) with the same title is available (found via ResourceShelf). This paper looks at whether data from social bookmark sites like del.icio.us can help us improve Web search.

This is a question that has been subject to much speculation over the last few years. On the one hand, social bookmark data may be high quality labels on web pages because they tend to be created by people for their own use (to help with re-finding). On the other hand, manually labeling the Web is a gargantuan task and it is unclear if the data is substantially different than what we can extract automatically.

Unfortunately, as promising as social bookmarking data might seem, the authors conclude that it is not likely to be useful for Web search. While they generally find the data to be of high quality, they say the data only covers about 0.1% of the Web, only a small fraction of those are not already crawled by search engines, and the tags in social bookmarking data almost always are "obvious in context" and "would be discovered by a search engine." Because of this, the social bookmarking data "are unlikely to be numerous enough to impact the crawl ordering of a major search engine, and the tags produced are unlikely to be much more useful than a full text search emphasizing page titles."

Update: It looks like I will be attending the WSDM 2008 conference after all. If you are going, please say hello if you see me!

Friday, December 21, 2007

Interactive machine learning talk

Dan Olsen at BYU gave a talk, "Interactive Machine Learning", at UW CS a couple months back.

Dan's group is doing some clever work that combines machine learning and HCI. The UW CS talk is good but long. If you are short on time, first take a look at the fun short demo videos Dan's group produced.

I particularly recommend seeing the clever "Screen Crayons" (WMV) application for annotating documents and the "Teaching Robots to Drive" (WMV) demo of an intuitive interface for training a robot car. The "Image Processing with Crayons" (WMV) demo is also good for getting quick introduction to the core idea.

Tuesday, December 18, 2007

Findory turns off the lights

Findory turned off its last webserver today. Sadness.

Previous posts ([1] [2] [3] [4] [5] [6]) have more details on the shutdown and Findory's history.

Monday, December 17, 2007

Microsoft and intelligent agents on the desktop

John Markoff quotes Microsoft Chief Research Officer Craig Mundie on the opportunity multicore processors on our desktop creates for AI and personalized software agents:

In the future, Mr. Mundie said, parallel software will take on tasks that make the computer increasingly act as an intelligent personal assistant.

"My machine overnight could process my in-box, analyze which ones were probably the most important, but it could go a step further," he said. "It could interpret some of them, it could look at whether I've ever corresponded with these people, it could determine the semantic context, it could draft three possible replies. And when I came in in the morning, it would say, hey, I looked at these messages, these are the ones you probably care about, you probably want to do this for these guys, and just click yes and I'll finish the appointment."

Craig had more extensive thoughts at the July 2007 Microsoft Analyst meeting on personalized assistants running on your desktop PC.

Eric Enge interviews Sep Kamvar

Eric Enge posted an interview with Google personalization guru Sep Kamvar.

Some highlights of Sep's answers below:

The two signals that we use right now are the search history and the location. We constantly experiment with other signals, but the two signals that have worked best for us are location and search history.

Some signals that you expect would be good signals, turn out not to be that good. So for example, we did one experiment with Orkut, and we tried to personalize search results based on the community that users had joined. It turns out that while people were interested in the Orkut communities, they didn't necessarily search in line with those Orkut communities.

It actually harkened back to another experiment that we did, where in our first data launch of personalized search we allowed everybody to just check off categories that were of interest to them. People did that, and people would check off categories like literature. Well, they were interested in literature, but they actually didn't do any searching in literature. So, what we thought would be a very clean signal, actually turned out to be a noisy signal.

When I think about what I am interested in, I don't necessarily think about what I am interested in that I search for and what I am interested in that I don't search for. That's something that we found was better learned algorithmically rather than directly.

A signal should be very closely aligned with search and what you are searching for in order for it to be useful to personalizing search ... In addition, we've found that your more recent searches are much more important than searches from a long time ago.

For more on the problems with explicitly extracting preferences -- as Sep found when explicitly asking for each user's category interests in an early version of Google Personalized Search -- please see my post, "Explicit vs. implicit data for news personalization", and the links from that post.

For more on trying to use signals not closely aligned with search, please see my earlier post, "Personalizing search using your desktop files".

For more on the importance of focusing on recent searches for personalized search, please see also my past posts, "The effectiveness of personalized search" and "The many paths of personalization".

Thursday, December 13, 2007

Geoffrey Hinton on the next generation of NNets

AI guru Geoffrey Hinton recently gave a brilliant Google engEdu talk, "The Next Generation of Neural Networks".

If you have any interest in neural networks (or, like me, got frustrated and lost all interest in the mid-1990s), set aside an hour and watch the talk. It is well worth it.

The talk starts with a short description of the history of neural networks, focusing on the frustrations encountered, and then presents Boltzmann machines as a solution.

Geoffrey clearly is motivated by trying to imitate the "model the brain could be using." For example, after fixing the output of a model to ask it to "think" of the digit 2, he enthusiastically describes the model as his "baby", the internal activity of one of the models as its "brain state", and the output of different forms of digits it recognizes as a 2 as "what is going on in its mind."

The talk is also full of enjoyably opinionated lines, such as when Geoffrey introduces alternating Gibbs sampling as a learning method and says, "I figured out how to make this algorithm go 100,000 times faster," adding, with a wink, "The way you do it is instead of running for [many] steps, you run for one step." Or when he dismissively calls support vector machines "a very clever type of perceptron." Or when he criticizes locality sensitive hashing as being "50 times slower [with] ... worse precision-recall curves" than a model he built for finding similar documents. Or when he said, "I have a very good Dutch student who has the property that he doesn't believe a word I say" when talking about how his group is verifying his claim about the number of hidden unit layers that works best.

You really should set aside an hour and watch the whole thing but, if you can't spend that much time or can't take the level of detail, don't miss the history of NNets in the first couple minutes, the description and demo of one of the digit recognition models starting at 18:00 (slide 19), and the discussion of finding related documents using these models starting at 31:40 (slide 28).

On document similarity, it was interesting that a Googler asked a question about finding similar news articles in the Q&A at the end of the talk. The question was about dealing with substantial changes in the types of documents you see -- big news events, presumably, that cause a bunch of new kinds of articles to enter the system -- and Geoffrey addressed it by saying that small drifts could be handled incrementally, but very large changes would require regenerating the model.

In addition to handwriting recognition and document similarity, some in Hinton's group have done quite well using these models in the Netflix contest for movie recommendations (PDF of ICML 2007 paper).

On a lighter note, that is the back of Peter Norvig's head that we see at the bottom of the screen for most of the video. We get two AI gurus for the price of one in this talk.

BellKor ensemble for recommendations

The BellKor team won the progress prize in the Netflix recommender contest. They have published a few papers ([1] [2] [3]) on their ensemble approach that won the prize.

The first of those papers is particularly interesting for the quick feel of the do-whatever-it-takes method they used. Their solution consisted of tuning and "blending 107 individual results .... [using] linear regression."

This work is impressive and BellKor deserves kudos for winning the prize, but I have to say that I feel a little queasy reading this paper. It strikes me that this type of ensemble method is difficult to explain, hard to understand why it works, and likely will be subject to overfitting.

I suspect not only will it be difficult to know how to apply the results of this work to different recommendation problems, but also it even may require redoing most of the tuning effort put in so far by the team if we merely swap in a different sample of the Netflix rating data for our training set. That seems unsatisfying to me.

It probably is unsatisfying to Netflix as well. Participants may be overfitting to the strict letter of this contest. Netflix may find that the winning algorithm actually is quite poor at the task at hand -- recommending movies to Netflix customers -- because it is overoptimized to this particular contest data and the particular success metric of this contest.

In any case, you have to admire the Bellkor team's tenacity. The papers are worth a read, both for seeing what they did and for their thoughts on all the different techniques they tried.

Sunday, December 09, 2007

Facebook Beacon attracts disdain, not dollars

I have been watching the uproar over Facebook Beacon over the last couple weeks with some amusement.

The system was intended to aggregate purchase histories from some online retailers, a poorly thought out attempt to deal with the lack of purchase intent that makes it difficult for Facebook to generate much revenue from advertising.

Om Malik calls ([1] [2] [3]) Facebook Beacon a "privacy nightmare", a "fiasco", and a "major PR disaster". He goes on to write that, even after the latest changes, "I don't think it's easy to trust Facebook to do the right thing."

Dare Obasanjo says "Facebook Beacon is unfixable" because "affiliate sites are pretty much dumping their entire customer database into Facebook ... without their customers permission" and accuses Facebook of "violations of user privacy to make a quick buck."

Eric Eldon at VentureBeat writes that Facebook is dealing with "a revolt against the feature, because it sends messages to your friends about your purchase and other online behavior .... whether or not you are logged in to Facebook and whether or not you have approved any data sharing."

The NYT quotes one Facebook user as saying, "Just because I belong to Facebook, do I now have to be careful about everything else I do on the Internet?" and quotes another as saying, "I feel like my trust in Facebook has been violated."

Facebook faces a hard challenge here. Facebook users are not coming to Facebook thinking of buying things. Because of this lack of commercial intent, most advertisements are likely to be perceived as irrelevant and useless to Facebook users. That will lead to low value to advertisers and low revenues for Facebook.

So, Facebook will struggle desperately to get the revenues promised by their $15B valuation, doing things that almost certainly will annoy and anger their users.

However, Facebook's success largely is due to benefiting from a fad. People flock to whatever social networking site all their friends are using. It wasn't always Facebook. MySpace was once considered by some crowds to be the place to be.

I wonder if all the annoying things that Facebook starts to do with its advertising will be what makes Facebook become uncool. It may be what makes Facebook's fickle audience want to find some new place to hang, somewhere that doesn't suck, and makes Facebook become yesterday's news.

One Wikipedia to rule them all

John Battelle notes a study that reports:

In December 2005 ... 2% of the [top] links proposed by Google and 4% of those proposed by Yahoo came from Wikipedia.

Today 27% of Google's results on the first link alone come from Wikipedia, as do 31% of Yahoo's.

Nick Carr once wrote of this trend, saying:

Could it be that, counter to our expectations, the natural dynamic of the web will lead to less diversity in information sources rather than more?

And could it be that Wikipedia will end up being Google's most formidable competitor? After all, if Google simply points you to Wikipedia, why bother with the middleman?

Update: A week later, the NYT reports that Google started testing a "Wikipedia competitor" called Knol. Google VP Udi Manber "said the goal of Knol was to cover all topics, from science to medicine to history, and for the articles to become 'the first thing someone who searches for this topic for the first time will want to read.'"

Google shared storage and GDrive

The WSJ reports that "a service that would let users store on its computers essentially all of the files they might keep .... could be released [by Google] as early as a few months from now."

Many others have launched products that provide limited storage on the cloud, including AOL, Microsoft, and Yahoo, but Google appears to have an unusual take on it, backing up all of the users files and providing search over them.

One way to view this might be as an extension of the "search across computers" feature of Google Desktop Search. That feature already lets "you to search your home computer from your work computer" by copying an index of the "files that you've been working with recently" and "your web history" to Google, but is limited to only a fraction of your desktop files.

From the sound of the WSJ article -- "a service that would let users store on its computers essentially all of [their] files" -- Google not only will let Google Desktop Search indexes all of your files, but also will copy the original file to the Google cloud.

Please see also my older posts ([1] [2]) on Google's GDrive.

Please see also Philipp Lenssen's post of screen shots from a leaked copy of a version of GDrive (codenamed Platypus) that is only available inside of Google. It apparently replicates and synchronizes all your files across multiple machines.

Friday, December 07, 2007

Google Reader feed recommendations

Googler Steve Goldberg announces that Google Reader has launched feed recommendations based on "what other feeds you subscribe to, as well as your Web History data."

My Google Reader recommendations are quite good. My top recommendations are Seattlest, Slog (from The Stranger, a Seattle free newspaper), Natural Language Processing Blog, Machine Learning etc, and Daniel Lemire's blog.

Pretty hard to complain about those. Nice work, Nitin Shantharam and Olga Stroilova.

It is worth noting that Bloglines has had recommendations for some time, but their recommendations suffer badly from a popularity bias. For example, my top Bloglines recommendations are Slashdot, Dilbert, NYT Technology, Gizmodo, and BBC World News.

[Found via Sriram Krishnan]

Saturday, November 17, 2007

Who cares about grandma?

Jeremy Crane at Compete.com reports that

The top 1% of searchers performs a full 13% of all searches in a given month.

If you extend this to the top 20% the number of queries increase to roughly 70%.

John Battelle argues that the focus on making things easy for common users -- make it work for grandma, as I frequently advocate -- could be misguided.

Since power users make the majority of searches, a search engine that targets power users could attract a majority of searches without attracting the majority of visitors.

However, there is some debate in the comments to John's post about whether Jeremy measured the right thing. What matters most is not the number of searchers, but the ad revenue from those searches.

It is unclear whether these power users who are making the majority of searches are actually the most profitable visitors. Some commenters argue, only anecdotally, that power users may be the least likely to click on ads.

It is an interesting question and one that begs for hard data. Does anyone know if 20% of searchers generate 70% of advertising revenue? Is that 20% is the same 20% that does the 70% of searches? Alternatively, is there is a negative relationship between number of searchers performed by a user and the average revenue per search?

Update: At least for banner advertising, AOL EVP Dave Morgan apparently has some data, writing:

Ninety-nine percent of Web users do not click on ads on a monthly basis. Of the 1% that do, most only click once a month. Less than two tenths of one percent click more often. That tiny percentage makes up the vast majority of banner ad clicks.

Who are these “heavy clickers”? They are predominantly female ... older ... [and] Midwesterners ... They look at sweepstakes far more than any other kind of content. Yes, these are the same people that tend to open direct mail and love to talk to telemarketers.

What does all of this mean? It means that while clickers may be valuable audiences, they are by no means representative of the Web at large.

[Found via Danah Boyd via Jeremy Pickens]

Personalizing the newspaper

I do not agree with everything AP CEO Tom Curley said in his Nov. 1 speech, but I am going to shamelessly pick out the parts I agree with in my excerpts below.

First, some excerpts on personalized news:

The perfect paper or newscast is becoming possible -- at least in the reader's or viewer's eyes. What is it you really want to know? We can personalize content now.

We’re not stuck on those 15-ton behemoths that miraculously manufacture a one-size-fits-all package over several hours that gets delivered over even more hours at great cost or captive of a 22-minute time slot engineered to reach a vast range of content tastes.

The economies of scale with mass production of print newspapers or television broadcasts are much smaller on the Web. On the Web, we have the opportunity to print a different newspaper for each reader, giving each reader a personalized front page.

Next, some excerpts on personalized advertising:

The structure for advertising is changing from mass to targeted.

When you drop a cookie on someone in the digital space, the ads you serve that viewer become up to 200 times more valuable ... The future is about serving ads to people, not to pages or programs.

Offline, we have no opportunity to show different advertisements to different people. The newsprint page, the TV broadcast, the billboard, all are static. Online, we can identify each viewer of the ad space and show something that is likely to be relevant (and maybe even helpful) to that viewer.

Just like Amazon shows a different page to each user -- a store for every customer -- newspapers should build a different page for each reader. Newspapers have gone far too long trying to apply the old static offline model to the online world.

Please see also my Nov 2004 post, "It's the content itself", on a much older speech by Tom Curley calling for personalized news.

[Tom Curley speech found via TechDirt]

Show advertising people might want

Tim Lee at TechDirt has a great post about personalization of television advertising:

As the quality of Google ads got better, users started to discover that Google ads were actually useful and relevant, and they got in the habit of looking at them.

Ads are content, and they're a lot more effective if they contain information people actually want.

With a little ingenuity, TV networks could be using devices like TiVo the same way Google uses click-through statistics: as a way to gather data on user attitudes toward different ads.

Display a different set of ads to each viewer, with the ads chosen based on the individual viewer's show-watching and ad-skipping history as well as some basic demographic characteristics. For example, users who frequently skip car ads would be shown fewer car ads. Viewers under 40 would never be shown ads for adult diapers, and all-male households would never be shown ads for feminine hygiene products.

Viewers would find ads more useful and less irritating, while advertisers would be willing to pay higher rates for ads that were precisely targeted at relevant subgroups ... Show users ads they actually find entertaining and useful.

We are all bombarded by advertising in our daily lives. Junk mail, ads in magazines, TV ads, it is all ineffective mass market noise pummeling us with things we don't want. It is a useless waste of time, a missed opportunity to capture a fleeting glimpse of attention.

Things change if we view advertising as content. Advertising can be useful information about products and services we actually want. The advertisements we see should be helpful and interesting, not annoying and irrelevant.

Personalizing advertising -- targeting to advertising to individual interests -- can make advertisements relevant, useful, and helpful. By learning from what each person likes and does not like, personalized advertising can use that fleeting glimpse of our attention to show us something we actually might need.

Please see also my March 2005 post, "Personalized TV advertising", and my Nov 2005 post, "Is personalized advertising evil?"

Thursday, November 08, 2007

Microsoft Dryad and Google MapReduce

Michael Isard at Microsoft Research gave a Google Tech Talk, "Dryad: A general-purpose distributed execution platform".

Dryad might be considered Microsoft's answer to Google's MapReduce. Both are distributed programming platforms designed for large scale computation over massive amounts of data. Michael directly points to MapReduce in his abstract, and the talk focuses the "more general computations" Dryad supports.

Do not miss the Q&A at the end of the talk (starting at 43:10) with the Google engineers poking at the scalability and reliability of parts of Dryad.

Rick Dalzell retires

Amazon CIO Rick Dalzell retired after ten years at the company.

It was a great pleasure working with Rick during the early days of Amazon. As Jeff Bezos said, "He has been a coach and mentor to many of us."

Thank you, Rick, and congratulations on your retirement!

Tuesday, November 06, 2007

UW CS professors to lead Google Fremont office

John Cook at the Seattle PI has the scoop that UW CS professors Brian Bershad and Craig Chambers "are joining Google's new development office in Seattle's Fremont neighborhood" and that "Bershad will lead the new Fremont office."

I have to say, this is quite the coup for Google. Shiva Shivakumar at Google Kirkland is quoted in the article as saying, "These two are spectacular", an opinion that matches my own experience with Brian and Craig during my years at UW CS.

By the way, I first heard rumors of this from Erik Selberg. Very cool to see those rumors confirmed.

Update: John Cook at the Seattle PI posts details of his tour of the Google Fremont office.

Friday, November 02, 2007

Targeted advertising and Facebook

Stefanie Olsen at CNet asks, "Can Facebook feed its ad brains?", in an article that describes some of the challenges Facebook will have as it attempts to drive substantial revenue from its website.

Some excerpts:

Facebook must figure out how to serve the right ads to the right people in real time.

That's no small task. In fact, it's a massive computing problem and one that very few companies apart from Google and Amazon have mastered.

Now Facebook must figure out how to take billions of data points about its members and turn that into an automatic ad machine .... With advertising, it's all about matching the right person to the right ad.

Machine learning in online advertising might involve trying many different techniques on affinity groups to figure out which work best ... No one obvious technique is the silver bullet for social networks -- no one has solved the problem of serving ads in that setting before.

I am quoted a couple times in the article arguing that purchase intent is weak on Facebook. People do not come to Facebook on a mission to buy a product. This makes it particularly challenging to get Facebook users to perceive the advertising not as annoying but as useful, which is almost certainly a necessity for their ad platform to generate the unusually high revenues that appear to be expected in their $15B valuation.

For more on advertising and dealing with weak purchase intent, please see my earlier post, "What to advertise when there is no commercial intent?".

Update Eric Eldon posts details on what Facebook has launched so far, saying:

Users will be able to add advertisers as friends ... [and] click on ... buttons ... [that] share the advertising information with their friends on Facebook ... This sounds far-fetched as a successful advertising strategy.

Facebook ... will also run what it is calling SocialAds, which take social actions from your friends -- such as a movie rental -- and combine that information with an advertiser's message to display the most "relevant" ads to you ... The risk, however, is that users will bristle when they get hit with such ads.

Wednesday, October 31, 2007

Personalized search for movies

Seung-Taek Park and David Pennock from Yahoo Research had a good paper on personalized search at KDD 2007, "Applying Collaborative Filtering Techniques to Movie Search for Better Ranking and Browsing" (PDF).

While the paper focuses on personalized search for movies, the techniques discussed are applicable to other types of search.

The authors start with some motivation that probably sounds familiar to readers of this blog:

Recommender systems are widely used ... to overcome information overload ... Information retrieval systems list relevant items ... only if a user asks for it ... Recommender systems predict the needs of a user ... and recommend items ... even though the user does not specifically request it.

We build a prototype personalized movie search engine called MAD6 ... MAD6 combines both information retrieval and collaborative filtering techniques for better search and navigation.

I like the technique MAD6 uses for personalized search. They use an "item-based collaborative filtering algorithm to calculate a user's expected ratings" on items in search results, fill in any gaps with average ratings from the general population, then re-rank the items.

For example, if a searcher rated Terminator and Terminator 2 highly, the personalized search results would first order the search results by relevance to the search terms and popularity, then re-rank Terminator, Terminator 2, and anything related to those two movies higher in the search results. In the example in the paper, this resulted in the top 5 search results for a query for [arnold action] being Terminator 2, Commando, True Lies, Last Action Hero, and Terminator.

As the authors report, this order was significantly different than the norm for a search for [arnold action] on other search engines. In their tests, they found their personalized rank performed very well on navigational queries -- when people already know what they are looking for -- but not as well on less directed informational queries.

The paper explains why (where GRank is a general rank that orders by popularity, PRank is the personalized search, and Web is a Web search):

When navigational queries are submitted, participants are more satisfied with PRank and Web than GRank. However, when informational queries are submitted, participants prefer GRank rather than PRank and Web.

One possible explanation is that, when participants submit navigational queries, they may have very clear target movies in their minds. These movies may be their favorites and are more likely rated before the test.

However, when informational queries are submitted, participants may not have clear target movies and [fewer] returned items ... [may] be rated ... Then ... the item-based algorithm may be inaccurate due to the lack of user information ... The item-based algorithm suffers from a cold start problem. We believe users' satisfaction of PRank will increase as users provide more ratings.

This result is unfortunate. A goal of recommender systems is to enhance discovery of unfamiliar items. If PRank is performing poorly on informational queries, it is failing at this task.

This result is surprising to me though. It should be possible to tune PRank to only modify the rankings when it has sufficient evidence that the change would be an improvement, otherwise falling back to GRank. First do no harm. PRank should only make a change when the majority of people will see the change as an improvement.

More generally, it should be possible to tune the recommender to favor serendipity and enhance discovery in informational queries while also supporting re-finding in navigational queries. Serendipity largely reflects the amount of surprise in the recommendations -- pushing away from the popular and toward the unusual -- while re-finding is merely surfacing or annotating items seen before. It should be possible to do both.

A very interesting paper and a worthwhile read. I love the approach of layering an item-based recommender on top of search results to create a form of personalized search (Findory made an attempt ([1] [2]) at doing something similar in web search). By looking at specific actions, not only can the personalized search act at a finer level of detail than Google Personalized Search ([1] [2]), but also it can adapt immediately to short-term trends, what you are searching for right now.

By the way, let me note that the PDF for this paper used to be publicly available, but appears to have been pulled. It now only is available with KDD or ACM Digital Library membership. I often do not write about papers that cannot easily be downloaded, but this one is sufficiently interesting that I wanted to make sure people knew about it.

Also, this paper, as many others, cites Sarwar et al., 2001 as the first work on item-based collaborative filtering. As I have said before, that may not be accurate.

Update: Seung-Taek Park in the comments gave an alternative location (PDF) for downloading the paper. I changed the link at the beginning of this post to point directly to the PDF file. Thanks, Seung-Taek!

Monday, October 29, 2007

Google Tech Talk on similarities

Yury Lifshits gave a Google Tech talk, "Similarity Search: A Web Perspective", surveying various algorithms for finding similar items. Slides (PDF) from the talk are available.

In the talk, Yury mentions his tutorial, "Algorithms for Nearest Neighbor Search", which should be of interest to those who want to dive in deeper.

Thursday, October 25, 2007

Facebook and the value of blocking Google

In his post, "The $15 billion nonsense", Nick Carr nails it:

Extrapolating Facebook's true worth from Microsoft's investment is a ridiculous exercise ... The investment ... was a price Microsoft had to pay to nail down the partnership.

Partnering ... is far more about gaining future strategic options and blocking the advance of ... Google ... than about making a financial gain.

This was $240M to gain share for Microsoft's advertising platform at the expense of Google and Yahoo. It says nothing about the market value of Facebook.

Update: Mike Masnick reports on rumors that some crazy hedge funds might have gone in at the same $15B valuation for Facebook. As Mike says, "Those hedge funds don't get any of those additional benefits that Microsoft gets." If true, this news would support a $15B valuation for Facebook, at least if you believe those hedge funds are responsible stewards of their investors' money.

Update: John Battelle says, "I think no one ... has truly grokked what Facebook has a shot at doing - Adsense driven not by search queries, but by personal profile." Call me a skeptic on this one. I think personalized advertising will produce much higher returns by reaching back to the last action with purchase intent than by targeting a coarse personal profile.

Tuesday, October 23, 2007

Searchers say, please read my mind

Greg Sterling at Search Engine Land reports on a Kelton Research poll that found considerable frustration among search engine users and a desire for a search engine that can "read their minds":

72.3 percent of Americans experience "search engine fatigue" when researching a topic on the Internet ... More than three out of four (75.1 percent) of those who experience search engine fatigue report getting up and physically leaving their computer without the information they were seeking.

Kelton asked survey respondents whether they wished that search engines like Google could, in effect, read their minds, delivering the results they were actually looking for. . . That capability is something that 78 percent of all survey-takers "wished" for.

As Greg Sterling says, this sounds like a common hunger for personalized search.

See also a March 2005 post where, in response to Udi Manber statement that "we are not in the mind reading business," I said, "If you need to read minds ... well then you better read minds. [Searchers will] think it's your fault, not theirs, if you don't give them what they need."

Monday, October 22, 2007

The Web titans arrive in Seattle

Todd Bishop at the Seattle PI writes about the branch offices Yahoo and Google have opened in Seattle. Some excerpts:

Yahoo's plan to put a large engineering outpost in Bellevue -- down the road from Google's Kirkland branch and Microsoft's Redmond campus -- means that all three giants of Internet search and Web services will have large offices in the region.

Added to other existing players -- such as Amazon.com, RealNetworks, InfoSpace and Adobe's Fremont branch -- they promise to turn the region into more of a center for Web-based technology.

I am quoted in the article with some thoughts on the impact to technology firms in Seattle and computer science at the University of Washington.

Friday, October 19, 2007

Google News, Krishna Bharat, and RecSys 2007

Google researcher and creator of Google News Krishna Bharat just gave the keynote talk at the Recommender Systems 2007 conference.

The talk had a pleasantly idealistic focus on increasing access to knowledge. Krishna clearly sees helping people find news information as a noble and important mission.

Krishna devoted most of the early part of the talk to discussing the history of writing and information broadcast, ending with the claim that the Web was creating a change in news consumption and access equally revolutionary to radio and television.

This revolution comes from universal and easy access to news and lower costs of producing news. Krishna saw this as having a very broad impact, saying, "The Internet can (and will) do a lot for democracy."

Even so, Krishna warned of challenges, saying, "Technology enables free speech but doesn't guarantee it," and expressing concerns about censorship. He said the goal of Google News was to ensure the spread of knowledge, multiple perspectives, and differing opinions.

To provide multiple perspectives, Google News crawls a broad list of sources, ranks and clusters them, then explicitly exposes the clusters to readers. That makes it easy for people to see the difference in, for example, how a hostage crisis is covered in South Korea and Pakistan.

The clustering attempts to group stories on the same event together. Krishna made the interesting comment that the clusters will change with time, with old and new stories shifting clusters as follow-up stories on an event appear. They use a technique Krishna only broadly described as an agglomerative hierarchical clustering algorithm.

Krishna provided more details on how Google determines the relevance of stories and authority of sources. He started by describing how human editors determine the relevance of stories, a long list that included scope/impact, urgency, lack of negativity, unexpectedness, lack of ambiguity, the "human element", ability of the audience to identify with the story, elite (e.g. celebrity) references, consonance, continuity, market forces, local bias, and ideological bias.

Krishna then said that Google determines article relevance by looking at the authority of the source, timeliness of the article, whether it is an original piece, placement by the editors on the source page, the apparent scope and impact, and the popularity of the article.

A big piece of determining the relevance of a story is determining the authority of the source. Google estimates that by looking at the characteristics of all the articles produced by the source (including number of non-duplicate stories, length of the articles, breadth of the articles, number of important/breaking stores, click rate by Google News readers, and the average quality of the writing), PageRank of the news website, and real world data on the news company (e.g. number of employees).

Krishna did see personalization and recommendations for news as a long term goal, saying we want to "get the right news to the right audience." And, Krishna has been interested in this for a long time, all the way back to his 1995 work on the Krakatoa Chronicle. As for more recent work, Krishna summarized the WWW 2007 paper, "Google News Personalization".

Overall, Krishna focused on the Google's mission of making information universally accessible and useful. He clearly wants to help people find news and be informed about world events, using whatever tools, personalization or otherwise, serve that mission.

Predictive accuracy is not enough

Sun Labs researcher Paul Lamere posts a list of properties of a good music recommender system:

A good recommendation has three aspects:

familiarity - to help us gain trust in the recommender
novelty - without new music, the recommendation is pointless
relevance - the recommended music has to match my taste

The emphasis on credibility is insightful. Recommendations that are too obscure may be perceived as low quality because the user cannot easily and quickly evaluate them.

This is just one of many examples of the yawning gap between predictive accuracy -- accurately predicting what people want to buy next -- and perceived quality -- the usefulness of the recommendations to users.

Saturday, October 13, 2007

Advances driven by the future of computing

UW CS Professor extraordinaire Ed Lazowska gave a talk recently, "Computer Science: Past, Present and Future", that covered some of the large new opportunities will be created from advances in computing in the next ten years.

What I found most insightful in the talk was Ed's discussion of how many fields will be going to go from data poor to data rich, a process driven by cheaper sensors and the availability of massive parallel computing power to analyze, summarize, and understand the incoming flood of sensor data. He specifically held up oceanography, medicine, astronomy, and biology as fields that will be forever changed by this shift.

If you are short on time, the slides (PDF) are available from a similar talk Ed gave at FCRC 2007. Slide 37 summarizes the major opportunities and some details on those opportunities start on slide 46.

Update: Oops, sorry, the video from Ed's talk is not available yet. It should be available in just a few days.

Update: The video from the talk is now available.

Attending RecSys 2007

I will be at the ACM Recommender Systems 2007 conference Oct 19-20 next week in Minneapolis. If you will be there, please say hello if you see me.

Wednesday, October 10, 2007

Highly available distributed hash storage from Amazon

Amazon CTO Werner Vogels announces an awesome SOSP 2007 paper he co-authored with nine other people at Amazon, "Dynamo: Amazon's Highly Available Key-value Store" (PDF).

Like the Google File System and Google Bigtable papers, this Amazon paper describes a critical part of Amazon's infrastructure, a reliable, fast, and scalable storage system.

Some excerpts from the paper:

There are many services on Amazon's platform that only need primary-key access to a data store ... [and] using a relational database would lead to inefficiencies and limit scale and availability.

Dynamo uses a synthesis of well known techniques to achieve scalability and availability: Data is partitioned and replicated using consistent hashing, and consistency is facilitated by object versioning. The consistency among replicas during updates is maintained by a quorum-like technique and a decentralized replica synchronization protocol. Dynamo employs a gossip based distributed failure detection and membership protocol ... Storage nodes can be added and removed from Dynamo without requiring any manual partitioning or redistribution.

In the past year, Dynamo has been the underlying storage technology for a number of the core services in Amazon's ecommerce platform.

The paper overall is fascinating, not only for the discussion of Amazon's needs and how the authors thought about them when building Dynamo, but also for critical discussion of other distributed databases. The related work section (Section 3) is worth its weight in gold.

A paper on a Amazon distributed database immediately invites comparison with the Google Filesystem and Bigtable. Unlike Google FS (and also unlike distributed databases like C-store), Amazon's Dynamo oddly is optimized for writes. From the paper:

Many traditional data stores execute conflict resolution during writes and keep the read complexity simple.

Dynamo targets the design space of ... a data store that is highly available for writes ... We push the complexity of conflict resolution to the reads in order to ensure that writes are never rejected.

This choice surprises me and does appear to lead to some problems later. The 15ms average read latency seems high if any service needs to make more than a dozen database queries when serving a web page; latencies like that make this a pretty heavyweight data store service.

As they described in the paper, at least a few groups at Amazon needed a much lighter weight service that could be hit thousands of times, so they had to use rather extreme parameters (requiring all replicas be available for writes, and only one for reads) to force Dynamo to work for them. With those parameters, they effectively turned off the high availability for writes and pushed the complexity of conflict resolution back away from reads, which makes me wonder if write optimization really should have been part of Dynamo's design goals in the first place.

Another difference with Google FS and Bigtable is that Google's systems organize data as a map, supporting high performance range scans over the data. On the one hand, Google may have more need for that, with its building of search indexes and analyzing massive text and log data. On the other hand, Amazon has massive text and log data too, and Dynamo seems like it may not be able to help with large scale data indexing and analysis tasks.

On both not supporting range scans and the optimization for writes over reads, the source of that appears to be that the authors focused on the needs of the shopping cart. They repeatedly return to that as a motivating example. It is not clear to me why they choose to focus on that task over their other needs.

I had a couple other surprises. Dynamo relies on random, uniform distribution of the keys for load balancing -- something that seems likely to run into problems with highly skewed access patterns -- rather than supporting additional replication of frequently accessed data. More serious, Dynamo is limited to a few hundred nodes because they punted on some of the hard problems of ensuring consistency in metadata (like their routing table) at larger scale.

Overall, a very interesting paper and system from Amazon. I love how Amazon has adapted the motivation of P2P distributed hash tables like Chord and Pastry to an environment with all trusted machines like an Amazon data center, taking advantage of that to reduce latency and improve reliability. I also am impressed by how remarkably configurable Dynamo is -- from the underlying storage to the number of replicas to the means of conflict resolution -- so that it can adapt to the widely varying needs of different groups at Amazon.

By the way, unlike Google, Yahoo, and Microsoft, Amazon publishes academic papers only rarely. They deserve kudos for doing so here. With this paper, Amazon is revealing some of the remarkable challenges in large scale computing they face. As people are attracted to those challenges, perhaps this will be the start of more openness from Amazon.

Saturday, October 06, 2007

Starting Findory: Hardware go boom

Computers go down, a lot more often than we might like.

For most of Findory's four years, it ran on six servers. In that time, those servers had one drive failure, one bad memory chip, and four power supplies fail.

Of these, only the two power supply failures caused outages on the site, one of one hour and one a painful eight hour outage. There were a few other short outages due to network problems in the data center, typically problems that took the entire data center offline temporarily.

Findory's six servers were all cheap commodity Linux boxes, typically a single core low-end AMD processors, 1G of RAM, and a single IDE disk. Findory was cheap, cheap, cheap.

The reliability of Findory over its lifetime was perfectly acceptable, even quite good compared to other little startups with similar levels of traffic, but I think it is interesting to think about what may have been able to prevent the outages without considerable expense.

Surprisingly, RAID disks would not have helped much, though they would have made it easier to recover from the one drive failure that did occur on a backend machine. Better redundancy on the network may have helped, but would have been expensive. Splitting servers and replicating across data centers may have helped, but would have been both expensive and complicated for a site of Findory's size and resources.

Probably the biggest issue was that Findory did not have a hot standby running on its small database. A hot standby database would have avoided both the one hour outage and the eight hour outage. Those outages were caused by losing the first, then a second power supply on the database machine.

Looking back at all of these, I think it was particularly silly not to have the database hot standby. The cost of that would have been minimal and, not only would it have avoided the outage, but it would have reduced the risk of data loss by having a constant database backup. I almost added the hot standby many times, but kept holding off on it. While I may have mostly gotten away with it, it clearly was a mistake.

Other thoughts? Anyone think it was foolish not to run in RAID and not to be split across data centers? Or, maybe you have the opposite opinion, that small startups should not worry much about uptime and small outages are just fine? How about the hardware failure rates, are they typical in your experience?

Please see also my other posts in the "Starting Findory" series.

Thick, thin, and Office Live

The new Office Live is out and seems to be getting widely panned.

Mike Arrington says, "Microsoft has failed to understand the real power of Google Docs - easy, no hassle document creation, collaboration and access from the browser."

Richard MacManus writes, "Is this what Microsoft's answer to the Web Office is - tacked on features to its all-powerful desktop suite?"

But, the ever contrarian Nick Carr responds to Richard:

Well, yes, that's precisely what Microsoft's answer is. And while MacManus is right that Microsoft's offering is "messy" and even "muddled," one should not underestimate the company's ability to shape the market by tacking features onto its "all-powerful desktop suite." It's a strategy, after all, that has served the company well many times in the past.

Microsoft's online offering does not have to be better than, say, Google's; it just has to be (a) more convenient for typical business users with (b) good enough functionality.

Why Microsoft should do more than bolt features on to Office if those features are sufficient to undermine the appeal of switching elsewhere?

If you are collaborating a lot on documents, you could switch to Google Docs. Or, you could just get by using the beast with which you and all your colleagues already are familiar, Microsoft Office, with all its additional features. Office Live appears to be an attempt to add just enough collaboration to reduce the appeal of switching to Google Docs. Not a bad strategy.

As for Mike's point, well, yes, but Microsoft does not want everything to move to the browser. Their entire business depends on a thick client -- a PC loaded with code and data -- having more value than a thin client.

I am reminded of Upton Sinclair's words: "It is difficult to get a man to understand something when his salary depends on his not understanding it." Likewise, it will be difficult to get Microsoft to promote apps running in the browser when their business depends on maintaining the value of a heavyweight desktop machine.

Update: I want to be clear that there is a difference between analyzing Microsoft's strategy and endorsing it. I am not rooting for Microsoft or anyone else here, just trying to understand why they do what they do.

Thursday, October 04, 2007

Web 2.0 is dead and spammy, long live Web 3.0?

While I am no more a fan of the name Web 3.0 than Web 2.0, Jason Calacanis has an entertaining rant on where the Web is going:

Web 3.0 throttles the "wisdom of the crowds" from turning into the "madness of the mobs" we've seen all to often, by balancing it with a respect of experts. Web 3.0 leaves behind the cowardly anonymous contributors and the selfish blackhat SEOs that have polluted and diminished so many communities.

This reminds me of what Xeni Jardin wrote back in Oct 2005:

Web 2.0 is very open, but all that openness has its downside: When you invite the whole world to your party, inevitably someone pees in the beer.

These days, peed-in beer is everywhere. Blogs begat splogs -- junk diaries filled with keyword-rich text to lure traffic for ad revenue.

See also my previous post, "Growth, crap, and spam", where I said, "There seems to be a repeating pattern with Web 2.0 sites. They start with great buzz and joy from an enthusiastic group of early adopters, then fill with crud and crap as they attract a wider, less idealistic, more mainstream audience."

See also my previous posts, "Community, content, and the lessons of the Web" and "Getting the crap out of user-generated content".

[Calacanis post found via Nick Carr]

Recommender systems and diversity

Knowledge@Wharton recently published an article, "Reinforcing the Blockbuster Nature of Media: The Impact of Online Recommenders". The article discusses research work by Kartik Hosanagar and Dan Fleder at Wharton on whether recommender systems improve diversity of sales and help people discover items that otherwise might be buried in the long tail.

An excerpt from the article:

Recommenders -- perhaps the best known is Amazon's -- tend to drive consumers to concentrate their purchases among popular items rather than allow them to explore and buy whatever piques their curiosity, the two scholars suggest.

Hosanagar and Fleder argue that online recommenders "reinforce the blockbuster nature of media." And they warn that, by deploying standard designs, online retailers may be recreating the very phenomenon -- circumscribed media purchasing choices -- that some of them have bragged about helping consumers escape.

I am briefly quoted in the article arguing for a somewhat milder conclusion, saying:

Linden, reached via email, declares himself untroubled by Hosanagar and Fleder's findings. "Recommendation algorithms easily can be tuned to favor the back catalog -- the long tail -- as Netflix does," he argues. Netflix, the online DVD purveyor, consciously highlights obscure titles in designing its recommender.

Linden also argues that, in the absence of online recommenders, consumers would turn to even cruder tools, like traditional bestseller lists. "You have to ask what content would otherwise be in place of the recommendations and whether that content would have greater diversity," he says.

Hosangar and Fleder have two papers detailing their work, a very long paper titled "Blockbuster Culture's Next Rise or Fall: The Impact of Recommender Systems on Sales Diversity" and a shorter ACM article, "Recommender systems and their impact on sales diversity".

The papers are an interesting read. What I found most surprising about their work was that, in their simulations, a recommender algorithm that did compensate for bestseller bias (called r4 in their paper) still reduced diversity. Although I had questions I had about their simulation model (which I already have discussed with Dan), I think their work should serve as an additional caution to those working on recommender systems to be concerned the impact choices in the algorithm can have on level of diversity, especially if one of the business goals of the recommendations is to drive movement in the back catalog.

Please also see comments on this research work from recommender researcher and U of Michigan Professor Paul Resnick, particularly his thoughts on the simulation framework used.

Wednesday, October 03, 2007

Revisiting Yahoo Answers

Looking at Yahoo Answers again two years after its launch, I think I was wrong to think of it as a question answering website.

Despite the name, Yahoo Answers is a discussion forum. People are using it like a newsgroup, chatting about various topics. They are not using Yahoo Answers as much to generate authoritative answers to questions.

For example, looking at the "popular" answers right now, I see these questions: "Rudy Giuliani vs. Hillary Clinton?" "What's the last CD that you started out LOVING, but you played it so much you 'retired' it indefinitely?" "Egypt: when did I feel sad in Egypt?" "Name 5 one hit wonders of the eighties?"

These are not questions that have any objective answer. Rather, they are a discussion started with a subject line in the form of a question. People are "answering" a question not to provide an answer, but to engage in a conversation and chat with other people.

If Yahoo Answers is a discussion forum, it has a few implications. First, it should be compared not with the now-defunct Google Answers or Google's NLP question answering work, but with Google Groups, Slashdot, and other popular forums.

Second, those seeking to emulate Yahoo Answers probably would be mistaken to focus on question answering over community. People using Yahoo Answers are seeking conversations, not truth. The site is successful because it creates fun discussions and entertains people, not because it yields knowledge.

Third, Yahoo Answers itself may want to emphasize features that favor discussion and deemphasize features focused on generating high quality answers. In particular, while the "best answer" feature creates a fun, winner-takes-all type of contest, ending the discussions after a short period of time may be undesirable if the discussion is still attracting attention. In addition, if promoting discussion is the goal, features that help people find discussions they want to join would be beneficial; search and browse features may want to favor that over finding direct answers to questions.

I have to admit, I am late to the party in understanding the true purpose of Yahoo Answers. Two years ago, when Yahoo Answers launched, Gary Price said, "Will Yahoo Answers simply be the next generation of an online bulletin board?" Looks like he was right on.

See also my Dec 2005 post, "Yahoo Answers and the wisdom of the crowd", where I, mistakenly it appears, focused on Yahoo Answers as a question answering tool.

Sunday, September 30, 2007

Findory to shut down November 1

Update: The shut down of Findory has been rescheduled to occur on November 22, 2007.

Update: The shut down of Findory has been postponed. The website will remain active past November 1, 2007. More information when I can.

Findory will be shutting down on November 1, 2007. I posted the following on the site:

Thank you, Findory readers, for the last four years. I hope you not only found interesting news on Findory you otherwise would have missed, but also, just by using Findory, took pleasure in helping others find news they might enjoy.

Sadly, all good things must come to an end. On November 1, 2007, Findory will be shutting down.

It was a wonderful experiment building Findory. Information personalization is in our future.

Some day, online newspapers will focus on your interests, building you your own unique, customized front page of news. Some day, search engines will learn from what you do to help you find what you need. Some day, your computer will adapt to your needs to help with the task at hand. Some day, information overload will be tamed.

But not today. Findory will be shutting down on November 1. The website will no longer carry news, blogs, videos, podcasts, or favorites. The daily e-mails will cease. To ease the transition for users of Findory Inline and the Findory API, empty feeds will be served for a couple weeks into November.

I am sorry to see Findory go. Though it will not be Findory doing so, I continue to believe that the future will be personalized.

- Greg Linden, Founder, Findory.com

Please also see my earlier post, "Findory rides into the sunset".

KFS, a Google FS clone

Rich Skrenta discusses and endorses Kosmix's KFS, an alpha-stage Google File System clone.

There appear to be a few of these early stage clones, the biggest being the Yahoo-supported Hadoop.

The Yahoo train wreck?

Analyst Henry Blodget calls Yahoo's declining Comscore numbers "an absolute disaster", says the company is a "train wreck", and predicts that "if the company can't reverse this trend in short order, its only hope will be to sell itself."

Yowsers. It's not all that bad, is it?

Like Microsoft, Yahoo may have been a feeble competitor lately, often appearing distracted, slow to react, and unable to do more than follow in most areas.

Yet, like Microsoft, Yahoo remains an giant in the field. In fact, not only does Yahoo have second place market share in many Web products, but also, according to at least one study, Yahoo has the strongest search brand around.

Yahoo's biggest problem at the moment seems to be that they try to do too much and end up doing nothing well. Even in their core business of advertising, they are playing second fiddle to Google. There's no reason for people to use the second best when switching costs are so low, but Yahoo seems bizarrely content to stay behind the leader even as it sees its audience trickling away.

There are as many opinions on what Yahoo should focus on as Yahoo has products, but, I have to say, I am amazed Yahoo has not done more with personalization.

Yahoo has hundreds of millions of signed in users with long histories of what they wanted and enjoyed. Rather than chasing Google's tail, they could lead in personalized advertising, search, e-mail, and news. Yahoo could use their knowledge of what their users need to focus attention, surface relevant information, and be as helpful as possible.

It is something Yahoo could do better than anyone else and would make Yahoo different. As Jeff Bezos used to say about Amazon, when shoppers try other stores, the experience should seem "hollow and pathetic". "Why doesn't this store know me?", shoppers should ask. "Why doesn't it know what I want?"

Yahoo should be the same way. Yahoo should know you. Elsewhere, the experience should feel vaguely unpleasant, like jumping from talking to your friends to being alone in a group of strangers. "Why is this site showing me that?", they should ask. "Doesn't it know I don't like that?"

Please also see my earlier post, "Yahoo post-Semel and the long road ahead".

Please also see my May 2006 post, "Yahoo home page cries out for personalization".

Friday, September 28, 2007

Starting Findory: Infrastructure and scaling

[After a long break, I am returning to my "Starting Findory" series, a group of posts about my experience starting and building Findory.]

From the beginning of Findory, I was obsessed with performance and scaling.

The problem with personalization is that that it breaks the most common strategy for scaling, caching. When every visitor is seeing a different page, there are much fewer good opportunities to cache.

No longer can you just grab the page you served the last guy and serve it up again. With personalization, every time someone asks for a page, you have to serve it up fresh.

But you can't just serve up any old content fresh. With personalization, when a visitor asks for a web page, first you need to ask, who is this person and what do they like? Then, you need to ask, what do I have that they might like?

So, when someone comes to your personalized site, you need to load everything you need to know about them, find all the content that that person might like, rank and layout that content, and serve up a pipin' hot page. All while the customer is waiting.

Findory works hard to do all that quickly, almost always in well under 100ms. Time is money, after all, both in terms of customer satisfaction and the number of servers Findory has to pay for.

The way Findory does this is that it pre-computes as much of the expensive personalization as it can. Much of the task of matching interests to content is moved to an offline batch process. The online task of personalization, the part while the user is waiting, is reduced to a few thousand data lookups.

Even a few thousand database accesses could be prohibitive given the time constraints. However, much of the content and pre-computed data is effectively read-only data. Findory replicates the read-only data out to its webservers, making these thousands of lookups lightning fast local accesses.

Read-write data, such as each reader's history on Findory, is in MySQL. MyISAM works well for this task since the data is not critical and speed is more important than transaction support.

The read-write user data in MySQL can be partitioned by user id, making the database trivially scalable. The online personalization task scales independently of the number of Findory users. Only the offline batch process faced any issue of scaling as Findory grew, but that batch process can be done in parallel.

In the end, it is blazingly fast. Readers receive fully personalized pages in under 100ms. As they read new articles, the page changes immediately, no delay. It all just works.

Even so, I wonder if I have been too focused on scaling and performance. For example, there have been some features in the crawl, search engine, history, API, and Findory Favorites that were not implemented because of the concern about how they might scale. That may have been foolish.

The architecture, the software, the hardware cluster, these are just tools. They serve a purpose, to help users, and have little value on their own. A company should focus on users first and infrastructure second. Despite the success in the design of the core personalization engine, perhaps I was too focused on keeping performance high and avoiding scaling traps when I should have been giving readers new features they wanted.

Please see also a post on O'Reilly Rader, "Database War Stories #8: Findory and Amazon".

If you have not seen them already, please see also the other posts in the "Starting Findory" series.

Wednesday, September 26, 2007

Reputation from edits and edit reversal

An enjoyable WWW 2007 paper out of UC Santa Cruz, "A Content-Driven Reputation System for the Wikipedia" (PDF), builds on a great but simple idea: High quality authors usually do not have their Wikipedia edits reversed.

From the paper:

In our system, authors gain reputation when the edits they perform to Wikipedia articles are preserved by subsequent authors, and they lose reputation when their edits are rolled back or undone in short order.

Most reputation systems are user-driven: they are based on users rating each other's contributions or behavior ... In contrast, ... [our] system ... requires no user input ... authors are evaluated on the basis of how their contributions fare.

A content-driven reputation system has an intrinsic objectivity advantage over user-driven reputation systems. In order to badmouth (damage the reputation of) author B, an author A cannot simply give a negative rating to a contribution by B. Rather, to discredit B, A needs to undo some contribution of B, thus running the risk that if subsequent authors restore B's contribution, it will be A's reputation, rather than B's, to suffer, as A's edit is reversed. Likewise, authors cannot simply praise each other's contributions to enhance their reputations: their contributions must actually withstand the test of time.

A fun demo of the technique is available that colors the text of some Wikipedia articles based on the reputation of the authors, providing some measure of how trustworthy particular passages of text might be.

It is curious how this simple but clever technique seems less susceptible to gaming. I was trying to think of ways the system could be manipulated -- Would people retaliate for having their edits reversed? Would they make lots of non-controversial but useless edits to increase their reputation? -- but these and other obvious attacks seem like they might have a fairly high risk of damaging your own reputation as people caught on and reversed the changes.

I also was trying to think how this might be applied elsewhere. For example, on eBay, rather than have sellers and buyers rate each other with inane things like "A++++!", perhaps eBay seller reputation could be determined instead by how often the transaction is reversed? What if eBay implemented a 30-day unconditional return policy on all transactions, then reported buyer reputation based on payment rate and seller reputation based on return rate?

Netflix and the KDD Cup 2007

There is an interesting group of papers in the KDD Cup 2007 on the Netflix Prize recommender contest. The full papers are particularly good and detail techniques that are doing well in the contest, but also don't miss the much lighter introductory paper, "The Netflix Prize" (PDF), and its discussion and charts of the progress on the prize.

On a related note, you might also be interested in Simon Funk's enjoyable write-ups ([1] [2] [3]) on his explorations into using SVD for the Netflix contest. I particularly liked his focus on the speed of and twiddles to SVD.

By the way, I am no longer playing with the contest, but I should admit that I never got anywhere near the performance of these contenders. For more on that, see my post, "Latest on the Netflix prize". Interesting that my prediction in that post -- that additional data on the movies from another source will be necessary -- so far has turned out to be wrong.

Microsoft Research Beyond Search grants

Microsoft Research is offering $1M in research grants for "research in the area of Semantic Computing, Internet Economics and Online Advertising".

A juicy dataset is provided for that research, a "Microsoft adCenter Search query log excerpt with 100 million search queries along with ad click logs sampled over a few months, and a Live Search query log excerpt with 15 million search queries with per-query search result clickthroughs."

Sadly, the grants and access to the data are open only to academic research groups. I suppose that is to be expected after the ruckus that followed the now-defunct AOL Research group's naive attempt to offer their search logs more widely. I guess non-academics will just have to buy search logs from ISPs.

By the way, I thought it was amusing, when looking over the terms of Microsoft's request for proposals, to see that participants are discouraged from "relying exclusively on non-Microsoft technologies." Ah, yes, what good is research if it doesn't use Microsoft products?

[Found via ResourceShelf]

Friday, September 21, 2007

Hotmap and map attention data

Hotmap is a fun demo that shows a heat map of what areas people are viewing frequently on Microsoft Live Search Maps.

From the About page:

Hotmap shows where people have looked at when using Virtual Earth, the engine that powers Live Search Maps: the darker a point, the more times it has been downloaded.

It is a pretty cool idea. The heat maps clearly focus on high population areas, roads, coastlines, rivers, country borders, and other items of interest.

Seems like there could be a bunch of unusual and useful applications here, especially if you take into account time series (that people look at some map tiles after looking at other map tiles).

Danyel Fisher at Microsoft Research has two articles on the project: "Hotmap: Looking at Geographic Attention" and "How We Watch the City: Popularity and Online Maps". The articles are light reads, plenty of pretty pictures of heat maps.

As the second of the two papers mentions, there are some amusing examples of where people direct their attention. Here in Seattle, a "small, very bright point on the shore of Lake Washington points out Bill Gates' house."

For related work that uses GPS log data rather than map search log data, also make sure to check out "The Microsoft Multiperson Location Survey" and "Predestination: Inferring Destinations from Partial Trajectories".

Update: More on Hotmap from Matthew Hurst, Todd Bishop, and directly from Danyel Fisher.

Update: One month later, the NYT reviews a commercial device called the Dash Express that apparently "broadcasts information about its travels back to the Dash network", allowing users to "warn each other through the network [anonymously] the second they hit a traffic slowdown."

E-mail versus social networks

Insightful thoughts from Om Malik:

E-mail has most ... of our attention .... [and] has all the elements needed for a social ecosystem, namely the address book.

Yahoo might have taken the wrong approach to ... social networking ... It should have started from within Yahoo's email service, which has some 250 million subscribers.

[E-mail should be] something better, something that doesn't make us all groan every time we open our inbox.

It probably is not fair to pick on Yahoo here. Microsoft and Google also seem to have had only limited success with their social network apps while letting their e-mail apps languish.

But, I think Om has an excellent point. Rather than try to replace e-mail apps with social apps, most people might be better served by bringing more social features to our e-mail apps.

On this topic, you might be interested in checking out "Inner circle: people centered email client", a fun CHI 2005 paper out of Microsoft Research.

Update: Two months later, Saul Hansell at the NYT posts about the Yahoo Inbox 2.0 project, an extension to Yahoo Mail "that can automatically determine the strength of your relationship to someone by how often you exchange e-mail and instant messages with him or her" and displays "other information about your friends ... much like the news feed on Facebook."

Tuesday, September 18, 2007

Google's PowerPoint launches

Google launches Google Presentations, lightweight PowerPoint-like functionality integrated into Google Docs.

Philipp Lenssen posts a lengthy review. More details are available from the Google Help pages.

As you might expect in an online app, the focus appears to be on collaboration, sharing, and virtual conferencing (using chat and synchronized online viewing of the presentation).

Stepping back and looking at the bigger picture here, I find myself getting to the point where my entire day is spent in the browser. Even on machines where I have Microsoft Office installed, I often find it faster to quickly view documents using the GMail integration with Google Docs than open other applications.

I was skeptical that Google would get us to that point, but they have. Google appears to be making remarkable progress chipping away at the utility of a desktop PC environment.

Monday, September 17, 2007

The Google phone company

Bob Cringely's latest column proposes that Google spend several billion to buy the 700 MHz band, sell "Google Cubes" that act as small fileservers, WiFi points, and the mesh of a 700 MHz network, and then "overnight" become the "biggest and lowest-cost ISP" and the "biggest and lowest-cost mobile phone company" while "dominating local- and location-based search".

Bob is thinking big today, it appears.

Friday, September 14, 2007

Tech talk: Searching for Evil

Security research guru Ross Anderson has a talk up on Google Video, "Searching for Evil", that, among other things, surveys some of the more unusual Web-based financial schemes.

If you only have a few minutes, jump to 20:23 to check out Ross' frightening examples of some phishing-like schemes that are popping up on the web. The first example shows how people recruit mules on the Web to sit in the middle of a fraudulent financial transaction, with the person who accepted a too-good-to-be-true job offer getting badly screwed in the end.

If you have more time to dive in deeper and watch the whole thing, I enjoyed Ross' discussion at the beginning of the talk about using evolutionary game theory in simulations of network attacks. He refers to a WEIS 2006 paper, "The topology of covert conflict" (PDF), for more details. That paper starts to "build a bridge between network science and evolutionary game theory" and to "explore ... sophisticated [network] defensive strategies" including "cliques ... the cell structure often used in revolutionary warfare" which turn out to be "remarkably effective" for defending a network against adaptive attackers.

Similarly, though not mentioned in his talk, Ross has a ESAS 2007 paper, "New Strategies for Revocation in Ad-Hoc Networks" (PDF) which looks at how to "remove nodes that are observed to be behaving badly" from ad-hoc networks. They come up with a remarkable conclusion that "the most effective way of doing revocation in general ad-hoc networks is the suicide attack ... [where] a node observing another node behaving badly simply broadcasts a signed message declaring both of them to be dead."

Wednesday, September 12, 2007

Netflix prize and the value of experimenting

The Netflix Prize leaderboard continues to be a fascinating proof of the value of experimentation when working with big data.

The top entries include teams of graduate students from around the world, with eastern Europe particularly well represented. The second best entry at the moment is from undergraduates from Princeton (kudos, Dinosaur Planet).

Some of the teams disclose information about their solutions, enough to make it clear that the teams are playing with a wide variety of techniques.

I love the "King of the Hill" approach to these kinds of problems. There should be no sacred cows, no egos preventing people from trying and testing new techniques. From the seasoned researcher to the summer intern, anyone should be able to try their hand at the problem and build on what works.

Please also see my July 2007 post, "Netflix Prize enabling recommender research", and my June 2007 post, "Latest on the Netflix Prize".

See also my April 2006 post, "Early Amazon: Shopping cart recommendations", for an example from the early days of Amazon of the value of A/B testing and experimentation.

Update: The KDD Cup 2007 papers are available. They give a nice flavor for the approaches (mostly based on twiddles to SVD) that are currently near the lead.

Tuesday, September 11, 2007

Leaked information on Google Reader

According to Philipp Lenssen, an internal Google talk with confidential information on Google Reader was briefly available on Google Video.

Philipp posts a summary from someone referred to as "Fanboy". Ionut Alex Chitu also posts two summaries ([1] [2]) of the content of the talk. Worth a look.

There is mention of planned social sharing features, details on the internal operations of Google Reader, and various statistics on feeds and feed reading. It also sounds like they plan on launching feed recommendations soon.

Impressive that the team working on Google Reader is so small, just seven people.

Update: Three months later, the feed recommendations have launched.

Actively learning to rank

Filip Radlinski and Thorsten Joachims had a paper at KDD 2007, "Active Exploration for Learning Rankings from Clickthrough Data" (PDF), with a good discussion of strategies for experimenting with changes to the search results to maximize a search engine's ability to learn from clickstream data.

Some excerpts:

[When] learning rankings of documents from search engine logs .... all previous work has only used logs collected passively, simply using the recorded interactions that take place anyway. We instead propose techniques to guide users so as to provide more useful training data for a learning search engine.

[With] passively collected data ... users very rarely evaluate results beyond the first page, so the data obtained is strongly biased toward documents already ranked highly. Highly relevant results that are not initially ranked highly may never be observed and evaluated.

One possibility would be to intentionally present unevaluated results in the top few positions, aiming to collect more feedback on them. However, such an ad-hoc approach is unlikely to be useful in the long run and would hurt user satisfaction.

We instead introduce ... changes ... [designed to] not substantially reduce the quality of the ranking shown to users, produce much more informative training data and quickly lead to higher quality rankings being shown to users.

The strategy they propose is to come up with some rough estimate of the cost of ranking incorrectly, then twiddle with the search results in such a way that the data produced will help us minimize that cost.

There are a bunch of questions raised by the paper that could use further discussion: Is the loss function proposed a good one (in particular, with how it deals with lack of data)? How do other loss functions perform on real data? How much computation does the proposed method require to determine which experiment to run? Are there simpler strategies that require less online computation (while the searcher is waiting) that perform nearly as well on real data?

But, such quibbles are beside the point. The interesting thing about this paper is the suggestion of learning from clickstream data, not just passively from what people do, but also actively by changing what people see depending on what we need to learn. The system should explore the data, constantly looking for whether what it believes to be true actually is true, constantly looking for improvements.

On a broader point, this paper appears to be part of an ongoing trend in search relevance rank away from link and text analysis and toward analysis of searcher behavior. Rather than trying to get computers to understand the content and whether it is useful, we watch people who read the content and look at whether they found it useful.

People are great at reading web pages and figuring out which ones are useful to them. Computers are bad at that. But, people do not have time to compile all the pages they found useful and share that information with billions of others. Computers are great at that. Let computers be computers and people be people. Crowds find the wisdom on the web. Computers surface that wisdom.

See also my June 2007 post, "The perils of tweaking Google by hand", where I discussed treating every search query as an experiment where results are frequently twiddled, predictions made on the impact of those changes, and unexpected outcomes result in new optimizations.