Sunday, March 30, 2008

Massively distributed Ajax profiling

Emre Kiciman and Ben Livshits have a SOSP 2007 paper, "AjaxScope: A Platform for Remotely Monitoring the Client-Side Behavior of Web 2.0 Applications" (PDF), with the clever idea of using web browser clients to help do massively distributed instrumentation and performance optimization of Web 2.0 applications.

Some excerpts from the paper:
AjaxScope ... [allows] instrumenting and remotely monitoring the client-side execution of web applications within users' browsers .... at the source code level to improve developers' visibility into ... correctness and performance.

We take advantage of ... the ability to dynamically serve new, different versions of the code each time any user runs a web application ... to ... provide differently instrumented code per user and per execution of an application.

Instrumentation is dynamically added or removed from a program as its real-world behavior is observed across users. We [also] distribute instrumentation and run-time analyses across many users' execution of an application ... A combination of these techniques allows us to take ... runtime monitoring policies that would normally impose a prohibitively high overhead and instead spread the overhead across users and time.

Because our prototype can parse and rewrite standard JavaScript code, it does not require changes to the server-side infrastructure of web applications, nor does it require any extra plug-ins or extensions on the client browser.

To our knowledge, AjaxScope is the first to extend the developer's visibility into web application behavior onto the end-user's desktop ... in that it allows developers to gain insight into application behavior in a wide-area setting.
The paper goes on to give several examples of doing instrumentation with AjaxScope including reporting client-side errors, detecting infinite loops, optimizing string concatenation, finding memory leaks, and looking for opportunities for static caching.

Update: About a year later, Microsoft made parts of AjaxView, a similar toolkit, available for download.

The death and life of newspapers

Eric Alterman writes an article, "Out of Print", in the March 31 New Yorker about "the death and life of the American newspaper."

An excerpt:
Until recently, newspapers were accustomed to operating as high-margin monopolies. To own the dominant, or only, newspaper in a mid-sized American city was, for many decades, a kind of license to print money.

Most managers in the industry have reacted to the collapse of their business model with a spiral of budget cuts, bureau closings, buyouts, layoffs, and reductions in page size and column inches ... Molly Ivins complained ... the newspaper companies' solution to their problem was to make "our product smaller and less helpful and less interesting" ... [and] that may help explain ... the dwindling number of Americans who buy and read a daily paper.
While the localized monopolies newspapers enjoyed are not coming back, I do think there is a lot newspapers could do to make their product more comprehensive, more helpful, and more interesting. In particular, I think newspapers should focus on local news and advertising and make their websites adaptive to the needs and interests of their readers.

For more on that, please see my earlier posts, "Newspapers and local content", "People who read this article also read", "Personalizing the newspaper", and "What to advertise when there is no commercial intent?"

Friday, March 28, 2008

Hadoop Summit notes

James Hamilton posts detailed notes ([1] [2] [3] [4] [5]) on the talks at the Hadoop Summit.

Hadoop is the open source version of Google GFS and MapReduce. Yahoo is pushing a lot of the development of Hadoop.

The notes cover talks on Hadoop, Yahoo Pig, Microsoft Dryad, Yahoo HBase, Facebook Hive, Yahoo ZooKeeper, and a few applications using these platforms.

James' notes are excellent, long, and detailed. I found several noteworthy tidbits in there.

Most of the Hadoop clusters are small, 2k nodes or less, often just tens of nodes. They have plans to expand to 5k. However, this seems to conflict with Yahoo's earlier announcement that they have Hadoop running on a "10,000 core Linux cluster".

They have hit problems with scheduling jobs -- "FIFO scheduling doesn't scale for large, diverse user bases", "We're investing heavily in scheduling to handle more concurrent jobs" -- but an attempt to deal with this by breaking one cluster into many virtual clusters called Hadoop on Demand is not working well -- "HoD scheduling implementation has hit the wall ... HoD was a good short term solution but not adequate for current usage levels. It's not able to handle the large concurrent job traffic Yahoo! is currently experiencing."

Joins are "hard to write in Hadoop" and even harder to optimize, which is part of the motivation for the higher level Pig language built on top of Hadoop. While the Pig team argues that their language is simpler and easier to use than SQL, they do have plans to write an SQL-like processing layer on top of Pig.

On the topic of joins, there apparently was a brief discussion comparing Hadoop/MapReduce data processing to more traditional databases where the primary difference mentioned was that databases have indexes where Hadoop and MapReduce have to create the equivalent of those indexes ad hoc for each job. While that is not a new point, considering it again makes me wonder if there might be a middle ground here where we retain older extracts and other intermediate results and reuse them for similar computations over the same data. That would have a similar effect as an index, but would have the advantages of being built on demand and targeted to the current workloads.

If there are other tidbits you found of interest in the notes (or you attended the conference and found other things to be of interest), please add a comment to this post!

Update: James adds a summmary post with a few more thoughts.

Update: As Doug Cutting and Chad Walters point out in the comments, a 2k node cluster easily could have 10k cores, so I struck my statement that only having 2k nodes seems to be in conflict with Yahoo's earlier announcement.

Talk on disk as the new RAM

Northeastern Professor Gene Cooperman recently gave a curious Google engEdu tech talk, "Disk-Based Parallel Computation, Rubik's Cube, and Checkpointing".

Gene's starting point is that "disk is the new RAM" and the "disks of a cluster can serve as if they were RAM" because the bandwidth to 50 disks is 5G/second, same as the bandwidth to RAM.

The talk just gets more fun from there, with Gene claiming that "a compute cluster with 32 quad core nodes, each with 500G of local disk, is a good approximation of ... a single computer with 10 terabytes of RAM and 200 CPU cores."

The premise is, of course, outlandish. The obvious issue to come up is that the latency characteristics of 10T of RAM is totally different than the latency characteristics of 32 500G disks.

But, as long as long as we can batch the reads and writes to the disk, this difference does not matter. Gene gives a few classes of algorithms -- breadth first state-space search, some algorithms that involve millions of accesses to hash tables, some types of pointer chasing -- that they have found amenable to the model.

This has parallels to MapReduce and the changes we need to do to algorithms to make them work well in a MapReduce framework, as one Googler pointed out during the Q&A.

If you only have time for a few minutes and want the gist of the talk, I would recommend you at least watch from 31:48 for 5-10 minutes.

Wednesday, March 26, 2008

Using dwell times for search relevance

UNC Professor Diane Kelly gave a Google engEdu talk recently titled "Relevance Feedback: Getting the Most out of Your User".

I found the discussion of implicit relevance feedback in search -- using click and display times to determine whether a page was relevant -- that starts around 33:00 to be particularly interesting.

One noteworthy tidbit is that Diane says (at 58:00) that their work found "no direct relationship between display time and usefulness" of web pages.

I was surprised to hear that. I thought dwell times were a good indicator of the relevance of a page.

There appear to be more details on that in her 2004 SIGIR paper, "Display time as implicit feedback: Understanding task effects" (PDF). That paper, in a small user study, found "great variation between subjects in the relationship of display time and usefulness rating ... [and a] lack of any significant relationship between these two factors ... [which] indicates that using display times averaged over a group of users as a measure of usefulness is unlikely to work."

Please see also Joachims et al. 2005, "Accurately interpreting clickthrough data as implicit feedback", which discusses (in Section 2) this study and other studies on the usefulness of the time spent reading a page for determining relevance.

Please see also my earlier posts, "Actively learning to rank" and "Ranking using Indiana University's user traffic".

Using IMDb data for Netflix Prize

In a post titled "More data usually beats better algorithms", Anand Rajaraman talks about the success a team from his data mining class at Stanford found in the Netflix recommendation contest by using IMDb data to supplement the Netflix ratings training set.

Some excerpts:
Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database (IMDB). Guess which team did better?

Team B got much better results, close to the best results on the Netflix leaderboard! I'm really happy for them, and they're going to tune their algorithm and take a crack at the grand prize.

But the bigger point is, adding more, independent data usually beats out designing ever-better algorithms to analyze an existing data set. I'm often surprised that many people in the business, and even in academia, don't realize this.
I do suspect that adding more data from IMDb, Amazon, or some other source will be necessary for anyone to win the Netflix Prize. As I said after working on the data a bit last year, "In my analyses, data simply seemed too sparse in some areas to make any predictions, and supplementing with another data set seemed like the most promising way to fill in the gaps."

Please see also my previous posts, "The advantages of big data and big clusters", "Better understanding through big data", Netflix Prize enabling recommender research", and "The Netflix Prize and big data".

Update: Yehuda Koren from the currently top-ranked BellKor team swung by and commented:
Our experience with the Netflix data is different.

IMDB data and the likes gives more information only about the movies, not about the users ... The test set is dominated by heavily rated movies (but by sparsely rating users), thus we don't really need this extra information about the movies.

Our experiments clearly show that once you have strong CF models, such extra data is redundant and cannot improve accuracy on the Netflix dataset.
Update: Anand has a follow-up post. Some choice excerpts:
Why not have more data and better algorithms?

Scalable algorithms involve only a fixed number of sequential scans and sorts of data (since large data sets must necessarily reside on disk and not RAM). Most algorithms that require random access to data or take time greater than O(N log N) are not scalable to large data sets. For example, they cannot easily be implemented using methods such as MapReduce.

Thus, choosing a more complex algorithm can close the door to using large data sets, at least at reasonable budgets that don't involve terabytes of RAM.

In the case of web search, Google made a big leap by adding links and anchor text, which are independent data sets from the text of web pages. In the case of AdWords, the CTR data was an independent data set from the bid data. And Google ... became a domain registrar so they could add even more data about domain ownership and transfers into their ranking scheme. Google consistently has believed and bet on more data, while trumpeting the power of their algorithms.

You might think that it won't help much to add more of the same data, because diminishing returns would set in ... In many important cases, adding more of the same data makes a bigger difference than you'd think ... [such as when] the application sees the data embedded in a very-high-dimensional space.

Friday, March 21, 2008

Designing for internet scale

James Hamilton wrote a LISA 2007 paper, "On Designing and Deploying Internet-Scale Services" (PDF) with a remarkable brain dump of good advice on building large scale services.

The paper does read like a laundry list, so let me point at what I consider the most important recommendations.

Expect failures and test for them, to the point of never shutting down the system normally -- "crash only software" -- and designing and testing your system so that the "operations team [is] willing and able to bring down any server in the service at any time without draining the workload first."

Automate everything and keep it simple. Failover should be automatic, nothing should require human intervention, and this should be constantly tested. Complexity and dependencies are your enemy when trying to manage all possible failure modes automatically, and James rightly insists on keeping it simple, to the point that he argues that anything that would add complexity without producing order of magnitude improvements should not be implemented.

Test live with rollback, versioning, and incremental rollouts. As long as problems can be quickly reverted, rolling out code live early and often is not expensive or dangerous. More dangerous is "big-bang changes" where many things change at once, making it hard to determine where problems lie and hard to revert to the older version of the code base.

While I agree with almost everything James wrote, I found a few areas where I disagree based on my own experience at Amazon.

First, James wrote that "we like shipping on a 3-month cycle." That hardly seems like early and often to me. I think code should be shipped constantly, multiple times a week or even multiple times a day across different services. On the Web, there is no reason to not ship frequently, and launching continuously forces developers to eliminate any dependencies with other code in their launch while correctly adding pressure on high quality tools for monitoring and debugging the health of the system, versioning, rapid rollbacks, incremental rollouts, and live experiments.

Second, James says we should not "affinitize requests or clients to specific servers". Depending on what he means here, we may disagree, because I think this potentially could conflict with simplifying database design. In particular, it is much easier to do database replication (and James acknowledges that "database scaling remains one of the hardest problems in designing internet scale services") if we stick a user to a specific server for their writes. Then, we can just guarantee the much simpler eventual consistency with rather loose requirements on what eventual means. The counter-argument here is that sticking users to specific servers can create hot spots, but I think that largely can be avoided if we pick the server to stick to for writes in part based on load.

Finally, James advocates for a manual "big red switch" that allows us to throttle our system manually. I think this is in conflict with the idea of automating everything. What I would prefer to see is that the system automatically responds to low load by using more resources and less load by using less, with monitoring, warnings, and ability to override. As James says, "People make mistakes. People need sleep. People forget things." I think it is optimistic to believe that, in the middle of a crisis, people could quickly decide how to optimally throttle back a system so it can maintain the highest level of quality of service at unusual load. But an automated system that has even limited understanding of the cost of its parts could do that.

In all, a great paper, thought-provoking and well worth reading.

Attending ICSWM 2008

I will be at the "International Conference on Weblogs and Social Media" March 30 to April 2 here in Seattle.

If you are interested in weblogs, the networks formed between weblogs, and using those networks for helping people find information and other people, it looks to be a fun conference.

For anyone in Seattle or the SF Bay Area, this is an easy one to attend. Please say hello if you see me there!

Geeking with Greg inside Microsoft

I have a started an internal version of Geeking with Greg inside Microsoft. It will be similar to this blog, but cover more day-to-day work, crazy ideas, and things that are not public.

If you are internal to the Redmond giant and are interested, it is here.

If you are outside Microsoft, sorry about that. Happens to us all. I wish I could get to Steve Yegge's internal version of his blog at Google, for example, but no luck for me either.

Interview on MIX Online

I have a video interview on MIX Online titled "Greg Linden: Trends in Collective Intelligence and Centralization".

I ramble on for about 20 minutes on topics in personalization, recommendations, and some thoughts on MapReduce and Dryad. Nothing earth-shattering in there, but perhaps worth twenty minutes if you just can't get enough Greg Linden in your day.

Watching the video requires downloading Microsoft Silverlight, sorry about that. If you do go ahead and install Silverlight, you might as well check out its nifty multiresolution image zoom demo on the Hard Rock site too.

Update: In the comments to this post, Simon Lyall pointed out that the video is available in many formats at the bottom of the page (under the "Downloads" section), so Silverlight is not required to watch it.

Friday, March 14, 2008

Re-finding in Firefox 3

It appears we will be seeing some interesting new support for re-finding in the next version of Firefox. From the release notes for Firefox 3 Beta 4:
Type in all or part of the title, tag or address of a page to see a list of matches from your history and bookmarks

Results are returned according to their frecency (a combination of frequency and recency of visits to that page)... An adaptive learning algorithm further tunes the results to your patterns.
[Found via Larry Dignan]

Update: About two months later, Todd Bishop quotes Mozilla's Mike Beltzner as saying:
I confidently predict that the Awesome Bar is going to change the way people navigate the Web.

It's solving a new problem on the Web -- that the Web has gotten so big, and revisitation is more popular than novel navigation. People go back to sites they've been to before much more than they go to new Web sites.

Understanding how to simplify revisitation is a really big thing for the Web. It totally will change the way you browse.

Talk on Hadoop at Google Fremont

UW CS graduate student Aaron Kimball is giving a talk, "Welcome to the new area of cloud computing", here in Seattle on Hadoop and tools for large scale distributed computation.

The talk is at 6:30pm on April 30 at the Google Fremont office. It is open to anyone and is free. Sounds like fun! Sign up soon if you are interested. There are only a limited number of slots available.

[Found via Kevin Merritt]

Saturday, March 01, 2008

People who read this article also read

I have an article in the March 2008 issue of IEEE Spectrum titled "People Who Read This Article Also Read...".

The article is on personalized news. An excerpt from the introduction:
So far, few newspaper sites look different from the pulp-and-ink papers that spawned them. Editors still manually choose and lay out news stories. Often, the front page changes only once a day, just like the print version, and it shows the same news to all readers.

There's no need for that uniformity. Every time a Web server generates a news page, for example, in response to a reader's clicking on a link, it can create that page from scratch. An online news site can change minute by minute. And it can even generate different front pages, essentially producing millions of distinct editions, each one targeting just one person -- you.

The most interesting and important way to customize a site is to create a page of stories based on your unique interests culled from information about your past reading behavior. There's already a model for that -- the recommendation systems used by Amazon, TiVo, and Netflix. Using information on past purchases, movie ratings, or items viewed, these systems steer consumers to items from among the thousands or millions they have on offer. Newspapers can and should borrow this idea.

It could transform the industry. Based on articles viewed, these systems could highlight the ones they think a reader would find most interesting, even presenting them in order, with the most interesting article first. No longer would readers have to skim pages of news to find what they needed. No longer would reporters have to battle for the limited space on the front page.
The article goes on to discuss recommender systems in general, the techniques used by Findory and Google News, and the long-term goal of personalizing information.

Does the entropy of search logs indicate that search should be easy?

Qiaozhu Mei had a fun talk at WSDM 2008 with dramatic, sometimes outlandish, but always thought-provoking statements about web search based on an analysis he did with Ken Church on the Live Search weblogs.

The paper, "Entropy of Search Logs: How Hard is Search? With Personalization? With Backoff?" (ACM page) covers much of the talk and adds more detail. An excerpt:
Although there are lots of pages out there, there are not that many pages that people actually go to ... [We] estimate the entropy of URLs (and queries and IP addresses) based on logs from Microsoft's www.live.com. We find that it takes just 22 bits to guess the next URL (or next query or next IP address). That is ... millions, not billions.

Large investments in clusters in the cloud could be wiped out if someone found a way to capture much of the value of billions [of indexed pages] in a small cache of millions.
So, this is one of the dramatic, thought-provoking, outlandish claims. Qiaozhu is arguing that analysis how many pages people actually go to in Live Search seems to indicate that a relatively small cache of pages (millions) would be sufficient to capture almost all searchers. During the talk, he took this even further, saying that there might be some possibility of fitting a search index of all the pages you might care about on a flash drive.

On the one hand, there is little doubt there is a long tail of infrequently executed queries and that many (and perhaps even most) queries can be satisfied from a reasonably large cache if it is well structured.

On the other hand, there appear to be large logical leaps in using entropy to make claims about the difficulty of the problems in search. Information entropy provides a lower bound on how much space is needed to encode information, the maximum compression possible for the information. The analyses in the paper, such as that all the URLs accessed through Live Search in the last two years could be encoded in 22.44 bits, provide the minimum number of bits required to encode that data. To take that minimum and then re-state it as the actual amount of effort required (e.g. "It takes around 22 bits to guess the next url") seems like it requires a fairly large leap of faith.

Practical issues also get in the way here -- such as new or updated pages and responding immediately to changes in behavior such as a major news event -- that make it seem unlikely that "large investments in expensive computer systems in the cloud to index billions of pages ... could ... be wiped out [because] ... a small cache of a few million pages ... [captures] much of the value."

And, other looks at similar data have said there is little potential to capture all of the value of a search engine with a large cache. For example, in SIGIR 2007, Baeza-Yates et al. wrote that:
In the query log we used, 88% of the unique queries are singleton queries, and 44% are singleton queries out of the whole volume. Thus, out of all queries in the stream composing the query log, the upper threshold on hit ratio is 56%.

This is because only 56% of all the queries comprise queries that have multiple occurrences. It is important to observe, however, that not all queries in this 56% can be cache hits because of compulsory misses. This is different from capacity misses, which happen due to space constraints on the amount of memory the cache uses. If we consider a cache with infinite memory, then the hit ratio is 50%. Note that for an infinite cache there are no capacity misses.
Yet there remains much to consider here. While there may be large numbers of singleton queries, do those queries matter? Are there clicks on them? What would be the impact of dropping many of them (either by mapping them to similar queries that are in the cache or by returning no results)? The measures of entropy given by Qioazhu imply that many of these queries do not matter to users, part of why I found the paper so thought-provoking. Looking at cache size with respect to user experience -- satisfying queries that matter -- seems like fertile ground for further investigation.

The WSDM paper also had an interesting discussion of how much personalization could help with search, where "personalization" here is limited to looking at how users from the same IP range (e.g. grouping everyone in the 156.111.*.* block together). The authors found that the variation (as measured by entropy) of the URLs clicked for a query for an IP range was much less than the entropy overall. In giving an example of how MSG can be disambiguated to mean either "Madison Square Garden" or "Mono-sodium Glutamate", they imply that this is primarily due to the implicit information about location contained in IP addresses.

In this personalization/localization analysis and some of the other analyses in the paper, I did wonder if the drops in entropy are at least partially explained by reducing the data size. That is, when I limit to queries from a small IP range, there may be only a few people who did that query and clicked results, which would give me low entropy, but only because I have just a few examples. Even so, the conclusion makes intuitive sense, that adding information about IP address to queries can help disambiguate some queries, especially local searches.

Finally, the paper also includes brief but fun discussions of the difficulty of query suggestions, the variation in what people click on in the search results, and time of day and time of week effects. Definitely a fun and thought-provoking read.

Update: Video of Qiaozhu's talk is now available.

Update: Please see also my earlier post on a WWW 2007 paper by Dou et al. where they use "click entropy ... as a simple measurement on whether the query should be personalized."