Wednesday, November 30, 2011

Browsing behavior for web crawling

A recent paper out of Yahoo, "Discovering URLs through User Feedback" (ACM), describes the value from using what pages people browse to and click on (which is in Yahoo's toolbar logs) to inform their web crawler about new pages to crawl and index.

From the paper:
Major commercial search engines provide a toolbar software that can be deployed on users' Web browsers. These toolbars provide additional functionality to users, such as quick search option, shortcuts to popular sites, and malware detection. However, from the perspective of the search engine companies, their main use is on branding and collecting marketing statistics. A typical toolbar tracks some of the actions that the user performs on the browser (e.g., typing a URL, clicking on a link) and reports these actions to the search engine, where they are stored in a log file.

A Web crawler continuously discovers new URLs and fetches their content ... to build an inverted index to serve [search] queries. Even though the basic mechanism of a crawler is simple, crawling efficiently and eff ectively is a difficult problem ... The crawler not only has to continuously enlarge its repository by expanding its frontier, but also needs to refresh previously fetched pages to incorporate in its index the changes on those pages. In practice, crawlers prioritize the pages to be fetched, taking into account various constraints: available network bandwidth, peak processing capacity of the backend system, and politeness constraints of Web servers ... The delay to discover a Web page can be quite long after its creation and some Web sites may be only partially crawled. Another important challenge is the discovery of hidden Web content ... often ... backed by a database.

Our work is the first to evaluate the benefits of using the URLs collected from a Web browser toolbar as a form of user feedback to the crawling process .... On average, URLs accessed by the users are more important than those found ... [by] the crawler ... The crawler has a significant delay in discovering URLs that are first accessed by the users ... Finally, we [show] that URL discovery via toolbar [has a] positive impact on search result quality, especially for queries seeking recently created content and tail content.
The paper goes on to quantify the surprisingly large number of URLs found by the toolbar that are useful, not private, and not excluded by robots.txt. Importantly, a lot of these are deep web pages, only visible by doing a query on a database, and hard to ferret out of that database any way but looking at the pages people actually look at.

Also interesting are the metrics on pages the toolbar data finds first. People often send links to new web pages by e-mail or text message. Eventually, those links might appear on the web, but eventually can be a long time, and many of the urls found first in the toolbar data ("more than 60%") are found way before the crawler manages to discover them ("at least 90 days earlier than the crawler").

Great paper out of Yahoo Research and a great example of how useful behavior data can be. It is using big data to help people help others find what they found.

Monday, November 28, 2011

What mobile location data looks like to Google

A recent paper out of Google, "Extracting Patterns From Location History" (PDF), is interesting not only for confirming that Google is studying using location data from mobile devices for a variety of purposes, but also for the description of the data they can get.

From the paper:
Google Latitude periodically sends his location to a server which shares it with his registered friends.

A user's location history can be used to provide several useful services. We can cluster the points to determine where he frequents and how much time he spends at each place. We can determine the common routes the user drives on, for instance, his daily commute to work. This analysis can be used to provide useful services to the user. For instance, one can use real-time traffic services to alert the user when there is traffic on the route he is expected to take and suggest an alternate route.

Much previous work assumes clean location data sampled at very high frequency ... [such as] one GPS reading per second. This is impractical with today's mobile devices due to battery usage ... [Inferring] locations by listening to RF-emissions from known wi-fi access points ... requires less power than GPS ... Real-world data ... [also] often has missing and noisy data.

17% of our data points are from GPS and these have an accuracy in the 10 meter range. Points derived from wifi signatures have an accuracy in the 100 meter range and represent 57% of our data. The remaining 26% of our points are derived from cell tower triangulation and these have an accuracy in the 1000 meter range.
The paper goes on to describe how they clean the data and pin noisy location trails to roads. But the most interesting tidbit for me was how few of their data points come from GPS and how much they have to rely on less accurate cell tower and WiFi hotspot triangulation.

A lot of people have assumed mobile devices would provide nice trails of accurate and frequently sampled locations. But, if the Googlers' data is typical, it sounds like location data from mobile devices is going to be very noisy and very sparse for a long time.

Tuesday, November 15, 2011

Even more quick links

Even more of what has caught my attention recently:
  • Spooky but cool research: "Electrical pulses to the brain and muscles ... activate and deactivate the insect's flying mechanism, causing it to take off and land ... Stimulating certain muscles behind the wings ... cause the beetle to turn left or right on command." ([1])

  • Good rant: "Our hands feel things, and our hands manipulate things. Why aim for anything less than a dynamic medium that we can see, feel, and manipulate? ... Pictures Under Glass is old news ... Do you seriously think the Future Of Interaction should be a single finger?" ([1])

  • Googler absolutely shreds traditional Q&A and argues that the important thing is getting a good product, not implementing a bad product correctly to spec. Long talk, if you're short on time, the talk starts at 6:00, meat of the talk starts at 13:00, and the don't miss parts of the talk are at 17:00 and 21:00. ([1])

  • "There has been very little demand for Chromebooks since Acer and Samsung launched their versions back in June. The former company reportedly only sold 5,000 units by the end of July, and the latter Samsung was said to have sold even less than that in the same timeframe." ([1])

  • With the price change to offer Kindles at $79, Amazon is now selling them below cost ([1])

  • Personalization applied to education, using the "combined data power of millions of students to provide uniquely personalized learning to each." ([1] [2] [3] [4] [5] [6])

  • It is common to use human intuition to choose algorithms and tune parameters on algorithms, but this is the first I've ever heard of using games to crowdsource algorithm design and tuning ([1])

  • Great slides from a Recsys tutorial by Daniel Tunkelang, really captures the importance of UX and HCIR in building recommendation and personalization features ([1])

  • Bing finally figured out that when judges disagree with clicks, clicks are probably right ([1])

  • Easy to forget, but the vast majority of US mobile devices still are dumbphones ([1])

  • Finally, finally, Microsoft produces a decent mobile phone ([1])

  • Who needs a touch screen when any surface can be a touch interface? ([1])

  • Impressive augmented reality research demo using Microsoft Kinect technology ([1])

  • Very impressive new technique for adding objects to photographs, reproducing lighting, shadows, and reflections, and requiring just a few corrections and hints from a human about the geometry of the room. About as magical as the new technology for reversing camera shake to restore out-of-focus pictures to focus. ([1] [2])

  • Isolation isn't complete in the cloud -- your neighbors can hurt you by hammering the disk or network -- and some startups have decided to go old school back to owning the hardware ([1] [2])

  • "The one thing that Siri cannot do, apparently, is converse with Scottish people." ([1])

  • Amazon grew from under 25,000 employees to over 50,000 in two years ([1])

  • Google Chrome is pushing Mozilla into bed with Microsoft? Really? ([1])

  • Is advice Steve Jobs gave to Larry Page the reason Google is killing so many products lately? ([1])

  • Why does almost everyone use the default software settings? Research says it appears to be a combination of minimizing effort, an assumption of implied endorsement, and (bizarrely) loss aversion. ([1])