Saturday, July 30, 2005

Findory in Forbes

Forbes published a review of Findory in their Meta Blogs section of Forbes.com Best of the Web.

My favorite line: "The site may know what you like better than you do."

Friday, July 29, 2005

Propagating trust paper

Yahoo Research recently redesigned their website, spurring me to take a peek over their recent publications.

A bunch of interesting papers up there, but I wanted to comment on one of them, "Propagation of trust and distrust" by Guha et al.

The paper is about predicting how much people will trust other people in a large social network, a large social network like, say, Yahoo 360.

The basic idea is similar to the TrustRank paper in that trust and distrust are propagated out across each connection in the network to predict trustworthiness of nodes across the entire network. One difference is that the Guha et al. paper is focused on predicting "trust between any two people in the system with high accuracy", not just a global measure of trust.

Applications at Yahoo are numerous. I could imagine trying to apply it to finding authoritative and trustworthy reviews in Yahoo Local and Yahoo Shopping, determining the reputation of sellers in Yahoo Shopping and Yahoo Auctions, and discovering useful weblogs in My Yahoo.

By the way, Prabhakar Raghavan is one of the co-authors of the paper. Prabhakar Raghavan was CTO at Verity and was recently poached by Yahoo to head Yahoo Research.

Thanks, Ravi Kumar, for pointing me to a copy of the paper.

Thursday, July 28, 2005

Google Sawzall

Google Labs has a new paper out, "Interpreting the Data: Parallel Analysis with Sawzall".

Sawzall is a high level, parallel data processing scripting language built on top of MapReduce. The system allows Google to do distributed, fault tolerant processing of very large data sets.

Here's an excerpt from Section 14 "Utility" of the paper:
Although it has been deployed only for about 18 months, Sawzall has become one of the most widely used programming languages at Google.

One measure of Sawzall's utility is how much data processing it does. We monitored its use during the month of March 2005.

During that time, on one dedicated Workqueue cluster with 1500 Xeon CPUs, there were 32,580 Sawzall jobs launched, using an average of 220 machines each. While running those jobs, 18,636 failures occurred (application failure, network outage, system crash, etc.) that triggered rerunning some portion of the job. The jobs read a total of 3.2x10**15 bytes of data (2.8PB) and wrote 9.9x10**12 bytes (9.3TB) ... The jobs collectively consumed almost exactly one machine-century.
2.8 petabytes in one month. Yowsers.

Data. Yummy, yummy data. Gimme, gimme, gimme!

Update: Brian Dennis describes Sawzall, MapReduce, and other powerful tools at Google as "major force multipliers."

Feedster and My AOL

Feedster announces a deal with AOL to power feed search and subscriptions on the new My AOL. Congrats, Scott and Scott!

Chris Sherman posts a detailed review of My AOL. An excerpt on the threat to Yahoo:
In launching the My AOL portal, AOL has taken another step toward demolishing the "walled garden" of content the service has been known for .... The launch of My AOL is a direct shot over Yahoo's bow .... AOL now offers a compelling alternative for a start page for many people.
I like the way My AOL focuses on casual users, providing a nice default page and making it easy for people to do small and quick customizations. MSN's start.com/3 seems to have a similar approach.

Sucking up the talent

Ben Elgin at BusinessWeek reports on the "hiring binge" at Google and Yahoo that is "swallowing up some of tech's best and brightest."

The article mentions some of the high profile hires, including Adam Bosworth (BEA -> Google), Louis Monier (Altavista & eBay -> Google), Larry Tesler (Apple & Amazon -> Yahoo), Usama Fayyad (Microsoft & Revenue Science -> Yahoo), and Rob Pike (Bell Labs -> Google).

Moreover CEO James Pitkow has a great quote in the article, saying that Google has "created a Willy Wonka effect" since "engineers want to work on the coolest problems with the smartest people."

While Google and Yahoo may have been able to attract top talent, the article is less positive toward Microsoft's efforts:
Not only has [Microsoft] lost several top minds to Google in recent years, the Redmond (Wash.) company is also facing tougher competition for talent coming out of universities, even in its own backyard. Oren Etzioni, a professor of computer science at the University of Washington in Seattle, says Google has hired most of the top one-third of his search class in each of the past two years.
Little startups have been impacted as well. JotSpot CEO Joe Kraus is quoted as saying, "If you're talking to someone great, they're invariably talking to Google, and they often have an offer."

We've found that to be true at Findory. The people we are most interested in bringing on board typically also are being pursued by Google, Amazon, and Microsoft.

[BusinessWeek article via Danny Sullivan]

Google and classified advertising

Steve Outing at Poynter writes about how real estate agents are moving from using newspaper classifieds to using search engine advertising:
Google has begun pursuing the real estate category directly, now employing three regional sales teams in the U.S. that are focused on the real estate and other classifieds categories. The search giant has been making presentations to major brokerage firms such as Century 21, RE/MAX, and Coldwell Banker ....

Already suffering by assaults on merchandise, auto, and housing rental categories by Craigslist (and other online competitors), now the real estate category looks about to be hit very hard by migration of local real estate spending to search advertising -- a form of advertising that was almost non-existent just three years ago.
This is all part of a larger trend of small merchants using AdWords and AdSense to find and target interested audiences directly instead of using newspaper classifieds, eBay, or more traditional forms of mass market advertising.

See also my earlier post, "Google, small businesses, and eBay".

Wednesday, July 27, 2005

MSN Screensaver

MSN launches a screensaver that displays news and weather.

Does this remind anyone else of Pointcast?

Tuesday, July 26, 2005

Flickr and feeds in Yahoo 360

Yahoo 360 launched several new features yesterday, including integrating Flickr photos and a feed reader.

[via Gary Price]

My Google, now with RSS feeds

When My Google was announced a couple months ago, some people expressed disappointment that it couldn't handle feeds.

Yesterday, Sean Knapp posted on the Google Blog that the latest version of My Google offers a bunch of new content including arbitrary RSS/Atom feeds.

Unfortunately, it feels a little klutzy to me. Adding feeds is awkward, there are no feed recommendations or related feeds, and I see no excerpts on the RSS feeds (titles only). It seems less powerful than My Yahoo or Bloglines. But it also doesn't seem like it is trying to be as easy to use and friendly for casual users as My AOL or the MSN Start.com prototype.

Good to see this from Google though. With this launch, all the search giants -- Yahoo, Ask, MSN, Google, and AOL -- now have a web-based feed reader.

Monday, July 25, 2005

Yahoo, data mining, and advertising

Fred Vogelstein at Fortune writes about "Yahoo's Brilliant Advertising Solution".

Here's an excerpt on data mining for Yahoo's targeted advertising:
[Yahoo] can also predict the probable response rate to the ads on each segment of the portal. It can predict what time of day the ads are likely to be most effective.

And increasingly, by analyzing "click streams" on its network, Yahoo can spot potential buyers at various stages of the consideration process.

In other words, by looking at the billions of user clicks that flow through its servers every day, Yahoo is getting better and better at figuring out that a given pattern -- say, a user who's looked up football on Yahoo Sports, checked out adventure movies on Yahoo Entertainment, and compared truck prices on Yahoo Autos -- means the browser is interested in buying a Jeep and is just beginning to think about a purchase.
Advertising is information, information about products and services you don't know about and might want. Advertising should be relevant, useful, and interesting. Targeting, personalized advertising wastes less of your time and focuses you in on the information you need.

Yahoo gets 28 cents per page view

In a brief note, John Battelle mentions a report from analyst Safa Rashtchy that claims that Yahoo gets $.28 in revenue per page view.

$.28 per page view would be pretty remarkable, so I investigated. It appears to be true. In Yahoo's Q2 conference call slides (PDF), they say they got 3.14B page views in Q2 2005 and made $875M in revenue.

That's $.28 per page view. Impressive.

Seven months ago, it made a bit of a splash that Google was earning a dime per search. Yahoo's current numbers of $.28 per page view is quite a bit of progress on top of that. It appears the potential to monetize traffic with online advertising just keeps getting better and better.


Update: Umm... No, they don't.

In the comments to this post, Henrik Torstensson points out that the 3.14B page views is per day, not per quarter, so the revenue per page view is closer to $0.003. John Battelle, in his post, quoted Safa Rashtshy as saying $.28 "of revenue per average daily page" which I then misinterpreted.

Sorry, folks, my mistake. I got this one totally wrong.

Your cell phone knows you

Ryan Singel at Wired reports on some work at the MIT Media Lab to use data from cell phones to track and predict the behavior of their owners.
Cell phones know whom you called and which calls you dodged, but they can also record where you went, how much sleep you got and predict what you're going to do next.

[Nathan] Eagle's Reality Mining project logged 350,000 hours of data over nine months about the location, proximity, activity and communication of volunteers, and was quickly able to guess whether two people were friends or just co-workers ... Eagle's algorithms were able to predict what people ... would do next and be right up to 85 percent of the time.

Users create their own automatic life diary that could be searched and queried. "I can go ask it, 'How much sleep did I get in October?' 'When was the last time I had lunch with Adam?' 'Where did I go after that?'" said Eagle.
Microsoft Research has a project with similar goals called MyLifeBits.

Every TV show at your fingertips

Cory Doctorow writes about a remarkable PVR from Promise TV that:
[grabs] the entire broadcast TV multiplex -- all the channels being broadcast in the UK -- slices them up according to the free, over-the-air electronic programming guide, and stores an entire month's worth.

Why program a TiVo to get certain shows for you when you can record every single show on the air, all at once, and then use recommendations, search, a grid, or any other means you care to name to figure out which of those thousands and thousands and thousands of hours of programming you want to watch.
Clever idea. Massive drives are getting cheap enough that recording every show broadcast is a real possibility.

MSN Virtual Earth launches

MSN launched Virtual Earth today.

So far, the product seems closer to Google Maps than Google Earth.

Chris Sherman has a detailed review.

Saturday, July 23, 2005

MSN blog search coming soon?

Niall Kennedy speculates that MSN will be announcing a weblog search in the next couple days.

Just a few weeks ago, Steve Rubel uncovered screen shots of Yahoo's upcoming weblog search.

The entry of the search giants into blog search was widely predicted. The only thing I find surprising is that Google is absent.

For more on what we'll see in the future of weblog search, see my earlier post, "Personalized blog search".

Update: Stefanie Olsen at CNet reports that Yahoo has launched blog search in Korea. Blog search will launch in the US "in the coming weeks."

Friday, July 22, 2005

RSS for the mainstream

Steve Outing at Poynter Online posts that some feed readers are going out of their way to hide the word RSS from their users. He quotes Rob Jan de Heer as saying:
We never mentioned the word RSS. Ever. Because this product is meant for the average computer user, not the computer wizards most of us are. And let's face it, most people have never heard of RSS.
Exactly right. Why expose things called RSS, Atom, and XML to readers at all? Do they care what these data formats are? No, only geeks like us care. Mainstream readers just want to read news.

See also my earlier post, "XML is for geeks".

Statistical bug isolation

Warning: If you read this weblog only because you are interested in personalization and not because you're a computer science geek, you probably won't be interested in this post.

One of the few nice things about flying around the country is that I get to catch up on some of my reading.

Alex Aiken from Stanford CS will be giving a talk on Cooperative Bug Isolation at University of Washington. It sounds really cool:
This talk presents techniques for the systematic monitoring of thousands to millions of distributed program executions. We discuss how to exploit this information in a particular application: Using partial information gathered from many program executions to automatically isolate the causes of bugs.
As much as I'd like to attend, I won't be able to, so I pulled up one of his papers instead, Scalable Statistical Bug Isolation.

The paper is great. It's a simple but really compelling idea. Instrument the code with predicates (e.g. (x == NULL)). Run the code in production watching for crashes or other failure scenarios. Do credit assignment back to the predicates to determine the cause of the crashes.

Sounds easy, but there's a few nice tricks in here. They adaptively sample the data based on frequency of execution, making the performance impact of the instrumentation undetectable. The credit assignment is much harder than you might think at first -- as they explain in depth -- but they discover a nifty way of floating the root cause of the bugs to the top, eliminating redundancies, and avoiding predicates that indicate behaviors caused by multiple bugs or just correlated (but not causal) to the bugs.

What I found particularly compelling was that they were able to detect the root cause of bugs that caused heap corruption. As most software engineers know, heap corruption is painful to debug because, by the time the program crashes, the stack trace may be nowhere near the code that originally corrupted the heap. Many systems, including Microsoft Windows, rely on sending stack traces back from users to help diagnose bugs that made it into production, but the stack trace usually doesn't give you the data you need to track memory corruption bugs down. Their technique does.

Anyway, if you're a computer dork like me and feel like geeking out, the paper is worth downloading. It's a good read.

Update: Professor Ben Liblit (first author on the paper) contacted me about this post to recommend another paper, "Bug Isolation Via Remote Program Sampling", that covers the technical details of how they instrument the programs and to mention explicitly that people can help out their effort by downloading and using some of their instrumented binaries for Fedora Linux.

Wednesday, July 20, 2005

Findory API

Did you know that Findory has an API?

It's a simple REST API, accepting parameters through an HTTP GET request and returning results in RSS XML format. Our RSS page provides a nice GUI for picking the parameters you need to get the data you want.

On our API page, we included some ideas on things you could build. I particularly like the ideas for mashups, such as combining Amazon's and Findory's APIs to provide product information, news stories, and blog articles about the latest Harry Potter and other top selling titles.

Like everything at Findory, our API is simple and easy to use. Just as it should be.

Now, go forth and innovate! If you build something cool using our API, please let us know! We'd love to see it!

Update: Just a few hours later, Jason Dowdell and Pete Freitag created two mashups using Findory's API. Wow, that was fast! Great work!

I love the detail page for the latest Harry Potter book with blog posts about the book listed at the bottom of the page. Very, very cool.

Tuesday, July 19, 2005

Yahoo Shopping VP on personalization

Brian Smith has an interview with Yahoo Shopping GM/VP Rob Solomon.

I thought this tidbit on community and personalization was particularly interesting:
The next generation [shopping search] has to do with community features and more robust ratings and review. We will focus on bringing more content into the experience ... It's about search, content, community, and personalization.
Search helps people find something when they know exactly what they want. Content, ratings, and reviews help shoppers differentiate between products. Personalization helps surface interesting items buried in the long tail. Combined, shoppers can find and discover the right product at the right price.

See also my previous post, "Tyranny of choice and the long tail".

[interview found via Gary Price]

Findory at SES

Findory will be part of the "Meet the Blog & Feed Search Engines" panel on August 9 at the Search Engine Strategies conference.

Mark Fletcher (Bloglines/Ask), Peter Hirshberg (Technorati), Scott Rafer (Feedster), and Jim Pitkow (Moreover) also are on the panel.

Many of the other panels look interesting, including the panels on search history with Jonathan Leblang (A9), Marissa Mayer (Google), Tim Mayer (Yahoo), Jim Rainey (Ask Jeeves), and Grant Ryan (Eurekster) and on news search with Neil Budde (Yahoo), Jim Pitkow (Moreover), and Chris Tolles (Topix.net).

Looking forward to it!

Update: I had to come back early from SES, but it was great to see so many people there, including Gary Price, Marissa Mayer, Nathan Stoll, Jonathan Leblang, DeWitt Clinton, Chris Sherman, Michael Bazeley, Rich Skrenta, Jim Pitkow, Mark Fletcher, Jason Calcanis, Gary Stein, Neil Budde, Dave Sifry, Scott Rafer, and many others. Fun to bump into Ruben Ortega on the plane flight down as well.

Sunday, July 17, 2005

Building on the wisdom of the crowds

Chris Anderson posts some interesting thoughts on aggregation and recommendation systems:
Amazon and Google ... have built their brands on the power of their filters ... Google's search algorithms [and] Amazon's recommendations ... are nothing more than the wisdom of the crowds, the statistically measured opinions of millions of ... people. That's why we trust them.
Chris makes a great point. The power of people drives Amazon's and Google's systems. Algorithms may sort, sift, and aggregate the data, but the source of the knowledge is the actions of individuals.

Some talk about social software as if it is unique in the ability to tap into the wisdom of the crowds. Sharing is explicit in social software systems -- people have to explicitly state relationships and explicitly share recommendations -- so perhaps it is not obvious that there are more subtle ways to share this knowledge.

Systems like Amazon's recommendations and Google's PageRank implicitly share the wisdom of the crowds. Collaborative filtering-based systems like Amazon's recommendations use algorithms to find people with similar interests and share their wisdom. Google's PageRank algorithm mines the link graph -- links people made between web pages -- to surface authoritative and useful sources of information.

Implicit systems like Amazon's and Google's surface the wisdom of the crowds anonymously, quietly, and with no effort from users.

Thursday, July 14, 2005

Personalized blog search

In "Finding a blog in a haystack", Stephen Baker at BusinessWeek writes that "industry observers say it's likely a question of months before [MSN, Google, and Yahoo] dive into blog search."

Stephen mentions a couple interesting problems in blog search, timeliness (old news is no news) and relevance rank. Here is an excerpt on blog search relevance rank:
The challenge, though, is figuring out which posts should rank atop the results. Should it simply be the most recent? Or from a popular blogger? Or perhaps a blogger already bookmarked by the user, or even by those on the user's buddy list?
From current blog search engines, it's just the most recent posts that make it to the top. In the future, it will have to be more.

Readers want to see the most relevant, useful, and interesting news quickly and easily. Sorting by date isn't enough, not with millions of spam blogs, personal diaries, and junk blogs out there. Sorting by popularity or reputation is getting closer, but it discards the best part of weblogs, the sparkling gems stuck out in the long tail. Sorting by what your friends like or using a social network might work if everyone has an extensive network, but most people won't bother with the work of setting that up.

What you need is a way to sort by what is most relevant to you. You need a system that surfaces the blog posts that are most interesting to you and people like you. You need a site that learns what you like and focuses you in on the gems. You need a personalized relevance rank.

See also my previous post, "A relevance rank for news and weblogs".

[Found on Findory]

Monday, July 11, 2005

On-demand DVDs and the long tail

Amazon recently bought CustomFlix, a business that produces DVDs on demand. It's a move that will allow Amazon to push even further into the eclectic, low volume, long tail of movies.

[via TechDirt]

Personalized news on mobile phones

Michael Kanellos at CNet reports on BuddyBuzz, a project at Stanford that is trying to do personalized news on mobile phones.

The instructions for BuddyBuzz give some feel for how the app works.

Personalization is one way to make the most of scarce screen real estate, a problem that is particularly serious on a mobile devices.

Friday, July 08, 2005

Yahoo and being underfoot

Yahoo is moving quickly these days, and it appears many smaller companies are at risk of being caught underfoot.

Yahoo launched MyWeb2.0 a week ago. Unfortunately, this puts del.icio.us, a nifty social bookmark site that just got funding, in a bad position, not only competing against clones like de.lirio.us, but also competing against a search giant.

And, as Gary Price reports, Yahoo HotJobs is adding job listings spidered from other sites, creating a job metasearch engine similar to the core idea behind many job search vertical startups like Indeed.com.

Now, Steve Rubel discovers that Yahoo will soon be launching an feed search engine similar to Technorati and Feedster. Rumors popped up five months ago about a Yahoo RSS Search after some folks spotted a suspicious Yahoo crawler in their logs. It is not surprising to find that the rumor is true.

In some ways, it is good for a startup to see the entry of a big company into its area since it attracts attention and legitimizes the field. Apple's support for podcasting is a big win for any startup working on podcasts. The entry of Yahoo into social tagging supports those startups betting that tagging will go mainstream.

But competing directly against these giants is scary if you have no differentiator. Back in January 2005, in "Will Technorati die?, I wondered how Technorati would survive the entry of the search giants into blog search. Since then, Technorati has launched some interesting differentiators, including aggressively going after blog tagging and some clever features for surfacing interesting parts of their data.

The recent pace of innovation has been exciting. But, for the little guys, standing still in all this excitement risks getting caught underfoot.

Update: For yet another example, Yahoo Real Estate just launched maps of rental listings. When they take the obvious next step of doing this for home sales, the feature will be similar to the products offered by Redfin and a few other startups.

Update: For more on Yahoo HotJobs and the many startups in the space, see Charlene Li's excellent article.

The good and bad of contextual ads

Mark Glaser at OJR has an interesting article on "The good, bad and ugly of contextual ads".

Among other things, the article points out that infrequent crawling can lead to problems where advertising is targeted to content that is no longer on the page. The problem is even worse for personalized sites, since the pages the crawler sees may be nothing like the pages each reader sees.

See also my previous post, "Personalized advertising on Findory".

Saturday, July 02, 2005

Findory growth

Another quarter, another Findory.com traffic graph!

That's one pretty chart. Nice exponential growth trend, eh? We love you guys too! Keep on using Findory!

This graph is the same as our previous graphs ([1] [2]) with Q2 2005 added.

Friday, July 01, 2005

Profile of Findory

Alarm:clock has an excellent profile of Findory with some details on our business. My favorite line: "It's a lean, mean, personalization machine."