Geeking with Greg: 08/01/2006

Thursday, August 31, 2006

Google Personalized Search and Bigtable

One tidbit I found curious in the Google Bigtable paper was this hint about the internals of Google Personalized Search:

Personalized Search generates user profiles using a MapReduce over Bigtable. These user profiles are used to personalize live search results.

This appears to confirm that Google Personalized Search works by building high-level profiles of user interests from their past behavior.

I would guess it works by determining subject interests (e.g. sports, computers) and biasing all search results toward those categories. That would be similar to the old personalized search in Google Labs (which was based on Kaltix technology) where you had to explicitly specify that profile, but now the profile is generated implicitly using your search history.

My concern with this approach is that it does not focus on what you are doing right now, what you are trying to find, your current mission. Instead, it is a coarse-grained bias of all results toward what you generally seem to enjoy.

This problem is worse if the profiles are not updated in real time. This tidbit from the Bigtable paper suggests that the profiles are generated in an offline build, meaning that the profiles probably cannot adapt immediately to changes in behavior.

See also my earlier posts, "Google Bigtable paper", "A real personalized search from Google", and "More on Google personalized search".

Google Bigtable paper

Google has just posted a paper they are presenting at the upcoming OSDI 2006 conference, "Bigtable: A Distributed Storage System for Structured Data".

Bigtable is a massive, clustered, robust, distributed database system that is custom built to support many products at Google. From the paper:

Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers.

Bigtable is used by more than sixty Google products and projects, including Google Analytics, Google Finance, Orkut, Personalized Search, Writely, and Google Earth.

A Bigtable is a sparse, distributed, persistent multidimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.

The paper is quite detailed in its description of the system, APIs, performance, and challenges.

On the challenges, I found this description of some of the real world issues faced particularly interesting:

One lesson we learned is that large distributed systems are vulnerable to many types of failures, not just the standard network partitions and fail-stop failures assumed in many distributed protocols.

For example, we have seen problems due to all of the following causes: memory and network corruption, large clock skew, hung machines, extended and asymmetric network partitions, bugs in other systems that we are using (Chubby for example), overflow of GFS quotas, and planned and unplanned hardware maintenance.

Make sure also to read the related work section that compares Bigtable to other distributed database systems.

See also my previous posts, "Google's BigTable", "C-Store and Google BigTable", and "I want a big, virtual database".

Social software is too much work

Nick Carr has a good post today, "Social software in perspective", where he quotes from two posts ([1] [2]) critical of social software and then he says:

The crux of the problem is that, in most cases, social software is an extremely inefficient way for a person to get something done.

The crowd may enjoy the product of other people's inputs, but for the rather small group of individuals actually doing the work, it demands the investment of a lot of time for very little personal gain. It's a fun diversion for a while - and then it turns into drudgery.

It's very easy to confuse fads for trends ... Out in the real world, hardly anyone has even heard of Flickr or Digg or Delicious.

People are lazy, appropriately so. If you ask them to do work, most of them won't do it. From their point of view, you're only of value to them if you save them time.

See also my previous posts, "Yahoo gets social with MyWeb", "Implicit vs. explicit sharing", and "Summing collective ignorance".

Wednesday, August 30, 2006

Findory interview at Search Engine Lowdown

Garrett French at Search Engine Lowdown posted a long interview with me about Findory and personalization.

Monday, August 28, 2006

Google expanding in Bellevue?

John Cook at the Seattle PI reports that Google "is now taking a serious look at gobbling up nearly all of a 20-story office building under construction in downtown Bellevue."

If true, this would be a substantial expansion for Google in the Seattle area. John noted that "Google could house more than 1,000 employees" in the new building, nearly an order of magnitude increase from their current Seattle area presence.

Many of those hires probably would come from nearby Microsoft, University of Washington computer science, and Amazon.com.

[Found via Barry Schwartz]

Starting Findory: Marketing

Ah, marketing. Is there anything that techies like less?

It is obviously naively idealistic, but I think we geeks wish marketing was unnecessary. Wouldn't it be nice if people could easily and freely get the information they need to make informed decisions?

Sadly, information is costly, and the time spent analyzing information even more so. People generally do use advertisements to discover new products and rely on shortcuts such as brand reputation as part of their decision-making.

As much as we might hate it, marketing is important.

Marketing also is absurdly expensive. It is mostly out of reach for a self-funded startup. Though I recognized the need, Findory.com did almost no traditional marketing.

There were limited experiments with some advertising. For the most part, these tests showed the advertising spend to be relatively ineffective. The customer acquisition costs came out to a few dollars, cheap compared to what many are willing to pay, but more than a self-funded startup reasonably could afford.

Even without substantial advertising, for the first two years, Findory grew at about 100% per quarter. Most of this was from word of mouth and viral marketing features.

Findory tried to accelerate word of mouth by focusing marketing time and effort on PR with reporters and bloggers, sharing data through RSS feeds and APIs, and providing Findory content to websites and weblogs with Findory Inline.

Findory's growth has stalled lately, casting some doubt on the strategy of pursuing word of mouth and viral marketing alone. Again, the question comes up over whether to spend time and treasure on non-traditional or traditional marketing.

Thursday, August 24, 2006

Amazon launches utility computing service

Amazon launched the Amazon Elastic Compute Cloud, a utility computing service, in limited beta. From the page:

Amazon EC2 is a web service that provides resizable compute capacity ... It is designed to make web-scale computing easier for developers.

Just as Amazon Simple Storage Service (Amazon S3) enables storage in the cloud, Amazon EC2 enables "compute" in the cloud ... It provides you with complete control of your computing resources and lets you run on Amazon's proven computing environment ... Amazon EC2 changes the economics of computing by allowing you to pay only for capacity that you actually use.

Not sure what this has to do with selling merchandise online, but it is pretty interesting.

See also the Alexa Web Search Platform, which also allows computing and storage on Amazon's servers, and the Sun Grid Compute Utility.

[Found via Philipp Lenssen]

Monday, August 21, 2006

Seattle internet startups ordered by traffic

About a week ago, John Cook at the Seattle PI posted a great list of Seattle area internet startups.

I thought it might be fun to see a version of this list ordered by web traffic. With much thanks to John for the original, here is a reworked version that includes Alexa ranks and is sorted by Alexa rank:

Zillow, Alexa rank: 976
Predictive online real estate technology.

43Things/Robot Co-op, Alexa rank: 2233
Social networking through sites such as 43things and 43people.

Newsvine, Alexa rank: 3492
Blogging and social media.

Judy's Book, Alexa rank: 7590
Local search and social networking.

Wetpaint, Alexa rank: 10987
Wikis.

Trumba, Alexa rank: 11146
Online calendars.

Jobster, Alexa rank: 13034
Job search and social networking

PayScale, Alexa rank: 14497
Predictive salary technology.

Mpire, Alexa rank: 14660
Comparison shopping search engine.

GarageBand.com, Alexa rank: 15247
Social networking music service.

Farecast, Alexa rank: 17716
Predictive airfare technology.

Findory, Alexa rank: 19248
Personalized news service.

HouseValues/HomePages, Alexa rank: 21720
Online real estate.

Redfin, Alexa rank: 22117
Online real estate.

BuddyTV, Alexa rank: 23795
Social networking TV community.

Blue Dot, Alexa rank: 38110
Social networking recommendation service.

Mixxer, Alexa rank: 41301
Mobile music and social networking.

Bag Borrow or Steal, Alexa rank: 44630
Luxury goods borrowing service.

PixPo, Alexa rank: 57770
Online video broadcasting.

Vizrea, Alexa rank: 58137
Digital media distribution and social networking.

Broadband Sports, Alexa rank: 62649
Online video

PixPulse, Alexa rank: 63524
Mobile media publishing and social networking.

TripHub, Alexa rank: 88059
Online group travel community.

Pluggd, Alexa rank: 111033
Podcasting.

HomeMovie.com, Alexa rank: 138783
Online video.

Melodeo, Alexa rank: 186463
Podcasting.

Musicmobs, Alexa rank: 193467
Sharing and discovery online music service.

PhoneSherpa, Alexa rank: 276364
User generated ringtones and graphics.

Snapvine, Alexa rank: 301367
Audio comments for blogs and social networking sites.

Weedshare, Alexa rank: 302232
Online music swapping service.

Jookster, Alexa rank: 308432
Social search.

Cdigix, Alexa rank: 328498
Digital media distribution.

Super Oyster, Alexa rank: 349830
Ticket buying marketplace that allows people to buy out others on an online waiting list.

Sampa, Alexa rank: 378389
Web site building tools and storage.

Inrix, Alexa rank: 389879
Real time and predictive traffic technology.

Healia, Alexa rank: 404933
Personalized health search engine.

Curious Office, Alexa rank: 409317
Internet incubator.

Avvo, Alexa rank: 411395
Consumer-oriented online legal service. (Stealth)

Cozi, Alexa rank: 486398
Digital home technology.

GoGoMo, Alexa rank: 490065
Digital media distribution and storage

Pelago, Alexa rank: 561566
Mobile social networking. (Stealth)

SecondSpace, Alexa rank: 562914
Online consumer service. (Stealth)

Yodio, Alexa rank: 633350
Podcasting.

NewsCloud, Alexa rank: 640911
Social media.

PrestoGifto, Alexa rank: 689970
Customized T-shirts, coffee mugs and other products for online merchants.

SnapTune, Alexa rank: 741315
Online radio service.

GridNetworks, Alexa rank: 851439
Digital media delivery.

Ontela, Alexa rank: 981609
Mobile software platform.

SmartSheet, Alexa rank: 1359158
Collaboration tools for small business owners.

Pheromone Trail, Alexa rank: 1415162
Social search.

Bill Monk, Alexa rank: 1438348
Mobile bill paying and online trading service.

Positive Motion, Alexa rank: 1484923
User-generated flash cards on mobile phones.

Peppers and Pollywogs, Alexa rank: 1847159
Online party planning service.

TextPayMe, Alexa rank: 2311186
Mobile bill paying.

ImageKind, Alexa rank: 2372741
Online art community.

GimmeNow.com, Alexa rank: 3609578
Online shopping service.

SwitchGear, Alexa rank: 4312922
Digital home software. (Stealth)

Joingle, Alexa rank: 999999999
Unknown.

DigWorks, Alexa rank: 999999999
Digital media.

Grouped.com, Alexa rank: 999999999
Social networking dating site.

Shelfari, Alexa rank: 999999999
Unknown.

One Degree, Alexa rank: 999999999
(Stealth)

Treemo, Alexa rank: 999999999
(formerly HyperMob): Mobile social networking.

Even though Alexa traffic data is not very reliable (see my earlier post), I think this new version of John's list is fun and interesting.

I have to say, I am pleased with how high Findory is in this list. Tied with Farecast, GarageBand, HouseValues, and Redfin, it seems.

It also appears that Findory has the highest web traffic by far of any Seattle area self-funded internet startup; I am fairly sure that all the other startups near or above the level of Findory have multiple millions in funding. Go, little startup, go!

Friday, August 18, 2006

Starting Findory: Customer feedback

Findory customer service is light, averaging a couple e-mails per day. They are mostly suggestions, a few requests for help, and a few complaints.

Of the few complaints, the most common are various forms of rants about either liberal bias or conservative bias. Partially this is due to our political climate right now where any site at all related to news is bombarded by absurdity from the extreme fringes.

However, accusations of bias also may be due to Findory explicitly trying to not pigeonhole people. The idea of personalized news has been around for a decade or so. One common criticism of the idea is that personalization may pigeonhole people, showing them only what they want to see. On Findory, opinion articles are not selected based on a particular view, with the result that people are exposed to viewpoints they might prefer to ignore.

It is an interesting question whether Findory should have given people what they wanted -- let people put on their blinders and pigeonhole themselves -- or if it is doing the right thing for the long-term by helping people discover a breadth of information and viewpoints.

On the requests for help, the most common is somewhat amusing. People often expect Findory to be harder to use than it is. It surprises me, but some write in and ask, "How do I set it up? What do I do?"

Perhaps Findory is too simple. "Just read articles! That's it," I often say. "Findory learns from what you do." Everyone has been trained to expect sites to be more difficult -- lengthy registration and configuration, for example -- and it appears it can be confusing when all those barriers are removed.

On the suggestions, the most common are requests for a feed reader (which we did as Findory Favorites), interest in rating articles and sources (which we have prototypes a few times but never launched because it seems to change the focus away from reading), a desire to see news photos inline on the page (potentially costly, but something we are exploring), extending the crawl (always working on it), and support for non-English languages (prohibitively expensive due to the changes required in the recommendation engine).

Customer feedback is useful. Many Findory features have been implemented as a direct result of suggestions from Findory readers.

However, just listening to customers is not enough. Customers will tend to suggest iterative improvements and request more features. Customers will not offer ideas for big jumps, big ideas, or major new products. Customers will not balance requests for new features against simplicity and usability.

When looking at customer feedback, I find it is important to look beyond the words to try to divine the intent. The best solution may be something completely different than what was suggested.

See also my other posts in the Starting Findory series.

Talk on fast autocompletion

Holger Bast gave a fun talk at Google called "Type Less, Find More: Fast Autocompletion Search with a Succinct Index".

The demo at the beginning is a nice demonstration of the value of doing prefix matches on multiple search terms. It shows a potential direction for future search engines, away from the one-shot deal of "enter keywords, get results" and toward an interactive dialogue where the search engine constantly suggests possible results and refinements.

The talk enters a second stage around 19:00 where Holger spends most of his time on the challenging problem of how to build indexes to support prefix search efficiently. Quite interesting as well.

Holger also had two papers at the SIGIR conference on this work, one (PDF) with the same name as his Google talk but with more detail and another called "When You're Lost for Words: Faceted Search with Autocompletion" (PDF).

Wednesday, August 16, 2006

New personalized web search at Findory

Findory just launched a new alpha test version of its personalized web search. You can get to it by using the search box in the upper right corner of the screen and selecting "Web" from the drop down menu.

In the web search results, any item with a personalized icon after the title is recommended by Findory. Findory recommends and reorders web search results based on each person's previous searches, the search results on which that person has clicked, and what other web searchers have found interesting.

The obvious question to ask about this is whether it is effective. That is somewhat hard to answer given the data, traffic, and resources of a tiny startup, but I do have some evaluation numbers.

My initial analysis used a larger data set and attempted to determine how often personalized web search could narrow in on the result on which the searcher would click. In this test, 52% of the time, the item that would be clicked was in the candidate recommendation set.

This first test merely shows that there is potential to recommend the item that would be clicked. It does not say that the recommender would successfully identify that item over other items or that, in reordering the search results, that it would not improperly reduce the rank of the to be clicked item by promoting other, less important items.

The second, more complicated analysis tried to get to these questions. It fired off queries to Google, reordered the results using the recommendations, and then looked at the new ranking of the item which would be clicked.

The results from this were promising. 32% of the time, the clicked result was in the top 5 of the reordered search results, slightly higher than the norm for the unchanged Google search results. About 3% of the time, the reordering moved the clicked result to the top slot or one of the top 5 slots when it was not there already. Only 1% of the time did it move the clicked item from the top slot (not a good thing to do), and it never moved the clicked item out of the top 5 results.

Because major search engines have limits on being hammered with queries, I was only able to run this second analysis over a small randomized sample of 500 queries. A longer run would be desirable, as would being able to A/B test variants on the algorithm under heavy, live traffic. As a tiny startup, both are impossible, at least for now.

If you want to try out Findory personalized web search, you can get to it using the search box of at the top right corner of every Findory page. Select "Web" from the drop down, enter your query, and look for results with a personalized icon after the title. Click on a few search results if you like, then search again or for the same or a related term.

Here are direct links to a few searches -- "video games", "amazon", "windows live" -- to get you started.

Please take it easy on this. It is pretty cool, but it is just an alpha stage product by a tiny little startup. I hope you enjoy trying it, and please let me know if you have any thoughts or comments.

Update: If you find yourself getting kicked out to a Google search result page instead of seeing the results on Findory, that is because Google failed to return search results to Findory. It is rare, only occurring when the Google API coughs up a smurf. When this happens, rather than display an error page, Findory asks Google to serve up the results directly.

Update: If you want to try making Findory the default web search in Firefox for a little while, there is a Firefox plug-in available.

Feed recommendations on My Yahoo

Don Loeb (ex-Yahoo, now at FeedBurner) reports that My Yahoo is now doing "feed recommendations based on a user's overall yahoo usage and RSS subscriptions."

I was not able to see the new module on my My Yahoo page, so either I missed it or this is rolling out as an A/B test.

Sounds cool though. The My Yahoo page should adapt and learn based on usage rather than requiring manual configuration. Less work is good.

See also my previous post, "Yahoo home page cries out for personalization".

[Don Loeb post via Matt McAlister]

Tuesday, August 15, 2006

Chris Sherman on social search

Chris Sherman at Search Engine Watch posts a nice critique of social search, the efforts to layer community-provided content or add social networking features to web search.

Some excerpts:

Social search ... incorporates both automated software as well as human judgments ... That's what makes social search intriguing -- and fundamentally flawed ...

No matter how many people get involved with bookmarking, tagging, [or] voting ... the scale and scope of the web means that most content will be unheralded by social search efforts ... People-mediated search will never be as comprehensive as algorithmic search.

Another problem arises with tagging ... Tags are not a panacea for categorizing and organizing the web ... Without [a] controlled vocabulary, tagging ultimately remains a chaotic, messy process.

Another factor is human laziness ... We've always had the ability to add tags and other metadata to our Microsoft office documents, and yet how many people do this?

We also have a problem with idiots and spammers. Some people ... do a poor job of labeling content. Others will deliberately mislabel content ... In both cases, it's difficult ... to recognize this spuriously labeled content ... it's difficult to filter the noise from the signal.

On this last point, the mainstream and the profits of the targeting that broad audience attract spam. If sites using tagging and other forms of community-generated content enter the mainstream, they will be flooded with spam.

When that happens, the problem will be how to automatically filter all the tag and user-content spam, a problem that looks little different than automatically filtering web spam.

See also the article Chris posted a day later, "Who's Who in Social Search".

See also my previous posts, "Yahoo gets social with MyWeb" and "Different visions of the future of search".

Joel Spolsky on management methods

Joel Spolsky (of Joel on Software) posted a good series ([1] [2] [3]) on management styles.

The first style offered is "Command and Control". Joel describes it as, "Primarily, the idea is that people do what you tell them to do, and if they don't, you yell at them until they do."

The second style is "Econ 101", which "is the style used by people who know just enough economic theory to be dangerous" and "assumes that everyone is motivated by money, and that the best way to get people to do what you want them to do is to give them financial rewards and punishments to create incentives."

The third style given is the "Identity" method where "you have to make your employees identify with the goals of the organization, so that they are highly motivated, then you need to give them the information they need to steer in the right direction" yielding teams where "people have a sense of loyalty and commitment to their coworkers."

Both Command-and-Control and Econ 101 appear to be coercive, top-down methods. One uses fear, the other money, to bend employees to the will of management.

Joel emphasizes the morale effects of the three strategies in his essays, saying that coercive strategies are bad for morale, productivity, and retention.

I would add that a key point on third, the Identity method, is that it is not just about convincing people that management knows what is best. Rather, decisions are emergent. People closest to the ground, closest to the problem, are making the decisions.

One of my common points of disagreement with MBAs at Stanford Business School -- and, unsurprisingly, we argued a lot -- was about these and similar management strategies. The majority of the Stanford MBAs favored one of the top-down, coercive approaches.

The core of their argument usually appealed to the importance of unity of action. To use a military metaphor, the idea is that it does not matter much which hill you charge up as long as you all charge up the same hill. Coordination yields overwhelming force against competitors.

I was in the minority arguing for collaborative, identity-based approaches. I argued that it is more important to pick the best hill to climb. Even if some on your team stumble up neighboring hills, that is okay.

At the core, the question is whether you prioritize exploration and optimization over coordination and control.

If you favor exploration, optimization, creativity, and innovation, you will prefer collaborative, bottom-up, identity-based management.

If you favor control, coordination, and overwhelming force, you will prefer command-and-control, top-down styles of management.

At this point in this discussion, which I have had many times, some argue that the best strategy depends on the industry. For example, some may claim that creativity is unnecessary on a factory floor.

The book "The Human Equation" (ironically out of Stanford Business School) convincingly refutes this belief. It provides several compelling examples of large productivity gains in factory settings such as auto assembly plants using less coercive management methods.

So, why do coercive methods persist? I think this is a point where individual incentives of managers come into play.

Collaborative management methods pay off in the long-term for the company, but are difficult and time-consuming in the short-term for individual managers. Managers and employees often are in a position for a short time, perhaps just a couple years, and may discount long-term gains.

As a result, it can be more beneficial for an individual manager to spend their time building an empire, which has direct and short-term benefit for the manager, and then move on, leaving their wreckage behind.

I have watched others do this before and, while it is not a strategy I have any admiration or respect for, I sadly have to admit it appeared to be quite good for the manager's career.

In the end, it appears that it is in the best long-term interests of a company to encourage collaborative management styles, but the short-term interests of individual managers may push them toward coercive styles. Particularly in larger organizations, where individual goals easily can depart from the need of the company, this should be an area of concern.

Sunday, August 13, 2006

Slides from sponsored search talk

Jan Pedersen (Chief Scientist at Yahoo) posted the slides (PDF) from his SIGIR AIRWeb invited talk, "Sponsored Search: Theory and Practice".

It is an interesting overview of web search advertising, including history of sponsored search, results of game theoretic analysis of efficient web advertising auctions, and various forms of click fraud. Well worth a look.

By the way, if there are any other game theory geeks out there that like punishing themselves on a Sunday afternoon, I spent a little time trying to track down Jan's reference to "Edleman et al." on slide 21. My best guess is that it is a reference to Edelman et al., "Internet Advertising and the Generalized Second Price Auction: Selling Billions of Dollars Worth of Keywords" (PDF), a November 2005 paper out of Harvard, Stanford, and Berkeley.

This Edelman et al. paper makes the claim that the Generalized Second Price (GSP) auction used by Google and Yahoo "does not have an equilibrium in dominant strategies, and truth-telling is generally not an equilibrium strategy." The difference is due to the way advertisers bid on advertising slots. As the authors explain, "GSP essentially charges bidder i the bid of bidder i + 1, VCG [Vickrey-Clarke-Groves] charges bidder i the externality that he imposes on others, i.e., the decrease in the value of clicks received by other bidders because of i's presence." As examples in the paper show, under some scenarios, GSP gives advertisers some opportunities to shave their bids instead of bidding at their true value.

The authors go on to argue that VCG "would reduce incentives for strategizing and make life easier for advertisers" and "new entrants such as Ask Jeeves and Microsoft Network have a comparative advantage over the established players in implementing VCG." However, they temper that by admitting that "VCG is hard to explain to typical advertising buyers" and "the revenue consequences of switching to VCG are uncertain."

In the end, I was not able to determine if the theoretical inefficiencies of GSP described in the paper were substantial compared to implementation problems like various forms of click fraud. Interesting nonetheless.

Anyway, definitely check out Jan's slides and afterward, if you also want to beat yourself with a game theory pain stick, download the Edelman et al. paper.

Friday, August 11, 2006

Google as Napoleonic France

A cute article in The Economist compares Google's current strategic position to the war with Napoleonic France. Some excerpts:

Prince Klemes Von Metternich, foreign minister of the Austrian Empire during the Napoleonic era ..., would have no trouble recognising Google.

To him, ... [Google] would closely resemble the Napoleonic France that in his youth humiliated Austria and Europe's other powers. Its rivals -- Yahoo ... eBay ... Microsoft --- would look a lot like Russia, Prussia, and Austria.

Metternich responded by forging an alliance among those three monarchies to create a "balance of power" against France. Google's enemies, he might say, ought now to do the same thing.

The alliances are already being struck. Last year Yahoo and Microsoft announced that they would connect their instant-messaging systems ... [and] "voice chat" ... In May, Yahoo! and eBay struck an alliance ...

All Google needs is an overextension into the hinterlands of Microsoft to make the analogy complete.

The Economist article focuses on the first tier players -- Yahoo, Microsoft, AOL, and eBay -- but I suspect the second tier might be in play too.

In particular, I think Amazon -- with its e-commerce, payment systems, clever APIs, associates advertising program, and fledging A9 web search engine -- may also be attractive for those seeking strategic alliances in the search war.

See also my previous posts, "Amazon Omakase and personalized ads", "Yahoo and eBay, Amazon and Microsoft", and "Amazon web search switches to Microsoft".

Blog spam at AIRWeb

There was quite a group of experts on hand at the SIGIR AIRWeb workshop, researchers from Google, Yahoo, Microsoft, AOL, Technorati, Bloglines, Snap, and more.

I wanted to follow up on one particularly interesting tidbit to come out of the group discussions. The group was asked, "What is your biggest concern in weblog spam?"

By far, the biggest concern was the proliferation of spam weblogs with fake content that looks real.

These appear to take a few forms. One is to manually steal content from other places and post it to your blog without any reference to the original creator of the content.

Another technique is to post everything from an RSS feed or a combination of feeds to a blog, again without any credit to the original sources.

Another is to create a weblog that is just nonsense but not immediately obviously nonsense. This can be done by stitching together sentences from feeds or by being even more clever and making a good random text generator (like this example or this one).

Once the splog is created, it is typically used for ad revenue (using AdSense or whatever), link spam (to improve PageRank), or both.

It is a hard problem to solve. For the stolen content, crawlers need to do a good job of identifying the original, authoritative source of the content. For random content, crawlers may be able to recognize statistical outliers or grammatical errors in the content, but, in the worst case, may have to try to parse and minimally understand the content.

Spam spam spam spam. Lovely spam! Wonderful spam!

YouTube and ease of use

In an article, "Missing the point about YouTube", John Dvorak says:

Nobody -- and I mean nobody -- made it easy until YouTube ... It's brain dead simple.

The first thing you notice about YouTube is the lack of barriers to entry. You can sign up quickly and upload anything in any format right away.

The YouTube formula for success is simply ease-of-use and convenience.

[via TechDirt]

See also my previous post, "Yahoo Video, metasearch, and grandma".

List of Seattle internet startups

John Cook at the Seattle PI posts a good list of Seattle internet startups.

Wednesday, August 09, 2006

Web spam, AIRWeb, and SIGIR

I am giving a very short talk at AIRWeb (a workshop on web spam) at SIGIR.

I thought some readers of this weblog might be interested in web spam but unable to make it to this workshop. It might be fun to discuss the topic a bit here.

I wanted to use my short talk to bring up three topics for discussion:

First, I wanted to talk about the scope of weblog junk and spam, especially if "junk and spam" weblogs are loosely defined as "any weblog not of general interest."

The primary data points on which I will focus are that Technorati reported 19.6M weblogs in Oct 2005, but the dominant feed reader, Bloglines, reported that only 1.4M of those weblogs have any subscribers on Bloglines and a mere 37k have twenty or more subscribers.

This seems to suggest that over 95% of weblogs, possibly over 99%, are not of general interest. The quality of the long tail of weblogs may be much worse than previously described.

Second, I wanted to bring up the profit motive behind spam. Specifically, I will mention that scale attracts spam -- that the tipping point for attracting spam seems to be when the mainstream pours in -- and that this has implications for many community-driven sites that currently only have an early adopter audience.

Third, I wanted to discuss how "winner takes all" encourages spam. When spam succeeds in getting the top slot, everyone sees the spam. It is like winning the jackpot.

If different people saw different search results -- perhaps using personalization based on history to generate individualized relevance ranks -- this winner takes all effect should fade and the incentive to spam decline.

What do you think? I would enjoy getting a discussion on web spam going in the comments!

Update: The talk went well. Let me briefly summarize some of the comments I received at the workshop on what I said above.

On the Technorati vs. Bloglines numbers, a few people correctly pointed out that this could be seen as recall vs. precision and that, for some applications, it may be important to list every single weblog, even if that weblog has no readers. At least a couple others disputed whether weblogs without readers were important at all. One mentioned that readers may be able to be faked, which might allow spammers a way to attack this type of filter.

On the "winner takes all" and personalization as a potential solution, some seemed skeptical that there was enough variation in individual perceptions of relevance to make a big enough impact. Others seemed intrigued by the possibility of using user behavior and recommender systems to filter out spam.

I enjoyed talking about web spam in such a prestigious group! Great fun!

Update: See also the paper, "Adversarial Information Retrieval on the Web" (PDF), which gives a good summary of the discussions at the workshop.

More on privacy and anonymized data sets

On the timely issue of preserving privacy in anonymized data sets, Dan Frankowski (researcher at GroupLens, intern at Google, currently attending SIGIR) recently gave a talk at Google called "You Are What You Say: Privacy Risks of Public Mentions". I had a chance to watch it this morning.

The talk discusses some interesting examples of how anonymized data sets can be combined with other public data sets to reveal private information.

The most dramatic example given in the talk is how a former governor's medical records were revealed by combining information from an anonymized data set of medical records with publicly available voting records.

Unfortunately, the conclusion of the talk largely is that it is "hard to preserve privacy" against all forms of this kind of attack while preserving the usefulness of the data set.

Update: Dan gave a nearly identical talk at SIGIR today.

Friday, August 04, 2006

A chance to play with big data

A couple fun new data sets are being made available by the search giants.

First, in a humorously titled post, "All Our N-gram are Belong to You", folks at Google Research announced that they "processed 1,011,582,453,213 words of running text and are publishing the counts for all 1,146,580,664 five-word sequences that appear at least 40 times." Very cool.

The massive Googly data set is available on six DVDs -- probably about 30G of compressed data -- but not as a download.

Second, the new AOL Research site has posted a list of APIs and data collections from AOL.

Of most interest to me is data set of "500k User Queries Sampled Over 3 Months" that apparently includes {UserID, Query, QueryTime, ClickedRank, DestinationDomainUrl} for each of 20M queries. Drool, drool!

You know, just the other day, I was watching a Google Tech Talk where a researcher was lamenting the difficulty of getting access to big data. It is exciting to see two of the giants, Google and AOL, making this kind of data available.

Update: Sadly, AOL has now taken the 500k data set offline. This is a loss to academic research community which, until now, has had no access to this kind of data.

The move seems to be a response to a bunch of inflammatory blog posts ([1] [2] [3] [4] [5]) that make outlandish claims like:

AOL has released very private data about its users without their permission ... The ability to analyze all searches by a single user will often lead people to easily determine who the user is, and what they are up to. The data includes personal names, addresses, social security numbers and everything else someone might type into a search box.

So much for the privacy of AOL's users ... This is identity theft just waiting to happen, that's what this is.

I expect it's a matter of time before a major national newspaper prints an interview with somebody identified and embarrassed in this manner.

Nevermind that no one actually has come up with an example where someone could be identified. Just the theoretical possibility is enough to create a privacy firestorm in some people's minds.

I am as concerned about privacy as any tech geek, but most of my concern is focused on things like millions of credit cards being leaked and millions of social security numbers being lost.

If someone comes up with a clear example of a privacy violation from this AOL data, I would be convinced. Until then, this looks to me like the mob of the blogosphere getting distracted in the shadows and missing the big privacy picture.

Unfortunately, the research community now will be denied a tool that could have helped push forward the state of information retrieval. Research that could have been accelerated will now be stalled. We all will suffer from the loss.

Update: I have brief quotes in articles in the New York Times and the Red Herring on this topic.

Update: This morning, in a front page article, New York Times reporters track down a specific AOL user using the released data set and ask her about her AOL searches.

Update: It now appears the entire AOL Research site has been taken offline, including access to their publications, other data sets, and APIs. That is disappointing.

Update: The NYT reports that AOL's CTO, a senior researcher, and a senior manager have been dismissed. It appears AOL Research is being shut down.

Some may cheer AOL getting a firm spanking over this privacy issue, but I think the long-term costs are grave. I suspect this pretty much eliminates any future access for the academic research community to large scale data sets. After this, the only work on big data will be at the search giants.

Hindering academic research will slow progress on building the next generation of search. It is hard to measure the cost of difficulty finding the information you need -- the productivity loss of a few minutes a day over millions of people is difficult to measure -- but it is a cost we will all be paying.

Update: Annalee Newitz at Wired Magazine adds some perceptive with a list of severe privacy violations from the past and present. [via Bruce Schneier]

Update: A couple days later, Katie Hafner at the NYT writes that "Researchers Yearn to Use AOL Logs, but They Hesitate".

Update Seven weeks later, the NYT also adds perceptive by writing about recent examples of severe privacy violations including "2.6 million current and former Circuit City credit card account holders", "names, Social Security numbers, and dates of birth on roughly 28 million veterans", "names, addresses and credit and debit card numbers of some 243,000 customers of Hotels.com", "names and Social Security numbers — and in some instances medical histories — of some 51,000 current and former patients of PSA HealthCare", ChoicePoint's leak of 145,000 accounts, and CardSystems' leak of 40M credit card numbers. Read more at the "Chronology of Data Breaches".

Update: Seven months later, Henry Blodget writes:

ISPs happily sell clickstream data ... It's a big business.

They don't sell your name--just your clicks -- but the clicks are tied to you as a specific user (User 1, User 2, etc.).

How much are your clicks worth? According to [Compete CTO] David [Cancel], about 40 cents a month per user (per customer) .. .and he estimates that there are 10-12 big buyers of this data. In other words, your ISP is probably making about $5 a month ($60 a year) off your clickstreams.

Someone [in the audience] points out that this is just as bad as the AOL search thing. It's much more! David says -- his excited eyes indicating that he's a happy customer. Someone else observes that the benefits/drawbacks of this are in the eye of the beholder: for the ISPs it's awesome.

Going to SIGIR next week

I will be at SIGIR next week here in Seattle. It should be a good time for us information retrieval geeks.

If you read this weblog and are attending the conference, please say hello if you get a chance. I should be there for most of the conference, including Sunday night and the AIRWeb workshop on Thursday.

Looking forward to it!

Google and tiny acquisitions

An interesting quote from Google CEO Eric Schmidt:

Reporter: "How many acquisitions do you do?"

Schmidt: "It's one or two a week it seems. Most acquisitions: They are very small. 1-2-3 people and you never, never hear about them."

I was just talking with someone today about how many large companies seem dysfunctional with their acquisition strategies.

I recall seeing some business research that showed that the majority of large acquisitions result in a net loss in value for the merged companies and most of the remainder fail to create measurable value. Only a tiny minority turn out to be a long-term win for shareholders.

Very small acquisitions, by contrast, are easily integrated and typically create value. They also tend to be relatively inexpensive -- few investors to buy off -- which further reduces risk.

Nevertheless, managers tend toward the big mergers, apparently because managers' personal incentives often tend to encourage empire building over maximizing long-term value.

Google's acquisition strategy sounds remarkably sane compared to most.

[via Haochi in the Google Blogoscoped forums]

Update: A month later, Alan Sipress at the Washington Post wrote an article, "Google Goes to Market", about Google's strategy of tiny acquisitions. [via Barry Schwartz]

Thursday, August 03, 2006

Starting Findory: Talking to the press

I wanted to do a piece on talking to the press. I am a tech geek, not a marketing dweeb, so talking to the press is not my strong point. But, having been forced to do it, I think I have learned a few things.

The biggest thing I have learned is to focus on what reporters want. It seems to me that reporters are looking to write a solid, interesting story, supported by good sourcing and strong quotes. A typical story seems to consist a few paragraphs of explanation followed by some quotes from someone that express an opinion on the reporter's explanation.

My strategy is to only try to make a few key points. I try to say a few things that describe my impression of the overall situation, intending that piece to help with the explanation that is in the reporter's own words.

Then, I try to give just two or three pieces of opinion, my view on the situation. At this point, I try to talk slowly and carefully, keeping in mind that the report probably is trying to transcribe an exact quote. Sometimes, I repeat the quote or something fairly similar to the quote.

One other thing that I have found important is that reporters do not always ask the questions you think you they should. That is okay. You can usually take a question that you may feel is poorly chosen or misdirected and simply answer a variant on it that is the question you thought should have been asked. That variant is just as likely to make it into a quote as a more directed answer and may be more appropriate and more useful to the reporter.

Finally, I have found that most press coverage of tiny, little startups like Findory will be positive. After all, if they were going to write something negative, they would not bother writing about Findory at all. Yes, being ignored is an issue, but it is good to know that, when you are talking to the press, if you get any coverage, you are likely to get fairly positive coverage.

Maybe it is working on a news site, but I tend to have a lot of sympathy for the press. I think reporters are struggling with a difficult job and try to help them as much as I can. However, helping reporters is not a selfless act. The more helpful I am, the more likely it is that I get in the story, and any mention of Findory may benefit the company.

More on Google and VoIP using WiFi

Back in August 2005, I posted a speculative piece about how Google Talk and free city-wide internet wireless access (like Google WiFi) could be combined to allow free phone calls:

What if I had a phone that works over WiFi? ... What if there was city-wide WiFi coverage? Or WiFi coverage equivalent to cell phone networks (covering cities and major highways)? [Then] my WiFi phone would work everywhere.

Google just launched Google Talk, a VoIP application. Google is rumored to be thinking about a nationwide free WiFi network. Combined these two, add a WiFi phone to the mix, and am I about to get free mobile calling nationwide?

Since then, there have been some developments around this idea.

First, there are new WiFi phones appearing on the market. These phones sound easy to use and come preloaded with the necessary VoIP software.

Second, the NYT did a report last month on VoIP using WiFi and the potential threat to cellular networks.

Third, Katie Fehrenbacher at Gigom tested using the Google WiFi in Mountain View for making calls.

It is not clear whether VoIP using WiFi will be built out by Google, eBay/Skype, or a startup like Fon, but it is looking very likely to get built by someone.

Wednesday, August 02, 2006

Digg and community participation

Interesting stats (from a not-yet-released DuggTrends article) posted by Richard MacManus on Digg:

Top 10 users contributed 1792 of the frontpage stories - i.e 29.8%

Top 100 contributed 3324 stories i.e 55.28%

There are 444,809 registered users, out of these only 2287 contributed one or more story for the period of 6/19/2006 9:31:28 PM to 7/30/2006 4:41:34 PM

It appears Digg is run by a very small community, just a couple hundred users. In this sense, it may be similar to the much older Kuro5hin.

Update: The Duggtrends article is up now.

Tuesday, August 01, 2006

Amazon Omakase and personalized ads

Amazon appears to have launched the first stage of a Google AdSense competitor. It is called Amazon Omakase and based on Amazon Associates.

From the Omakase Links FAQ:

With Omakase Links, Associates can now automatically display the products and content that visitors to the page are most likely to buy.

Adding Omakase Links to your pages is easy. Use the Build Links tool to select the appearance and behavior of the ad, and then simply cut-and-paste the code into your template or Web page.

Your page will now display Omakase Links and after a short learning period, the ads will be optimized based on what the Associate has been successful with in the past; what that user has been interested in; and what the site is about.

Because Omakase Links optimize on more than just the page itself, Associates may see a range of different products in their links but they will also see that the links learn what their visitors want.

In fact, because Omakase Links aim to show the right product to the right person, each person visiting their site may see different products.

Unlike Google AdSense, Amazon Omakase appears to be personalized. Not only are the ads picked based on the content of the site, but also based on each individual visitor's interests; different visitors may see different ads depending on their interests and needs.

See also my Feb 2006 post, "Amazon version of AdSense?", where I said "Amazon.com, with their expertise in personalization, is well positioned" to create "a future of personalized advertising" where ads are "helpful and relevant".

[Amazon Omakase found via E-media Tidbits]

Update: The first three links in this post (to Amazon Omakase, Amazon Associates, and the Omakase FAQ) require login. Not sure if a normal Amazon account works, but an Amazon Associates account does.

Update: I just saw an example of Omakase in action on a weblog. The Amazon ads were for products strongly related to my recent browsing and purchase history at Amazon, but unrelated to the content of the blog. All five of the ads struck me as relevant and I clicked through on two of the five to find out more about the products. Very interesting.

Starting Findory: Startups are hard

There is no doubt about it. Startups are hard. You have to do everything -- and I mean everything -- yourself.

You have to do all the technical work, including prototyping, research, coding, web development, design, data analysis, metrics, system administration, networking, hardware procurement, and database administration.

There is marketing, including advertising, press relations, viral campaigns, framing the product, and pitching to influencers.

There is business development, including handing requests for technology licensing, talking about cross marketing deals, exploring feed licensing, and investigating partnerships.

There is management, including project management, building a culture, choosing founders and advisors, hiring, mentoring, setting expectations, and building teams.

There is legal, including creation of and modifications to the corporate structure, non-disclosure agreements, licensing agreements, employee incentive packages, patents, intellectual property assignments, and frameworks for partnerships.

There is finance, including accounting, taxes, licensing, managing cash flow, business planning, and pitching potential investors.

There is customer service, including building help pages, creating customer self-service features, and handling incoming contacts with suggestions, complaints, requests for information, and requests for help.

All this for incredible risk. Often, it is your own money being poured into the company. There typically are no or limited salaries and no benefits.

That incredible risk comes with many rewards. Though unlikely, there is a large potential payoff dangling off at the horizon. Startups are building something new, the exciting tip of the cycle of creative destruction. And, running a startup, you have much freedom in what you decide to pursue.

However, that freedom can be a curse. There are so many degrees of freedom, it can be overwhelming.

What should the company do? Even more important, what should it not do? Does it have enough money to pursue X? Can a competitor better implement feature Y? What really matters to the company?

There are so many options, but so little time. You must keep moving, keep making decisions, but always be willing to stop or reverse if something looks wrong.

You learn a lot, but it is incredibly difficult. It is a remarkable challenge, an incredible experience.

eBay, scammers, and self-governance

A couple interesting articles on eBay this morning.

The first is from Mike at TechDirt who talks about scammers using web robots to cheaply build a high reputation at eBay:

They use bots to scan eBay and buy $0.01 "buy it now" items ... Any new scamming user can build up a nice looking feedback page with tons of successful deals -- all at just a penny a shot.

The bots can create tons of new users as well, all of which are quickly building up good eBay reputations.

Then, they can waltz in with the real scam and drop the account, and move right on to the next "primed" account their bot has set up for them.

The second is from Nick Carr who quotes from a new book called "Who Controls the Internet?":

When Goldsmith and Wu look beneath eBay's "self-governing facade," they find "a far different story -- a story of heavy reliance on the iron fist of coercive governmental power."

eBay maintains a large and aggressive internal security force -- numbering almost a thousand -- and this force works in close harmony with national law-enforcement agencies to police the eBay community.

"Perpetually threatened by cheaters and fraudsters, eBay established an elaborate hand-in-glove relationship with the police and other governmental officials who can arrest, prosecute, incapacitate, and effectively deter these threats to its business model ... Without this powerful hidden-hand help of governments in the places where it does business, eBay's thriving 'self-governing' community could not survive."