Saturday, January 31, 2009

How Google crawls the deep web

A googol of Googlers published a paper at VLDB 2008, "Google's Deep-Web Crawl" (PDF), that describes how Google pokes and prods at web forms to see if it can find things to submit in the form that yield interesting data from the underlying database.

An excerpt from the paper:
This paper describes a system for surfacing Deep-Web content; i.e., pre-computing submissions for each HTML form and adding the resulting HTML pages into a search engine index.

Our objective is to select queries for millions of diverse forms such that we are able to achieve good (but perhaps incomplete) coverage through a small number of submissions per site and the surfaced pages are good candidates for selection into a search engine's index.

We adopt an iterative probing approach to identify the candidate keywords for a [generic] text box. At a high level, we assign an initial seed set of words as values for the text box ... [and then] extract additional keywords from the resulting documents ... We repeat the process until we are unable to extract further keywords or have reached an alternate stopping condition.

A typed text box will produce reasonable result pages only with type-appropriate values. We use ... [sampling of] known values for popular types ... e.g. zip codes ... state abbreviations ... city ... date ... [and] price.
Table 5 in the paper shows the effectiveness of the technique, that they are able to retrieve a significant fraction of the records in small and normally hidden databases across the Web with only 500 or less submissions to the form. The authors also say that "the impact on our search traffic is a significant validation of the value of Deep-Web content."

Please see also my April 2008 post, "GoogleBot starts on the deep web".

Friday, January 23, 2009

My father, Ted Linden, 1938 - 2009

My father, Theodore Linden, died on Tuesday, January 20, 2009, a few months after being diagnosed with brain cancer.

My father valued reason and rationality above all else, which made particularly cruel the cancer that pulled his mind from him in pieces. During his lifetime, that mind had remarkable influence, including seminal work in computer security and object-oriented programming, helping to build the earliest graphical operating systems at Xerox, and designing early AIs for robotic vehicles.

The eyes of a child are not a mirror for the life of a father, but my earliest memories of my father's work is in the computer terminal he brought home with him from Xerox PARC. We would dial in to the mainframe using a funky acoustic modem, the kind where you would pick up a phone handset and plop it down into little soft cups so the computer could talk with little beeps into the phone. The reason I loved that terminal as a young child is that Dad would occasionally let me use it to play Haunt, an early, clever, and amusing game set in a haunted house that, to this day, remains my favorite text adventure of all time.

Later, he took me into Xerox and let me try a Xerox Star. I still remember the enormous removable hard drives, shaped like the saucer section of the Star Trek Enterprise, nearly a meter across. A child being a child, the main thing that sticks in my mind again was the games, such as Maze War, Draw, Trek, and Pinball. Those early experiences made me want to write computer games myself and eventually pushed me into computer science. History may not repeat, but it echoes, and I look on curiously as I see my own son starting down this same path.

The window through which a child sees is small. As an adult, I can more fully see the impact my father had. The Xerox Star and Alto were the foundation for all modern operating systems. The desktop metaphor, the mouse, all that we see on our computers in our daily lives today was built on the shoulders of research at Xerox. My father's work at Xerox (ACM) was a part of that on which others now stand.

After Xerox, in the early 1980's, he worked on the DARPA Autonomous Land Vehicle project (IEEE), an early attempt to build a car that drove itself. His research included replanning when the robot encountered unexpected conditions ([1]), work that much later would influence the planning systems used in the Mars Rover (IEEE). The ALV was too ambitious of a project to succeed with the technology available in the 1980's, but, in 2005, he looked on with pride and joy at the remarkable success of the nearby Stanford team when their car drove itself 150 miles down twisty desert roads in the DARPA Grand Challenge.

Even as an adult, I did not know about all my father did until recently. He lived his life with a quiet competence -- things just got done -- and never bragged or boasted. I was surprised, for example, to discover that he did seminal work in computer security (ACM) and object-oriented programming (ACM) way back in the 1970's. All of this from a man who was the son of a millwright, only in America.

My father spent much of his life in the study of intelligent systems. His children follow his footsteps. My sister has devoted her life's work to neurobiology and the study of the mind at University College London. I work in artificial intelligence and machine learning. As we look back at my father and a life well lived, we can only hope we might have a fraction of the positive impact he did. Ted Linden will be remembered and sorely missed.

If you knew my father or his work, please comment on this post. I would much enjoy hearing from you.

Update: Thank you, all of you, who posted comments on this post. I appreciate your thoughts and stories about my father.

Update: Two months later, on March 21, 2009, we held my father's memorial. It was our celebration and remembrance of his life. Below is my speech on that day:
In his retirement, in the wee hours of the morning, it was not unusual to find my father out of bed, downstairs, playing Civilization in his office. Basking in the gentle glow of his computer screen, he would relive the never-ending struggles of men and nations against each other, the competition for wealth, power, and influence we see throughout history.

Earlier, when I was at college, we both started looking at game theory. As part of my work, I built little simulations of people interacting that showed that, even when people have short-term incentives to hurt each other, they often cooperate. My Dad saw this and started devouring texts on the topic, looking at game theory, evolutionary economics, the incentives created by societies, and the emergence of cooperation. In all of this, he was seeking to understand how normally competing actors suddenly could start working together.

Throughout his life, my father was fascinated with the goodness of man. Why are people good to each other? Is it somehow innate? Can it be derived from reason? Is it part of the rules that successful societies build? How is it that altruism can exist?

Perhaps it was born of his Jesuit upbringing, looking to reason to explain the nature of man. It lived on in his career in computers and early artificial intelligence. He built robotic cars that attempted to learn about and navigate through the world. He studied how little programs running on different computers, without really trusting the other programs to do the right thing, could share knowledge to protect each other from outside threats. In his career, much of his work involved trying to build robots and agents that understand the world and cooperate with each other.

In all of it, he was motivated by the same questions: Why are people good? Why do people cooperate? In this world, how can altruism exist?

Like so many before him, my father only gained partial answers these questions. He saw rational people as able to see the benefits of cooperation in the long-term. He noted the societies we build reward cooperation and penalize cheating. He read of the benefits of altruism that accrue not only to ourselves, but to our families and societies. And he wondered if this was enough to explain the kindness we see around us.

Looking back as a son, I am inspired by my father's pursuit of such great questions. A life inquiring about the very essence of human nature, that is a life well lived.

Monday, January 19, 2009

Preferred sites feature in Google search

Alex Chitu describes an experimental feature currently in A/B testing on Google where searchers can add "a list of sites you want to appear more often when you search."

More information on the Google help page for the new feature.

Seems like an interesting blend of SearchWiki (which lets people manually reorder Google search results) and Google Personalized Search (which automatically reorders search results based on your past search and visit behavior).

Friday, January 16, 2009

2008 prediction and looking to 2009

In January of 2008, I laid out my prediction for what would happen in the rest of the year. Unfortunately, it turns out to be largely correct.
We will see a dot-com crash in 2008 ... The crash will be driven by a recession and prolonged slow growth in the US. Global investment capital will flee to quality.

Most startups will find their revenue models were unrealistic and will rapidly have to seek change. Many will jump over to advertising, but the advertising market will have constricted.

The big players will not be immune from this contagion. Google, in particular, will find its one-trick pony lame, with the advertising market suddenly stagnant or contracting and substantial new competition ... Google and Yahoo will do ... substantial layoffs.
Parts of the prediction did not come true yet in 2008, but may in 2009.
It will be more prolonged and deeper than the crash of 2000.

Desperate competition with dwindling opportunity will drive profits in online advertising to near zero. Google ... will find their available cash dropping.

It is likely that those with little to lose will attempt scary new forms of advertising. The Web will become polluted with spyware, intrusiveness, and horrible annoyances. None of this will work, of course, and there will be lawsuits and new privacy legislation, but we will have to endure it while it lasts.
Underlying the prediction of a dot com crash was a prediction of the broader economy. While the direction was correct, I have to admit, I got parts of that prediction wrong. In particular, I did not expect huge multinational banks to be so rash as to make gambles that could drive them to insolvency, nor did I expect the normally more considered Europeans to follow the foolhardy practices of we Americans. I did expect hedge fund failures and a housing price fall, but I was as amazed as everyone else that the ratings on bonds would become completely unreliable and that one particularly prestigious fund was actually just a Ponzi scheme. Where I was wrong, it appears my mistake was assuming Wall Street still had some lines it still would not cross.

I do want to say that I am in no way happy that my 2008 prediction turned out to be largely true. It is awful, truly awful.

It also is sad to look back at the irrationality and willful blindness of early 2008. A glance over the comments on my January 2008 post gives a feel for how people reacted to anyone who tried to send up the red flags. It is not true that no one saw this recession coming -- economists like Nourel Roubini, Paul Krugman, Robert Shiller, and Dean Baker were quite vocal -- but they were marginalized and maligned.

I hope that the early warning I tried to give had at least some positive impact. It seemed that some startups were looking to secure funding rounds earlier than they might otherwise and build up cash, but, as much as I might hope it was helpful, I have no evidence that what I posted influenced any startups to be more cautious.

As for predictions for 2009, given the vitriol my 2008 post attracted, I am going to leave this game to others. If you really do want to see things fairly close to what I think will happen in 2009, please see [1], [2], [3], [4], [5], and [6].

Tuesday, January 06, 2009

Blending the link and query-click graphs

A fun paper out of Yahoo Research, "Dr. Searcher and Mr. Browser: A Unified Hyperlink-Click Graph" (PDF), looks at the value of combining two graphs that search engines typically use as part of static and dynamic ranking, the query-click bipartite graph (which shows what pages people click on immediately after searching) and the link graph (which shows hyperlinks between pages on the Web).

The query-click graph is a bipartite graph with queries on one side and clicked documents on the other. A query (e.g. [greg linden]) is linked to a document (e.g. http://glinden.blogspot.com) if people who make the query click on that document. The links are usually weighted by the probability that someone clicks on the document given that search query. Search engines get the query-click graph by parsing their query logs. Random walks of the query-click graph have been a popular research topic for finding similar queries and similar documents.

The hyperlink graph is a graph where web pages are nodes and a link from one page to another is a directed edge in the graph. Search engines get a link graph by crawling the Web and parsing all the html of all the pages. Random walks of the link graph are used to find related documents and by algorithms such as PageRank to compute the authority of web pages.

The authors of this Yahoo paper had the idea of combining the two graphs and doing random walks of the combined graph. In particular, they model searchers starting with a query, then clicking, then randomly walking the hyperlink graph, until they eventually search again. The goal is to more accurately model searcher behavior and more accurately find the pages people actually want when they search.

I really like the idea and think it should be useful, but I cannot help but notice that the combined hyperlink and query-click graphs are trying to approximate how people actually move across the Web. If a search engine had good data on how people actually moved across the Web, it might not need this approximation.

For more on using data on how people actually move across the Web, please see my earlier posts, "Google Toolbar and the actual surfer model" and "Search trails and relevance".