Geeking with Greg

Friday, October 13, 2006

Talk on Google AdWords and AdSense

Shiva Shivakumar (Director, Google Kirkland) who "led AdSense through beta, launch and hypergrowth" gave a talk at University of Washington Computer Science called "Google Ad Systems" (video available).

The talk is a light introduction to AdWords and AdSense. It mostly covers the history of the products with some brief discussions of the challenges around relevance, scale, optimal auction pricing, and click fraud.

The talk is worthwhile, a good overview of the work involved in building a system like AdSense and AdWords.

I tend to follow this stuff fairly closely, so the only thing that was new to me in the talk was hearing that AdWords is still running on top of a massive MySQL deployment. AdSense, in contrast, is running on top of the Google data infrastructure, GFS and Bigtable. Shiva was not clear about whether the difference was merely due to legacy issues or whether there is something special about the AdWords data access patterns that makes MySQL preferable.

Shiva's talk unfortunately only touches briefly on auction theory and click fraud issues. If you are interested in more details there, you might dive into Jan Pedersen's talk and a related paper I discussed in an earlier post.

Wednesday, October 11, 2006

Yahoo's troubles

Saul Hansell at the NYT tells us that "At Yahoo, All Is Not Well". Some selected excerpts:

Yahoo would seem to have a strong hand. It is the world's most popular Web site, with more than 400 million monthly users ... But in recent months the company has suffered some embarrassing setbacks.

From video programming to social networking -- areas of interest to users and advertisers alike -- the company is losing its initiative. And each time a product fails in the market or is late, Yahoo loses some ability to do more deals and hire more talented employees.

Yahoo has been stymied because its text advertising business has been largely frozen until it completes a new software system. The upgrade is more than a year late ... Yahoo's [old] system produces much less money from every page than Google ... Google has $11 billion in cash and a market value of $131 billion, while Yahoo has $4 billion in cash and is worth $34 billion.

Current and former Yahoo employees say the company has been bogged down by bureaucracy and internal squabbling .... Companies that try to do deals with Yahoo also say they find it to be slow, demanding and inconsistent in negotiations.

Yahoo's faltering image and plunging stock price may also be hurting its ability to recruit talented people ... Yahoo's existing employees are grumbling that with the stock price so low, many of their options have become worthless. Some Yahoo veterans have bolted for trendier start-ups.

Of all the problems at Yahoo, I think the lengthy delays in competing against Google AdWords and AdSense are the worst.

The business is advertising. It is not tagging, sharing, chatting, or socializing. Ads drive everything. To fail to compete on advertising is to fail.

For more on what Microsoft and Yahoo should do to compete with Google, see my previous post, "Kill Google, Vol. 3".

For more on the failure of Microsoft and Yahoo to compete with Google, see my previous posts, "Yahoo and MSN cannot compete?" and "Yahoo gives up?"

Monday, October 09, 2006

Progress on Netflix recommendations contest

Seven days after the start of the Netflix contest, the first entry appears on the leaderboard that beats the performance of Netflix's recommender engine Cinematch.

See also my earlier post, "Netflix offers $1M prize for improved recs".

Update: John Chandler-Pepelnjak notes in the comments to this post that a new entry for a team called "The Thought Gang" just qualified for the "Progress Prize" of $50k. Excellent.

Update: Two weeks after the start of the Netflix contest, there are now six entries that beat Netflix Cinematch. The top entry, "NIPS Reject", has a nearly 2% improvement. Impressive.

Friday, October 06, 2006

YouTube is not Googly

Kevin Delaney at the WSJ says that "Google Inc. is in talks to acquire popular video-sharing site YouTube Inc. for roughly $1.6 billion."

This is a horrible idea.

YouTube is a collection of uploaded content. They have no interesting technology. All Google would be buying is YouTube's existing content and user base.

Google has never been about owning content and users before. Google makes it easier to find other people's content. Google's core strength is in helping people find and discover information, not in controlling information and people.

This merger would be classic deworsification. If it happens, GooTube will be exactly what Microsoft and Yahoo have been waiting for, a lovely little distraction for Google.

See also my previous post, "The problem with YouTube".

Update: On this note, Danny Sullivan points out an LA Times article that said, "Google admitted this year that its internal audits discovered that the company had been spending too much time on new services to the detriment of its core search engine."

Update: Mark Cuban calls Google buying YouTube "crazy" and "moronic" in his post, "Some thoughts on YouTube and Google". [via Don Dodge]

Update: It's official, Google bought YouTube. All hail GooTube.

Update: Om Malik says Google is a loser in this because he thinks "this is Compaq-DEC, Skype-eBay kind of a deal for them in the long run." John Battelle notes that "this marks Google's first significant out of brand acquisition, the company's first true brand-management challenge."

Update: When we look back at this merger in a few years, I suspect it will be seen as when Google jumped the shark. It is the day Google loses its focus on technology and begins a stumbling effort at trying to become a media company.

Update: Om Malik adds, "It is the distraction factor ... The copyright issues and all those other problems are going to strain google where it is weakest - management and control."

Update: Microsoft CEO Steve Ballmer says, "Right now, there's no business model for YouTube that would justify $1.6 billion. And what about the rights holders? At the end of the day, a lot of the content that's up there is owned by somebody else. The truth is what Google is doing now is transferring the wealth out of the hands of rights holders into Google." [via Todd Bishop]

Wednesday, October 04, 2006

The advantages of big data and big clusters

Ionut Alex Chitu points to a UC Berkeley talk, "Theorizing from data: Avoiding the capital mistake" by Googler and AI guru Peter Norvig.

I particularly enjoyed Peter's thoughts on the advantages of big data and big clusters. Near the beginning of the talk, Peter said:

Rather than argue about whether this algorithm is better than that algorithm, all you have to do is get ten times more training data. And now all of a sudden, the worst algorithm ... is performing better than the best algorithm on less training data.

Worry about the data first before you worry about the algorithm.

Later, near the end of the talk, Peter extended this point:

Is it just that Google has more data and more machines? But it couldn't be just the more data because [in some competitions] ... everyone got the same data.

So, I think that having more machines is a very important part because it allows us to turn around the experiments much faster than the other guys. So, it's not the online performance where you are actually doing the search that matters, but it's the -- gee, I have an idea, I think we should change this -- and we can get the answer in two hours which I think is a big advantage over someone else who takes two days. And, I think it also helps that we took an engineering approach of -- well, we'll try anything.

Amazon was similar in many ways. While Amazon did not have Google's mighty parallel processing tools and massive cluster, Amazon did have big data (transactional and log data) which it used extensively for website visible features like personalization and search query refinements and backend work like supply chain optimizations. In addition, Amazon was very early if not the first to do website A/B tests, a framework for rapidly testing new algorithms and designs live on Amazon.com, which encouraged behavior like Google's "try anything" engineering approach.

I find Peter's words particularly interesting when thinking about the Netflix contest. Netflix may be demonstrating how to do a Google-like experimental effort if you do not have Google-scale resources. The Netflix contest uses other people's machine resources and the power of many minds "trying anything" to attempt to find improvements to the Netflix recommender system.

See also notes by Brian Mingus on what appears to be the same talk a few weeks later at U of Colorado at Boulder.

[Ionut post found via Philipp Lenssen]

Monday, October 02, 2006

Google Reader redesigns

Google Reader, a feed reader similar to Ask.com's Bloglines, recently redesigned and is getting rave reviews ([1] [2] [3]).

If you try Google Reader and are in the mood to experiment, please also give Findory's feed reader a try. Unlike other feed readers, it constantly recommends other interesting articles and feeds. It uses what other Findory readers found to help you discover things you might otherwise miss.

See also my previous post, "RSS sucks and information overload", where I said, "The problem is that the current generation of feed readers merely reformat RSS for display."

See also my earlier post with details on Findory's feed reader.

A9 redesigns, simplifies

A9.com, Amazon's web search startup, has done a major design that completely changes the site.

It now appears to be a metasearch engine, like Dogpile or Metacrawler, that gives a lot of control over which seach engines are used.

I actually like the site better than before -- it seems cleaner and more usable to me -- but the functionality seems minimal, merely showing search results side-by-side from many search engines. Despite the spiffy AJAX UI on top, this is the kind of thing that has been around for a decade.

I would love to see A9 go a big step further and automatically decide which of thousands of search engines to query based on the information need of the searcher, then combining and reranking the results. That is a hard problem, but a very interesting one.

Danny Sullivan has some scathing comments about A9's redesign. Danny says, "Frankly, A9's always felt like some type of Amazon plaything, a way for Amazon to say they were in search but also pretend it was all just an experiment, if it failed to succeed. I think the failure is now apparent, and Amazon seems to be cutting its losses pretty dramatically."

Ouch, but there is a lot of truth in Danny's words. A9 spent millions going nowhere instead of attacking the interesting and hard problems in personalized search, federated search, and personalized advertising.

See also my previous post, "What will become of A9?"

Update: The punches keep coming. Paul Kedrosky says Amazon has been "wasting their time in nowhere search efforts." Joe at TechDirt says, "[Amazon] realizes it doesn't have much to bring to search." Ouch.

Netflix offers $1M prize for improved recs

This is interesting. Netflix is offering $1M in a contest to anyone who can improve the predictive accuracy of their recommendation engine by 10%.

From their Rules page:

We're quite curious, really. To the tune of one million dollars.

We've developed our world-class movie recommendation system: Cinematch. Its job is to predict whether someone will enjoy a movie based on how much they liked or disliked other movies. We use those predictions to make personal movie recommendations based on each customer's unique tastes. And while Cinematch is doing pretty well, it can always be made better.

Now there are a lot of interesting alternative approaches to how Cinematch works that we haven't tried. Some are described in the literature, some aren't. We're curious whether any of these can beat Cinematch by making better predictions. Because, frankly, if there is a much better approach it could make a big difference to our customers and our business.

So, we thought we'd make a contest out of finding the answer. It's "easy" really. We provide you with a lot of anonymous rating data, and a prediction accuracy bar that is 10% better than what Cinematch can do on the same training data set. (Accuracy is a measurement of how closely predicted ratings of movies match subsequent actual ratings.)

If you develop a system that we judge most beats that bar on the qualifying test set we provide, you get serious money and the bragging rights. But (and you knew there would be a catch, right?) only if you share your method with us and describe to the world how you did it and why it works.

Serious money demands a serious bar. We suspect the 10% improvement is pretty tough, but we also think there is a good chance it can be achieved. It may take months; it might take years.

There is no cost to enter, no purchase required, and you need not be a Netflix subscriber. So if you know (or want to learn) something about machine learning and recommendation systems, give it a shot. We could make it really worth your while.

Sounds like fun. I wonder how much time I'll end up wasting on this one.

If you are thinking of entering the contest, you might be interested to know that much of the Internet Movie Database (IMDb) database is available for download. Another good source for movie content is Amazon Web Services.

[Contest found via Pete Abilla and John Krystynak]

Update: I should explicitly point out that this Netflix data is by far the largest ratings data set available to the research community. Most work on recommender systems outside of companies like Amazon or Netflix has had to make do with the relatively small 1M rating MovieLens data or the 3M EachMovie data set. This Netflix data set is 100M ratings. It will be enormously useful for recommender system research.

Update: The comments on this post are starting to get pretty interesting.

Friday, September 29, 2006

Not walloped by Wallop

I have been playing with Wallop, a social networking startup spun out of Microsoft Research, for a couple days now. So far, I am underwhelmed.

First, I have to admit, I am in no way in their target demographic. I am too old and boring, and my dating days are way behind me.

That being said, Wallop looks like a confusing mess to me. The entire site is a giant Flash app with a non-standard and non-intuitive user interface. There are menus and buttons scattered everywhere, but most of them don't do what you want; right-clicking is the primary way of taking action. Everything is rounded, buttons, menus, pictures, everything.

All this stuff is supposed to be hip, I guess. To me, it is just befuddling.

The default view when you first go to Wallop is your "profile", which is just a blob of text. In a circular pattern around that, there is a picture of yourself, a menu bar with a bunch of buttons that are effectively tabs for changing what is displayed, and then little windows around that show you a subset of the content from the tabs (e.g. the title of one of your uploaded MP3s).

The focus on Wallop seems to be on sharing pictures and music. They prominently feature a toolbar for playing uploaded MP3s. There are also communication tools for use inside of Wallop, a messaging system titled "Conversations" and a weblog.

There does seem to be a lot of potential for lurking and ad hoc communication. For example, I can pick people I don't know, look at all their public photos (which is the default setting when you upload photos) and comment on any one I want. In general, it appears you can tag or comment on about any item on Wallop.

There seems to me much less emphasis on the network than on other sites. The network tab shows people in a circular pattern radiating out from you. Looking at friend-of-a-friend relationships seemed slow and cumbersome. I could not find a way to search the network for people I know.

So far, I'm not sure what the appeal would be. It's confusing, cumbersome, not useful, and not fun. I don't get it.

See also my previous post on Wallop, "Microsoft's new Wallop startup", that includes links to academic papers on Wallop published while it was still at Microsoft Research.

See also Kari Lynn Dean's 2003 Wired article on Wallop, "Will Microsoft Wallop Friendster?".

Thursday, September 28, 2006

WebSpam talk and SIMS 141 speakers

I really enjoyed this "WebSpam" (link is correct, talk is mislabeled on Google Video) talk by Marc Najork from Microsoft Research.

It covers a lot of the techniques for web spam, examples of their tricks, spammer's motivations, and some of the countermeasures. Light and fun. Slides from the talk (PDF) are also available.

This talk is from Marti Hearst's SIMS 141 Fall 2005 class at UC Berkeley. There are slides and videos from many other talks available on that page. She had a remarkable set of speakers that included Jan Pedersen, Daniel Rose, Susan Dumais, Peter Norvig, Sep Kamvar, Bradley Horowitz, John Battelle, and Sergey Brin.

The quality of the videos on the SIMS 141 course page is low. There are higher quality videos for some of the talks on Google Video. A few of these talks on Google Video appear to be mislabeled; look at the comments at the bottom of the page to get the right titles.

Of the other talks, I particularly enjoyed "User Experience Issues in Web Search" by Daniel Rose from Yahoo Search. Peter Norvig's talk was also a lot of fun, but the low quality of the video available from the SIMS 141 page keeps me from recommending it. Finally, if you haven't looked at the Stuff I've Seen desktop search project from Microsoft Research, definitely look over the slides (PDF) from Susan Dumais' talk.

Unfortunately, Sep Kamvar's SIMS 141 talk on search personalization at Google appears to have not been recorded. Darn. I very much would have liked to see that.

Marc Najork also was at SIGIR AIRWeb 2006. I wrote up some notes on one particularly interesting discussion that involved Marc and many others. I also have other notes on the short presentation I did at that workshop.

On the Stuff I've Seen project at Microsoft Research, please also see some of my previous posts on that project, including "Using the desktop to improve search", "Finding and discovering", and Google Memex".

Thanks to Nathan Weinberg for reminding me to go back to the SIMS 141 page and mentioning the Google Video versions of some of the talks. Thanks to Danny Sullivan for originally pointing out the videos and slides available from the SIMS 141 page. And, thank you, Professor Marti Hearst, for making this remarkable content from your class available to all of us.

Update: The titles on the copies of the talks on Google Video have been fixed. They are now all correct.

Wednesday, September 27, 2006

Potential of web search personalization

There are some great data points on the potential of personalized web search in a KDD 2006 paper, "A Large-Scale Analysis of Query Logs for Assessing Personalization Opportunities", by Steve Wedig and Omid Madani from Yahoo Research.

The paper starts with what should by now be a familiar-sounding motivation for personalized search:

Interacting with search engines has traditionally been an impersonal affair, with the returned results a function only of the query entered.

Unfortunately the average query length is consistently reported to be around two, so many queries are too short to disambiguate the user's information need. Moreover, users often view only the first page of results, which makes precision critically important.

These limitations have motivated researchers to look beyond the query and consider how a search's context can provide further evidence about the user's information need.

To determine the potential for personalized search, the researchers analyzed "six months of query logs from the Yahoo! search engine" that "contained about 1.35 million cookies, 26 million searches, and 20 million clicks." Their goal was to determine "the extent of short and long term history available" and the "consistency and convergence rate" of user's interests.

Right at the beginning, the authors distinguish between using a searcher's short-term history to change search results, which they call "adjustment", and modifying searcher results using a profile built from their long-term history, which they refer to as "personalization".

Frequent readers of this weblog would know that I would call the first personalization and the second "probably not worth doing". But this paper does a good job quantifying the potential impact of both the short-term and long-term approaches to personalized search.

In particular, the authors looked at the number of searchers who had enough information for profiles built from long-term history. In their analysis, 50% of queries to Yahoo Search came from "users who performed at least 100 queries over the 6 month period." That seems promising.

However, later in the paper, they analyze the number of queries necessary for a user's interests to clearly converge and become distinct from the population as a whole. They determined it required "a few hundred queries". Less than 25% of queries and less than 3% of users appeared to have that much data.

This does not mean that a long-term, profile-based approach to personalization is not worth doing, but it does mean that it would only impact a minority of the queries and users.

The short-term approach, which they call "adjustment", appears to have potential to influence many queries. The researchers talk a bit about some promising approaches for that in the last part of the paper, including focusing on less common clickthroughs, clickthroughs that users tend to return to, and related clickthroughs. They claim that "with short-term adjustment, a single click ... could dramatically improve results for the rest of your search, even without any prior user history."

In the end, it is probably worth doing both approaches, but this paper is useful for understanding some of the limitations of each. Well worth reading.

For more on personalized web search, please also see some of my previous posts: "Beyond the commons: Personalized web search", " Google Personalized Search and Bigtable", " More on Google personalized search", and " New personalized web search at Findory".

By the way, if you like this post, you may also be interested in my post, "Recommending advertisements", on another of Omid Madani's papers.

Update: If you have trouble downloading the paper from Yahoo Research, you can also get it from the ACM.

Management and incentives at Google

Googler Steve Yegge has some interesting tidbits on Google's management buried in one of his posts. Some extended excerpts:

Google's process probably does look like chaos ...

What's to stop engineers from leaving all the trouble projects, leaving behind bug-ridden operational nightmares? What keeps engineers working towards the corporate goals if they can work on whatever they want? How do the most important projects get staffed appropriately?

Google drives behavior through incentives. Engineers working on important projects are, on average, rewarded more than those on less-important projects. You can choose to work on a far-fetched research-y kind of project that may never be practical to anyone, but the work will have to be a reward unto itself. If it turns out you were right and everyone else was wrong (the startup's dream), and your little project turns out to be tremendously impactful, then you'll be rewarded for it. Guaranteed.

The rewards and incentives are too numerous to talk about here, but the financial incentives range from gift certificates and massage coupons up through giant bonuses and stock grants, where I won't define "giant" precisely, but think of Google's scale and let your imagination run a bit wild, and you probably won't miss the mark by much.

Google a peer-review oriented culture, and earning the respect of your peers means a lot there. More than it does at other places, I think. This is in part because it's just the way the culture works; it's something that was put in place early on and has managed to become habitual. It's also true because your peers are so damn smart that earning their respect is a huge deal.

Another incentive is that every quarter, without fail, they have a long all-hands in which they show every single project that launched to everyone, and put up the names and faces of the teams (always small) who launched each one, and everyone applauds. Gives me a tingle just to think about it. Google takes launching very seriously, and I think that being recognized for launching something cool might be the strongest incentive across the company.

The perks are over the top, and the rewards are over the top, and everything there is so comically over the top that you have no choice, as an outsider, but to assume that everything the recruiter is telling you is a baldfaced lie, because there's no possible way a company could be that generous to all of its employees.

The thing that drives the right behavior at Google, more than anything else, more than all the other things combined, is gratitude. You can't help but want to do your absolute best for Google; you feel like you owe it to them for taking such incredibly good care of you.

Beautiful. I love it. I wish I had been able to push things further in this direction when I was at Amazon.

On Steve's comments about gratitude, see also my earlier post, "Free food at Google", where I said, "Perks can be seen as a gift exchange, having an impact on morale and motivation disproportionate to their cost."

For an interesting comparison to Microsoft, see my July 2004 post, "Microsoft cuts benefits". After the predictable drop in morale and loss of key people from cutting benefits, Microsoft reversed their policy, which I described in my May 2006 post, "Microsoft drops forced rank, increases benefits".

See also some of my other posts about Google's management structure, "First, kill all the managers" and "Google's rules of management".

For a comparison to Amazon, I have a couple posts on their management practices, one critical of "two pizza teams" and one praising some of Amazon's non-monetary rewards.

Update: Dare Obasanjo at Microsoft has a quite different take on all of this: "A company pays you at worst 'what they think they can get away with' and at best 'what they think you are worth', neither of these should inspire gratitude."

Humans and algorithms and humans

John Battelle interviews Googler Matt Cutts. Some interesting excerpts from Matt on algorithms based on user data:

When savvy people think about Google, they think about algorithms, and algorithms are an important part of Google. But algorithms aren't magic ... quite often ... [they] are based on human contributions in some way.

The simplest example is that hyperlinks on the web are created by people ... Google News ranks based on which stories human editors around the web choose to highlight. Most of the successful web companies benefit from human input, from eBay's trust ratings to Amazon's product reviews and usage data. Or take Netflix's star ratings ... [they] are done by people, and they converge to pretty trustworthy values after only a few votes.

Findory is similar in that its recommendations are based on what humans find and discover. The knowledge of what is good and what is not comes from readers; it is people sharing what they found with each other.

Findory's personalization is like what happens on social networking sites, but all the sharing happens anonymously and implicitly. Findory's algorithms quietly do all the work behind the scenes so that everyone in the Findory community can recommend articles to each other.

Matt also has a quick warning about some of the issue with abuse and spam:

The flip side is that someone has to pay attention to potential abuse by bad actors. Maybe it's cynical of me, but any time people are involved, I tend to think about how someone could abuse the system. We've seen the whole tagging idea in Web 1.0 when they were called meta tags, and some people abused them so badly with deceptive words that to this day, most search engine give little or no scoring weight to keywords in meta tags.

See also my previous post, "Community, content, and the lessons of the Web", where I said, "We cannot expect the crowds to selflessly combine their skills and knowledge to deliver wisdom, not once the sites attract the mainstream. Profit motive combined with indifference will swamp the good under a pool of muck ... At scale, it is no longer about aggregating knowledge, it is about filtering crap."

See also my previous post, "Getting the crap out of user-generated content".

Sunday, September 24, 2006

Findory switches from Google to Amazon ads

Findory recently launched a new version of our personalized advertising engine. This new version is based on Amazon Associates rather than Google AdSense.

Like the old engine, the new engine targets based on the content of the page and each reader's clickstream history on Findory. Unlike the old engine, it shows a targeted selection of books from Amazon rather than text ads from Google AdSense.

Why did Findory switch?

AdSense is an intelligent, self-optimizing, ad targeting system. It is a stubborn beast, convinced it knows what is right.

When Findory layered its own intelligent ad targeting system on top of AdSense, the two fought like crazed monkeys.

Findory would tell AdSense, "This page has articles about Google, Yahoo, search, engines, and technology," and then ask Google to target ads. Given that description, what do you think would be reasonable? Probably ads for web search engines and things related to web search engines?

Instead, Google sometimes would respond with ads for aircraft or automobile engines, blindly fixating on the word "engine" and apparently ignoring the rest of the information. It is hard to work with that.

We even tried test cases where we sent them nothing but a single keyword. For example, we said, "This page is about 'Google'." AdSense sometimes responded with ads for get-rich-quick schemes and penny stocks. That may be amusingly ironic. It even may be lucrative. But, it is not relevant.

After a year of experiments and optimizations to improve our targeting on top of AdSense, it became clear that we were not going to be able to bend it to our will. AdSense wants to target by itself. Any attempt to push it in one direction or another seems doomed to failure.

We decided to switch to a system where we would have more control. When we advertise books, Findory is completely responsible for the targeting. We analyze the Findory page and a reader's history, then our ad system picks specific books at Amazon based on that data.

Going back to that "Google" keyword test case, if I tell Findory's new advertising engine to target to the keyword "Google" (and nothing else), Findory responds with ads for four books: "The AdSense Code", "The Google Story", O'Reilly's "Google Advertising Tools", and "Google Maps Hacks". Ah, much better.

The new book ads target to any page on Findory in real-time as the content changes. Check out the targeting for Wired Magazine, ScienceDaily, Gizmodo, and Google Blogoscoped. And, of course, don't miss the targeting on your personalized Findory front page and how the ads change as you click on new articles.

The performance of the new ad system is roughly the same as the old Google AdSense system. Clickthroughs are much higher than what they were before but, because Amazon Associates has a much lower effective payment per click (about $.05), the incoming revenue is a little lower.

I like the new system much better. The ads are relevant and useful. We have complete control. And, I like helping people discover new books.

Friday, September 22, 2006

R.I.P. Froogle?

Google apparently has decided to retire its metashopping search, Froogle. From an article by Ben Charny at Marketwatch:

Google intends to "de-emphasize" its own Froogle shopping search engine, a Web site featuring paid listings from eBay and other online retailers. Google intends for Froogle to no longer be a standalone Web site; instead its listings would be absorbed by other search features, [analyst Robert] Peck wrote.

That is sad. I have always liked Froogle and had high hopes for it ([1] [2] [3] [4]), but it has been woefully lacking in attention in the last year or two. Even so, I am surprised Google is deciding to shut Froogle down rather than improve it.

There are several other comparison shopping sites available, including Shopzilla, Shopping.com, PriceGrabber, Smarter.com, and mySimon.

[Found via Paul Kedrosky]

Update: John Battelle pings Google PR and gets the response that "Froogle is alive and well." This may be a denial that Froogle is going to be shut down, but it also could be semantic games with the present versus future tense. Hard to tell.

Winner takes all, relevancy, and personalized search

Eric Goldman has an interesting article in InformIT that talks about personalization and its coming impact on search. Some excerpts:

Currently, search engines principally use "one size fits all" ranking algorithms to deliver homogeneous search results to searchers with heterogeneous search objectives.

Personalized algorithms produce search results that are custom-tailored to each searcher's interests, so different searchers will see different results.

Personalized ranking algorithms represent the next major advance in search relevancy ... Improvements in one-size-fits-all algorithms will yield progressively smaller relevancy benefits. Personalized algorithms transcend those limits [by] optimizing relevancy for each searcher.

Personalized ranking algorithms also reduce the effects of search engine bias. Personalized algorithms mean that there are multiple "top" search results for a particular search term, instead of a single "winner," so web publishers won't compete against each other in a zero-sum game ... Also, personalized algorithms necessarily will diminish the weight given to popularity-based metrics (to give more weight for searcher-specific factors), reducing the structural biases due to popularity.

See also my March 2005 post, "The key challenge is personalization", where I said:

With only one generalized relevance rank, further improvements to search quality become increasingly difficult because people disagree on how relevant a particular page is to a particular search.

At some point, to get further improvements, relevance rank will have to be customized to each person's definition of relevance.

See also my July 2006 post, "Combating web spam with personalization", where I said:

Another way to reduce the value [of web spam] is to reduce the maximum payoff. If different people see different search results, spamming becomes much less attractive. The jackpot from getting to the top of the page disappears.

Personalized search shows different search results to different people based on their history and their interests. Not only does this increase the relevance of the search results, but also it makes the search results harder to spam.

See also my August 2006 post, "Web spam, AIRWeb, and SIGIR", where I said:

"Winner takes all" encourages spam. When spam succeeds in getting the top slot, everyone sees the spam. It is like winning the jackpot.

If different people saw different search results -- perhaps using personalization based on history to generate individualized relevance ranks -- this winner takes all effect should fade and the incentive to spam decline.

Social networks and phishing

This "Social Phishing" paper (PDF) that will appear in an upcoming issue of Communications of the ACM is frightening. It describes very successful phishing attacks using information pulled off social networking sites.

From the paper:

The question we ask here is how easily and how effectively a phisher can exploit social network data found on the Internet to increase the yield of a phishing attack. The answer, as it turns out, is: very easily and very effectively.

Our study suggests that Internet users may be over four times as likely to become victims if they are solicited by someone appearing to be a known acquaintance.

To mine information about relationships and common interests in a group or community, a phisher need only look at any one of a growing number of social network sites, such as Friendster (friendster.com), MySpace (myspace.com), Facebook (facebook.com), Orkut (orkut.com), and LinkedIn (linkedin.com). All these sites identify "circles of friends" which allow a phisher to harvest large amounts of reliable social network information.

The experiment spoofed an email message between two friends, whom we will refer to as Alice and Bob. The recipient, Bob, was redirected to a phishing site with a domain name clearly distinct from Indiana University; this site prompted him to enter his secure University credentials. In a control group, subjects received the same message from an unknown fictitious person with a University email address.

The 4.5-fold difference between the social network group and the control group is noteworthy. The social network group's success rate (72%) was much higher than we had anticipated.

When they received the e-mail to go to this non-University website, 349 of the 487 students targeted provided their University username and password. Remarkable and frightening.

The paper contains other interesting details such as differences in success rates according to field of study and gender of sender and receiver.

See also a Google Tech Talk on Google Video, "Badvertisements: Stealthy Click Fraud with Unwitting Accessories", by Markus Jakobsson, one of the authors of the paper, that discusses this phishing study and some of his other work on click fraud.

Update: If you liked this, don't miss Markus' demonstration of a crafty CSS/Javascript hack that reveals parts of your browser history. To see it, click on the "View" link on the right side of his page.

Thursday, September 21, 2006

Boxwood from Microsoft Research

If you enjoyed the Google Bigtable, Chubby, and GFS papers, you might also enjoy a recent paper out of Microsoft Research, "Boxwood: Abstractions as the Foundation for Storage Infrastructure".

The basic idea is to create a distributed data store over a small cluster. It is similar in motivation to Bigtable and GFS, but lower-level. From the paper:

The overall goal of the Boxwood project is to experiment with data abstractions as the underlying basis for storage infrastructure ... [that includes] redundancy and backup schemes to tolerate failures, expansion mechanisms for load and capacity balancing, and consistency maintenance in the presence of failures.

The principal client-visible abstractions that Boxwood provides are a B-tree abstraction and a simple chunk store abstraction provided by the Chunk Manager.

It is worth noting right away that Boxwood is a research project, not a deployed system. The Boxwood prototype runs on a small cluster of eight machines. GFS and Bigtable run on tens of thousands of machines and provide the backend for many of Google's products.

It is also worth noting that they have different standards for failure tolerance. For one of several examples, the Boxwood paper says that "failures are assumed to be fail-stop". Contrast that with the experience of the folks at Google working on Bigtable:

One lesson we learned is that large distributed systems are vulnerable to many types of failures, not just the standard network partitions and fail-stop failures assumed in many distributed protocols.

For example, we have seen problems due to all of the following causes: memory and network corruption, large clock skew, hung machines, extended and asymmetric network partitions, bugs in other systems that we are using (Chubby for example), overflow of GFS quotas, and planned and unplanned hardware maintenance.

In any case, the Boxwood paper is an interesting read. This is work at Microsoft that may follow a similar path to GFS and Bigtable.

See also my previous post, "Yahoo building a Google FS clone?", that talks about Yahoo's involvement in Hadoop.

See also the Eclipse project at Microsoft Research.

Update: Mary Jo Foley mentions another Microsoft Research project called Dryad and quotes Bill Gates as saying, "[Google] did MapReduce; we have this thing called Dryad that's better." Unfortunately, there appears to be very little public information on Dryad; I can find no publications on the work.

Wednesday, September 20, 2006

The Daily You paper

Lawrence Kai Shih and David Karger at MIT wrote an interesting WWW2004 paper, "Using URLs and Table Layout for Web Classification Tasks", about a news recommender system they called "The Daily You".

The recommender system is unusual in that it uses proximity on a page and similarities in the URLs to find related articles. From the paper:

In recommendation systems ... typically, the Web is treated as a large text corpus: the numerous features used are the words in the documents, and standard machine learning algorithms such as Naive Bayes or support vector machines are applied.

The Web is more than just text, however: it contains rich, human-oriented structure suitable for learning. In this paper, we argue that two features particular to Web documents, URLs and the visual placement of links on a page, can be of great value in document classification. We show that machine-learning classifiers based on these features can be simultaneously more efficient and more accurate than those based on the document text.

Our motivating example for these classification problems is The Daily You, a tool providing personalized news recommendations from the Web. The Daily You uses URLs and table layout to solve two important classification problems: the blocking of Web advertisements and the page regions and outbound hyper-links predicted to be "interesting" to its user.

Shih and Karger are saying that human editors already identify related articles by putting them in close proximity, either close together on a web page or by giving them similar URLs on their website. They try to extract and exploit that to generate good news recommendations. It is a cute idea.

Tuesday, September 19, 2006

Tagging should be automated

Yahoo PM Matt McAlister posts some thoughts on the mainstream and the effort involved with tagging. Some excerpts:

My mother will never organize her web pages with tags.

What's missing from the tagging world is automatic learning. People shouldn't have to find the 'save' button, click it, fill in tags, and hit save. My browser history says a lot about what interests me. The time I spend on a page says a lot about what I value. Any social activities I initiate or receive can inform a machine what the world around me thinks about.

The influencer is clearly willing to work harder ... but everyone else will need something more personal to happen as a result of tagging to warrant the amount of effort to do it.

See also today's post on the Dead 2.0 blog, "Ask Skeptic's Mom: What's Tagging?"

See also my previous posts, "Manual vs. automated tagging" and "Social software is too much work".

<< Back to glinden.blogspot.com