John Battelle's Searchblog: The Search Papers Archives

SearchMob!

Search This Blog

PERFECT FOR THAT PERSON WITH EVERYTHING
Order 'The Search'

Yup, it makes the perfect gift for that officemate or colleague who you thought had everything....including you! If you order here, I promise to sign it, assuming we can figure out the shipping...

You can also buy the audio version here.

Check my book page for more info.

Blogger's Rights

Support Blogger's Rights!

Active Topics

19 comments: Mayer At SES: Google Mobile Bump (08.25)
9 comments: Driving on the Vineyard (08.12)
8 comments: Commodity Computing (08.08)
7 comments: The Web's End (08.23)
6 comments: Accoona IPO (08.06)

Monthly Archives

About John Battelle

Searchblog Newsletter

Enter email to subscribe to "Re-Find", Searchblog's weekly newsletter:

Calendar

September 2007
Su	Mo	Tu	We	Th	Fr	Sa
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

Syndicate

Full text feed

Excerpts Only

Powered by

Movable Type 3.34

The Search Papers Archive

December 16, 2004

Search Paper Fun: Most Cited

I sent a query to Lee Giles, the guru at Penn State behind CiteSeer (with Steve Lawrence, who is now at Google) asking him which search-related papers are the most cited. I was struck by the near parity between Page and Brin's original paper on Google and Jon Kleinberg's paper on Hubs and Authorities. Giles did a bit of fiddling with Google Scholar and responded:

For web related work these are well cited in the Google Scholar using the query “web”:

PDF] The Semantic Web
T Berners-Lee, J Hendler, O Lassila - View as HTML - Cited by 1347
... May 17, 2001. The Semantic Web. A new form of Web content that is meaningful to
computers will unleash a revolution of new possibilities. ... Web: A Research Agenda. ...
Scientific American, 2001 - www-personal.si.umich.edu

[PDF] The anatomy of a large-scale hypertextual Web search engine
S Brin, L Page - View as HTML - Cited by 1087
Abstract In this paper, we present Google, a prototype of a large-scale search
engine which makes heavy use of the structure present in hypertext. Google ...
Computer Networks and ISDN Systems, 1998 - kulturinformatik.uni-lueneburg.de - firstrate.co.nz - net.cs.pku.edu.cn - scalab.uc3m.es - all 69 versions

However, this one can’t be ignored:

[PDF] Authoritative sources in a hyperlinked environment
J Kleinberg… - Cited by 1059
Abstract. The network structure of a hyperlinked environment can be a rich
source of information about the content of the environment, provided we ...
Journal of the ACM, 1999 - portal.acm.org - nan.dhs.org - cs.cmu.edu - mathe.tu-freiberg.de - all 73 versions

This book is the first to discuss the web in any detail:

[PS] Modern Information Retrieval
R Baeza-Yates, B Ribeiro-Neto, R Baeza-Yates - View as HTML - Cited by 1198
Page 1. Modern Information Retrieval. Ricardo Baeza-Yates. Berthier Ribeiro-Neto.
ACM Press New York. ... 1.1.2 Information Retrieval at the Center of the Stage . . ...
Addision Wesley, 1999 - dcc.ufmg.br - sunsite.dcc.uchile.cl - sims.berkeley.edu - portal.acm.org - all 7 versions »

All worthy reads!

Posted by John Battelle at 5:29 PM
Permalink
Comments (1)

November 18, 2004

Google Scholar Launches: A Hint of Things to Come?

Google has, for some time, had a few verticalized, niche search solutions hidden in their Advanced Search areas, notably their "topic specific" search around Linux, the Mac, govt sites, and the like. Today the company launched another, more ambitious vertical search tool called Google Scholar. According to folks I spoke to last night at Google, the service was done by one engineer in his "20% time." Anurag Acharya, the engineer behind the service, tuned Google's crawler for academic papers and worked with universities to make those papers available to others on the web.

The services has the tagline "Stand on the shoulders of giants." It includes a cross referenced citation link for each paper, which is very cool, and as we all know, the basis of PageRank (and the WWW) in the first place. Here's a search for vertical or domain specific search, for example.

This move marks a trend toward making usually invisible (and useful) information more accessible, one that I could imagine spreads to other domains, perhaps ones more commercial in nature. (Scholar does not have ads in it, at least for now). The special ranking algorithm and policies for dealing with the nature of a structured document universe such as this clearly scales to other opportunities - ie, travel, automotive, business information and the like.

Here's Resourceshelf's take on this, and SEW's.

Cnet coverage.

Posted by John Battelle at 5:24 AM
Permalink
Comments (9)
TrackBacks (2)

March 25, 2004

Upcoming WWW Conference: Loads O Search

Resourceshelf has culled the upcoming WWW conference for selected references to search. There's also a whole track on the Semantic Web.

The complete list is a Who's Who of search stars and a telling map of who's doing interesting research in the area. Included: Intel, University of Washington, IBM, Yahoo (Understanding User Goals in Search), National University of Singapore, MIT, Microsoft. A9's Udi Manber (who I did meet with, but can't go into our talk quite yet) is giving a keynote.

OK, I think I have to go to this.

Posted by John Battelle at 9:00 AM
Permalink
Comments (0)

January 11, 2004

The Search Papers: Do Web Search Engines Suppress Controversy?

The First Monday peer-reviewed journal recently published "Do Web Search Engines Suppress Controversy?" by Susan Gerhart, a software engineering professor at Embry-Riddle Aeronautical University. Driving the paper is this sentiment:

"The dilemma of controversies is that the searcher beginning to explore a topic doesn’t know the search terms to investigate a controversy unless it is revealed with reasonable visibility, e.g. not item number 879 in search results, nor buried three links away from result number 30."

In other words, if you are just starting to research a topic, and have no idea if there are any controversies surrounding said topic, how will you ever know if the search engine has a bias toward not revealing those controversies?

This paper explores the hypothesis that, as Gerhart puts it: "A given, well–known specific controversy will not be revealed in the top search results." She then creates an experiment to test this hypothesis, by outlining both a broad topic, and a related controversial subtopic. An example is "Albert Einstein" as the broad topic, and "Did Einstein’s first wife, Mileva Maric, receive appropriate credit for scientific contributions to Einstein’s early work" as the subtopic. The question is, do search engines leave out the more controversial bits, the stuff that, taken as a whole, provide texture and context to any searcher's understanding of a topic?

For the many examples she tested, Gerhart found proof on both sides of the ledger, and the paper left me disappointed that she could not come to a more decisive conclusion. She did note that in fact most search engines were roughly equal in their performance in the experiments. And she has some interesting thoughts on how controversies are integrated (or not) into the web at large, and some suggestions as to how various actors on the web - site authors, researchers, search engines - might better organize themselves to portray a more relevant set of SERPs to any particular query.

All in all, I liked this paper, as it forced me to think about the politics and architecture of search engine results. She introduces the idea of "sunny" vs. "dark" search results, and concludes that "sunny" results - those that do not include controversies, tend to float toward the top. Her final conclusion:

"Web search engines do not conspire to suppress controversy, but their strategies do lead to organizationally dominated search results depriving searchers of a richer experience and, sometimes, of essential decision–making information. These experiments suggest that bias exists, in one form or another, on the Web and should, in turn, force thinking about content on the Web in a more controversial light."

The one thing Dr. Gerhart left out entirely is the effect of blogs. As most of us certainly know, when the blogosphere latches onto a controversy (or just a politically-driven meme), that aspect of a topic usually shoots to the top of the SERPs. As with most good papers, this one left me feeling like there is much work yet to be done.

Posted by John Battelle at 10:42 AM
Permalink
Comments (0)

December 8, 2003

The Search Papers: Bray on Search

Tim Bray has a series called On Search over at his Ongoing blog, and I find it worthy of a read'n'muse. He starts with this backgrounder on himself and search issues as he sees them, and has a ton of entries on any number of subjects, too numerous to go into here. Highlights: he writes on interface issues (warning, not for the faint of geek), how best to search XML (answer: we don't know yet, recall he was a co-author of same), and on result rankings, with a quick refresher on why PageRank works, and good advice on paying attention to your own logs. Also worthy: his primer on how search works, and his discussion of the technical search terms precision and recall (with an interesting note on the absence of top companies in the research community - see my post on this here), and lastly (whew), his mini-rant on intelligent search, and why it's a long way off. An excerpt:
"If we want better search (and we do), we’d better not count on AI voodoo or linguistic juju or semantic mojo. We need to work with good sound statistical techniques, and be clever about generating and using metadata, and we need to get our APIs right. All of these things are hard, and there is good work being done in all of them."

Posted by John Battelle at 12:59 PM
Permalink
Comments (1)
TrackBacks (1)

Searchblog Classifieds!

Recent Jobs

View All Jobs

Post a Job

Get your job site
at SimplyHired.com

Searchblog, in paperback

Searchblog
Print Edition

Get Your Own Print Version of Searchblog

Click here to buy a customized print version of the entire contents of Searchblog.

Search Resources

License

This work is licensed under a Creative Commons Attribution- NonCommercial- NoDerivs 2.5 License.