library.htm: the searchers' Library, which will fill the reading wants of any keen seeking-minded searcher

portal → searching classrooms → library.htm

You'r deep inside searchlores

The searchers' library

The searchers' Library will fill the reading wants of any keen, seeking-minded searcher

You would be well advised to learn some evaluation lore: the fact that an essay is in pdf format and full of references to other "university" essays, the fact that it's language appears more formal, more "professional" does not mean NOTHING. As you will realize on yourself, many of our own essays beat, for importance, depth of reach and development potential these "pdf-library" essays hands down. Yet some "established" papers are indeed useful and instructive.
As always, caveat emptor (and gaudet fur :-) Judge by yourself.
Pdf files may be cracked through some of these tricks, our Library is in progress, your suggestions for inclusion are welcome.

The library:	"Pdf" texts		Html texts
Our essays:	searching	proxing	malwares
Our classes and labs:	classrooms	Lab 1	Lab 2	Lab 3
A discussion about the utility of this library: "The major problem I see with these papers is that they are made to sound pompous in order to impress the other empty 'academic' heads. For a person that just wants to make the damn thing work they are tedious to read, hard to quickly evaluate and can probably be replaced with an hour or so serious thinking on the problem. Now, with the more avant-garde problems they may be the only source of reliable knowledge, which sadly means that one has to swallow the tons of crap to find the important bits in there, but hey - nobody said it should be easy :)"

The Library (pdf)

Olivier Chapelle: Large margin optimization of ranking measures (2007)
"WMost ranking algorithms, such as pairwise ranking, are based on the op- timization of standard loss functions, but the quality measure to test web page rankers is often different. We present an algorithm which aims at op- timizing directly one of the popular measures, the Normalized Discounted Cumulative Gain. It is based on the framework of structured output learn- ing, where in our case the input corresponds to a set of documents and the output is a ranking. The algorithm yields improved accuracies on several public and commercial ranking datasets. "

Jon M. Kleinberg: Authoritative Sources in a Hyperlinked Environment (1997-1998)
"We develop a set of algorithmic tools for extracting information from the link structures of such environments, and report on experiments that demonstrate their e.ectiveness in a variety of contexts on the World Wide Web. The central issue we address within our framework is the distillation of broad search topics, through the discovery of authoritative information sources on such topics."

L. Page; S. Brin; R. Motwani; T. Winograd: The PageRank Citation Ranking: Bringing Order to the Web (January 1998)
"The importance of a Web page is an inherently subjective matter, which depends on the readers interests, knowledge and attitudes. But there is still much that can be said objectively about the relative importance of Web pages. This paper describes PageRank, a method for rating Web pages objectively and mechanically, effectively measuring the human interest and attention devoted to them."

Steve Lawrence, C. Lee Giles: Searching the World Wide Web (1998)
"The coverage of any one engine is significantly limited: No single engine indexes more than about one-third of the “indexable Web,” the coverage of the six engines investigated varies by an order of magnitude, and combining the results of the six engines yields about 3.5 times as many documents on average as compared with the results from only one engine."

Min Fang, Narayanan Shivakumar, Hector Garcia-Molina, & VV.AA: Computing Iceberg Queries Efficiently (Aug 1998)
"In this paper we develop efficient execution strategies for an important class of queries that we call iceberg queries. An iceberg query performs an aggregate function over an attribute (or set of attributes) and then eliminates aggregate values that are below some specified threshold."

Brian D. Davison et alia: DiscoWeb: Applying Link Analysis to Web Search (1998)
"How often does the search engine of your choice produce results that are less than satisfying, generating endless links to irrelevant pages even though those pages may contain the query keywords? How often are you given pages that tell you things you already know?"

VV.AA: AltaVista Search Intranet Developer's Kit (Reference manual, Ver. 2.6, April 1999)
"The AltaVista Search Developer's Kit Version lets you build your own search and retrieval application or add AltaVista Search-powered search capabilities to database applications and file repositories. Users can find what they need quickly and easily without special database training."

Steinowitz, Hidden Internet connections:
Part 1: Windows and Internet connections, Part 2: Intermezzo - The Internet (November 1999)
"It's often said that Microsoft writes certain hidden features in its software which automatically connects to Microsoft servers to transmit data about the user. All this incoming information is stored in huge databases which Microsoft subsequently uses for its merciless tactics in maintaining its world-dominating position.
How can we prove that Microsoft is doing this?"

G. W. Flake, S. Lawrence, C. Lee Giles, F. M. Coetzee: Self-Organization of the Web and Identification of Communities (1999)
"Despite the decentralized and unorganized nature of the web, we show that the web self-organizes such that communities of highly related pages can be efficiently identified based purely on connectivity. This discovery allows the identification of communities independent of, and unbiased by, the specific words used by authors."

B.J. Jansen, A. Spink, T. Saracevic,: Real Life, Real Users, and Real Needs: A Study and Analysis of User Queries on the Web (2000)
"We analyzed transaction logs containing 51,473 queries posed by 18,113 users of Excite, a major Internet search service. We provide data on: (i) sessions - changes in queries during a session, number of pages viewed, and use of relevance feedback, (ii) queries - the number of search terms, and the use of logic and modifiers, and (iii) terms - their rank/frequency distribution and the most highly used search terms."

Gary William Flake, Steve Lawrence, C. Lee Giles: Efficient Identification of Web Communities (2000)
"We define a community on the web as a set of sites that have more links (in either direction) to members of the community than to non-members. Members of such a community can be efficiently identified in a maximum flow / minimum cut framework, where the source is composed of known members, and the sink consists of well-known non-members. "

Jason Zien et alia: Web Query Characteristics and their Implications on Search Engines (2000) ""

Stephen Dill et alia: Self-similarity in the Web (2001) "Algorithmic tools for searching and mining the web are becoming increasingly sophisticated and vital. In this context, algorithms which use and exploit structural information about the web perform better than generic methods in both efficiency and reliability. We present an extensive characterization of the graph structure of the web, with a view to enabling high-performance applications that make use of this structure"

VV.AA: AltaVista Internet Search Services (Search Engine Interface Description, Ver 2.04, July 2000)
"AltaVista’s Internet Search Services (ISS) program is designed to provide global search capabilities for Internet portals. ISS allows portal operators to provide their end users with access to AltaVista’s collection of images, audio clips, video clips, and general text search results."

Juan M. Madrid, Susan Gauch: Incorporating Conceptual Matching in Search
"As the number of available Web pages grows, users experience increasing difficulty finding documents relevant to their interests. One of the underlying reasons for this is that most search engines find matches based on keywords, regardless of their meanings. To provide the user with more useful information, we need a system that includes information about the conceptual frame of the queries as well as its keywords."

A. Arasu, J. Cho, H. Garcia-Molina, A. Paepcke, S.Raghavan: Searching the Web (2001)
"We offer an overview of current Web search engine design. After introducing a generic search engine architecture, we examine each engine component in turn. We cover crawling, local Web page storage, indexing, and the use of link analysis for boosting search performance."

Steve Lawrence et alia: Persistence of Web References in Scientific Research (February 2001)
"Some argue that the lack of persistence of web resources means that they should not be cited in scientific research. We analyze references to web resources in computer science publications, finding that the number of web references has increased dramatically in the last few years, and that many of these references are now invalid."

Jon Kleinberg & Steve Lawrence: The Structure of the Web (November 2001)
"Because of the decentralized nature of its growth, the Web has been widely be- lieved to lack structure and organization as a whole. Recent research, however, shows a great deal of self-organization."

Anukool Lakhina et alia: On the Geographic Location of Internet Resources (2001)
"One relatively unexplored question about the Internet’s physical structure concerns the geographical location of its components: routers, links and autonomous systems (ASes)."

Robert Steele: Techniques for Specialized Search Engines (2001)
"It is emerging that it is very difficult for the major search engines to provide a comprehensive and up-to-date search service of the Web. Even the largest search engines index only a small proportion of static Web pages and do not search the Web’s backend databases that are estimated to be 500 times larger than the static Web."

Steve Lawrence: Context in Web Search (2001)
"Web search engines generally treat search requests in isolation. The results for a given query are identical, independent of the user, or the context in which the user made the request. Next generation search engines will make increasing use of context information, either by using explicit or implicit context information from users, or by implementing additional functionality within restricted contexts. "

Jean-Pierre Eckmann, Elisha Mosesy: Curvature of Co-Links Uncovers Hidden Thematic Layers in the WorldWideWeb (30 November 2001)
"I read some papers about the subject and I think this one really captures the essence of web community and IMO gives an efficient (focused) way of spidering them", (Nemo)

Ding Choon Hoong, Rajkumar Buyya: Guided Google: A Meta Search Engine and its Implementation using the Google Distributed Web Services
"This paper proposes a guided meta-search engine, called “Guided Google”, which provides meta-search capability developed using the Google Web Services. It guides and allows the user to view the search results with different perspectives. This is achieved through simple manipulation and automation of the existing Google functions. Our meta-search engine supports search based on “combinatorial keywords” and “search by hosts"

Wolfgang Barthel, Alexander K. Hartmann, Martin Weigt: Solving satisfiability problems by fluctuations: The dynamics of stochastic local search algorithms (2002)
"In order to find the exact shortest path, a broadcast method equivalent to a breadth first search (BFS) must be used. As mentioned in our discussion of Gnutella, broadcasting can overwhelm the bandwidth resources of the network."

David Wolpert, Kagan Tumer, Esfandiar Bandari: Improving Search Algorithms by Using Intelligent Coordinates (23 Jan 2003)
"We consider the problem of designing a set of computational agents so that as they all pursue their self-interests a global function G of the collective system is optimized. Three factors govern the quality of such design. The first relates to conventional exploration-exploitation search algorithms for finding the maxima of such a global function, e.g., simulated annealing (SA)."

Vladimir Pestov, Aleksandar Stojmirovic Indexing schemes for similarity search: an illustrated paradigm (14 Nov 2002)
"What is needed, is a fully developed mathematical paradigm of indexability for similarity search that would incorporate the existing structures of database theory and possess a predictive power."

Lada A. Adamic, Rajan M. Lukose, Bernardo A. Huberman: Local Search in Unstructured Networks (4 Jun 2002)
"It has become clear that the simplest classical model of random networks, the Erdos-Renyi model, is inadequate for describing the topology of many naturally occurring networks. These diverse networks are more accurately described by power-law or scale-free link distributions. In these highly skewed distributions, the probability that a node has k links is approximately proportional to 1=kˆT"

Andrea Montanari, Riccardo Zecchina: Boosting search by rare events (19 Dec 2001)
"Randomized search algorithms for hard combinatorial problems exhibit a large variability of performances. We study the different types of rare events which occur in such out-of-equilibrium stochastic processes and we show how they cooperate in determining the final distribution of running times. As a byproduct of our analysis we show how search algorithms are optimized by random restarts."

Tair-Rong Sheu, Kathleen Carley: Monopoly Power on the Web - A Preliminary Investigation of Search Engines (27 Oct 2001)
"We focus on major search engines that provide general search services. We assume that the top 19 search engines in the June 2000 rating from Nielsen/NetRatings account for 100 % market share. We collected data on the hyperlinks connecting these search engine web sites over a five-month period from August 12th 2000 to Dec. 12th 2000. Each month’s network was stored as a binary matrix."

Hung-Yu Kao, Shian-Hua Lin, Jan-Ming Ho, Ming-Syan Chen: Mining Web Informative Structures and Contents Based on Entropy Analysis (2001)
"In this paper, we study the problem of mining the informative structure of a news Web site that consists of thousands of hyperlinked documents. We define the informative structure of a news Web site as a set of index pages (or referred to as TOC, i.e., table of contents, pages) and a set of article pages linked by these TOC pages. Based on the Hyperlink Induced Topics Search (HITS) algorithm, we propose an entropy-based analysis (LAMIS) mechanism for analyzing the entropy of anchor texts and links to eliminate the redundancy of the hyperlinked structure so that the complex structure of a Web site can be distilled."

Brian Amento, Loren Terveen, and Will Hill: Does "Authority" Mean Quality? Predicting Expert Quality Ratings of Web Documents (2000)
"Search engines like Google or AltaVista return tens of thousands of items, and even human-maintained directories like Yahoo or UltimateTV contain dozens to hundreds of items. However, these items vary widely in quality, ranging from large, well-maintained sites to smaller sites that contain specialized content to nearly content-free, completely worthless sites. No one has the time to wade through more than a handful of items."

Weiyi Meng and alia: A Highly Scalable and Effective Method for Metasearch (2001)
"A metasearch engine is a system that supports unified access to multiple local search engines. Database selection is one of the main challenges in building a large-scale metasearch engine. The problem is to efficiently and accurately determine a small number of potentially useful local search engines to invoke for each user query. In order to enable accurate selection, metadata that reflect the contents of each search engine need to be collected and used. This article proposes a highly scalable and accurate database selection method."

Jonathan D. Herbach: Improving Authoritative Sources in a Hyperlinked Environment via Similarity Weighting Or, "How to get better search results on the Web" (May 2001)
"Recent literature demonstrates that the network structure of a hyperlink environment can serve as an e.ective source for inferring the importance of content in documents. One such connectivity-analysis algorithm, HITS, determines document importance based upon the hyperlink structure of the Web. In order to compensate for the problems of a pure connectivity-analysis algorithm, we develop and test an algorithm based upon HITS that also considers document content."

Taher H. Haveliwala: TopicSensitive PageRank (2002)
"In the original PageRank algorithm for improving the ranking of search-query results, a single PageRank vector is computed, using the link structure of the Web, to capture the relative \ importance" ofWeb pages, independent of any particular search query. To yield more accurate search results, we propose computing a set of PageRank vectors, biased using a set of representative topics, to capture more accurately the notion of importance with respect to a particular topic."

Amr Z. Kronfol: FASD: A Fault-tolerant, Adaptive, Scalable, Distributed Search Engine (Mai 2002)
"Search and retrieval under the Napster model; Search and retrieval under the Gnutella model; Retrieval under the Freenet model. This paper introduces FASD, a fault-tolerant, adaptive, scalable, and distributed search layer designed to augment existing peer-to-peer applications. Although completely decentralized, FASD’s approach is able to efficiently match the recall and precision of a centralized search engine."

Sepandar D. Kamvar, Taher H. Haveliwala, Christopher D. Manning, Gene H. Golub: Exploiting the Block Structure of theWeb for Computing PageRank (2003)
"The web link graph has a nested block structure: the vast majority of hyperlinks link pages on a host to other pages on the same host, and many of those that do not link pages within the same domain. We show how to exploit this structure to speed up the computation of PageRank by a 3-stage algorithm whereby (1)~the local PageRanks of pages for each host are computed independently using the link structure of that host, (2)~these local PageRanks are then weighted by the "importance" of the corresponding host, and (3)~the standard PageRank algorithm is then run using as its starting vector the weighted aggregate of the local PageRanks. Empirically, this algorithm speeds up the computation of PageRank by a factor of 2 in realistic scenarios. Further, we develop a variant of this algorithm that efficiently computes many different "personalized" PageRanks, and a variant that efficiently recomputes PageRank after node updates."

Tara Calishain and Rael Dornfest: OReilly_-_Google_Hacks (2003)

Google Hacks: 100 Industrial-Strength Tips and Tools, is a book of tips about Google, the currently foremost Internet search engine, by Tara Calishain and Rael Dornfest (ISBN 0-596-00447-8). The book was published by O'Reilly & Associates on February 2003. Very useful, do find it on the web, or buy it.

Table of Contents
Credits Foreword Preface Chapter 1. Searching Google 1. Setting Preferences 2. Language Tools 3. Anatomy of a Search Result 4. Specialized Vocabularies: Slang and Terminology 5. Getting Around the 10 Word Limit 6. Word Order Matters 7. Repetition Matters 8. Mixing Syntaxes 9. Hacking Google URLs 10. Hacking Google Search Forms 11. Date-Range Searching 12. Understanding and Using Julian Dates 13. Using Full-Word Wildcards 14. inurl: Versus site: 15. Checking Spelling 16. Consulting the Dictionary 17. Consulting the Phonebook 18. Tracking Stocks 19. Google Interface for Translators 20. Searching Article Archives 21. Finding Directories of Information 22. Finding Technical Definitions 23. Finding Weblog Commentary 24. The Google Toolbar 25. The Mozilla Google Toolbar 26. The Quick Search Toolbar 27. GAPIS 28. Googling with Bookmarklets Chapter 2. Google Special Services and Collections 29. Google Directory 30. Google Groups 31. Google Images 32. Google News 33. Google Catalogs 34. Froogle 35. Google Labs Chapter 3. Third-Party Google Services 36. XooMLe: The Google API in Plain Old XML 37. Google by Email 38. Simplifying Google Groups URLs 39. What Does Google Think Of... 40. GooglePeople Chapter 4. Non-API Google Applications 41. Don't Try This at Home 42. Building a Custom Date-Range Search Form 43. Building Google Directory URLs 44. Scraping Google Results 45. Scraping Google AdWords 46. Scraping Google Groups 47. Scraping Google News 48. Scraping Google Catalogs 49. Scraping the Google Phonebook Chapter 5. Introducing the Google Web API 50. Programming the Google Web API with Perl 51. Looping Around the 10-Result Limit 52. The SOAP::Lite Perl Module 53. Plain Old XML, a SOAP::Lite Alternative 54. NoXML, Another SOAP::Lite Alternative 55. Programming the Google Web API with PHP 56. Programming the Google Web API with Java 57. Programming the Google Web API with Python 58. Programming the Google Web API with C# and .NET 59. Programming the Google Web API with VB.NET Chapter 6. Google Web API Applications 60. Date-Range Searching with a Client-Side Application 61. Adding a Little Google to Your Word 62. Permuting a Query 63. Tracking Result Counts over Time 64. Visualizing Google Results 65. Meandering Your Google Neighborhood 66. Running a Google Popularity Contest 67. Building a Google Box 68. Capturing a Moment in Time 69. Feeling Really Lucky 70. Gleaning Phonebook Stats 71. Performing Proximity Searches 72. Blending the Google and Amazon Web Services 73. Getting Random Results (On Purpose) 74. Restricting Searches to Top-Level Results 75. Searching for Special Characters 76. Digging Deeper into Sites 77. Summarizing Results by Domain 78. Scraping Yahoo! Buzz for a Google Search 79. Measuring Google Mindshare 80. Comparing Google Results with Those of Other Search Engines 81. SafeSearch Certifying URLs 82. Syndicating Google Search Results 83. Searching Google Topics 84. Finding the Largest Page 85. Instant Messaging Google Chapter 7. Google Pranks and Games 86. The No-Result Search (Prank) 87. Google Whacking 88. GooPoetry 89. Creating Google Art 90. Google Bounce 91. Google Mirror 92. Finding Recipes Chapter 8. The Webmaster Side of Google 93. A Webmaster's Introduction to Google 94. Generating Google AdWords 95. Inside the PageRank Algorithm 96. 26 Steps to 15K a Day 97. Being a Good Search Engine Citizen 98. Cleaning Up for a Google Visit 99. Getting the Most out of AdWords 100. Removing Your Materials from Google Index

Thomas Demuth: A Passive Attack on the Privacy of Web Users Using Standard Log Information
(it's always kinda suspect when they do not put any date. Around 2002)
Abstract: Several active attacks on user privacy in the World Wide Web using cookies or active elements (Java, Javascript, ActiveX) are known. One goal is to identify a user in consecutive Internet session to track and to profile him (such a profile can be extended by personal information if available). In this paper, a passive attack is presented that uses information of a dierent network layer in the first place. It is exposed how expressive the data of the HyperText Transfer Protocol (HTTP) can be with respect to identify computers (and therefore their users). An algorithm to reidentify computers using dynamically assigned IP addresses with a certain degree of assurance is introduced. Thereafter simple countermeasures are demonstrated. The motivation for this attack is to show the capability of passive privacy attacks using Web server log files and to propagate the use of anonymising techniques for Web users.
Keywords: privacy, anonymity, user tracking and profiling, World Wide Web, server logs

Caroline M. Eastman, Bernard J. Jansen: Coverage, Relevance, and Ranking: The Impact of Query Operators on Web Search Engine Results
(October 2003)
Research has reported that about 10% of Web searchers utilize advanced query operators, with the other 90% using extremely simple queries. It is often assumed that the use of query operators, such as Boolean operators and phrase searching, improves the effectiveness of Web searching. We test this assumption by examining the effects of query operators on the performance of three major Web search engines. We selected one hundred queries from the transaction log of a Web search service. Each of these original queries contained query operators such as AND, OR, MUST APPEAR (C), or PHRASE (“ ”). We then removed the operators from these one hundred advanced queries. We submitted both the original and modified queries to three major Web search engines; a total of 600 queries were submitted and 5,748 documents evaluated. We compared the results from the original queries with the operators to the results from the modified queries without the operators. We examined the results for changes in coverage, relative precision, and ranking of relevant documents. The use of most query operators had no significant effect on coverage, relative precision, or ranking, although the effect varied depending on the search engine.We discuss implications for the effectiveness of searching techniques as currently taught, for future information retrieval system design, and for future research.

Akamai Technologies (The akamai clowns are sworn enemies of privacy: the original pdf file requires -theoretically- a lengthy registration process and is streamed using a very crude and old streaming technology, not exactly a propaganda tool for akamai dubious feats, if you ask me:-) Here a (bad) copy automatically downloaded using an image capture script plus automathical scanner (no human revision... you'll have to do it yourself).
Akamai Streaming - When Performance Matters (January 2004)
(October 31, 2005,)
This paper offers a quantitative study of streaming (delivering audio & video over the Web) performance and demonstrates how Akamai ensures streams that "optimize return on streaming investments" (read: should avoid users copying the stream). Akamai's streaming capabilities are based on a unique combination of a global network of servers, innovative technology, and a wealth of first-hand experience. In this paper, Akamai presents their streaming approach, measurement methodology, and the data that demonstrates the "performance" of the Akamai Platform.

Zoltàn Gyöngyi, Hector Garcia-Molina, Jan Pedersen: Combating Web Spam with TrustRank
(Proceedings of the 30th VLDB Conference, Toronto, Canada, 2004)
Web spam pages use various techniques to achieve higher-than-deserved rankings in a search engine’s results. While human experts can identify spam, it is too expensive to manually evaluate a large number of pages. Instead, we propose techniques to semi-automatically separate reputable, good pages from spam. We first select a small set of seed pages to be evaluated by an expert.
Once we manually identify the reputable seed pages, we use the link structure of the web to discover other pages that are likely to be good. In this paper we discuss possible ways to implement the seed selection and the discovery of good pages. We present results of experiments run on the World Wide Web indexed by AltaVista and evaluate the performance of our techniques. Our results show that we can effectively filter out spam from a significant fraction of the web, based on a good seed set of less than 200 sites.

Zoltan Gyongyi Hector Garcia-Molina: Web Spam Taxonomy
(SEARCH ENGINE SPAM WORKSHOP, May 09, 2005, Chiba, Japan)
Web spamming refers to actions intended to mislead search engines into ranking some pages higher than they deserve. Recently, the amount of web spam has increased dramatically, leading to a degradation of search results. This paper presents a comprehensive taxonomy of current spamming techniques, which we believe can help in developing appropriate countermeasures.

Albert Bifet, Carlos Castillo, Paul-Alexandru Chirita, Ingmar Weber An Analysis of Factors Used in Search Engine Ranking
(SEARCH ENGINE SPAM WORKSHOP, May 09, 2005, Chiba, Japan)
This paper investigates the influence of different page features on the ranking of search engine results. We use Google (via its API) as our testbed and analyze the result rankings for several queries of different categories using statistical methods. We reformulate the problem of learning the underlying, hidden scores as a binary classification problem. To this problem we then apply both linear and non-linear methods. In all cases, we split the data into a training set and a test set to obtain a meaningful, unbiased estimator for the quality of our predictor. Although our results clearly show that the scoring function cannot be approximated well using only the observed features, we do obtain many interesting insights along the way and discuss ways of obtaining a better estimate and main limitations in trying to do so.

Panagiotis T. Metaxas, Joseph DeStefano Web Spam, Propaganda and Trust
(SEARCH ENGINE SPAM WORKSHOP, May 09, 2005, Chiba, Japan)
In this paper, we first analyze the influence that web spam has on the evolution of the search engines and we identify the strong relationship of spamming methods to propagandistic techniques in society. Our analysis provides a foundation to understanding why spamming works and oers new insight on how to address it. In particular, it suggest that one could use anti-propagandistic techniques in the web to recognize spam. The second part of the paper demonstrates such a technique, called backwards propagation of distrust.

Ricardo Baeza-Yates, Carlos Castillo, Vicente Lopez Pagerank Increase under Different Collusion Topologies
(SEARCH ENGINE SPAM WORKSHOP, May 09, 2005, Chiba, Japan)
We study the impact of collusion –nepotistic linking– in a Web graph in terms of Pagerank. We prove a bound on the Pagerank increase that depends both on the reset probability of the random walk e and on the original Pagerank of the colluding set. In particular, due to the power law distribution of Pagerank, we show that highly-ranked Web sites do not benefit that much from collusion.

Baoning Wu, Brian D. Davison Cloaking and Redirection: A Preliminary Studys
(SEARCH ENGINE SPAM WORKSHOP, May 09, 2005, Chiba, Japan)
Cloaking and redirection are two possible search engine spamming techniques. In order to understand cloaking and redirection on the Web, we downloaded two sets ofWeb pages while mimicking a popularWeb crawler and as a common Web browser. We estimate that 3% of the rst data set and 9% of the second data set utilize cloaking of some kind. By checking manually a sample of the cloaking pages from the second data set, nearly one third of them appear to aim to manipulate search engine ranking.
We also examined redirection methods present in the rst data set. We propose a method of detecting cloaking pages by calculating the dierence of three copies of the same page. We examine the dierent types of cloaking that are found and the distribution of different types of redirection.

Sibel Adalý, Tina Liu, Malik Magdon-Ismail Optimal Link Bombs are Uncoordinated
(SEARCH ENGINE SPAM WORKSHOP, May 09, 2005, Chiba, Japan)
We analyze the recent phenomenon termed a Link Bomb, and investigate the optimal attack pattern for a group of web pages attempting to link bomb a specific web page. The typical modus operandi of a link bomb is to associate a particular page with a search text and then boost that page’s pagerank. (The attacking pages can only control their own content and outgoing links.) Thus, when a search is initiated with the text, a high prominence will be given to the attacked page. We show that the best organization of links among the attacking group to maximize the increase in rank of the attacked node is the direct indi- vidual attack, where every attacker points directly to the victim and nowhere else. We also discuss optimal attack patterns for a group that wants to hide itself by not pointing directly to the victim. We quantify our results with experiments on a variety of random graph models.

Gilad Mishne, David Carmel, Ronny Lempel Blocking Blog Spam with Language Model Disagreement
(SEARCH ENGINE SPAM WORKSHOP, May 09, 2005, Chiba, Japan)
We present an approach for detecting link spam common in blog comments by comparing the language models used in the blog post, the comment, and pages linked by the comments. In contrast to other link spam filtering approaches, our method requires no training, no hard-coded rule sets, and no knowledge of complete-web connectivity. Preliminary experiments with identification of typical blog spam show promising results.

András A. Benczúr, Károly Csalogány, Tamás Sarlós, Máté Uher SpamRank – Fully Automatic Link Spam Detection Work in progress
(SEARCH ENGINE SPAM WORKSHOP, May 09, 2005, Chiba, Japan)
Spammers intend to increase the PageRank of certain spam pages by creating a large number of links pointing to them. We propose a novel method based on the concept of personalized PageRank that detects pages with an undeserved high PageRank value without the need of any kind of white or blacklists or other means of human intervention. We assume that spammed pages have a biased distribution of pages that contribute to the undeserved high PageRank value. We define SpamRank by penalizing pages that originate a suspicious PageRank share and personalizing PageRank on the penalties. Our method is tested on a 31 M page crawl of the .de domain with a manually classified 1000-page stratified random sample with bias towards large PageRank values.

Zoltàn Gyöngyi, Pavel Berkhin, Hector Garcia-Molina, Jan Pedersen: Link Spam Detection Based on Mass Estimation
(October 31, 2005,)
Link spamming intends to mislead search engines and trigger an artificially high link-based ranking of specific target web pages. This paper introduces the concept of spam mass, a measure of the impact of link spamming on a page’s ranking. We discuss how to estimate spam mass and how the estimates can help identifying pages that benefit significantly from link spamming. In our experiments on the host-level Yahoo! web graph we use spam mass estimates to successfully identify tens of thousands of instances of heavy-weight link spamming.

The Library (html)

Jenny Edwards, Kevin McCurley, John Tomlin: An Adaptive Model for Optimizing Performance of an Incremental Web Crawler (2001)
"This paper outlines the design of a web crawler implemented for IBM Almaden's WebFountain project and describes an optimization model for controlling the crawl strategy. This crawler is scalable and incremental. The model makes no assumptions about the statistical behaviour of web page changes, but rather uses an adaptive approach to maintain data on actual change rates which are in turn used as inputs for the optimization. Computational results with simulated but realistic data show that there is no "magic bullet" - different, but equally plausible, objectives lead to conflicting "optimal" strategies. However, we find that there are compromise objectives which lead to good strategies that are robust against a number of criteria."

Kevin S. McCurley: Geospatial Mapping and Navigation of the Web (2001)
"Web pages may be organized, indexed, searched, and navigated along several different feature dimensions. We investigate different approaches to discovering geographic context for web pages, and describe a navigational tool for browsing web resources by geographic proximity"

Ziv Bar-Yossef, Sridhar Rajagopalan: Template Detection via Data Mining and its Applications (2001)
"We formulate and propose the template detection problem, and suggest a practical solution for it based on counting frequent item sets. We show that the use of templates is pervasive on the web. We describe three principles, which characterize the assumptions made by hypertext information retrieval (IR) and data mining (DM) systems, and show that templates are a major source of violation of these principles. As a consequence, basic ``pure'' implementations of simple search algorithms coupled with template detection and elimination show surprising increases in precision at all levels of recall"

Sriram Raghavan Hector Garcia Molina : Crawling the Hidden Web (Extended Abstract) (2001)
"Current-day crawlers retrieve content from the publicly in- dexable Web, i.e., the set of web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior registration. In partic- ular, they ignore the tremendous amount of high quality content \hidden" behind search forms, in large searchable electronic databases. Our work provides a framework for addressing the problem of extracting content from this hid- den Web. At Stanford, we have built a task-speci c hidden Web crawler called the Hidden Web Exposer (HiWE). In this poster, we describe the architecture of HiWE and out- line some of the novel techniques that went into its design."

LATENT SEMANTIC INDEXING, Taking a Holistic View
(2002), part of "Patterns in Unstructured Data Discovery, Aggregation, and Visualization"
A Presentation to the Andrew W. Mellon Foundation by Clara Yu, John Cuadrado, Maciej Ceglowski, J. Scott Payne

"When you search an LSI-indexed database, the search engine looks at similarity values it has calculated for every content word, and returns the documents that it thinks best fit the query."

to basic

to advanced

to classroom

to further