An effective web crawling ordering from graph search techniques
Kunsoo Park (박근수)
School of Computer Science and Engineering, Seoul National University</a>
School of Computer Science and Engineering, Seoul National University</a>
2009/03/16 Mon, 2PM-3PM
A web crawler is a fundamental software which iteratively downloads web pages by following links of web pages starting from a small set of initial URLs. Previously several web crawling orderings have been studied. In this talk we consider various graph search techniques including lexicographic breadth-first search, lexicographic depth-first search and maximum cardinality search as well as well-known breadth-first search and depth-first search, and then find the best web crawling ordering among them. Furthermore, we treat internal links and external links in different manners for effective crawling. For maximum cardinality search and lexicographic breadth-first search, we present linear time algorithms based on the partition refinement method. Experimental results show that maximum cardinality search is the best graph search technique for web crawlers.