Efficient Crawler Should have its knowledge bases, crawling algorithm, and analysis of crawler learning ability. The knowledge bases of the crawler are incrementally built from the log of previous crawling. For efficient result of the next crawling, we present three knowledge bases: starting URLs, topic keywords, and URL prediction. Good starting URLs support the crawler to collect as many relevant web pages as possible. Good topic keywords support the crawler to recognize appropriate keywords matching the required topic. Good URL prediction supports the crawler to predict the relevancy of the content of unvisited URLs. Crawling algorithm has been separated into two parts: crawling with no knowledge bases and crawling with knowledge bases. Crawling with no knowledge bases is used in the first crawling for Internet exploration. The information gathered from the first crawling has been accumulated to be the experience of the crawler, i.e. the knowledge bases, during the consecutive crawling. Crawling with knowledge bases should be used in the next crawling for more efficient result and better network bandwidth utilization. - Weblink Team
1 comment:
I think Intelligent crawling will definitly help to reduce the hits on the websites and reduces traffic, i hate MSN bots, i observed least hits by Inktomi. :-)
Post a Comment