Browsing by Author "NGUYEN THI, Hoanh Oanh"
Now showing 1 - 1 of 1
Results Per Page
Sort Options
- ItemFast Webpage Classification Using URL Features(2005-08-25T03:36:30Z) KAN, Min-Yen; NGUYEN THI, Hoanh OanhWe demonstrate the usefulness of the uniform resource locator (URL)alone in performing web page classification. This approach is magnitudes faster than typical web page classification, as the pages themselves do not have to be fetched and analyzed. Our approach segments the URL into meaningful chunks and adds component, sequential and orthographic features to model salient patterns. The resulting binary features are used in supervised maximum entropy modeling. We analyze our approach's effectiveness in binary, multi-class and hierarchical classification. Our results show that, in certain scenarios, URL-based methods approach and sometime exceeds the performance of full-text and link-based methods. We also use these features to predict the prestige of a webpage (as modeled by Pagerank), and show that it can be predicted with an average error of less than one point (on a ten-point scale) in a topical set of web pages.