Fast Webpage Classification Using URL Features

dc.contributor.authorKAN, Min-Yenen_US
dc.contributor.authorNGUYEN THI, Hoanh Oanhen_US
dc.date.accessioned2005-08-25T03:36:30Zen_US
dc.date.accessioned2017-01-23T06:59:36Z
dc.date.available2005-08-25T03:36:30Zen_US
dc.date.available2017-01-23T06:59:36Z
dc.date.issued2005-08-25T03:36:30Zen_US
dc.description.abstractWe demonstrate the usefulness of the uniform resource locator (URL)alone in performing web page classification. This approach is magnitudes faster than typical web page classification, as the pages themselves do not have to be fetched and analyzed. Our approach segments the URL into meaningful chunks and adds component, sequential and orthographic features to model salient patterns. The resulting binary features are used in supervised maximum entropy modeling. We analyze our approach's effectiveness in binary, multi-class and hierarchical classification. Our results show that, in certain scenarios, URL-based methods approach and sometime exceeds the performance of full-text and link-based methods. We also use these features to predict the prestige of a webpage (as modeled by Pagerank), and show that it can be predicted with an average error of less than one point (on a ten-point scale) in a topical set of web pages.en_US
dc.format.extent761126 bytesen_US
dc.format.mimetypeapplication/pdfen_US
dc.identifier.urihttps://dl.comp.nus.edu.sg/xmlui/handle/1900.100/1851en_US
dc.language.isoenen_US
dc.relation.ispartofseriesTRC8/05en_US
dc.titleFast Webpage Classification Using URL Featuresen_US
dc.typeTechnical Reporten_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
TRC8-05.pdf
Size:
743.29 KB
Format:
Adobe Portable Document Format
Description:
License bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.53 KB
Format:
Plain Text
Description: