Metadata extraction and text categorization using Universal Resource Locator expansions

Min-Yen KAN

Metadata extraction and text categorization using Universal Resource Locator expansions

Files

report.pdf(261.25 KB)

Date

2003-10-01T00:00:00Z

Authors

Min-Yen KAN

Abstract

Uniform resource locators (URLs), which mark the address of a resource on the World Wide Web, are often human-readable and can indicate metadata about a resource. This paper explores the mining of URLs to yield categoric metadata about web resources via a three-phase pipeline of word segmentation, abbreviation expansion and classification. I apply this approach to the problem of subject metadata generation and quantify its performance relative to title- and document-based methods, both which require the retrieval of the source document.

URI

https://dl.comp.nus.edu.sg/xmlui/handle/1900.100/1436

Collections

Technical Reports

Full item page