USING CONCEPT RELATIONSHIPS TO IMPROVE DOCUMENT CATEGORIZATION

Authors

  • Péter Schönhofen
  • Hassan Charaf

Abstract

In the information age we much depend on our ability to find information hidden in mostly unstructured and textual documents. This article proposes a simple method in which (as an addition to existing systems) categorization accuracy can be improved, compared to traditional techniques, without requiring any time-consuming or language-dependent computation. That result is achieved by exploiting properties observed in the entire document collection as opposed to individual documents, which may also be regarded as a construction of an approximate concept network (measuring semantic distances). These properties are sufficiently simple to avoid entailing massive computations; however, they try to capture the most fundamental relationships between words or concepts. Experiments performed on the Reuters-21578 news article collections were evaluated using a set of simple measurements estimating clustering efficiency, and in addition by Cluto, a widely used document clustering software. Results show a 5-10% improvement in clustering quality over traditional tf (term frequency) or tf x idf (term frequency-inverse document frequency) based clustering.

Keywords:

document categorization, statistical analysis, document, clustering

How to Cite

Schönhofen, P., Charaf, H. “USING CONCEPT RELATIONSHIPS TO IMPROVE DOCUMENT CATEGORIZATION”, Periodica Polytechnica Electrical Engineering, 48(3-4), pp. 165–182, 2004.

Issue

Section

Articles