USING CONCEPT RELATIONSHIPS TO IMPROVE DOCUMENT CATEGORIZATION
Abstract
In the information age we much depend on our ability to find information hidden in mostly unstructured and textual documents. This article proposes a simple method in which (as an addition to existing systems) categorization accuracy can be improved, compared to traditional techniques, without requiring any time-consuming or language-dependent computation. That result is achieved by exploiting properties observed in the entire document collection as opposed to individual documents, which may also be regarded as a construction of an approximate concept network (measuring semantic distances). These properties are sufficiently simple to avoid entailing massive computations; however, they try to capture the most fundamental relationships between words or concepts. Experiments performed on the Reuters-21578 news article collections were evaluated using a set of simple measurements estimating clustering efficiency, and in addition by Cluto, a widely used document clustering software. Results show a 5-10% improvement in clustering quality over traditional tf (term frequency) or tf x idf (term frequency-inverse document frequency) based clustering.