Enhancing Information Retrieval through
Statistical Natural Language Processing: A Study of
Collocation Indexing
Ofer Arazy and Carson Woo
Abstract
Although the management of information assets—specifically, of text
documents that make up 80 percent of these assets—can provide
organizations with a competitive advantage, the ability of
information retrieval (IR) systems to deliver relevant information
to users is severely hampered by the difficulty of disambiguating
natural language. The word ambiguity problem is addressed with
moderate success in restricted settings, but continues to be the
main challenge for general settings, characterized by large,
heterogeneous document collections.
In this paper, we provide preliminary evidence for the usefulness
of statistical natural language processing (NLP) techniques, and
specifically of collocation indexing, for IR in general settings. We
investigate the effect of three key parameters on collocation
indexing performance: directionality, distance, and weighting. We
build on previous work in IR to (1) advance our knowledge of key
design elements for collocation indexing, (2) demonstrate gains in
retrieval precision from the use of statistical NLP for
general-settings IR, and, finally, (3) provide practitioners with a
useful cost-benefit analysis of the methods under investigation.
Keywords: Document
management, information retrieval (IR), word ambiguity, natural
language processing (NLP), collocation, distance, directionality,
weighting, general settings