Latent Semantic Indexing – What Is It?
Posted: February 15th, 2010 | Category: Writing and SpeakingLatent semantic indexing (LSI) is an information retrieval strategy that applies a certain mathematical technique to determine the concept or idea that is found in a body of text. This information retrieval technique uses the natural language processing system known as latent semantic analysis or LSA. LSA examines the interrelationships between various documents and the words that they contain and then creates a set of ideas for these documents. With LSI, the documents that are presented in response to a particular query do not necessarily have the exact words or phrases that the searcher has keyed in.
LSI provides the solution to two main problems with the common Boolean search method. These are the possibilities that a word has more than one meaning and several words having the same meanings. These two possibilities are the common reasons for the irritating appearance of documents for a particular query even if they are not relevant and the absence of documents that should have been included.
LSI is also useful for the automated specification of the categories for each document. It utilizes sample documents to determine the conceptual foundations of every category. The technique used is to compare the ideas that are found in the example documents for each category with those that can be extracted from the document to be classified and placing it in those categories where the concepts match.
Another advantage of LSI is that it is applicable for all languages because it is entirely based on mathematical analyses. Thus, it can extract the semantic content from the documents written in any language without the need to consult any thesaurus or dictionary. The search can also be made in a particular language while the documents to be queried can be in another language.
LSI is also applicable for terms that are not exactly words, such as the DNA sequences of genes. Thus, biological and medical documents can easily be searched and categorized using LSI. For example, LSI is capable of classifying genes based on the biological information that could be extracted from the abstracts and titles of biological databases.
It is also capable of automatically adjusting itself to changing terminology and it is hardly affected by unreadable characters, typographical mistakes, misspelled words, and other kinds of noise in documents. Therefore, LSI is applicable for a body of text that is the result of speech-to-text conversion programs and those that have been extracted from images by optical character recognition software. Check out http://ArticlesOnTap.com for more on this