Automatic Extraction of a Sublanguage Grammar




In the last few years there has been a resurgence of interest in applying stochastic techniques to natural language processing systems (after this approach had been abandoned in the machine translation projects of the late 50s); firstly because technical advances like hidden Markow models made them more practical (e.g. Church's and Hindle's work on deducing lexical ambiguity rules); secondly because of a growing disenchantment with pure knowledge-based systems' lack of robustness when facing extragrammaticalities and the inability to scale-up the mostly handcoded knowledge to other domains.

Statistical corpus analysis can be applied, for example, to sublanguages, which have a grammar of their own not necessarily included in the standard language, and whose close correspondence between word distribution and information-bearing properties has allowed automated knowledge extraction (eg pharmacology literature: Sager) and successful machine translation in restricted domains (eg weather reports: Lehrberger).

A hybrid prototype is being developed that attempts to automate the induction of the sublanguage grammar of a technical corpus (Merck Veterinary Manual). The text is automatically tagged with part-of-speech labels (based on a manually-tagged training chapter); these tags are grouped (bracketed) on the basis of Fano's mutual information statistic. Next a CFG is derived from these unlabeled bracketed sequences (cf Anderson's LAS), and finally a chart parser is constructed whose agenda is controlled by context-sensitive conditional probabilities.