Peter A. Machonis
 

In Natural Language Processing, there has been much recent attention paid to phrasal verbs (Jackendoff 2002, Sag et al. 2002, Villavicencio 2005).  This paper represents an attempt at parsing English phrasal verbs with and without insertion in large corpora.  We used manually constructed lexicon-grammar tables (Gross 1996) of over 1,200 transitive and ergative phrasal verbs, along with NooJ, a linguistic development environment that parses texts using large-scale dictionaries and local grammars.  In order to improve recall, NooJ uses a Text Annotation Structure that holds all unsolved ambiguities (Silberztein 2007).
           
The lexicon grammar tables of phrasal verbs were then used to compile a NooJ phrasal verb dictionary, which was next used in conjunction with a local grammar to identify phrasal verbs in the Henry James novel The Portrait of a Lady.  When applying the phrasal verb dictionary and grammar to a lexical and syntactic analysis of the entire text, NooJ identified 583 potential phrasal verbs.  Although the overall accuracy was only 84%, the concordances generated from our most exhaustive tables (the particles up and out) achieved 97% accuracy.
            Furthermore, the NooJ grammar and dictionary correctly identified over 80 discontinuous instances of phrasal verbs, such as
 

Shall I show the gentleman up, ma'am?
            She has reasoned the matter well
out

In fact, most of the problems encountered were with continuous examples.  Nouns were sometimes interpreted as part of phrasal verbs

He carried his hands in his pockets

This happens since the phrasal verb dictionary contains the expression hand something in “return.”  False positives also highlighted the problem of distinguishing particles from prepositions: 

            Ralph asked while they moved along the platform

Although the Text Annotation Structures also showed the correct interpretations of these sentences, expanding selectional restrictions of complements in the lexicon grammar of phrasal verbs to include semantic classes and domains (Le Pesant & Mathieu-Colas 1998) might help solve ambiguities such as these.  Recall will also be improved in the future by adding intransitive phrasal verbs and phrasal prepositional verbs to the NooJ database.
           Nevertheless, using NooJ to identify phrasal verbs in large corpora has shown that it is indeed a powerful linguistic tool that can help solve a key problem in Natural Language Processing. 

 

Work Cited
         
           
Gross, Maurice. 1996. Lexicon grammar. In Concise Encyclopedia of Syntactic Theories.  Keith Brown and Jim Miller, eds. 244-258. New York: Elsevier. 
            Jackendoff, Ray.  2002. English particle constructions, the lexicon, and the autonomy of syntax. In Verb-particle explorations. Nicole Dehe, Ray Jackendoff, Andrew McIntyre & Silke Urban, eds. 
67-94. NY: Mouton de Gruyter.
           
Le Pesant, Denis & Michel Mathieu-Colas. 1998. Introduction aux classes d’objets. Langages 131: 6-33.
           
Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann Copestake, and Dan Flickinger.  2002. Multiword expressions: a pain in the neck for NLP. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics.  1-15. Mexico City: CICLING.
           
Silberztein, Max.  2002-2006.  NooJ Manual.  http://www.nooj4nlp.net/
           
Silberztein Max. 2007. An Alternative to tagging. Proceedings of NLDB 2007. Lecture Notes in Computer Science 1-11. Berlin: Springer Verlag. 
            Villavicencio, Aline. 2005. The availability of verb-particle constructions in lexical resources: How much is enough? Computer Speech and Language 19: 415-432.