Expand description
Multi-Word Expression (MWE) processing.
Post-tokenization pipeline that collapses multi-token sequences into single semantic units (e.g., “fire engine” → FireEngine).
§How It Works
The MWE pipeline runs between lexing and parsing:
- Build a trie from known multi-word expressions
- Scan the token stream for matches using
apply_mwe_pipeline - Replace matched sequences with single tokens
§Supported MWE Types
- Compound nouns: “fire engine”, “ice cream”
- Phrasal verbs: “look up”, “give in”
- Fixed phrases: “in order to”, “as well as”
§Key Functions
build_mwe_trie: Construct the MWE lookup trieapply_mwe_pipeline: Transform token stream by collapsing MWEs
Structs§
Functions§
- apply_
mwe_ pipeline - Apply MWE collapsing to a token stream. Matches on lemmas (not raw strings) to handle morphological variants.
- build_
mwe_ trie - Build the MWE trie from lexicon data.