@motioneffector/parser

Documentation

Tokenization API

Internal types for tokens produced by the parser's tokenizer. Exposed for advanced use cases like error position mapping.


Token

A single token from the input string.

interface Token {
  type: TokenType
  value: string
  original: string
  start: number
  end: number
}
Property Type Description
type TokenType The type of token
value string Normalized value (lowercase for words)
original string Original text before normalization
start number Start position in original input (0-indexed)
end number End position in original input (exclusive)

Example:

// Input: "Get THE lamp"
// Tokens:
[
  { type: 'WORD', value: 'get', original: 'Get', start: 0, end: 3 },
  { type: 'WORD', value: 'the', original: 'THE', start: 4, end: 7 },
  { type: 'WORD', value: 'lamp', original: 'lamp', start: 8, end: 12 }
]

TokenType

Type of token produced by the tokenizer.

type TokenType =
  | 'WORD'
  | 'QUOTED_STRING'
  | 'NUMBER'
  | 'PUNCTUATION'
  | 'WHITESPACE'
Type Description Example
'WORD' Alphanumeric word get, lamp, café
'QUOTED_STRING' Text in quotes "hello world"
'NUMBER' Numeric value 123, 42
'PUNCTUATION' Punctuation character ., !, ,
'WHITESPACE' Whitespace (usually skipped) , \t, \n

Position Mapping

Use token positions to map errors back to input:

const input = 'get the unicorn from chest'
const result = parser.parse(input)

if (result.type === 'unknown_noun') {
  const position = result.position

  // Highlight the problem word
  const before = input.slice(0, position)
  const word = result.noun
  const after = input.slice(position + word.length)

  console.log(`${before}[${word}]${after}`)
  // "get the [unicorn] from chest"

  // Or show a caret
  console.log(input)
  console.log(' '.repeat(position) + '^'.repeat(word.length))
  // get the unicorn from chest
  //         ^^^^^^^
}

Tokenization Behavior

Word normalization:

  • Converted to lowercase
  • Trailing punctuation stripped
  • Unicode characters preserved

Quoted strings:

  • Delimiters: " or '
  • Escape sequences: \", \', \\
  • Value excludes quotes

Whitespace:

  • Tabs, newlines, spaces all treated as separators
  • Multiple consecutive whitespace collapsed
  • Not included in token output

Punctuation:

  • Stripped from word ends
  • Standalone punctuation skipped
  • Commas between items don't create lists

Examples:

// Lowercase
"GET LAMP" → [{ value: 'get' }, { value: 'lamp' }]

// Trailing punctuation stripped
"look." → [{ value: 'look' }]

// Quoted strings
'say "Hello!"' → [{ value: 'say' }, { value: 'Hello!', type: 'QUOTED_STRING' }]

// Unicode preserved
"get café" → [{ value: 'get' }, { value: 'café' }]

// Whitespace variations
"get\tlamp" → [{ value: 'get' }, { value: 'lamp' }]
"get\nlamp" → [{ value: 'get' }, { value: 'lamp' }]

Use Cases

Error display: Use position from UnknownNounResult or ParseErrorResult to highlight problems in the original input.

Custom preprocessing: If you need to normalize or transform input before parsing, token positions let you map results back to the original.

Debugging: Understanding tokenization helps diagnose why certain inputs parse unexpectedly.