Tokenization API
Internal types for tokens produced by the parser's tokenizer. Exposed for advanced use cases like error position mapping.
Token
A single token from the input string.
interface Token {
type: TokenType
value: string
original: string
start: number
end: number
}
| Property | Type | Description |
|---|---|---|
type |
TokenType |
The type of token |
value |
string |
Normalized value (lowercase for words) |
original |
string |
Original text before normalization |
start |
number |
Start position in original input (0-indexed) |
end |
number |
End position in original input (exclusive) |
Example:
// Input: "Get THE lamp"
// Tokens:
[
{ type: 'WORD', value: 'get', original: 'Get', start: 0, end: 3 },
{ type: 'WORD', value: 'the', original: 'THE', start: 4, end: 7 },
{ type: 'WORD', value: 'lamp', original: 'lamp', start: 8, end: 12 }
]
TokenType
Type of token produced by the tokenizer.
type TokenType =
| 'WORD'
| 'QUOTED_STRING'
| 'NUMBER'
| 'PUNCTUATION'
| 'WHITESPACE'
| Type | Description | Example |
|---|---|---|
'WORD' |
Alphanumeric word | get, lamp, café |
'QUOTED_STRING' |
Text in quotes | "hello world" |
'NUMBER' |
Numeric value | 123, 42 |
'PUNCTUATION' |
Punctuation character | ., !, , |
'WHITESPACE' |
Whitespace (usually skipped) | , \t, \n |
Position Mapping
Use token positions to map errors back to input:
const input = 'get the unicorn from chest'
const result = parser.parse(input)
if (result.type === 'unknown_noun') {
const position = result.position
// Highlight the problem word
const before = input.slice(0, position)
const word = result.noun
const after = input.slice(position + word.length)
console.log(`${before}[${word}]${after}`)
// "get the [unicorn] from chest"
// Or show a caret
console.log(input)
console.log(' '.repeat(position) + '^'.repeat(word.length))
// get the unicorn from chest
// ^^^^^^^
}
Tokenization Behavior
Word normalization:
- Converted to lowercase
- Trailing punctuation stripped
- Unicode characters preserved
Quoted strings:
- Delimiters:
"or' - Escape sequences:
\",\',\\ - Value excludes quotes
Whitespace:
- Tabs, newlines, spaces all treated as separators
- Multiple consecutive whitespace collapsed
- Not included in token output
Punctuation:
- Stripped from word ends
- Standalone punctuation skipped
- Commas between items don't create lists
Examples:
// Lowercase
"GET LAMP" → [{ value: 'get' }, { value: 'lamp' }]
// Trailing punctuation stripped
"look." → [{ value: 'look' }]
// Quoted strings
'say "Hello!"' → [{ value: 'say' }, { value: 'Hello!', type: 'QUOTED_STRING' }]
// Unicode preserved
"get café" → [{ value: 'get' }, { value: 'café' }]
// Whitespace variations
"get\tlamp" → [{ value: 'get' }, { value: 'lamp' }]
"get\nlamp" → [{ value: 'get' }, { value: 'lamp' }]
Use Cases
Error display:
Use position from UnknownNounResult or ParseErrorResult to highlight problems in the original input.
Custom preprocessing: If you need to normalize or transform input before parsing, token positions let you map results back to the original.
Debugging: Understanding tokenization helps diagnose why certain inputs parse unexpectedly.