aboutsummaryrefslogtreecommitdiff
path: root/data/core/tokenizer.lua
AgeCommit message (Collapse)Author
2024-04-15Skip patterns matching nothing in `tokenizer` (#1743)Guldoman
These patterns cause infinite loops, so warn about them and skip them.
2023-11-29Fix patterns starting with `^` in `tokenizer` (#1645)Guldoman
Previously the "dirty" version of the pattern was used, which could result in trying to match with multiple `^`, which failed valid matches.
2023-07-09Return state when tokenizing plaintext syntaxesGuldoman
2023-04-01Allow `tokenizer` to pause and resume in the middle of a line (#1444)Guldoman
2023-02-06Allow groups to be used in end delimiter patterns in tokenizer (#1317)Guldoman
* Allow empty groups as first match in tokenizer * Avoid pushing tokens with empty strings * Allow groups to be used in end delimiter in tokenizer * Use the first entry of the type table for the middle part of a subsyntax This applies to delimited matches with a table for `type` and without a `syntax` field. * Match only once if using `at_start` in tokenizer `find_text` * Check if match is escaped in the "close" case too Also allow continuing matching if the match was escaped.
2022-12-27Fix popping subsyntaxes that end consecutively (#1246)xwii
2022-12-11Add `regex.find_offsets`, `regex.find`, improve `regex.match` (#1232)Guldoman
`regex.match` now behaves like `string.match`. This required changes in the `tokenizer` and in the `detectindent` plugin.
2022-11-15Set initial tokenizer state to a `NULL` byteGuldoman
2022-11-15Add `tokenizer.extract_subsyntaxes`Guldoman
2022-11-03tokenizer: remove the limit of 3 subsyntaxes depth (#1186)Jefferson González
* tokenizer: remove the limit of 3 subsyntaxes depth Make the state a string of bytes instead of a 32bits integer to be able to have deeper subsyntax support. Fixes issues with syntax files like the one for PHP that was already hitting more than 3 subsyntaxes depth. * remove unnecesary call to set_subsyntax_pattern_idx * fixed wrong word on comments
2022-06-22Merge pull request #1040 from Guldoman/PR_tokenizer_errors_alertJefferson González
Add more tokenizer errors/warnings
2022-06-15Merge pull request #1034 from Guldoman/PR_escape_start_patternsJefferson González
Check if "open" pattern is escaped
2022-06-15Warn if token type is a table when not neededGuldoman
2022-06-15Add helper function to report bad patterns in tokenizerGuldoman
2022-06-15Fix malformed pattern check for group patterns in tokenizerGuldoman
If the token type was a simple string (and not a table), the size of the string was used instead of `1`.
2022-06-12Check if "open" pattern is escapedGuldoman
Previously this check was only done for "close" patterns.
2022-06-12Convert more byte offsets to utf-8 pos in regex tokenizerGuldoman
2022-05-31Show error if language plugin pattern has mismatching number of groupsGuldoman
The number of results from a pattern with groups must never be greater than the number of token types for that pattern. Also if a token type was undefined, it's now pushed as a `normal` one.
2022-05-31Fix UTF-8 matches in regex group `tokenizer`Guldoman
2022-05-28Allow using regex groups to split tokensGuldoman
Before, this was only supported by Lua patterns. This expects the regex to use the same syntax used for patterns. That is, the token should be split by empty groups.
2022-05-13tokenizer: fix next utf8 char retrieval bugjgmdev
2022-04-26Add utf8 support to tokenizer (#945)Jefferson González
* add utf8 support to tokenizer * wrap utf8 functions in string table using a 'u' prefix * document new utf8 functions
2022-03-04Force syntax patterns starting with `^` to match with the whole lineGuldoman
Before, syntax patterns/regexes that started with `^` didn't have the desired effect of matching with the start of the line. Now those patterns are used only when matching the whole line.
2022-01-12Add bit32 polyfill globallyGuldoman
2021-12-31Migrate to Lua 5.4Jan200101
2021-12-11Consume unmatched character correctlyGuldoman
We must consume the whole UTF-8 character, not just a single byte.
2021-11-23Manual merge of into .Adam Harrison
2021-10-23Fix problem checking utf-8 cont at end of stringFrancesco Abbate
2021-10-11Correctly identify the start of the next character in `tokenizer`Guldoman
When moving to the next character, we have to consider that the current one might be multi-byte.
2021-08-29replace unpack() with table.unpack()takase1121
I have no idea unpack() is still used and how it still worked.
2021-06-02Add PCRE to support regular expressionsAdam
Use regular expressions instead of Lua patterns for find and replace editor commands. Syntax files can now use regex or Lua patterns as before keeping backward compatibility for plugins.
2021-05-20Tokenizer cleanup (#198)Adam
* Cleaned up tokenizer to make subsyntax operations more clear. * Explanatory comments. * Made it so push_subsyntax could be safely called elsewhere. * Unified terminology. * Minor bug fix. * State is an incredibly vaguely named variable. Changed convention to represent what it actually is. * Also changed function name. * Fixed bug.
2021-05-19support for multiple groups in one pattern (#196)liquidev
2021-05-18fixed mixed indentationlqdev
2021-05-01Nested Syntax Highlighting (#160)adamharrison
2020-05-14Made tokenizer skip parsing process on plain-text filesrxi
This, along with the earlier rencache changes should resolve #64
2020-05-07Moved highlighter code from `DocView` to `Doc`rxi
* Only one highlighter state is kept per-document as opposed to one per-docview * Fixes a bug with retaining older highlighter state as a DocView wasn't able to detect lines changing above it's viewport * Renames `highlighter` module to more descriptive `tokenizer`