Jacob Carlborg
2018-10-17 10:36:23 UTC
I recently read this post [1] where it shows how to define a regular expression (using the PCRE) syntax that looks very much like a proper grammar. A reduced example for the post looks like this:
/
(?(DEFINE)
(?<addr_spec> (?&local_part) @ (?&domain) )
(?<local_part> (?&dot_atom) | (?"ed_string) | (?&obs_local_part) )
(?<domain> (?&dot_atom) | (?&domain_literal) | (?&obs_domain) )
)
^(?&addr_spec)$
/x
The three capture groups âaddr_specâ, âlocal_partâ and âdomainâ would be the grammar rules. It uses the (?&name) syntax to refer to another subgroup. TextMate does not support that syntax but supports the following syntax: \g<name>, which the documentation refers to as Subexp call [2]. This syntax seems to have the same semantics. (DEFINE) is something that seems to be PCRE specific and basically means that the following patterns will not be tried to match. It basically gives a place to define subpatterns. I didnât find anything corresponding in the TextMate regular expression syntax but defining an optional group can be used as a workaround.
Hereâs an example where I tried this technique to match a module declaration in the D language:
(?:
(?<module_declaration>(?<module>module)\s+\g<module_fully_qualified_name>\s*;)
(?<module_fully_qualified_name>\g<module_name>|\g<packages>\.\g<module_name>)
(?<module_name>\g<identifier>)
(?<packages>\g<package_name>|\g<package_name>\.\g<packages>)
(?<package_name>\g<identifier>)
(?<identifier>\w+)
)?
\g<module_declaration>
This is exactly according to the specified grammar [3] and it seems to be working as expected. Not sure if the optional group workaround causes some performance implications.
This technique seems like it could be a viable alternative to supporting variables in the TextMate grammar as has been discussed before. Whatâs missing from this to make it really useful would be something like (DEFINE) in PCRE and a place in the TextMate grammar to place generic patterns used in multiple rules, like a pattern for identifiers.
[1] https://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html <https://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html>
[2] https://macromates.com/manual/en/regular_expressions <https://macromates.com/manual/en/regular_expressions>
[3] https://dlang.org/spec/grammar.html#ModuleDeclaration
/
(?(DEFINE)
(?<addr_spec> (?&local_part) @ (?&domain) )
(?<local_part> (?&dot_atom) | (?"ed_string) | (?&obs_local_part) )
(?<domain> (?&dot_atom) | (?&domain_literal) | (?&obs_domain) )
)
^(?&addr_spec)$
/x
The three capture groups âaddr_specâ, âlocal_partâ and âdomainâ would be the grammar rules. It uses the (?&name) syntax to refer to another subgroup. TextMate does not support that syntax but supports the following syntax: \g<name>, which the documentation refers to as Subexp call [2]. This syntax seems to have the same semantics. (DEFINE) is something that seems to be PCRE specific and basically means that the following patterns will not be tried to match. It basically gives a place to define subpatterns. I didnât find anything corresponding in the TextMate regular expression syntax but defining an optional group can be used as a workaround.
Hereâs an example where I tried this technique to match a module declaration in the D language:
(?:
(?<module_declaration>(?<module>module)\s+\g<module_fully_qualified_name>\s*;)
(?<module_fully_qualified_name>\g<module_name>|\g<packages>\.\g<module_name>)
(?<module_name>\g<identifier>)
(?<packages>\g<package_name>|\g<package_name>\.\g<packages>)
(?<package_name>\g<identifier>)
(?<identifier>\w+)
)?
\g<module_declaration>
This is exactly according to the specified grammar [3] and it seems to be working as expected. Not sure if the optional group workaround causes some performance implications.
This technique seems like it could be a viable alternative to supporting variables in the TextMate grammar as has been discussed before. Whatâs missing from this to make it really useful would be something like (DEFINE) in PCRE and a place in the TextMate grammar to place generic patterns used in multiple rules, like a pattern for identifiers.
[1] https://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html <https://nikic.github.io/2012/06/15/The-true-power-of-regular-expressions.html>
[2] https://macromates.com/manual/en/regular_expressions <https://macromates.com/manual/en/regular_expressions>
[3] https://dlang.org/spec/grammar.html#ModuleDeclaration
--
/Jacob Carlborg
/Jacob Carlborg