sharing programmer-to-programmer. please enjoy. pages are formatted for landscape tablet, laptop, or monitor.

yREGEX, regular expressions

greek, artemes-agrotera (the huntress)

in production

yREGEX is a specialized regular expression library for my everyday, low-to-moderate volume use. it is a successful experiment built using kernighan’s finite-automata algorithm rather than recursion.

did i really need to build this ? heck no. this was a moment of insanity that really taught me some great stuff. no regrets. its definitely slower than regcomp/regexec, but its plenty fast for my purposes.

this version maximizes traceability and human verification. computer science has lots of mathy ways to “prove” results, but i needed to supplement unit testing with human verification of “how” the results were generated.

i sacrificed speed, but its fast in my sweet spots. i can easily gain an order of magnitude in performance by NOT saving all finds, stopping after the first result, and converting execution to byte-code. eventually.

source data

to start, i took an unmanicured text, alice in wonderland, for the input. it likely has spelling issues and whatnot, but that is not a concern. i just needed a large text with interesting word patterns. lewis carroll is nothing if not interesting.

i could have gone the dull route and parsed html or xml. but, but que diablo! who would do such a thing ;)

when reading the text, regex will group lines into paragraphs to be more realistic. longer, meaningful stings as actually meant by the authors.

the example will start on the full text, but later runs will move to a single paragraph to decrease the verbosity.

obviously, the results are normally presented like other regex engines, but i will show debugging reports below.

first, a complicated search regex (explained later). the purple line near the top is the regex, as entered.

each green line is for a paragraph showing beg and end lines in source. cnt is the number of finds in that text.

brown lines introduce the find defails. beg pos and len of find. also, the actual find to the right.

since i have used “shotgun” start for the finds, the shortest/laziest, regardless of pos, is always first (see len column)

the start method is configurable, but shotgun always has the longest/greediest last in the list. really helpful.

now that the report layout is somewhat clearer, here’s a quick literal search on only the first paragraph.

next, an OR group and word beg and end anchors…

then, a classic search for any length string beginning and ending with several possibilities.

to show the grouping capability, the first numbered group (#) is repeated with &1 at the end. obviously, this turns up many results. you can have 9 numbered groups 1-9.

now, adding a rule at the end to make sure the string at the start and end are different. rules are always very useful and multiple rules can be combined.

finally, i came differently at look-ahead and look-behind requirements. i created a special focus group that reports as group 0.

below, the focus area is shown in green font. this is a simple version of look-head style use.

the focus regex specification is shown in the middle of the purple regex line.

i much prefer using my extended character for simplicify. but, it’s a lot all at once., so i used pure standard ascii.

current meta-character list

regular expressions are all about expressiveness. the more expressive the grammar, the more flexible the search.

i built around posix extended meta-characters. then, i added a few to avoid death by parentheses. and, a few more to simplify or abbreviate key concepts.

i eliminated look ahead and look back by creating a “focus” group that would carry the actual desired text within the greater match.

then, i added extended rules and six symbols for grouping multiple regular expressions.

this was never meant to be for others or high volume. this is a programmer designing something perfectly suited to themselves ;)

future: i am planning a mode where ascii 0-127 are NEVER meta to avoid needing escaped meta-characters. this is easily doable using existing extended characters in my shrike font.

source code is GPL3 licensed, https://github.com/heatherlyrobert/yREGEX

Previous
Previous

eterm color configuration with theia-euryphaessa

Next
Next

command line service daemon control with yJOBS