<enebo[m]>
I have been pondering this enough to look at what is going on
<enebo[m]>
The benchmark itself is sort of weird because tme makes a special collection type which caches results of matchs (along with the regexp) while the other engines are just a list
<enebo[m]>
So it is hardly surprising a microbench of the same string is much faster. In fact it is surprising they are only 50% faster since they are only doing a hash lookup
<enebo[m]>
MRI I think did implement result cache on regexp recently
<enebo[m]>
lopex: as to just preprocessing/parsing regexp it would be useful to just do those once in Ruby parser
<headius>
sorry I meant string things from StringSupport should live in jcodings... and yeah there's lots of regex support code that could move into joni
<enebo[m]>
yeah I was more referring to lopex about the notion of caching
<enebo[m]>
headius: I agree though string support feels like it could be in jcodings other than having to rev it every time we change it
<enebo[m]>
many of those methods are done though
<headius>
anything without special chars should just be a simple match instruction in joni
<enebo[m]>
I have thought about that as well
<headius>
I'm going to make that debug flag configurable right now in fact
<headius>
a simple string is literally begin, match instruction, end
<enebo[m]>
but we do not want to use that engine at all if it is that case
<headius>
right
<enebo[m]>
yeah just some static methods
<enebo[m]>
there are some other considerations which I think could matter in these hot methods for strings
<enebo[m]>
I do not have a utf-8 hot version for split but if I did I would keep track of whether the substring created became 7bit
<enebo[m]>
which is something joni does not know anything about
<enebo[m]>
That getline method I fixed CR for in IO had me wondering why it does code range stuff in addition to line walking
<enebo[m]>
I know the answer is that it will look for raw \n then walk the bytes up to that index but if it is going to walk up to that using mbc length routines why not just do that
<enebo[m]>
(unless it is binary then it bails once it detects that)
<enebo[m]>
Maybe this is an artifact of how magic memchr is in C
<enebo[m]>
another fun fact of that fix yesterday is that broken binary data which for whatever reason was calling getline was endlessly trying to mbc walk it
<headius>
c == syntax.metaCharTable.esc && !syntax.op2IneffectiveEscape()
<headius>
so c == syntax.metaCharTable.esc is the esc char check
<headius>
other than that and the special chars it would be simple
<headius>
this does show that it depends on the syntax being used, so having it a joni util function makes more sense for that
<headius>
we only use one syntax though
<enebo[m]>
I guess a boolean could exist to mark special if any real special chars hit here
<enebo[m]>
those same chars above I guess are known to be escaped?
<headius>
if this were split into two subclasses, one that actually generates the tokens and the other that just scans for conditions, it would reuse most of this
<headius>
ah yeah so you might want things like foo\* to be a simple match for literally "foo*"
<headius>
so there's a bit more logic there to know it is an escape but it's escaping to be a simple char
<enebo[m]>
foo*
<enebo[m]>
hahah
<enebo[m]>
I see what happened
<enebo[m]>
foo\*
<headius>
oh yeah
<headius>
element stole my backslash
<headius>
foo\*
<headius>
/foo\*/
<headius>
which I had to type as \/foo\\*/
<enebo[m]>
but I think the '*' above is the escape and the one you highlighted is a kleene star
<enebo[m]>
because of the return type of the token
<headius>
right
<enebo[m]>
nope I guess not
<enebo[m]>
both are kleenes
<headius>
everything in the lower half are special characters
<enebo[m]>
upper half * is also a repeat
<headius>
ok yeah I am confused about the upper part then
<headius>
oh lexer probably does it
<enebo[m]>
this is the lexer but I guess the idea here is for anything here nearly it would mark it as special
<enebo[m]>
I suppose people would get angry if we stripped out GNU emacs buffer handling
<headius>
haha
<headius>
anyway, properties are in and I'm back to 9.4.2 stuff
<lopex[m]>
then why not just use that exact info plus empty bytecode ?
<enebo[m]>
lopex it could
<lopex[m]>
afaik we talked about it
<headius>
lopex: we want to avoid having to compile it
<headius>
ideally zero alloc to determine whether it needs to be a regex
<enebo[m]>
It is likely still building up a tree either way
<lopex[m]>
then why not check processed string against original ?
<enebo[m]>
headius: if we use the tree I guess it would just be to see if there were a single exact of any kind + the finish and end
<enebo[m]>
jimtng: cool
<enebo[m]>
but lopex mentioned just comparing processed string to original which I don't comprehend
<enebo[m]>
The tree mechanism does sound simpler than lex and lex must be making tokens as it builds the string
<lopex[m]>
enebo: wouldnt the length change ?
<enebo[m]>
builds the tree
<enebo[m]>
or instr list
<enebo[m]>
lopex: I don't understand what you are saying
<lopex[m]>
if you just use RubyRegexp.quote
<lopex[m]>
and comare quoted with original
<enebo[m]>
if they are same length it means they are simple or not simple?
<enebo[m]>
"(?-mix:foo*)"
<enebo[m]>
system ~/work/jruby split_opts 3937% mri31 -e 'p /foo*/.to_s'
<enebo[m]>
system ~/work/jruby split_opts 3938% mri31 -e 'p /food/.to_s'
<enebo[m]>
"(?-mix:food)"
<enebo[m]>
oh but quote
<enebo[m]>
mri31 -e 'p Regexp.quote(%q{food})'
<enebo[m]>
"food"
<enebo[m]>
"foo\\*"
<enebo[m]>
system ~/work/jruby split_opts 3941% mri31 -e 'p Regexp.quote(%q{foo*})'
<lopex[m]>
quote already has all the required logic
<lopex[m]>
wrt metaFound label there
<enebo[m]>
so it will add an extra \ for special chars
<lopex[m]>
it does another pass there dunno why though
<enebo[m]>
but this will mean to split I will have to generate the quoted string
<enebo[m]>
I suppose I could do that at parse time then set a flag on rubyregexp
<enebo[m]>
but will this work 100%
<lopex[m]>
or enumerate all the cases there just to check if it is indeed a simple string
<enebo[m]>
I don't mean will it find 100% of all simple strings but will it never give a false positive
<lopex[m]>
no idea how unicode char literals etc would play with this
<enebo[m]>
yeah
<enebo[m]>
Although they should stay same length and no special chars are mbc in regexp
<enebo[m]>
like we don't support greek unicode for '*' (if there is one)
<enebo[m]>
and I will cover my ass by saying perhaps * is greek historically
<enebo[m]>
lopex: This may be a working idea but I think the better solution would still be to add an isExact/Simple method in joni somewhere to allow future push to include optimizing these away from having to construct a regexp/engine
<lopex[m]>
before or after parse ?
<lopex[m]>
I mean joni parse
<enebo[m]>
yeah so joni parse makes a set of instrs directly or does it make an AST?
<lopex[m]>
it does make an AST
<lopex[m]>
the ast transformations are the most complex thin in joni
<enebo[m]>
I think AST would be the place to check and then what to do from that point forward would be a question in itself
<lopex[m]>
it does another ast reqrite pass
<enebo[m]>
but initial AST would be the simplest tree right
<lopex[m]>
*rewrite
<lopex[m]>
yeah
<enebo[m]>
so it would mark it as simple then whatever does the passes would decide it can be simple search and not do other work (this is idea of putting this into joni)
<enebo[m]>
but if we do not boil the ocean we would then still do the rest of the work but pass along that info into the regexp being built up
<enebo[m]>
the downside here is if we want the former feature added then adding the state in the latter probably would not need to exist
<enebo[m]>
The big issue I see also is people get an object back from joni and store it
<enebo[m]>
So the simple search has to still get stuffed into something common
<enebo[m]>
Since am talking about fleshing out different searches for joni supplied regexps I should point out the DFA thing TR did is another possible alternate mechanism which we would also maybe want
<enebo[m]>
That feature is interesting in removing some pathological processing
<enebo[m]>
yeah the other thought about this is the idea that regexp has like 30 fields.
<enebo[m]>
If regexp was not an object but a struct(or enum) then it could pass through that style of dispath but potentially make less alloc-heavy structs
<enebo[m]>
Ok I am brainstorming and that perhaps isn't useful today but I will bring it around to being enabling but not costly
<enebo[m]>
if initial AST could just mark it as a simple string we could maybe append that as an options bit we could check
<enebo[m]>
that would not add space and very little time
<headius>
we're still cutting off the bulk of overhead with that
<enebo[m]>
public final boolean isSimple() {
<enebo[m]>
return (getType2Bit() & SIMPLE) != 0;
<enebo[m]>
}
<enebo[m]>
What is this? :)
<enebo[m]>
Ah I see
<enebo[m]>
I think in AST we just want a string node
<lopex[m]>
yeah
<headius>
I might have a ripper bug that's causing most of the failures in IRB suite
<lopex[m]>
but it still will go through similar logic as quote
<enebo[m]>
headius: I am not surprised. I excluded two last night
<enebo[m]>
they do not crash but they do not pass
<headius>
ah you did, yeah I was trying to figure them out
<headius>
they are new cases of the same problem we have lots of excludes for
<headius>
mostly around checking indentation for incomplete source
<enebo[m]>
fwiw one of them is very odd
<headius>
we seem to output different error tokens so they don't filter them right
<headius>
they = irb
<headius>
for something like class A\n
<enebo[m]>
yeah on_err vs on_compile
<enebo[m]>
Tha is something I have noticed but not spent time to figure out
<headius>
we output #<Ripper::Lexer::Elem: on_error1@1:8:BEG: "": syntax error, unexpected end-of-file
<enebo[m]>
If you can easily see indentation failing it raises it in my mind
<headius>
one of those is an issue with the partial code prompt (add star to prompt when in partial code) and the other is an error for unterminated code
<headius>
it's all the same thing
<enebo[m]>
The combined parser work did fix a lot of issues (and was super icky to do) but I know there are lots of issues and something strange where the wrong on_ is called for errors
<enebo[m]>
This is not strictly that but if this gets fixed likely those will be as well
<headius>
yeah that is similar
<headius>
last error seems to be falling into some generic "on_error" rather than something specific to the previous token
<headius>
compile_error or on_parse_error or whatever?
<enebo[m]>
I am well aware of this though I just have not spent time on it because I fix what people report on this
<enebo[m]>
yeah there are multiple errors and how you call them is wrong somehow
<headius>
this is likely affecting irb autoindentation of code right now
<enebo[m]>
What is fascinating is sexp, tokenize, lex, and parse do give identical result eyes
<headius>
I just got this bug because I wanted to see how far off we are from passing all IRB tests and most of the failures are this
<headius>
I can add to this bug with IRB info
<enebo[m]>
That was from the issue but I pass all but only one thing there and it happens to then become the wrong error
<enebo[m]>
that line meant we match MRI exactly in all but one mode of ripper
<enebo[m]>
the fact ripper has 5 different modes is fun
<enebo[m]>
headius: in any case if I fix this issue all those others will be fixed since there are lots of excludes and they all seemingly is wrong handling of error reporting for bad syntax
<enebo[m]>
I will be into whatever this is and retesting all excludes
<headius>
yup that's the ticket
<headius>
not surprising that error conditions would be the last mile of compat with ripper
<enebo[m]>
There are at least 3 errors in your example too
<enebo[m]>
my orginal report has the same mismatch dispatch method name but yours also shows we do an extra nl token and we also have a different error state itself
<enebo[m]>
which could be calling wrong yyerror to not get the "expecting" part OR our parser is not behaving identically to bison
<headius>
yeah let's hope it's the former
<enebo[m]>
and not identically has tended to generally never happened but for error handling it very well could
<enebo[m]>
that could even explain the wrong name to dispatch but I am optimistic I just missed something
<headius>
could be simple! 😃
<enebo[m]>
but I at least know it is not trivial because I opened an issue on it
<enebo[m]>
If it had been trivial I would have fixed it
<enebo[m]>
lol I was planning on looking at the postgresql array in column thing today but I got lost in this simple regexp thing
<enebo[m]>
but I suspect postgresql is a lot more important than irb tests passing in error cases
<enebo[m]>
I will probably switch to that after lunch
<enebo[m]>
oh and I have to make rice shoot
<headius>
on_error1 and 2 use callee
<headius>
aha
<headius>
and the other on_* are generated
<headius>
so it needs __callee__ to produce the right id
<headius>
hah that's all it was
<headius>
for this case anyway
<enebo[m]>
headius: oh yeah I didn't think about this but I had to edit those files
<headius>
this fixes it
<headius>
I will have a patch in a moment but it basically does what you did
<headius>
I used eval but it could be define_method too
<enebo[m]>
but fwiw there are error handling issues also
<enebo[m]>
"Moving off of 9.2 since this has only been observed in one place and I worked around it (e.g. default ripper impl)."
<enebo[m]>
So I definitely have an issue to work through where we are using on_compile_error and somethign else but happily it was just the callee changes getting wiped out after updating
<headius>
Ugh I just realized a possible solution: encode the passed-in name such that we can tell whether it's a new and old name or just a single one
<headius>
Pretty sure it's only used for super, backtrace, and callee
<headius>
Moderate impact change would be to pass in some identifier object that can also hold the alias name, so basically encoding both names into a single parameter in a more object oriented way
<enebo[m]>
it would work with a small check but potentially be brittle if we consume it or start consuming it later in unexpected ways
<enebo[m]>
name\0alias and then to strlen
<headius>
yeah
<headius>
it would only happen if called through an alias, but in such cases we'd have to break the string into new strings for usage
<enebo[m]>
but if misused I wonder if \0 would shield us from mistakes
<headius>
\0name\0alias so we can just detect \0 quickly
<headius>
if charAt(0) == "\\0"; split out super or callee name
<headius>
right
<enebo[m]>
then no check would exist unless you were looking to use it
<enebo[m]>
but I suppose String maybe is more clever than I am thinking
<headius>
anything that doesn't super and doesn't callee doesn't use the name
<headius>
so only aliases WITH super or callee
<headius>
and aliased calls that get a backtrace, which is only interpreter
<enebo[m]>
but I am trying to come up without needing the check at all unless you are looking it up for the alias
<headius>
the way MRI does this is by passing the called method through
<headius>
implicitly on the stack
<headius>
so you'd receive an AliasMethod in the method body and have both names
<enebo[m]>
"foo\0alias" as a Java string will it lookup "foo" in a method table?
<headius>
no it will just treat \0 as a normal char
<enebo[m]>
ok I figured but I could not remember
<enebo[m]>
class Double implements Charset 👀
<headius>
it would be the lowest-impact change
<enebo[m]>
or CharSequence
<headius>
stuffing extra crap into a String arg
<headius>
yeah if it were CharSequence we could just have DynamicMethod implement CharSequence and pass it in
<headius>
which would be weird
<headius>
but it's String so we're stuck
<enebo[m]>
if alias we put it into a subtyped charsequence at the cost of toString()
<headius>
my patch removes 17 excludes
<headius>
from TestRubyLex
<enebo[m]>
Too bad all String accepting things did not implicitly toString() like string construction
<headius>
looking into the last four failures
<headius>
yeah duck typed StringProducer
<enebo[m]>
It makes me happy it was this and not something in the parser
<headius>
yeah me too
<headius>
hopefully the rest is simple to
<headius>
there's no more __callee__ things to fix though
<headius>
at least not in this file
<enebo[m]>
I looked and I remember thinking I was doing the same thing as MRI at whatever I was looking at so I didn't get why it was another name
<enebo[m]>
and I obviously thought I handled naming on the ruby side already
<enebo[m]>
yeah adding support for that and aliases would just be to not get it reported
<headius>
three of the remaining four are related to incomplete heredocs
<enebo[m]>
not totally surprised
<headius>
there's a fifth in the IRB repo that is different
<headius>
Error tokens must be ignored if there is corresponding non-error token.
<headius>
<[:compile_error]> was expected to be empty.
<enebo[m]>
and ripper is emitting stuff in the middle of that happening
<enebo[m]>
and some of it is "delayed" which will get emitted at a later point
<enebo[m]>
yeah that may be a case of something meant to be delayed and it not happening
<headius>
updated with example code
<enebo[m]>
or not
<enebo[m]>
aha you also see one should be compile_error
<headius>
yeah
<enebo[m]>
so not all of that was from the aliases I guess
<headius>
so it should start the heredoc and then save the rest, show the compile error and then the rest of the tokens
<enebo[m]>
our behavior of how we process is not really special other than using two types for strings and heredocs
<enebo[m]>
so it is just usually a matter of tracing through execution until you figure out where it went off the rails
<enebo[m]>
MRI will do same things but use 2 c methods
<headius>
I pushed remaining changes to my PR
<headius>
I modified your other __callee__ fix to use __callee__ rather than capturing the name, so they should compile as normal methods now
<enebo[m]>
ok
<headius>
enebo: PR looks fine to me not reading every single line
<headius>
I assume any discoveries from today's discussion will get in there (if not already there)
<headius>
split opts PR
<enebo[m]>
yeah I can see first AST pass would be where to look and if just a String Node it would mark it as a string search in options for the regexp
<enebo[m]>
I do not know where that place is and whether it has options available to it at that point but that would replace all that code
<enebo[m]>
and be less conservative (well optimal) in what it accepts
<headius>
ok so decision time on this jffi thing
<headius>
it is blocking people stuck on CentOS 7, but the only way I have to solve it at the moment is to manually build the extension on a CentOS 7 VM
<headius>
I guess I'd be ok doing that for one release but I don't like going backwards on automation of these builds
<enebo[m]>
building on various types of infra seems like such a basic use case but it is difficult
<headius>
it sure is
<headius>
hmm
<headius>
those M1 runner seem to be hanging a lot lately
<headius>
enebo: the job where you tagged failures still has two failures
<headius>
test_colorize_code_complete_false and test_colorize
<headius>
that first one still fails on my branch, other one does not
<enebo[m]>
yeah that config one is interesting because it complains about a missing method
<headius>
and I regressed something in ripper tests, or removed too many excludes... investigating that
<headius>
ah I did remove too much
<headius>
I guess this only fails with the flags we run with, like some coverage flags or something
<headius>
ok should be green now
<headius>
the failing colorizing test is one that tests for incomplete code, so again same issues likely
<headius>
I have started punting aspirational items again but looking through a few that have made progress
<headius>
ok there's some ordering thing with these colorize tests
<headius>
now the true one fails again and test_colorize is failing again
<headius>
I'm going to exclude all three for now and merge