This does not support unicode case folding, which is very much a
sorry-not-sorry situation because unicode is a disgusting labyrinthine
chaotic hellformat. Actually, our unicode support isn't very good from
the standpoint that we don't do any form of normalization, so
specifying non-ASCII values for scalar comparisons is probably asking
for trouble.
The way I implemented these changes ended up being directly coupled and
I am not interested in trying to decouple them, so instead here's a
single commit that makes changes to both the API and the format. Let's
go over these.
| now acts as a direct concatenation operator, rather than
concatenating with a space. This is because the format allows the
specification of a trailing space (by using | to fence the string just
before the newline). So it's now possible to spread a long string
without spaces over multiple lines, which couldn't be done before.
This does have the downside that the common pattern of concatenating
strings with a space now requires some extra trailing line noise. I
may introduce a THIRD type of concatenating string (thinking of
using + as the prefix) because I am a jerk. We will see.
The way multi-line strings are concatenated has changed. Partially this
has to do with increasing the simplicity of the aforementioned
implementation change (the parser forgets the string type from the
tokenizer. This worked before because there would always be a trailing
character that could be popped off. But since one type now appends no
character, this would have to be tracked through the parsing to
determine if a character would need to be popped at the end). But I
was also not terribly satisfied with the semantics of multiline
strings before. I wrote several words about this in
429734e6e813b225654aa71c283f4a8b4444609f, where I reached the opposite
conclusion from what is implemented in this commit.
Basically, when different types of string concatenation are mixed, the
results may be surprising. The previous approach would append the line
terminator at the end of the line specified. The new approach prepends
the line terminator at the beginning of the line specified. Since the
specifier character is at the beginning of the line, I feel like this
reads a little better simply due to the colocation of information. As
an example:
> first
| second
> third
Would previously have resulted in "first\nsecondthird" but it will now
result in "firstsecond\nthird". The only mildly baffling part about
this is that the string signifier on the first line has absolutely no
impact on the string. In the old design, it was the last line that had
no impact.
Finally, this commit also changes Value so that it uses []const u8
slices directly to store strings instead of ArrayLists. This is
because everything downstream of the value was just reaching into
string.items to access the slice directly, so cut out the middleman.
It was unintuitive to access a field named .string and get an
arraylist rather than a slice, anyway.
There are still some untested codepaths here, but this does seem to
work for nontrivial objects, so, woohoo. It's worth noting that this
is a recursive implementation (which seems silly after I hand-rolled
the non-recursive main parser). The thinking is that if you have a
deeply-enough nested object that you run out of stack space here, you
probably shouldn't be converting it directly to an object.
I may revisit this, though I am still not 100% certain how
straightforward it would be to make this nonrecursive with all the
weird comptime objects. Basically the "parse stack" would have to be
created at comptime.
I don't like big monolithic source files, so let's restructure a bit.
parser.zig is still bigger than I would like it to be, but there isn't
a good way to break up the two state machine parsers, which take up
most of the space. This is the last junk commit before I am seriously
going to implement the "streaming" parser. Which is the last change
before implementing deserialization to object. I am definitely not
just spinning my wheels here.