8 Commits

Author SHA1 Message Date
258cf2ae83
parser: reintroduce space strings and change token parsing strategy
I don't think I have the wherewithal to write this full commit message
right now. Since it should be a long one.

Basically. `+ ` now is the string space concatenation operator because
that is a very common use case. It's essentially the soft-wrap
character.

Also, lines starting with -, +, >, and | will now try to tokenize as
map keys if they do not contain the following space. The motivation
here is numeric map keys. Specifically, +/- are numeric leaders.

To facilitate this change, own-line scalars are now prohibited. So, for
example:

    key: -1000

is still fine, but

    key:
        -1000

is no longer accepted.
2023-10-18 00:20:19 -07:00
25386ac87a
rename flow_(list|map) to inline_(list|map)
This is simply better word choice.
2023-10-18 00:07:12 -07:00
8dd5463683
parser: change string and | semantics and expose slices in Value
The way I implemented these changes ended up being directly coupled and
I am not interested in trying to decouple them, so instead here's a
single commit that makes changes to both the API and the format. Let's
go over these.

| now acts as a direct concatenation operator, rather than
concatenating with a space. This is because the format allows the
specification of a trailing space (by using | to fence the string just
before the newline). So it's now possible to spread a long string
without spaces over multiple lines, which couldn't be done before.
This does have the downside that the common pattern of concatenating
strings with a space now requires some extra trailing line noise. I
may introduce a THIRD type of concatenating string (thinking of
using + as the prefix) because I am a jerk. We will see.

The way multi-line strings are concatenated has changed. Partially this
has to do with increasing the simplicity of the aforementioned
implementation change (the parser forgets the string type from the
tokenizer. This worked before because there would always be a trailing
character that could be popped off. But since one type now appends no
character, this would have to be tracked through the parsing to
determine if a character would need to be popped at the end). But I
was also not terribly satisfied with the semantics of multiline
strings before. I wrote several words about this in
429734e6e813b225654aa71c283f4a8b4444609f, where I reached the opposite
conclusion from what is implemented in this commit.

Basically, when different types of string concatenation are mixed, the
results may be surprising. The previous approach would append the line
terminator at the end of the line specified. The new approach prepends
the line terminator at the beginning of the line specified. Since the
specifier character is at the beginning of the line, I feel like this
reads a little better simply due to the colocation of information. As
an example:

  > first
  | second
  > third

Would previously have resulted in "first\nsecondthird" but it will now
result in "firstsecond\nthird". The only mildly baffling part about
this is that the string signifier on the first line has absolutely no
impact on the string. In the old design, it was the last line that had
no impact.

Finally, this commit also changes Value so that it uses []const u8
slices directly to store strings instead of ArrayLists. This is
because everything downstream of the value was just reaching into
string.items to access the slice directly, so cut out the middleman.
It was unintuitive to access a field named .string and get an
arraylist rather than a slice, anyway.
2023-10-08 16:57:52 -07:00
7db6094dd5
state/tokenizer: go completely the opposite direction re: whitespace
This commit makes both the parser and tokenizer a lot more willing to
accept whitespace in places where it would previously cause strange
behavior. Also, whitespace is ignored preceding and following all
values and keys in flow-style objects now (in regular objects,
trailing whitespace is an error, and it is also an error for non-flow
map keys to have whitespace before the colon). Tabs are no longer
allowed as whitespace in the line. They can be inside scalar values,
though, including map keys. Also strings allow tabs inside of them.

The primary motivation here is to apply the principle of least
astonishment. For example, the following

  -  [hello, there]

would previously have been parsed as the scalar " [hello, there]" due
to the presence of an additional space after the "-" list item
indicator. This obviously looks like a flow list, and the way it was
previously parsed was very visually confusing (this change does mean
that scalars cannot start with [, but strings can, so this is not a
real limitation. Note that strings still allow leading whitespace, so

  >  hello

will produce the string " hello" due to the additional space after the
string designator. For flow lists,

  [ a, b ]

would have been parsed as ["a", "b "], which was obviously confusing.
The previous commit fixed this by making whitespace rules more strict.
This commit fixes this by making whitespace rules more relaxed. In
particular, all whitespace preceding and following flow items is now
stripped. The main motivation for going in this direction is to allow
aligning list items over multiple lines, visually, which can make data
much easier to read for people, an explicit design goal. For example

  key:   [  1,  2,  3 ]
  other: [ 10, 20, 30 ]

is now allowed. The indentation rules do not allow right-aligning
"key" to "other", but I think that is acceptable (if we forced using
tabs for indentation, we could actually allow this, which I think is
worth consideration, at least). Flow maps are more generous:

  foo:  {  bar:  baz }
  fooq: { barq: bazq }

is allowed because flow maps do not use whitespace as a structural
designator. These changes do affect how some things can be
represented. Scalar values can no longer contain leading or trailing
whitespace (previously the could contain leading whitespace). Map keys
cannot contain trailing whitespace (they could before. This also means
that keys consisting of whitespace cannot be represented at all).
Ultimately, given the other restrictions the format imposes on keys
and values, I find these to be acceptable and consistent with the goal
of the format.
2023-10-04 22:54:53 -07:00
01f98f9aff
parser: start the arduous journey of hooking up diagnostics
The errors in the line buffer and tokenizer now have diagnostics. The
line number is trivial to keep track of due to the line buffer, but
the column index requires quite a bit of juggling, as we pass
successively trimmed down buffers to the internals of the parser.
There will probably be some column index counting problems in the
future. Also, handling the diagnostics is a bit awkward, since it's a
mandatory out-parameter of the parse functions now. The user must
provide a valid diagnostics object that survives for the life of the
parser.
2023-09-27 23:44:06 -07:00
3258e7fdb5
tokenizer: add finish function to check if there is trailing data
Since the tokenizer is decoupled from the parser, there's no good way
to do this. Also without attempting to parse the last line, it's
impossible to say if it is junk data or simply a missing trailing new
line.
2023-09-27 23:35:24 -07:00
0e60719c85
linebuffer: add strictness options
When the buffer was separated from the tokenizer, we lost some
validation, including really aggressive carriage return detection.
This brings this back in full force and adds some additional
validation on top of it.
2023-09-26 00:06:39 -07:00
38e47b39dc
all: do some restructuring
I don't like big monolithic source files, so let's restructure a bit.
parser.zig is still bigger than I would like it to be, but there isn't
a good way to break up the two state machine parsers, which take up
most of the space. This is the last junk commit before I am seriously
going to implement the "streaming" parser. Which is the last change
before implementing deserialization to object. I am definitely not
just spinning my wheels here.
2023-09-24 18:22:12 -07:00