12 Commits

Author SHA1 Message Date
96b950755b
config: migrate flow parser into the main parser object
I think I am actually going to make this a method of the ParserState
struct soon so lol check out my freaking code churn. But here we are.
2023-09-23 13:29:49 -07:00
9c866970f8
config: remove some duplication in the parser
There's still a fair amount lurking in here, but I believe this logic
is sound. Rather than duplicating the map/list logic under the
opposing key, we set the logic up to use the second loop around
(this is was how dedents worked, and now it also works for indents).

I'm not convinced this is as easy to follow, and it did lead me to add
some additional unreachables to the code, which should maybe be turned
into error returns instead. It does reduce the odds of a code change
missing a copied instance, which I think is a good thing.
2023-09-23 01:07:04 -07:00
47f4a1c479
config: dupe map keys
I didn't do an exhaustive search, but it seems that the managed
hashmaps only allocates space for the structure of the map itself, not
its keys or values. This mostly makes sense, but it also means that
this was only working due to the fact that I am currently not freeing
the input buffer until after iterating through the parse result.

Looking through this, I'm also reasonably surprised by how many times
this is assigned in the normal parsing vs the flow parsing. There is a
lot more repetition in the code of the normal parser, I think because
it does not have a granular state machine. It may be worth revisiting
the structure to see if a more detailed state machine, like the one
used for parsing the flow-style objects, would reduce the amount of
code repetition here. I suspect it certainly could be better than it
currently is, since it seems unlikely that there really are four
different scenarios where we need to be parsing a dictionary key.
Taking a quick glance at it, it looks like I could be taking better
advantage of the flipflop loop on indent as well as dedent. This might
be a bit less efficient due to essentially being less loop unrolling,
but it would also potentially make more maintainable code by having
less manual repetition.
2023-09-22 01:01:46 -07:00
e9cf908b61
config: use std.StringArrayHashMap for the map type
As I was thinking about this, I realized that data serialization is
much more of a bear than deserialization. Or, more accurately, trying
to make stable round trip serialization a goal puts heavier demands on
deserialization, including preserving input order.

I think there may be a mountain hiding under this molehill, though,
because the goals of having a format that is designed to be
handwritten and also machine written are at odds with each other.
Right now, the parser does not preserve comments at all. But even if
we did (they could easily become a special type of string), comment
indentation is ignored. Comments are not directly a child of any other
part of the document, they're awkward text that exists interspersed
throughout it.

With the current design, there are some essentially unsolvable
problems, like comments interspersed throughout multiline strings. The
string is processed into a single object in the output, so there can't
be weird magic data interleaved with it because it loses the concept
of being interleaved entirely (this is a bigger issue for space
strings, which don't even preserve a unique way to reserialize them.
Line strings at least contain a character (the newline) that can
appear nowhere else but at a break in the string). Obviously this isn't
technically impossible, but it would require a change to the way that
values are modeled.

And even if we did take the approach of associating a comment with,
say, the value that follows it (which I think is a reasonable thing to
do, ignoring the interleaved comment situation described above), if
software reads in data, changes it, and writes it back out, how do we
account for deleted items? Does the comment get deleted with the item?
Does it become a dangling comment that just gets shoved somewhere in
the document? How are comments that come after everything else in the
document handled?

From a pure data perspective, it's fairly obvious why JSON omits
comments: they're trivial to parse, but there's not a strategy for
emitting them that will always be correct, especially in a format that
doesn't give a hoot about linebreaks. It may be interesting to look at
fancy TOML (barf) parsers to see how they handle comments, though I
assume the general technique is to store their row position in the
original document and track when a line is added or removed.

Ultimately, I think the use case of a format to be written by humans
and read by computers is still useful. That's my intended use case for
this and why I started it, but its application as a configuration file
format is probably hamstrung muchly by software not being able to
write it back. On the other hand, there's a lot of successful software
I use where the config files are not written directly by the software
at all, so maybe it's entirely fine to declare this as being out of
scope and not worrying about it further. At the very least it's almost
certainly less of an issue than erroring on carriage returns. Also the
fact that certain keys are simply unrepresentable.

As a side note, I guess what they say about commit message length being
inversely proportional to the change length is true. Hope you enjoyed
the blog over this 5 character change.
2023-09-22 01:01:46 -07:00
6415571d01
config: refactor LineTokenizer to use an internal line buffer
The goal here is to support a streaming parser. However, I did decide
the leave the flow item parser state machine as fully buffered
(i.e. not streaming). This is not JSON and in general documents should
be many, shorter lines, so this buffering strategy should work
reasonably well. I have not actually tried the streaming
implementation of this, yet.
2023-09-22 01:00:17 -07:00
ab580fa80a
config: differentiate fields in Value
This makes handling Value very slightly more work, but it provides
useful metadata that can be used to perform better conversion and
serialization.

The motivation behind the "scalar" type is that in general, only
scalars can be coerced to other types. For example, a scalar `null`
and a string `> null` have the same in-memory representation. If they
are treated identically, this precludes unambiguously converting an
optional string whose contents are "null". With the two disambiguated,
we can choose to convert `null` to the null object and `> null` to a
string of contents "null". This ambiguity does not necessary exist for
the standard boolean values `true` and `false`, but it does allow the
conversion to be more strict, and it will theoretically result in
documents that read more naturally.

The motivation behind exposing flow_list and flow_map is that it will
allow preserving document formatting round trip (well, this isn't
strictly true: single line explicit strings neither remember whether
they were line strings or space strings, and they don't remember if
they were indented. However, that is much less information to lose).

The following formulations will parse to the same indistinguishable
value:

  key: > value
  key:
    > value
  key: | value
  key:
    | value

I think that's okay. It's a lot easier to chose a canonical form for
this case than it is for a map/list without any hints regarding its
origin.
2023-09-22 01:00:17 -07:00
b18326a07a
config: start doing some code cleanup
I was pretty sloppy with the code organization while writing out the
state machines because my focus was on thinking through the parsing
process and logic there. However, The code was not in good shape to
continue implementing code features (not document features). This is
the first of probably several commits that will work on cleaning up
some things.

Value has been promoted to the top level namespace, and Document has an
initializer function. Referencing Value.List and Value.Map are much
cleaner now. Type aliases are good.

For the flow parser, `popStack` does not have to access anything except
the current stack. This can be passed in as a parameter. This means
that `parse` is ready to be refactored to take a buffer and an
allocator.

The main next steps for code improvement are:

1. reentrant/streaming parser. I am planning to leave it as
   line-buffered, though I could go further. Line-buffered has two main
   benefits: the tokenizer doesn't need to be refactored significantly,
   and the flow parser doesn't need to be made reentrant. I may
   reevaluate this as I am implementing it, however, as those changes
   may be simpler than I think.

2. Actually implement the error diagnostics info. I have some skeleton
   structure in place for this, so it should just be doing the work of
   getting it hooked up.

3. Parse into object. Metaprogramming, let's go. It will be interesting
   to try to do this non-recursively, as well (curious to see if it
   results in code bloat).

4. Object to Document. This is probably going to be annoying, since
   there are a variety of edge cases that will have to be handled. And
   lots of objects that cannot be represented as documents.

5. Serialize Document. One thing the parser does not preserve is
   whether a Value was flow-style or not, so it will be impossible to
   do round-trip formatting preservation. That's currently a non-goal,
   and I haven't decided yet if flow-style output should be based on
   some heuristic (number/length of values in container) or just never
   emitted. Lack of round-trip preservation does make using this as a
   general purpose config format a lot more dubious, so I will have to
   think about this some more.

6. Document to JSON. Why not? I will hand roll this and it will suck.

And then everything will be perfect and never need to be touched again.
2023-09-22 00:53:26 -07:00
cd05097a78
config: add terminated strings
This was the final feature I wanted to add to the format. Also some
other things have been cleaned up a little bit (for example, the
inline parser does not need the dangling key to be attached to each
stack level just like the normal parser doesn't). There was also an
off-by-one error that bugged out detecting the pathological case of a
flow list consisting of only an empty string (`[ ]`, not to be
mistaken for the empty list `[]`).

Mixed multiline strings are a bit confusing but internally consistent.

    > what character does this string end with?
    |

ends with a newline character because that's the style of the
second-to-last line. However, seeing | last makes my brain think it
should end with a space. The reason it ends with a newline is because
our concatenation strategy consists of appending to the string early
(as soon as a line is added) rather than lazily. This is a tradeoff,
though.  while lazy appending would make this result more intuitive
(the string would end with a space) and it would allow us to remove
the self-proclaimed cheesy hack, it would make the opposite boundary
condition confusing:

    >
    | what character does this string start with?

With lazy appending, this string would start with a space
(despite > making it look like it should have a leading newline).
While both of these are likely to be uncommon edge cases, it doesn't
seem we can have it both ways. Of the two options, I think the current
logic is a little bit more clear.
2023-09-22 00:53:26 -07:00
8b5a0114ef
config: allow nested flow structures
This was kind of a pain in the butt to implement because it basically
required a second full state machine parser (though this one is a bit
simpler since there are less possible value types). It seems likely to
me that I will probably shove this directly into the main parser
struct at some point in the near future.
2023-09-17 19:47:18 -07:00
ec875ef1f7
config: fix several things
There was no actual check that lines weren't being indented too far.
Inline strings weren't having their trailing newline get chopped.
Printing is still janky, but it's better than it was.
2023-09-17 19:30:58 -07:00
c8375d6d3a
mostly functioning config parser
This hand-rolled wonder of switch statements is capable of parsing a 5
byte document in less than a gigasecond.

This was an interesting exercise in writing a non-recursive parser for
a nested structure format. There's a lot of very slightly different
repetition, which I'm not wild about, but it can handle deeply nested
documents. I tested with a 50 mb indented list tree document (10_000
lines of nesting) and a ReleaseFast build was able to parse it in
approximately 50 ms with a peak memory footprint of about 100 MB (of
which, half was the contents of the document itself, as the file is
read into a single allocated buffer that does not get freed until
program exit). I don't consider myself to be someone who writes high
performance software, but I think those results are quite acceptable,
and I doubt any recursive implementation would even be able to parse
that document at all (the python NestedText implementation smashes
directly into a RecursionError, unsurprisingly).

Anyway, let's call this a success. I will actually probably export this
to a separate project soon. The main problem is coming up with a name.
I also strongly suspect there are some lurking bugs still, and I think
I do want to add nested inline map/list support (and also parsing
directly into objects).
2023-09-14 23:42:31 -07:00
bff1fa95c9
ignominious dump of creation
it's beautiful. and broken
2023-09-13 00:11:45 -07:00