6 Commits

Author SHA1 Message Date
b18326a07a
config: start doing some code cleanup
I was pretty sloppy with the code organization while writing out the
state machines because my focus was on thinking through the parsing
process and logic there. However, The code was not in good shape to
continue implementing code features (not document features). This is
the first of probably several commits that will work on cleaning up
some things.

Value has been promoted to the top level namespace, and Document has an
initializer function. Referencing Value.List and Value.Map are much
cleaner now. Type aliases are good.

For the flow parser, `popStack` does not have to access anything except
the current stack. This can be passed in as a parameter. This means
that `parse` is ready to be refactored to take a buffer and an
allocator.

The main next steps for code improvement are:

1. reentrant/streaming parser. I am planning to leave it as
   line-buffered, though I could go further. Line-buffered has two main
   benefits: the tokenizer doesn't need to be refactored significantly,
   and the flow parser doesn't need to be made reentrant. I may
   reevaluate this as I am implementing it, however, as those changes
   may be simpler than I think.

2. Actually implement the error diagnostics info. I have some skeleton
   structure in place for this, so it should just be doing the work of
   getting it hooked up.

3. Parse into object. Metaprogramming, let's go. It will be interesting
   to try to do this non-recursively, as well (curious to see if it
   results in code bloat).

4. Object to Document. This is probably going to be annoying, since
   there are a variety of edge cases that will have to be handled. And
   lots of objects that cannot be represented as documents.

5. Serialize Document. One thing the parser does not preserve is
   whether a Value was flow-style or not, so it will be impossible to
   do round-trip formatting preservation. That's currently a non-goal,
   and I haven't decided yet if flow-style output should be based on
   some heuristic (number/length of values in container) or just never
   emitted. Lack of round-trip preservation does make using this as a
   general purpose config format a lot more dubious, so I will have to
   think about this some more.

6. Document to JSON. Why not? I will hand roll this and it will suck.

And then everything will be perfect and never need to be touched again.
2023-09-22 00:53:26 -07:00
cd05097a78
config: add terminated strings
This was the final feature I wanted to add to the format. Also some
other things have been cleaned up a little bit (for example, the
inline parser does not need the dangling key to be attached to each
stack level just like the normal parser doesn't). There was also an
off-by-one error that bugged out detecting the pathological case of a
flow list consisting of only an empty string (`[ ]`, not to be
mistaken for the empty list `[]`).

Mixed multiline strings are a bit confusing but internally consistent.

    > what character does this string end with?
    |

ends with a newline character because that's the style of the
second-to-last line. However, seeing | last makes my brain think it
should end with a space. The reason it ends with a newline is because
our concatenation strategy consists of appending to the string early
(as soon as a line is added) rather than lazily. This is a tradeoff,
though.  while lazy appending would make this result more intuitive
(the string would end with a space) and it would allow us to remove
the self-proclaimed cheesy hack, it would make the opposite boundary
condition confusing:

    >
    | what character does this string start with?

With lazy appending, this string would start with a space
(despite > making it look like it should have a leading newline).
While both of these are likely to be uncommon edge cases, it doesn't
seem we can have it both ways. Of the two options, I think the current
logic is a little bit more clear.
2023-09-22 00:53:26 -07:00
8b5a0114ef
config: allow nested flow structures
This was kind of a pain in the butt to implement because it basically
required a second full state machine parser (though this one is a bit
simpler since there are less possible value types). It seems likely to
me that I will probably shove this directly into the main parser
struct at some point in the near future.
2023-09-17 19:47:18 -07:00
ec875ef1f7
config: fix several things
There was no actual check that lines weren't being indented too far.
Inline strings weren't having their trailing newline get chopped.
Printing is still janky, but it's better than it was.
2023-09-17 19:30:58 -07:00
c8375d6d3a
mostly functioning config parser
This hand-rolled wonder of switch statements is capable of parsing a 5
byte document in less than a gigasecond.

This was an interesting exercise in writing a non-recursive parser for
a nested structure format. There's a lot of very slightly different
repetition, which I'm not wild about, but it can handle deeply nested
documents. I tested with a 50 mb indented list tree document (10_000
lines of nesting) and a ReleaseFast build was able to parse it in
approximately 50 ms with a peak memory footprint of about 100 MB (of
which, half was the contents of the document itself, as the file is
read into a single allocated buffer that does not get freed until
program exit). I don't consider myself to be someone who writes high
performance software, but I think those results are quite acceptable,
and I doubt any recursive implementation would even be able to parse
that document at all (the python NestedText implementation smashes
directly into a RecursionError, unsurprisingly).

Anyway, let's call this a success. I will actually probably export this
to a separate project soon. The main problem is coming up with a name.
I also strongly suspect there are some lurking bugs still, and I think
I do want to add nested inline map/list support (and also parsing
directly into objects).
2023-09-14 23:42:31 -07:00
bff1fa95c9
ignominious dump of creation
it's beautiful. and broken
2023-09-13 00:11:45 -07:00