47 Commits

Author SHA1 Message Date
33ab092a06
value: store strings/scalars as null-terminated
Since these were already always copied from the source data, this was a
very easy change to implement. This makes our output schema string
detection a bit stricter, and saves performing a copy in the case that
the output string needs to be 0 terminated.

Unfortunately, we can't skip copies in the general slice case since
each child element needs to get converted to the appropriate type.
2023-11-23 17:52:38 -08:00
21a9753d46
parser: change omitted value behavior to work with all default values
Special casing optional values was a little odd before. Now, the user
can supply a default value for any field that may be omitted from the
serialized data. This behaves the same way as the stdlib JSON parser
as well.
2023-11-23 17:47:21 -08:00
e8ddee5ab2
examples.reify: implement updated union/enum semantics 2023-11-06 20:45:04 -08:00
2f90ccba6f
parser: accept 0-size tagged union values as scalars
Given the type:

    union(enum) {
        none: void,
        any: []const u8,
    };

Previously your document would have had to be

    .none:

But now this can also be parsed as the simple scalar

    .none

This is much nicer if the tagged union is a member of a larger type,
like a struct, since the value can be specified in-line without
needing to create a map.

    my_union: .none

Whereas previously this would have had to have been (this style is
still supported):

    my_union: { .none: }

or

    my_union:
        .none:
2023-11-06 20:45:04 -08:00
d6e1e85ea1
parser: make tagged union field names respect expect_enum_dot
It's possible that this change may get reverted in the future, but I
think it makes things more consistent and has some other minor
benefits, so it probably won't be.

Consistency: tagged union fields are enum members by definition in zig,
so it makes these act like enumerations that accept values, which is
really how tagged unions work in zig.

Other benefits: tagged unions do not behave like structs, and having
their key start with a leading . helps to distinguish them visually.
You could say that it makes communicating intent more precise.

Here's an example: by default, given the following type:

    union(enum) {
        any: []const u8,
        int: i32,
    };

A corresponding nice document would now look like:

    .int: 42069

Whereas it used to be:

    int: 42069

My only concern here is that this potentially makes the serialization
noisier. But if so that's true of the enum handling, too.
2023-11-06 20:43:21 -08:00
ed913ab3a3
state: properly update key order when preserving the last key
Since I decided that Nice would guarantee (for some definition of
guarantee) preserving the order of keys in a document, this has some
impact on the parsing modes that tolerate duplicate keys. In the case
that the last instance of a duplicate key is the one that is
preserved, its order should be reflected.

In general, however, it's recommended not to permit duplicate keys,
which is why that's the default behavior.
2023-11-06 20:15:02 -08:00
73575a43a7
readme: basic editing pass
This still needs a lot of TLC to be actually, y'know, decent, but at
least it can become infinitesimally less bad.
2023-11-06 20:13:06 -08:00
1c5d7af552
parser: don't leak on parseTo error
A good idea.
2023-10-22 16:49:12 -07:00
f371aa281c
parser: default expect enum values with leading .
I prefer this, personally. And this is all about personal preference.
2023-10-22 16:48:45 -07:00
ce65dee71f
parser: ostensibly fix sentinel handling
I guess arrays don't need special handling because their memory is
explicitly accounted for, but it would probably be good to check that
a sentinel-terminated array initialized as `undefined` does get the
correct sentinel value.
2023-10-22 16:38:41 -07:00
f371f16e2f
slam dunk that minimum viable product vibe 2023-10-22 16:16:57 -07:00
f381edfff3
nice: chuck outdated format description comment 2023-10-22 15:36:50 -07:00
6d2c08878d
examples: add parsing to an object example 2023-10-22 15:36:34 -07:00
cca7d61666
readme: produce excessive verbiage
And I'm not even done yet. Man.
2023-10-19 21:44:05 -07:00
4690f0b808
parser: add option for case-insensitive scalar comparison
This does not support unicode case folding, which is very much a
sorry-not-sorry situation because unicode is a disgusting labyrinthine
chaotic hellformat. Actually, our unicode support isn't very good from
the standpoint that we don't do any form of normalization, so
specifying non-ASCII values for scalar comparisons is probably asking
for trouble.
2023-10-18 21:34:07 -07:00
1f75ff6b8a
readme: continue reading me
I'm going to have to come up with a license for this code also.
2023-10-18 21:29:58 -07:00
c83558de3e
start adding the readme 2023-10-18 21:15:29 -07:00
4c966ca9d0
parser: reintroduce space strings and change token parsing strategy
Once again I have entangled two conceptually distinct changes into a
single commit because demuxing them from the diff is too much work.
Alas. Let's break it down.

The simpler part of this change is to reintroduce "space strings" with
a slightly fresh coat of paint. We now have 3 different types of
string leaders that can be used together. So we now have:

    | directly concatenates this line with the previous line
    > prepends an LF character before concatenation
    + (NEW) prepends a single space character before concatenation

The `+` leader enables more æsthetic soft line wrapping than `|`
because it doesn't require the use of leading or trailing the
whitespace to separate words, as long as lines are broken at word
boundaries. Perhaps this is not as common a usecase as I am making it,
but I do like to hard wrap paragraphs in documents, so if anything,
it's a feature for me.

As I was considering what character to use for this leader, I realized
that I wanted to be able to support numeric map keys, a la:

    -1: negative one
    0:  zero
    +1: positive one

But previously this would not parse correctly, as the tokenizer would
find `-` and expect it to be followed by a space to indicate a list
item (and the additional string leader would cause the same problem
with `+`). I wanted to support this use case, so the parser was
changed to take a second pass on lines starting with the string
leaders (`|`, `+`, and `>`) and the list item leader (`-`) if the
leader has a non-space character following it. Note that this does not
apply to the comment leader (`#` not followed by a space or a newline
is a tokenization error) or to the inline list/map leaders(since those
do not respect internal whitespace, there is no way to treat them
unambiguously).

To reduce the likelihood of confusing documents, scalars are no longer
allowed to occupy their own line (the exception to this is if the
document consists only of a scalar value). Inline lists and maps can
still occupy their own line, though I am considering changing this as
well to force them to truly be inline. I think this change makes
sense, as scalars are generally intended to be represent an unbroken
single item serialization of some non-string value. In other words,

    # these two lines used to parse the same way
    key: 9001
    # but now the following line is a parse error due to the scalar
    # occupying its own line
    key:
        9001
    # also, this still works, but it may be changed to be an error in
    # the future
    key:
        [ 9, 0, 0, 1 ]

Inline maps have also been changed so that their keys can start with the
now-unforbidden string leaders and list item leader characters.
2023-10-18 21:15:29 -07:00
25386ac87a
rename flow_(list|map) to inline_(list|map)
This is simply better word choice.
2023-10-18 00:07:12 -07:00
8dd5463683
parser: change string and | semantics and expose slices in Value
The way I implemented these changes ended up being directly coupled and
I am not interested in trying to decouple them, so instead here's a
single commit that makes changes to both the API and the format. Let's
go over these.

| now acts as a direct concatenation operator, rather than
concatenating with a space. This is because the format allows the
specification of a trailing space (by using | to fence the string just
before the newline). So it's now possible to spread a long string
without spaces over multiple lines, which couldn't be done before.
This does have the downside that the common pattern of concatenating
strings with a space now requires some extra trailing line noise. I
may introduce a THIRD type of concatenating string (thinking of
using + as the prefix) because I am a jerk. We will see.

The way multi-line strings are concatenated has changed. Partially this
has to do with increasing the simplicity of the aforementioned
implementation change (the parser forgets the string type from the
tokenizer. This worked before because there would always be a trailing
character that could be popped off. But since one type now appends no
character, this would have to be tracked through the parsing to
determine if a character would need to be popped at the end). But I
was also not terribly satisfied with the semantics of multiline
strings before. I wrote several words about this in
429734e6e813b225654aa71c283f4a8b4444609f, where I reached the opposite
conclusion from what is implemented in this commit.

Basically, when different types of string concatenation are mixed, the
results may be surprising. The previous approach would append the line
terminator at the end of the line specified. The new approach prepends
the line terminator at the beginning of the line specified. Since the
specifier character is at the beginning of the line, I feel like this
reads a little better simply due to the colocation of information. As
an example:

  > first
  | second
  > third

Would previously have resulted in "first\nsecondthird" but it will now
result in "firstsecond\nthird". The only mildly baffling part about
this is that the string signifier on the first line has absolutely no
impact on the string. In the old design, it was the last line that had
no impact.

Finally, this commit also changes Value so that it uses []const u8
slices directly to store strings instead of ArrayLists. This is
because everything downstream of the value was just reaching into
string.items to access the slice directly, so cut out the middleman.
It was unintuitive to access a field named .string and get an
arraylist rather than a slice, anyway.
2023-10-08 16:57:52 -07:00
7db6094dd5
state/tokenizer: go completely the opposite direction re: whitespace
This commit makes both the parser and tokenizer a lot more willing to
accept whitespace in places where it would previously cause strange
behavior. Also, whitespace is ignored preceding and following all
values and keys in flow-style objects now (in regular objects,
trailing whitespace is an error, and it is also an error for non-flow
map keys to have whitespace before the colon). Tabs are no longer
allowed as whitespace in the line. They can be inside scalar values,
though, including map keys. Also strings allow tabs inside of them.

The primary motivation here is to apply the principle of least
astonishment. For example, the following

  -  [hello, there]

would previously have been parsed as the scalar " [hello, there]" due
to the presence of an additional space after the "-" list item
indicator. This obviously looks like a flow list, and the way it was
previously parsed was very visually confusing (this change does mean
that scalars cannot start with [, but strings can, so this is not a
real limitation. Note that strings still allow leading whitespace, so

  >  hello

will produce the string " hello" due to the additional space after the
string designator. For flow lists,

  [ a, b ]

would have been parsed as ["a", "b "], which was obviously confusing.
The previous commit fixed this by making whitespace rules more strict.
This commit fixes this by making whitespace rules more relaxed. In
particular, all whitespace preceding and following flow items is now
stripped. The main motivation for going in this direction is to allow
aligning list items over multiple lines, visually, which can make data
much easier to read for people, an explicit design goal. For example

  key:   [  1,  2,  3 ]
  other: [ 10, 20, 30 ]

is now allowed. The indentation rules do not allow right-aligning
"key" to "other", but I think that is acceptable (if we forced using
tabs for indentation, we could actually allow this, which I think is
worth consideration, at least). Flow maps are more generous:

  foo:  {  bar:  baz }
  fooq: { barq: bazq }

is allowed because flow maps do not use whitespace as a structural
designator. These changes do affect how some things can be
represented. Scalar values can no longer contain leading or trailing
whitespace (previously the could contain leading whitespace). Map keys
cannot contain trailing whitespace (they could before. This also means
that keys consisting of whitespace cannot be represented at all).
Ultimately, given the other restrictions the format imposes on keys
and values, I find these to be acceptable and consistent with the goal
of the format.
2023-10-04 22:54:53 -07:00
1683197bc0
state: parse whitespace in flow objects a bit differently
There were (and probably still are) some weird and ugly edge cases
here. For example, `[ 1 ]` would parse to a list of `1 `. This
implementation allows a single space to precede the closing ] and
errors out if there is more than one. Additionally, it rejects any
spaces before the item separator comma. This also applies to flow
maps, with the addition that they do not permit whitespace before `:`
now, either.

Leading spaces are still consumed with reckless abandon, so, for
example, `[   lopsided]` is valid. There is also some state sloppiness
flying around so `[   val,    ]` probably currently works as well.
Tightening up the handling of leading whitespace will be a bigger
restructuring that may involve state machine changes. I'll have to
think about it.
2023-10-03 23:25:58 -07:00
c5e8921eb2
state: use inferred error sets
As far as I can tell, the only reason ever not to use an inferred error
set is when you would get a dependency loop otherwise.
2023-10-03 23:19:01 -07:00
34ec58e0d2
value: implement parsing to objects
There are still some untested codepaths here, but this does seem to
work for nontrivial objects, so, woohoo. It's worth noting that this
is a recursive implementation (which seems silly after I hand-rolled
the non-recursive main parser). The thinking is that if you have a
deeply-enough nested object that you run out of stack space here, you
probably shouldn't be converting it directly to an object.

I may revisit this, though I am still not 100% certain how
straightforward it would be to make this nonrecursive with all the
weird comptime objects. Basically the "parse stack" would have to be
created at comptime.
2023-10-03 23:17:37 -07:00
0028092a4e
parser: in theory, hook up the rest of the diagnostics
In practice, there are probably still things I missed here, and I
should audit this to make sure there aren't any egregious copy paste
errors remaining. Also, it's pretty likely that the diagnostics
line_offset field isn't correct in most of these messages. More work
will need to be done to update that correctly.
2023-10-01 21:15:21 -07:00
01f98f9aff
parser: start the arduous journey of hooking up diagnostics
The errors in the line buffer and tokenizer now have diagnostics. The
line number is trivial to keep track of due to the line buffer, but
the column index requires quite a bit of juggling, as we pass
successively trimmed down buffers to the internals of the parser.
There will probably be some column index counting problems in the
future. Also, handling the diagnostics is a bit awkward, since it's a
mandatory out-parameter of the parse functions now. The user must
provide a valid diagnostics object that survives for the life of the
parser.
2023-09-27 23:44:06 -07:00
3258e7fdb5
tokenizer: add finish function to check if there is trailing data
Since the tokenizer is decoupled from the parser, there's no good way
to do this. Also without attempting to parse the last line, it's
impossible to say if it is junk data or simply a missing trailing new
line.
2023-09-27 23:35:24 -07:00
0e60719c85
linebuffer: add strictness options
When the buffer was separated from the tokenizer, we lost some
validation, including really aggressive carriage return detection.
This brings this back in full force and adds some additional
validation on top of it.
2023-09-26 00:06:39 -07:00
7f82c24584
parser: implement streaming parser
With my pathological 50MiB 10_000 line nested list test, this is
definitely slower than the one shot parser, but it has peak memory
usage of 5MiB compared to the 120MiB of the one-shot parsing. Not bad.
Obviously this result is largely dependent on the fact that this
particular benchmark is 99% whitespace, which does not get copied into
the resulting document. A (significantly) smaller improvement will be
observed in files that are mostly data with little indentation or
empty lines.

But a win is a win.
2023-09-25 01:18:09 -07:00
5037f69fbe
examples: add some sample documents to parse against 2023-09-24 22:25:22 -07:00
1d65b072ee
parser: stateful reentrancy
finally the flow parser has been "integrated" with the main parser in
that they now share a stack. The bigger thing is that the parsing has
been decoupled from the tokenization, which will allow parsing
documents without loading them fully into memory first.

I've been calling this the streaming parser, but it's worth noting that
I am referring to streaming input, not streaming output. It would
certainly be possible to do streaming output, but I am not interested
in that at the moment (it would be the lowest-memory-overhead
approach, but it's a lot of work for little gain, and it is less
flexible for converting input to objects).
2023-09-24 22:24:33 -07:00
38e47b39dc
all: do some restructuring
I don't like big monolithic source files, so let's restructure a bit.
parser.zig is still bigger than I would like it to be, but there isn't
a good way to break up the two state machine parsers, which take up
most of the space. This is the last junk commit before I am seriously
going to implement the "streaming" parser. Which is the last change
before implementing deserialization to object. I am definitely not
just spinning my wheels here.
2023-09-24 18:22:12 -07:00
8684fab23c
build: add oneshot parsing example 2023-09-24 15:14:58 -07:00
54e4a14e38
config: item start does not need to be stored at every stack level
This is a simplification, but the main motivation is that the flow
parser stack can be integrated with the main parser stack because they
are not disparate types any more.
2023-09-24 14:58:31 -07:00
dcd33bdf27
config: catch some missing key copies
For inline key items, the key memory wasn't getting copied. Now it
does.
2023-09-24 14:58:31 -07:00
3131a9d5fd
config: migrate flow parser into the main parser object
I think I am actually going to make this a method of the ParserState
struct soon so lol check out my freaking code churn. But here we are.
2023-09-24 14:58:31 -07:00
465d21eaae
config: remove some duplication in the parser
There's still a fair amount lurking in here, but I believe this logic
is sound. Rather than duplicating the map/list logic under the
opposing key, we set the logic up to use the second loop around
(this is was how dedents worked, and now it also works for indents).

I'm not convinced this is as easy to follow, and it did lead me to add
some additional unreachables to the code, which should maybe be turned
into error returns instead. It does reduce the odds of a code change
missing a copied instance, which I think is a good thing.
2023-09-24 14:58:31 -07:00
4fe340ea9b
config: dupe map keys
I didn't do an exhaustive search, but it seems that the managed
hashmaps only allocates space for the structure of the map itself, not
its keys or values. This mostly makes sense, but it also means that
this was only working due to the fact that I am currently not freeing
the input buffer until after iterating through the parse result.

Looking through this, I'm also reasonably surprised by how many times
this is assigned in the normal parsing vs the flow parsing. There is a
lot more repetition in the code of the normal parser, I think because
it does not have a granular state machine. It may be worth revisiting
the structure to see if a more detailed state machine, like the one
used for parsing the flow-style objects, would reduce the amount of
code repetition here. I suspect it certainly could be better than it
currently is, since it seems unlikely that there really are four
different scenarios where we need to be parsing a dictionary key.
Taking a quick glance at it, it looks like I could be taking better
advantage of the flipflop loop on indent as well as dedent. This might
be a bit less efficient due to essentially being less loop unrolling,
but it would also potentially make more maintainable code by having
less manual repetition.
2023-09-24 14:58:31 -07:00
a9d179acc1
config: use std.StringArrayHashMap for the map type
As I was thinking about this, I realized that data serialization is
much more of a bear than deserialization. Or, more accurately, trying
to make stable round trip serialization a goal puts heavier demands on
deserialization, including preserving input order.

I think there may be a mountain hiding under this molehill, though,
because the goals of having a format that is designed to be
handwritten and also machine written are at odds with each other.
Right now, the parser does not preserve comments at all. But even if
we did (they could easily become a special type of string), comment
indentation is ignored. Comments are not directly a child of any other
part of the document, they're awkward text that exists interspersed
throughout it.

With the current design, there are some essentially unsolvable
problems, like comments interspersed throughout multiline strings. The
string is processed into a single object in the output, so there can't
be weird magic data interleaved with it because it loses the concept
of being interleaved entirely (this is a bigger issue for space
strings, which don't even preserve a unique way to reserialize them.
Line strings at least contain a character (the newline) that can
appear nowhere else but at a break in the string). Obviously this isn't
technically impossible, but it would require a change to the way that
values are modeled.

And even if we did take the approach of associating a comment with,
say, the value that follows it (which I think is a reasonable thing to
do, ignoring the interleaved comment situation described above), if
software reads in data, changes it, and writes it back out, how do we
account for deleted items? Does the comment get deleted with the item?
Does it become a dangling comment that just gets shoved somewhere in
the document? How are comments that come after everything else in the
document handled?

From a pure data perspective, it's fairly obvious why JSON omits
comments: they're trivial to parse, but there's not a strategy for
emitting them that will always be correct, especially in a format that
doesn't give a hoot about linebreaks. It may be interesting to look at
fancy TOML (barf) parsers to see how they handle comments, though I
assume the general technique is to store their row position in the
original document and track when a line is added or removed.

Ultimately, I think the use case of a format to be written by humans
and read by computers is still useful. That's my intended use case for
this and why I started it, but its application as a configuration file
format is probably hamstrung muchly by software not being able to
write it back. On the other hand, there's a lot of successful software
I use where the config files are not written directly by the software
at all, so maybe it's entirely fine to declare this as being out of
scope and not worrying about it further. At the very least it's almost
certainly less of an issue than erroring on carriage returns. Also the
fact that certain keys are simply unrepresentable.

As a side note, I guess what they say about commit message length being
inversely proportional to the change length is true. Hope you enjoyed
the blog over this 5 character change.
2023-09-24 14:58:31 -07:00
a0107ab9fd
config: refactor LineTokenizer to use an internal line buffer
The goal here is to support a streaming parser. However, I did decide
the leave the flow item parser state machine as fully buffered
(i.e. not streaming). This is not JSON and in general documents should
be many, shorter lines, so this buffering strategy should work
reasonably well. I have not actually tried the streaming
implementation of this, yet.
2023-09-24 14:58:31 -07:00
b08d712616
config: differentiate fields in Value
This makes handling Value very slightly more work, but it provides
useful metadata that can be used to perform better conversion and
serialization.

The motivation behind the "scalar" type is that in general, only
scalars can be coerced to other types. For example, a scalar `null`
and a string `> null` have the same in-memory representation. If they
are treated identically, this precludes unambiguously converting an
optional string whose contents are "null". With the two disambiguated,
we can choose to convert `null` to the null object and `> null` to a
string of contents "null". This ambiguity does not necessary exist for
the standard boolean values `true` and `false`, but it does allow the
conversion to be more strict, and it will theoretically result in
documents that read more naturally.

The motivation behind exposing flow_list and flow_map is that it will
allow preserving document formatting round trip (well, this isn't
strictly true: single line explicit strings neither remember whether
they were line strings or space strings, and they don't remember if
they were indented. However, that is much less information to lose).

The following formulations will parse to the same indistinguishable
value:

  key: > value
  key:
    > value
  key: | value
  key:
    | value

I think that's okay. It's a lot easier to chose a canonical form for
this case than it is for a map/list without any hints regarding its
origin.
2023-09-24 14:58:31 -07:00
73f1d9b21b
config: start doing some code cleanup
I was pretty sloppy with the code organization while writing out the
state machines because my focus was on thinking through the parsing
process and logic there. However, The code was not in good shape to
continue implementing code features (not document features). This is
the first of probably several commits that will work on cleaning up
some things.

Value has been promoted to the top level namespace, and Document has an
initializer function. Referencing Value.List and Value.Map are much
cleaner now. Type aliases are good.

For the flow parser, `popStack` does not have to access anything except
the current stack. This can be passed in as a parameter. This means
that `parse` is ready to be refactored to take a buffer and an
allocator.

The main next steps for code improvement are:

1. reentrant/streaming parser. I am planning to leave it as
   line-buffered, though I could go further. Line-buffered has two main
   benefits: the tokenizer doesn't need to be refactored significantly,
   and the flow parser doesn't need to be made reentrant. I may
   reevaluate this as I am implementing it, however, as those changes
   may be simpler than I think.

2. Actually implement the error diagnostics info. I have some skeleton
   structure in place for this, so it should just be doing the work of
   getting it hooked up.

3. Parse into object. Metaprogramming, let's go. It will be interesting
   to try to do this non-recursively, as well (curious to see if it
   results in code bloat).

4. Object to Document. This is probably going to be annoying, since
   there are a variety of edge cases that will have to be handled. And
   lots of objects that cannot be represented as documents.

5. Serialize Document. One thing the parser does not preserve is
   whether a Value was flow-style or not, so it will be impossible to
   do round-trip formatting preservation. That's currently a non-goal,
   and I haven't decided yet if flow-style output should be based on
   some heuristic (number/length of values in container) or just never
   emitted. Lack of round-trip preservation does make using this as a
   general purpose config format a lot more dubious, so I will have to
   think about this some more.

6. Document to JSON. Why not? I will hand roll this and it will suck.

And then everything will be perfect and never need to be touched again.
2023-09-24 14:58:31 -07:00
429734e6e8
config: add terminated strings
This was the final feature I wanted to add to the format. Also some
other things have been cleaned up a little bit (for example, the
inline parser does not need the dangling key to be attached to each
stack level just like the normal parser doesn't). There was also an
off-by-one error that bugged out detecting the pathological case of a
flow list consisting of only an empty string (`[ ]`, not to be
mistaken for the empty list `[]`).

Mixed multiline strings are a bit confusing but internally consistent.

    > what character does this string end with?
    |

ends with a newline character because that's the style of the
second-to-last line. However, seeing | last makes my brain think it
should end with a space. The reason it ends with a newline is because
our concatenation strategy consists of appending to the string early
(as soon as a line is added) rather than lazily. This is a tradeoff,
though.  while lazy appending would make this result more intuitive
(the string would end with a space) and it would allow us to remove
the self-proclaimed cheesy hack, it would make the opposite boundary
condition confusing:

    >
    | what character does this string start with?

With lazy appending, this string would start with a space
(despite > making it look like it should have a leading newline).
While both of these are likely to be uncommon edge cases, it doesn't
seem we can have it both ways. Of the two options, I think the current
logic is a little bit more clear.
2023-09-24 14:58:31 -07:00
58c5d15fc3
config: allow nested flow structures
This was kind of a pain in the butt to implement because it basically
required a second full state machine parser (though this one is a bit
simpler since there are less possible value types). It seems likely to
me that I will probably shove this directly into the main parser
struct at some point in the near future.
2023-09-24 14:58:31 -07:00
a749d538fc
config: fix several things
There was no actual check that lines weren't being indented too far.
Inline strings weren't having their trailing newline get chopped.
Printing is still janky, but it's better than it was.
2023-09-24 14:58:31 -07:00
02e360f42d
mostly functioning config parser
This hand-rolled wonder of switch statements is capable of parsing a 5
byte document in less than a gigasecond.

This was an interesting exercise in writing a non-recursive parser for
a nested structure format. There's a lot of very slightly different
repetition, which I'm not wild about, but it can handle deeply nested
documents. I tested with a 50 mb indented list tree document (10_000
lines of nesting) and a ReleaseFast build was able to parse it in
approximately 50 ms with a peak memory footprint of about 100 MB (of
which, half was the contents of the document itself, as the file is
read into a single allocated buffer that does not get freed until
program exit). I don't consider myself to be someone who writes high
performance software, but I think those results are quite acceptable,
and I doubt any recursive implementation would even be able to parse
that document at all (the python NestedText implementation smashes
directly into a RecursionError, unsurprisingly).

Anyway, let's call this a success. I will actually probably export this
to a separate project soon. The main problem is coming up with a name.
I also strongly suspect there are some lurking bugs still, and I think
I do want to add nested inline map/list support (and also parsing
directly into objects).
2023-09-24 14:58:31 -07:00
3086022f8d
create something that doesn't work 2023-09-24 14:58:31 -07:00