Compare commits

...

2 Commits

Author SHA1 Message Date
314969ec92
config: dupe map keys
I didn't do an exhaustive search, but it seems that the managed
hashmaps only allocates space for the structure of the map itself, not
its keys or values. This mostly makes sense, but it also means that
this was only working due to the fact that I am currently not freeing
the input buffer until after iterating through the parse result.

Looking through this, I'm also reasonably surprised by how many times
this is assigned in the normal parsing vs the flow parsing. There is a
lot more repetition in the code of the normal parser, I think because
it does not have a granular state machine. It may be worth revisiting
the structure to see if a more detailed state machine, like the one
used for parsing the flow-style objects, would reduce the amount of
code repetition here. I suspect it certainly could be better than it
currently is, since it seems unlikely that there really are four
different scenarios where we need to be parsing a dictionary key.
Taking a quick glance at it, it looks like I could be taking better
advantage of the flipflop loop on indent as well as dedent. This might
be a bit less efficient due to essentially being less loop unrolling,
but it would also potentially make more maintainable code by having
less manual repetition.
2023-09-22 00:48:17 -07:00
1113550b5f
config: use std.StringArrayHashMap for the map type
As I was thinking about this, I realized that data serialization is
much more of a bear than deserialization. Or, more accurately, trying
to make stable round trip serialization a goal puts heavier demands on
deserialization, including preserving input order.

I think there may be a mountain hiding under this molehill, though,
because the goals of having a format that is designed to be
handwritten and also machine written are at odds with each other.
Right now, the parser does not preserve comments at all. But even if
we did (they could easily become a special type of string), comment
indentation is ignored. Comments are not directly a child of any other
part of the document, they're awkward text that exists interspersed
throughout it.

With the current design, there are some essentially unsolvable
problems, like comments interspersed throughout multiline strings. The
string is processed into a single object in the output, so there can't
be weird magic data interleaved with it because it loses the concept
of being interleaved entirely (this is a bigger issue for space
strings, which don't even preserve a unique way to reserialize them.
Line strings at least contain a character (the newline) that can
appear nowhere else but at a break in the string. Obviously this isn't
technically impossible, but it would require a change to the way that
values are modeled.

And even if we did take the approach of associating a comment with,
say, the value that follows it (which I think is a reasonable thing to
do, ignoring the interleaved comment situation described above), if
software reads in data, changes it, and writes it back out, how do we
account for deleted items? Does the comment get deleted with the item?
Does it become a dangling comment that just gets shoved somewhere in
the document? How are comments that come after everything else in the
document handled?

From a pure data perspective, it's fairly obvious why JSON omits
comments: they're trivial to parse, but there's not a strategy for
emitting them that will always be correct, especially in a format that
doesn't give a hoot about linebreaks. It may be interesting to look at
fancy TOML (barf) parsers to see how they handle comments, though I
assume the general technique is to store their row position in the
original document and track when a line is added or removed.

Ultimately, I think the use case a format to be written by humans and
read by computers is still useful. That's my intended use case for
this and why I started it, but its application as a configuration file
format is probably hamstrung muchly by software not being able to
write it back. On the other hand, there's a lot of successful software
I use where the config files are not written directly by the software
at all, so maybe it's entirely fine to declare this as being out of
scope and not worrying about it further. At the very least it's almost
certainly less of an issue than erroring on carriage returns. Also the
fact that certain keys are simply unrepresentable.

As a side note, I guess what they say about commit message length being
inversely proportional to the change length is true. Hope you enjoyed
the blog over this 5 character change.
2023-09-22 00:48:17 -07:00

View File

@ -430,7 +430,7 @@ pub fn LineTokenizer(comptime Buffer: type) type {
pub const Value = union(enum) {
pub const String = std.ArrayList(u8);
pub const Map = std.StringHashMap(Value);
pub const Map = std.StringArrayHashMap(Value);
pub const List = std.ArrayList(Value);
pub const TagType = @typeInfo(Value).Union.tag_type.?;
@ -727,7 +727,7 @@ pub const Parser = struct {
// key somewhere until we can consume the
// value. More parser state to lug along.
dangling_key = pair.key;
dangling_key = try arena_alloc.dupe(u8, pair.key);
state = .value;
},
.scalar => |str| {
@ -897,7 +897,7 @@ pub const Parser = struct {
switch (pair.val) {
.empty => {
dangling_key = pair.key;
dangling_key = try arena_alloc.dupe(u8, pair.key);
expect_shift = .indent;
},
.scalar => |str| try new_map.map.put(pair.key, try Value.fromScalar(arena_alloc, str)),
@ -995,7 +995,7 @@ pub const Parser = struct {
.none, .dedent => switch (pair.val) {
.empty => {
expect_shift = .indent;
dangling_key = pair.key;
dangling_key = try arena_alloc.dupe(u8, pair.key);
},
.scalar => |str| try putMap(map, pair.key, try Value.fromScalar(arena_alloc, str), self.dupe_behavior),
.line_string, .space_string => |str| try putMap(map, pair.key, try Value.fromString(arena_alloc, str), self.dupe_behavior),
@ -1013,7 +1013,7 @@ pub const Parser = struct {
switch (pair.val) {
.empty => {
expect_shift = .indent;
dangling_key = pair.key;
dangling_key = try arena_alloc.dupe(u8, pair.key);
},
.scalar => |str| try new_map.map.put(pair.key, try Value.fromScalar(arena_alloc, str)),
.line_string, .space_string => |str| try new_map.map.put(pair.key, try Value.fromString(arena_alloc, str)),
@ -1334,7 +1334,7 @@ pub const FlowParser = struct {
.consuming_map_key => switch (char) {
':' => {
const tip = try getStackTip(self.stack);
dangling_key = self.buffer[tip.item_start..idx];
dangling_key = try self.alloc.dupe(u8, self.buffer[tip.item_start..idx]);
self.state = .want_map_value;
},