nice-data

Author	SHA1	Message	Date
torque	e562e30e5e	grammar, spelling	2024-06-18 18:33:57 -07:00
torque	8aaceba484	parser.value.convertTo: add field converter concept It is convenient to be able to have custom logic for a specific field on a given struct without having to write a function to manually reify the whole thing from scratch.	2024-06-18 18:32:22 -07:00
torque	c74d615131	parser.value.convertTo: simplify struct field usage This avoids having to clone the map while maintaining the same conversion strictness.	2024-06-18 18:32:22 -07:00
torque	8ccb2c3a66	build: update for zig-0.13	2024-06-18 18:24:19 -07:00
torque	ad73ea6508	build: update for 0.12.0-dev.2208+4debd4338 I am hoping that by starting to roll over to zig 0.12 now it will be easier to migrate when the release actually happens. Unfortunately, the build system API changed fairly significantly and supporting both 0.11 and 0.12-dev is not very interesting.	2024-01-15 22:10:15 -08:00
torque	875b1b6344	start adding tests	2023-12-01 22:35:18 -08:00
torque	ea52c99fee	parser.Options: split truthy/falsy scalars into separate fields This makes overriding the defaults of just one of truthy or falsy more ergonomic. Previously, when overriding the truthy scalars, the user would also have to specify all of the falsy scalars as well.	2023-12-01 22:33:14 -08:00
torque	dbf2762982	parser: empty document should be scalar, not string I think I originally set this up before I had fully decided on the semantics of scalars vs strings. This option makes much more sense to me because it mirrors the empty value behavior map keys. Without an introducer sequence, it's can't be a string.	2023-12-01 22:31:30 -08:00
torque	0f4a9fcaa7	misc: commit things at random	2023-11-23 18:38:03 -08:00
torque	bd079b42d9	compile with zig master I was actually anticipating a bit more stdlib breakage than this, so I ended up just shimming it. Well, it works and also still works with 0.11.0, which is cool.	2023-11-23 18:37:19 -08:00
torque	bd0d74ee6a	examples.reify: add default value field	2023-11-23 17:56:27 -08:00
torque	2208079355	parser.Options: embellish expect_enum_dot description This affects tagged union parsing, and that should be mentioned here. So now it is.	2023-11-23 17:55:47 -08:00
torque	98eac68929	value: simplify list conversion code There was really no reason to use ArrayLists here when the list length is known ahead of time. This slightly shortens the code and should be slightly more memory/stack efficient.	2023-11-23 17:54:14 -08:00
torque	39619e7d6b	value: fix use of parseFloat	2023-11-23 17:52:38 -08:00
torque	33ab092a06	value: store strings/scalars as null-terminated Since these were already always copied from the source data, this was a very easy change to implement. This makes our output schema string detection a bit stricter, and saves performing a copy in the case that the output string needs to be 0 terminated. Unfortunately, we can't skip copies in the general slice case since each child element needs to get converted to the appropriate type.	2023-11-23 17:52:38 -08:00
torque	21a9753d46	parser: change omitted value behavior to work with all default values Special casing optional values was a little odd before. Now, the user can supply a default value for any field that may be omitted from the serialized data. This behaves the same way as the stdlib JSON parser as well.	2023-11-23 17:47:21 -08:00
torque	e8ddee5ab2	examples.reify: implement updated union/enum semantics	2023-11-06 20:45:04 -08:00
torque	2f90ccba6f	parser: accept 0-size tagged union values as scalars Given the type: union(enum) { none: void, any: []const u8, }; Previously your document would have had to be .none: But now this can also be parsed as the simple scalar .none This is much nicer if the tagged union is a member of a larger type, like a struct, since the value can be specified in-line without needing to create a map. my_union: .none Whereas previously this would have had to have been (this style is still supported): my_union: { .none: } or my_union: .none:	2023-11-06 20:45:04 -08:00
torque	d6e1e85ea1	parser: make tagged union field names respect expect_enum_dot It's possible that this change may get reverted in the future, but I think it makes things more consistent and has some other minor benefits, so it probably won't be. Consistency: tagged union fields are enum members by definition in zig, so it makes these act like enumerations that accept values, which is really how tagged unions work in zig. Other benefits: tagged unions do not behave like structs, and having their key start with a leading . helps to distinguish them visually. You could say that it makes communicating intent more precise. Here's an example: by default, given the following type: union(enum) { any: []const u8, int: i32, }; A corresponding nice document would now look like: .int: 42069 Whereas it used to be: int: 42069 My only concern here is that this potentially makes the serialization noisier. But if so that's true of the enum handling, too.	2023-11-06 20:43:21 -08:00
torque	ed913ab3a3	state: properly update key order when preserving the last key Since I decided that Nice would guarantee (for some definition of guarantee) preserving the order of keys in a document, this has some impact on the parsing modes that tolerate duplicate keys. In the case that the last instance of a duplicate key is the one that is preserved, its order should be reflected. In general, however, it's recommended not to permit duplicate keys, which is why that's the default behavior.	2023-11-06 20:15:02 -08:00
torque	73575a43a7	readme: basic editing pass This still needs a lot of TLC to be actually, y'know, decent, but at least it can become infinitesimally less bad.	2023-11-06 20:13:06 -08:00
torque	1c5d7af552	parser: don't leak on parseTo error A good idea.	2023-10-22 16:49:12 -07:00
torque	f371aa281c	parser: default expect enum values with leading `.` I prefer this, personally. And this is all about personal preference.	2023-10-22 16:48:45 -07:00
torque	ce65dee71f	parser: ostensibly fix sentinel handling I guess arrays don't need special handling because their memory is explicitly accounted for, but it would probably be good to check that a sentinel-terminated array initialized as `undefined` does get the correct sentinel value.	2023-10-22 16:38:41 -07:00
torque	f371f16e2f	slam dunk that minimum viable product vibe	2023-10-22 16:16:57 -07:00
torque	f381edfff3	nice: chuck outdated format description comment	2023-10-22 15:36:50 -07:00
torque	6d2c08878d	examples: add parsing to an object example	2023-10-22 15:36:34 -07:00
torque	cca7d61666	readme: produce excessive verbiage And I'm not even done yet. Man.	2023-10-19 21:44:05 -07:00
torque	4690f0b808	parser: add option for case-insensitive scalar comparison This does not support unicode case folding, which is very much a sorry-not-sorry situation because unicode is a disgusting labyrinthine chaotic hellformat. Actually, our unicode support isn't very good from the standpoint that we don't do any form of normalization, so specifying non-ASCII values for scalar comparisons is probably asking for trouble.	2023-10-18 21:34:07 -07:00
torque	1f75ff6b8a	readme: continue reading me I'm going to have to come up with a license for this code also.	2023-10-18 21:29:58 -07:00
torque	c83558de3e	start adding the readme	2023-10-18 21:15:29 -07:00
torque	4c966ca9d0	parser: reintroduce space strings and change token parsing strategy Once again I have entangled two conceptually distinct changes into a single commit because demuxing them from the diff is too much work. Alas. Let's break it down. The simpler part of this change is to reintroduce "space strings" with a slightly fresh coat of paint. We now have 3 different types of string leaders that can be used together. So we now have: \| directly concatenates this line with the previous line > prepends an LF character before concatenation + (NEW) prepends a single space character before concatenation The `+` leader enables more æsthetic soft line wrapping than `\|` because it doesn't require the use of leading or trailing the whitespace to separate words, as long as lines are broken at word boundaries. Perhaps this is not as common a usecase as I am making it, but I do like to hard wrap paragraphs in documents, so if anything, it's a feature for me. As I was considering what character to use for this leader, I realized that I wanted to be able to support numeric map keys, a la: -1: negative one 0: zero +1: positive one But previously this would not parse correctly, as the tokenizer would find `-` and expect it to be followed by a space to indicate a list item (and the additional string leader would cause the same problem with `+`). I wanted to support this use case, so the parser was changed to take a second pass on lines starting with the string leaders (`\|`, `+`, and `>`) and the list item leader (`-`) if the leader has a non-space character following it. Note that this does not apply to the comment leader (`#` not followed by a space or a newline is a tokenization error) or to the inline list/map leaders(since those do not respect internal whitespace, there is no way to treat them unambiguously). To reduce the likelihood of confusing documents, scalars are no longer allowed to occupy their own line (the exception to this is if the document consists only of a scalar value). Inline lists and maps can still occupy their own line, though I am considering changing this as well to force them to truly be inline. I think this change makes sense, as scalars are generally intended to be represent an unbroken single item serialization of some non-string value. In other words, # these two lines used to parse the same way key: 9001 # but now the following line is a parse error due to the scalar # occupying its own line key: 9001 # also, this still works, but it may be changed to be an error in # the future key: [ 9, 0, 0, 1 ] Inline maps have also been changed so that their keys can start with the now-unforbidden string leaders and list item leader characters.	2023-10-18 21:15:29 -07:00
torque	25386ac87a	rename flow_(list\|map) to inline_(list\|map) This is simply better word choice.	2023-10-18 00:07:12 -07:00
torque	8dd5463683	parser: change string and \| semantics and expose slices in Value The way I implemented these changes ended up being directly coupled and I am not interested in trying to decouple them, so instead here's a single commit that makes changes to both the API and the format. Let's go over these. \| now acts as a direct concatenation operator, rather than concatenating with a space. This is because the format allows the specification of a trailing space (by using \| to fence the string just before the newline). So it's now possible to spread a long string without spaces over multiple lines, which couldn't be done before. This does have the downside that the common pattern of concatenating strings with a space now requires some extra trailing line noise. I may introduce a THIRD type of concatenating string (thinking of using + as the prefix) because I am a jerk. We will see. The way multi-line strings are concatenated has changed. Partially this has to do with increasing the simplicity of the aforementioned implementation change (the parser forgets the string type from the tokenizer. This worked before because there would always be a trailing character that could be popped off. But since one type now appends no character, this would have to be tracked through the parsing to determine if a character would need to be popped at the end). But I was also not terribly satisfied with the semantics of multiline strings before. I wrote several words about this in `429734e6e8`, where I reached the opposite conclusion from what is implemented in this commit. Basically, when different types of string concatenation are mixed, the results may be surprising. The previous approach would append the line terminator at the end of the line specified. The new approach prepends the line terminator at the beginning of the line specified. Since the specifier character is at the beginning of the line, I feel like this reads a little better simply due to the colocation of information. As an example: > first \| second > third Would previously have resulted in "first\nsecondthird" but it will now result in "firstsecond\nthird". The only mildly baffling part about this is that the string signifier on the first line has absolutely no impact on the string. In the old design, it was the last line that had no impact. Finally, this commit also changes Value so that it uses []const u8 slices directly to store strings instead of ArrayLists. This is because everything downstream of the value was just reaching into string.items to access the slice directly, so cut out the middleman. It was unintuitive to access a field named .string and get an arraylist rather than a slice, anyway.	2023-10-08 16:57:52 -07:00
torque	7db6094dd5	state/tokenizer: go completely the opposite direction re: whitespace This commit makes both the parser and tokenizer a lot more willing to accept whitespace in places where it would previously cause strange behavior. Also, whitespace is ignored preceding and following all values and keys in flow-style objects now (in regular objects, trailing whitespace is an error, and it is also an error for non-flow map keys to have whitespace before the colon). Tabs are no longer allowed as whitespace in the line. They can be inside scalar values, though, including map keys. Also strings allow tabs inside of them. The primary motivation here is to apply the principle of least astonishment. For example, the following - [hello, there] would previously have been parsed as the scalar " [hello, there]" due to the presence of an additional space after the "-" list item indicator. This obviously looks like a flow list, and the way it was previously parsed was very visually confusing (this change does mean that scalars cannot start with [, but strings can, so this is not a real limitation. Note that strings still allow leading whitespace, so > hello will produce the string " hello" due to the additional space after the string designator. For flow lists, [ a, b ] would have been parsed as ["a", "b "], which was obviously confusing. The previous commit fixed this by making whitespace rules more strict. This commit fixes this by making whitespace rules more relaxed. In particular, all whitespace preceding and following flow items is now stripped. The main motivation for going in this direction is to allow aligning list items over multiple lines, visually, which can make data much easier to read for people, an explicit design goal. For example key: [ 1, 2, 3 ] other: [ 10, 20, 30 ] is now allowed. The indentation rules do not allow right-aligning "key" to "other", but I think that is acceptable (if we forced using tabs for indentation, we could actually allow this, which I think is worth consideration, at least). Flow maps are more generous: foo: { bar: baz } fooq: { barq: bazq } is allowed because flow maps do not use whitespace as a structural designator. These changes do affect how some things can be represented. Scalar values can no longer contain leading or trailing whitespace (previously the could contain leading whitespace). Map keys cannot contain trailing whitespace (they could before. This also means that keys consisting of whitespace cannot be represented at all). Ultimately, given the other restrictions the format imposes on keys and values, I find these to be acceptable and consistent with the goal of the format.	2023-10-04 22:54:53 -07:00
torque	1683197bc0	state: parse whitespace in flow objects a bit differently There were (and probably still are) some weird and ugly edge cases here. For example, `[ 1 ]` would parse to a list of `1 `. This implementation allows a single space to precede the closing ] and errors out if there is more than one. Additionally, it rejects any spaces before the item separator comma. This also applies to flow maps, with the addition that they do not permit whitespace before `:` now, either. Leading spaces are still consumed with reckless abandon, so, for example, `[ lopsided]` is valid. There is also some state sloppiness flying around so `[ val, ]` probably currently works as well. Tightening up the handling of leading whitespace will be a bigger restructuring that may involve state machine changes. I'll have to think about it.	2023-10-03 23:25:58 -07:00
torque	c5e8921eb2	state: use inferred error sets As far as I can tell, the only reason ever not to use an inferred error set is when you would get a dependency loop otherwise.	2023-10-03 23:19:01 -07:00
torque	34ec58e0d2	value: implement parsing to objects There are still some untested codepaths here, but this does seem to work for nontrivial objects, so, woohoo. It's worth noting that this is a recursive implementation (which seems silly after I hand-rolled the non-recursive main parser). The thinking is that if you have a deeply-enough nested object that you run out of stack space here, you probably shouldn't be converting it directly to an object. I may revisit this, though I am still not 100% certain how straightforward it would be to make this nonrecursive with all the weird comptime objects. Basically the "parse stack" would have to be created at comptime.	2023-10-03 23:17:37 -07:00
torque	0028092a4e	parser: in theory, hook up the rest of the diagnostics In practice, there are probably still things I missed here, and I should audit this to make sure there aren't any egregious copy paste errors remaining. Also, it's pretty likely that the diagnostics line_offset field isn't correct in most of these messages. More work will need to be done to update that correctly.	2023-10-01 21:15:21 -07:00
torque	01f98f9aff	parser: start the arduous journey of hooking up diagnostics The errors in the line buffer and tokenizer now have diagnostics. The line number is trivial to keep track of due to the line buffer, but the column index requires quite a bit of juggling, as we pass successively trimmed down buffers to the internals of the parser. There will probably be some column index counting problems in the future. Also, handling the diagnostics is a bit awkward, since it's a mandatory out-parameter of the parse functions now. The user must provide a valid diagnostics object that survives for the life of the parser.	2023-09-27 23:44:06 -07:00
torque	3258e7fdb5	tokenizer: add finish function to check if there is trailing data Since the tokenizer is decoupled from the parser, there's no good way to do this. Also without attempting to parse the last line, it's impossible to say if it is junk data or simply a missing trailing new line.	2023-09-27 23:35:24 -07:00
torque	0e60719c85	linebuffer: add strictness options When the buffer was separated from the tokenizer, we lost some validation, including really aggressive carriage return detection. This brings this back in full force and adds some additional validation on top of it.	2023-09-26 00:06:39 -07:00
torque	7f82c24584	parser: implement streaming parser With my pathological 50MiB 10_000 line nested list test, this is definitely slower than the one shot parser, but it has peak memory usage of 5MiB compared to the 120MiB of the one-shot parsing. Not bad. Obviously this result is largely dependent on the fact that this particular benchmark is 99% whitespace, which does not get copied into the resulting document. A (significantly) smaller improvement will be observed in files that are mostly data with little indentation or empty lines. But a win is a win.	2023-09-25 01:18:09 -07:00
torque	5037f69fbe	examples: add some sample documents to parse against	2023-09-24 22:25:22 -07:00
torque	1d65b072ee	parser: stateful reentrancy finally the flow parser has been "integrated" with the main parser in that they now share a stack. The bigger thing is that the parsing has been decoupled from the tokenization, which will allow parsing documents without loading them fully into memory first. I've been calling this the streaming parser, but it's worth noting that I am referring to streaming input, not streaming output. It would certainly be possible to do streaming output, but I am not interested in that at the moment (it would be the lowest-memory-overhead approach, but it's a lot of work for little gain, and it is less flexible for converting input to objects).	2023-09-24 22:24:33 -07:00
torque	38e47b39dc	all: do some restructuring I don't like big monolithic source files, so let's restructure a bit. parser.zig is still bigger than I would like it to be, but there isn't a good way to break up the two state machine parsers, which take up most of the space. This is the last junk commit before I am seriously going to implement the "streaming" parser. Which is the last change before implementing deserialization to object. I am definitely not just spinning my wheels here.	2023-09-24 18:22:12 -07:00
torque	8684fab23c	build: add oneshot parsing example	2023-09-24 15:14:58 -07:00
torque	54e4a14e38	config: item start does not need to be stored at every stack level This is a simplification, but the main motivation is that the flow parser stack can be integrated with the main parser stack because they are not disparate types any more.	2023-09-24 14:58:31 -07:00
torque	dcd33bdf27	config: catch some missing key copies For inline key items, the key memory wasn't getting copied. Now it does.	2023-09-24 14:58:31 -07:00
torque	3131a9d5fd	config: migrate flow parser into the main parser object I think I am actually going to make this a method of the ParserState struct soon so lol check out my freaking code churn. But here we are.	2023-09-24 14:58:31 -07:00

1 2

61 Commits