Will 33137706c2 pattern matcher 4 months ago
..
README.org 33137706c2 pattern matcher 4 months ago
json.k 33137706c2 pattern matcher 4 months ago

README.org

JSON parser

ngn/k has a built in json parser which is many times faster than this. So why have one written in K?

First, K is interpretted and so it's fundamentally easier to modify. Also, since K objects don't match 1-1 with json objects (no symbols in json, no distinct boolean types in K, etc.) there are necessarily choices made in any implementation of a parser.

API

All functions are created under the json namespace. The primary API points are tree and prs. prs takes a string and returns a K "object". That is to say if the toplevel json structure is a list prs returns a list and if its an "object" prs returns a dict.

tree returns a list of three items: the nodes, the parent vector and the left vector. This amounts to an array representation of a tree, which I've taken to calling Apter trees though I haven't done any scholarship as to the actual origins. There's likely more information available in Aaron Hsu's thesis, but I haven't been through it.

Most people will probably make use of the prs interface, but having the tree available opens up possibilities for alternative traversals.

From the tree representation, the nodes of interest will be "{", "[" and ":", which serve as root nodes for objects, lists and pairs where an object is essentially just a list of pairs. From any object node, you can use the parent vector to find an array of pairs within it. You can use the left vector to discover the default ordering of such pairs.

Downside

As mentioned above it's quite a bit slower than the built-in json parser. When I last checked it was a couple of magnitudes slower for reasonable sized json files and quite a bit slower for much larger files. This last is because it doesn't handle memory terribly efficiently. If possible, you should break up very large json files in to smaller chunks for processing.

Actually, it should be possible to turn this into a version which operates on the file in chunks, but I haven't done any work in this direction. Another possibility is to generate the tree without doing any split at all.

References

The version of pr here was offered by mlochbaum as discussed on the J mailing list. It was quite a bit faster than the naive version I had originally. ngn also offered a fast version which has the potential to be more useful for a chunked version, since it uses a fold which can be resumed. From that thread you can see contributions by @ovs and @chrispsn to make other speed enhancements over the orginal version.