Hacker News new | comments | show | ask | jobs | submitlogin
Faster and simpler with the command line: deep-comparing JSON files with jq (genius.engineering)
206 points by wwarnerandrew 5 months ago | hide | past | web | 84 comments | favorite

I really enjoyed this article, and I think it shows how successfully jq fits into the Unix culture of sed/awk/grep/etc. It seems so rare to find new CLUI tools that feel as "classical" as jq. It has helped me do one-off tasks like this several times, but I've really only scratched the surface. Often with newer tools I'm reluctant to invest in going deeper into really learning the features, but with jq I have a lot of confidence that it would pay off for years to come. I don't see any books yet, but this is about the phase where I'd normally buy an O'Reilly volume (hint hint).

Btw I'm surprised you needed -M, since I thought jq would suppress colors if it saw it wasn't writing to a tty.

Chapter 5 of Data Science at the Command Line (O'Reilly, 2014) mentions `jq` briefly: https://www.datascienceatthecommandline.com/chapter-5-scrubb...

I had the opposite thought when I used it for the first time.

Even when reading the article I thought about it :)

Just a heads up to anyone using jq - I've previously spent a couple of hours debugging a problem because jq uses float64 to store integers (which might lead to rounding-errors/overflows). For example:

  echo 1152921504606846976 | jq                                                            

This is an artifact of JavaScript, which even as of ES6 uses IEEE 754 double-precision floats for all numeric values. jq likely uses the same implementation internally for compatibility reasons and to avoid surprises of a different kind.

See https://www.ecma-international.org/ecma-262/6.0/#sec-ecmascr...

BigInt in top browsers now, not under a flag. Just sayin'!

https://brendaneich.com/wp-content/uploads/2017/12/dotJS-201... et seq.

I think an 'error out if overflow/truncation'-mode available as a command line flag could be useful if you just don't have any JS involved in the JSON pipeline.

Twitter had to switch their tweet id representation in the API to handle this - it was numeric and switched to strings.

Note: the links currently redirect to https://imgur.com/32R3qLv (image of a testicle and derogatory comment on HN)

Copy-pasting the link bypass this. It's using the HN referrer (I think?) to redirect to imgur

Yikes, that's nasty.

It is. But it's a problem of JSON itself, not just jq.

TIL: JSON has no specified number implementation: http://www.ecma-international.org/publications/files/ECMA-ST...

>JSON is agnostic about the semantics of numbers ... JSON instead offers only the representation of numbers that humans use: a sequence of digits.

So... anything is valid, per the spec.

JSON != JavaScript

> echo 1152921504606846976 | python -c 'import sys, json; print(json.load(sys.stdin))'


Python's json package != JSON

JSON: https://tools.ietf.org/html/rfc8259#page-8

The link says that it's up to the implementation, which means it's valid for Python's JSON implementation to support larger numbers.

It's less "interoperable" but not strictly invalid, by my read.

It looks like JSON doesn't specifically define how numeric numbers should be stored. It just recommends expecting precision up to the double precision limits.

Still interesting to know it's not just a jq quirk.

Just because python has bigints and its stdlib json module supports bigints in json doesn't mean that is an interoperable thing to do.

The problem is made worse on the receiving end (the browser). I've ran into this issue when serialization libraries in Java send a 64-bit long value as a sequence of digits, then things over ~50 bits get silently truncated, you find out about it, then switch to quoted strings.

That's Python not adhering to the JSON standard as defined in the RFC.

Another option is to use a tool like 'gron' to convert the JSON into a shell-friendly line-oriented format. This makes the rest straightforward.


jq can do something like that too! Check out its --stream option.

Oh, this sounds handy, though i can’t imagine it’d be as performant for this particular case.

That's cool, you made a thing that verifies the two export jobs you wrote have the same data even though they have different output.

I can't help wondering, if you control the code that generates the JSON, why not output in a conservative, consistent format? I'm sure there are pros/cons, but this work would allow something like `diff` to work, and then you don't have to maintain a separate utility.

good question! the analysis that I was doing was really a one-off for switching between these processes. We have unit tests and sanity checks to ensure consistency going forward, but as a final check before flipping the switch we wanted to be as confident as possible that we hadn't introduced any regressions across the full data-set.

The new export process is much more reliable and a _lot_ faster, but as a side effect of doing things in a different way it generated the export file in a different format. Given that the order of objects in an export file and the order of keys/etc in the JSON objects didn't matter for anything except comparing the two processes, I figured it was simpler to put the normalization logic in the one-off tool vs baking it into our export process. But certainly if we were maintaining both exports in an ongoing fashion and validating them against each other, it would make a lot more sense to spend time making sure they generated objects and keys in the same order.

I'm surprised the layout of the JSON doesn't arrange music by artist, any reasoning why that wasn't done?

If you have json line formatted stuff (or csv) and an aws account, you can do some nice things with Athena and SQL. We have a few simple backoffice tools that I've implemented around simple sql queries on data dumped from various systems that we have in json format. Awesome, if you want to do some quick selects, joins, etc.

If you are going to process this amount of data, don't load it all into memory and process line by line. Also do that concurrently if you have more than one CPU core available. I've done this with ruby, python, Java, and misc shell tools like jq. Use what you are comfortable with and what gets results quickly.

One neat trick with jq is to use it to convert json objects to csv and to then pipe that into csvkit for some quick and dirty sql querying. Generally gets tedious beyond a few hundred MB. I recommend switching to Athena or something similar if that becomes a regular thing for you.

OP here— good point! We actually use Athena to query these exports in S3 to debug data drift of specific export objects over time. It's quite a useful tool, I was able to go knowing basically nothing about Athena to querying gzipped newline-delimited JSON files in S3 using SQL in about an hour.

It's worth mentioning that there are much faster JSON parsing libraries than the default in Ruby stdlib. I still don't think Ruby is the best choice for doing raw JSON parsing. Last time I had to care about JSON speed we were transforming billions of events and the Ruby JSON lib was becoming a bottleneck

> It's worth mentioning that there are much faster JSON parsing libraries than the default in Ruby stdlib.

I am on the edge of my seat now.

Would you mind listing which libraries are much (say, an order of magnitude) faster?

Don't think it's an order of magnitude faster, but oj is supposed to be the standard for Ruby.


What do you think is the best way to do deep JSON comparisons? We work with 2GB JSONs all day, and it is super annoying how long they take to process.

Not parse them into a tree, to start with.

Use a streaming JSON parser, and compare them token by token unless/until they diverge, at which point you take whatever actual suitable to identify the delta.

Parsing it into a tree may be necessary if you want to do more complex comparisons (such as sorting child objects etc.), but even then depending on your need you may well be better off storing offsets into the file depending on your requirements.

https://github.com/lloyd/yajl is an example of a streaming JSON parser (caveat: I've not benchmarked it at all), but JSON is simple enough to write one specifically to handle two streams.

I believe this comparison benchmark could be useful for you and you can expand further with more tests. Although I got downvoted for sharing a link.


That still parses into a tree.

Rust is absolutely wonderful for tasks like this. They don't hit any of the cases where Rust's ownership can make things tricky. And the serde library makes deserializing JSON a piece of cake.

You end up with code which looks pretty similar to the equivalent JavaScript or Python code, but performs much faster (10x, 100x or even 1000x faster).

There's also pikkr (https://github.com/pikkr/pikkr) if you need really really fast JSON parsing.

Start with a compiled language, I guess? I don't operate on anywhere near that scale, but json-rust reaches 400 MB/s for me.

It doesn't parallelize, and you'd need memory enough for the entire structure, but of course Rust doesn't have GC overhead. You could trivially parse both files in parallel, at least.

(1) Try a language with fast allocations (C, C++, Rust, maybe Go or Java) -- anything except Python or Ruby


(2) Try using streaming API (I don't know Ruby, but quick google found https://github.com/dgraham/json-stream ). Note that this method will require you to massively restructure your program -- you want to avoid having all of the data in memory at once.

The streaming API might work better with jq-based preprocessing -- for example, if you want to compare two unsorted sets, it may be faster to sort them using jq, then compare line-by-line using streaming API.

Python is fast at parsing JSON, Go had hard time to match parsing speed of it. Additionally you have PyPy to help.

Python is fast at doing anything that doesn't involve running Python.

That's an important caveat. Python's C JSON parser library is super-fast, but if you want to use the data for anything but a simple equality check afterwards, it'll be slow as molasses.

Or you'll write a C extension for it...

nodejs comes to mind!

> My first thought was to write a ruby script to parse and compare the two exports, but after spending a little time coding something up I had a program that was starting to get fairly complicated, didn't work correctly, and was too slow—my first cut took well over an hour. Then I thought: is this one of those situations where a simple series of shell commands can replace a complex purpose-built script?

Key takeaway: next time, start with the second thought first and save yourself well over an hour!

I had a similar problem diffing large API responses a few months ago and implemented an automation friendly JSON schema tool. It's a great way to make a summary of the data, especially when looking for forgotten fields for example.


I love jq.

I replaced a bunch of bespoke ETL code with shell scripts. grep, sed, jq, xsv, psql, etc. Fast, efficient, iterative, inspectable, portable.

Alas, most everyone else insists on python, nodejs, ruby, AWS Lambda, jenkins goo, misc mayfly tech stacks. So my "use the most simple tool that works" advocacy has never gained traction.

As part of an automated Jira upgrade script (well, Makefile) we needed to export changes to the listener port/scheme configuration which is unfortunately stored “in the code” so to speak (in WEB-INF/web.xml) which Atlassian doesn’t deign to whitespace normally (indentation is all over the place, as is formatting, character encoding, and more) —- and they mangle it differently somehow with each point release. So the Makefile calls xmllint to normalize formatting and whitespace of both the untouched source files from the old and new release as well as the locally modified (deployed) configuration, then calls diff/patch accordingly (in a three-way).

I have found jq immensely useful to process ugly large responses from REST APIs in enterprise systems. It's like an awk for JSON... And I've been awk fan for 30 years for any text processing.

This is somewhat related to a hack I threw together recently: https://github.com/ecordell/jf

It attempts to address a similar problem (comparing json or subsets of json), but I wanted the structure of what was being compared to be more readable (compared to jq), so I went with graphql syntax. Doubt it would do great on larger datasets though.

I'm not sure why you are comparing the data to the old export instead of against a source of truth... for example what is in the upstream data source. Also why not verify using unit tests? Who is to say that the original export is valid and not the second export.

In theory, I agree! I hope the new codebase has a set of tests to validate just that.

But, in practice, you have a downstream consumer of this data format (Apple in this case..).. Validating the old and new formats are functionally identical is just as important as validating the new format matches the upstream source of truth :)

I'm using jsonassert [1], a Java based JSON unit testing library, for something very similar.

Not sure how it'd handle comparing 5GB files though.

1. http://jsonassert.skyscreamer.org/

I'm curious how well could Crystal language handles that huge amount of JSON since most of the Ruby code could be ported over to Crystal.

It has a JSON pull parser to minimize memory usage which is useful for memory constraint environment but at the expense of less performant. If that could be split up with fork Crystal processes, I believe it's feasible.

There are stream parsers for JSON for Ruby too, including bindings for C libraries like YAJL - using the default JSON parser is an awful choice for doing comparisons like that given the massive overhead of the amount of objects it'll be creating for no good reason.

Agree, I believe the benchmark shown Yaji and jq in this repo is useful for you


If speed is essential, why not use protobuf/flatbuffer or one of their variants?

I wanted to like jq, but honestly, I can't figure out it's crazy syntax.

If you're into JavaScript (or LiveScript) or functional programming, you might find ramda-cli[1] more palatable. Disclaimer: I've created it.

[1]: https://github.com/raine/ramda-cli

it's a bit weird, but it's perfect for writing quick shell one liners with once you get used to it. no regular scripting language can match it for that

awk, sed?

The braces make that more difficult, awx and sed can work with YAML more easily, but awk works best on column oriented data, and YAML and JSON are more row oriented.

For extracting a value out of json where the keys can arbitrarily re-order and you have nested maps containing the same key names? No thanks.

I was providing them as examples of things "perfect for writing quick shell one liners", not good JSON parsers.

+1 for a great tool.

I wonder if a C/C++ programm would perform better?

jq is a C program.

In theory a truly specific program could work better. In practise, the broad scope of jq allows you to discover the operations you need and respond to changes in requirements without being locked into custom code, and any given programmer probably couldn't do the same job better.

jq is a C program, yes, but jq programs are interpreted. Because jq is a dynamically-typed language, it wouldn't be easy to compile it to object code that would run too much faster than the byte-interpreted version (though it would still run faster).

As you say, jq's power is that it is an expressive language, and it's much much easier to write jq programs that work than it is to write C/C++ programs as needed that do the same or similar work.

The interpreted language is just the setup phase for a pipeline of compiled-in data transformations.

I don't understand this statement. Keep in mind I'm a jq maintainer.

Yes. If well-written anyways.

> What’s the best way to compare these two 5GB files?

A much simpler way to do this is simply to hash the files, for example using sha256sum, which AFAIK ships with just about every Linux distro. Then just compare the hashes.

Having the same content is not the same as being identical verbatim.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact