This week-end project started by browsing the open-data repository of Paris’ public transport network, which contains various APIs to query real-time departures, current disruptions, etc.The data reuse section caught my eye, as it features external projects that use this open data.In particular, the RATP status website provides a really nice interface to visualize historical disruptions on metro, RER/train and tramway lines.
Gzip doesn’t reduce your data’s size by 2000x. Of course this could be done by other languages as well, but running gzip on your data doesn’t keep it accessible.
Even turning the data into a Parquet file would’ve been a massive improvement, while keeping it accessible, but it likely would not have been 2000x smaller. 10x, maybe.
edit: zip: about 10x; 7zip about 166x (from ~10GB to 60MB) - still not 2000x
In my experience taking an inefficient format and copping out by saying “we can just compress it” is always rubbish. Compression tends to be slow, rules out sparse reads, is awkward to deal with remotely, and you generally always end up with the inefficient decompressed data in the end anyway, whether in temporarily decompressed files or in memory.
I worked in a company where they went against my recommendation not to use JSON for a memory profiler output. We ended up with 10 GB JSON files, even compressed they were super annoying.
We switched to SQLite in the end which was far superior.
Gzip doesn’t reduce your data’s size by 2000x. Of course this could be done by other languages as well, but running gzip on your data doesn’t keep it accessible.
Even turning the data into a Parquet file would’ve been a massive improvement, while keeping it accessible, but it likely would not have been 2000x smaller. 10x, maybe.
edit: zip: about 10x; 7zip about 166x (from ~10GB to 60MB) - still not 2000x
It all depends on the data entropy. Formats like JSON compress very well anyway. If the data is also very repetitive too then 2000x is very possible.
In my experience taking an inefficient format and copping out by saying “we can just compress it” is always rubbish. Compression tends to be slow, rules out sparse reads, is awkward to deal with remotely, and you generally always end up with the inefficient decompressed data in the end anyway, whether in temporarily decompressed files or in memory.
I worked in a company where they went against my recommendation not to use JSON for a memory profiler output. We ended up with 10 GB JSON files, even compressed they were super annoying.
We switched to SQLite in the end which was far superior.
Of course compressing isn’t a good solution for this stuff. The point of the comment was to say how unremarkable the original claim was.
Yeah I agree.