You never send a single individual log event per HTTP request, you always batch ...

jiggawatts · on Dec 6, 2024

Most log collection systems do not compress logs as they send them, because again, why would they? This would instantly turn their firehose of revenue cash down to a trickle. Any engineer suggesting such a feature would be disciplined at best, fired at worst. Even if their boss is naive to the business realities and approves the idea, it turns out that it's weirdly difficult in HTTP to send compressed requests. See: https://medium.com/@abhinav.ittekot/why-http-request-compres...

HTTP/2 would also improve efficiency because of its built-in header compression feature, but again, I've not seen this used much.

The ideal would be to have some sort of "session" cookie associated with a bag of constants, slowly changing values, and the schema for the source tables. Send this once a day or so, and then send only the cookie followed by columnar data compressed with RLE and then zstd. Ideally in a format where the server doesn't have to apply any processing to store the data apart from some light verification and appending onto existing blobs. I.e.: make the whole thing compatible with Parquet, Avro, or something other than just sending uncompressed JSON like a savage.

kiitos · on Dec 6, 2024

Most systems _do_ compress request payloads on the wire, because the cost-per-byte in transit over those wires is almost always frictional and externalized.

Weird perspective, yours.

piterrro · on Dec 6, 2024

They will compress over the wire, but then decompress and ingest counting billing for uncompressed data. After that, an interesting thing will happen, because they will compress the data along other interesting techniques to minimize the size of the data on their premises. Cant blame them... they're just trying to cut costs but the fact that they are charging so much for something that is so easily compressible is just... not fair.

jiggawatts · on Dec 7, 2024

A part of the problem is that the ingestion is not vector compressed, so they're charging you for the CPU overhead of this data rearrangement.

It would cut costs a lot if the source agents did this (pre)processing locally before sending it down the wire.

piterrro · on Dec 7, 2024

We should distinct between compression in transit and at rest. Compressing a larger corpus should yield better results in comparison to smaller chunks because dictionaries can be reused (zstd for example)