HN Reader

Show HN: Searchable compression for JSON (p50≈0.18 ms; 10-min demo)

Hi! I built SEE (Semantic Entropy Encoding) because the “data tax” (storage/egress) and the “CPU tax” (decompress/parse) keep rising together.

Tradeoff: it’s not always smaller than Zstd, but it stays searchable while compressed and minimizes I/O. Key numbers (demo): combined≈19.5% of raw, skip≈99%, lookup p50≈0.18 ms (bloom≈0.30).

10-min reproduction (no marketing): 1) Download the Demo ZIP (Release). 2) Follow README_FIRST.md. 3) Run `python samples/quick_demo.py` → prints ratio/skip/bloom + p50/p95/p99.

ROI quick math: Savings/TB ≈ (1 − 0.195) × Price_per_GB × 1000 (e.g., $0.05/GB → ~$40/TB). NDA/VDR (private, no confidential info in public): [https://docs.google.com/forms/d/e/1FAIpQLScV2Ti592K3Za2r_WLU...]

Happy to answer technical questions (schema-aware layout, delta strategy, bloom density, skip heuristics, failure modes).

“Why not just Zstd?” Short: Zstd-only can be smaller, but it isn’t searchable; you still pay I/O + CPU to decompress and parse JSON. SEE trades a bit of size for millisecond lookups and ~99% skipping, which often wins on TCO at scale.

“Will it hold on real data?” Short: Best on repetitive JSON/NDJSON (logs, events, telemetry). We provide a 10-minute demo so anyone can reproduce KPIs and stress it with their own patterns.

“Why not keep a separate index?” Short: Separate indexes add I/O/space and consistency overhead. SEE keeps searchability in the storage format, reducing random I/O and parse costs.

“Are the numbers cherry-picked?” Short: We publish p50/p95/p99, skip (present/absent), and bloom density. The demo script prints them all, along with raw and combined sizes.

4 months agoby kodomonocch1