TDD-001: Historical Ledger Storage & Reporting Strategy¶
Status: Accepted Date: 2025-12-23 Context: General Ledger Processing Worker Tasks
1. Context and Problem Statement¶
Our system processes financial general ledgers (CSV/Excel/Text) containing anywhere from hundreds to hundreds of thousands of records. Currently, when a user uploads a new ledger, the system runs a “cleanup” strategy that deletes all existing normalized records for that account and fiscal year to conserve database storage.
The Conflict:
- Storage Constraints: Storing millions of normalized rows for every historical upload in PostgreSQL is not scalable and leads to table bloat.
- Feature Requirement: Users need to be able to generate reports (IDC/CTED) based on previously uploaded ledgers, not just the most recent one.
- Current State: Because we delete the old rows, attempting to run a report on a historical ledger fails (no data found).
We need a solution that allows reporting on historical data without exponentially increasing database storage costs or performance degradation.
2. Decision: “Frozen Artifact” with Lazy Backfill¶
We will adopt a Hybrid Storage Strategy utilizing both PostgreSQL and Google Cloud Storage (GCS).
- Active Data (PostgreSQL): The most recently uploaded ledger’s records (for the account/fiscal year) remain in the database for fast, relational querying.
- Historical Data (GCS Artifacts): Upon processing, a compressed JSON
artifact (
.json.gz) of the normalized records is saved to GCS. - Lazy Backfill: For legacy files that currently have neither DB rows nor an artifact, we will re-process the original raw file on-demand and generate the artifact then.
The “Waterfall” Retrieval Logic¶
When TemplaterHelper requests normalized records, it will follow this priority
order:
- Hot Storage (DB): Check PostgreSQL. If records exist, use them.
- Cold Storage (Artifact): Check GCS for
ledger-artifacts/{ledgerFileId}.json.gz. If found, download, unzip, and use. - Legacy Fallback (Re-process): If neither exists
- Download the original raw input file from GCS.
- Run
LedgerNormalizerin-memory. - Map the results to match the database schema (snake_case).
- Save the result as a GCS artifact (so step 2 works next time).
- Return the data.
3. Options Considered¶
| Option | Pros | Cons | Verdict |
|---|---|---|---|
| Keep All Rows in DB | Simplest implementation; fast queries. | Massive storage cost; table bloat; performance degradation over time. | Rejected |
| JIT Normalization (No Storage) | Zero storage cost; always uses latest logic. | High CPU/Memory usage per report; risk of “Logic Drift” (old reports changing numbers due to code updates). | Rejected |
| Frozen Artifact (Selected) | Cheap storage (GCS); immutable data (audit safe); low database load. | Slightly more complex implementation; first access to legacy files is slower. | Accepted |
4. Technical Implementation¶
A. Update process_ledger Task¶
Modify the worker task to save the artifact immediately after normalization, before any destructive DB operations.
B. Update TemplaterHelper¶
Refactor getNormalizedRecords to implement the waterfall logic.
5. Consequences¶
- Positive:
- Significant reduction in database row count (Postgres primarily stores metadata and the active ledger).
- Historical reports are immutable once the artifact is created.
- GCS costs are significantly lower than RDS/Postgres storage.
- Negative:
- First-time generation for old legacy files will be slower (due to re-processing).
- We introduce a dependency on
zliband increased memory usage during the artifact fetch (must buffer the JSON file).
6. Action Items¶
- Implement the artifact creation in
process_ledger. - Refactor
TemplaterHelperwith the logic above. - Deploy to staging and verify that running a report on an old ledger triggers the “Reprocess and Backfill” logic successfully.