TDD-001: Historical Ledger Storage & Reporting Strategy¶

Status: Accepted Date: 2025-12-23 Context: General Ledger Processing Worker Tasks

1. Context and Problem Statement¶

Our system processes financial general ledgers (CSV/Excel/Text) containing anywhere from hundreds to hundreds of thousands of records. Currently, when a user uploads a new ledger, the system runs a “cleanup” strategy that deletes all existing normalized records for that account and fiscal year to conserve database storage.

The Conflict:

Storage Constraints: Storing millions of normalized rows for every historical upload in PostgreSQL is not scalable and leads to table bloat.
Feature Requirement: Users need to be able to generate reports (IDC/CTED) based on previously uploaded ledgers, not just the most recent one.
Current State: Because we delete the old rows, attempting to run a report on a historical ledger fails (no data found).

We need a solution that allows reporting on historical data without exponentially increasing database storage costs or performance degradation.

2. Decision: “Frozen Artifact” with Lazy Backfill¶

We will adopt a Hybrid Storage Strategy utilizing both PostgreSQL and Google Cloud Storage (GCS).

Active Data (PostgreSQL): The most recently uploaded ledger’s records (for the account/fiscal year) remain in the database for fast, relational querying.
Historical Data (GCS Artifacts): Upon processing, a compressed JSON artifact (.json.gz) of the normalized records is saved to GCS.
Lazy Backfill: For legacy files that currently have neither DB rows nor an artifact, we will re-process the original raw file on-demand and generate the artifact then.

The “Waterfall” Retrieval Logic¶

When TemplaterHelper requests normalized records, it will follow this priority order:

Hot Storage (DB): Check PostgreSQL. If records exist, use them.
Cold Storage (Artifact): Check GCS for ledger-artifacts/{ledgerFileId}.json.gz. If found, download, unzip, and use.
Legacy Fallback (Re-process): If neither exists
- Download the original raw input file from GCS.
- Run LedgerNormalizer in-memory.
- Map the results to match the database schema (snake_case).
- Save the result as a GCS artifact (so step 2 works next time).
- Return the data.

3. Options Considered¶

Option	Pros	Cons	Verdict
Keep All Rows in DB	Simplest implementation; fast queries.	Massive storage cost; table bloat; performance degradation over time.	Rejected
JIT Normalization (No Storage)	Zero storage cost; always uses latest logic.	High CPU/Memory usage per report; risk of “Logic Drift” (old reports changing numbers due to code updates).	Rejected
Frozen Artifact (Selected)	Cheap storage (GCS); immutable data (audit safe); low database load.	Slightly more complex implementation; first access to legacy files is slower.	Accepted

4. Technical Implementation¶

A. Update `process_ledger` Task¶

Modify the worker task to save the artifact immediately after normalization, before any destructive DB operations.

B. Update `TemplaterHelper`¶

Refactor getNormalizedRecords to implement the waterfall logic.

5. Consequences¶

Positive:
- Significant reduction in database row count (Postgres primarily stores metadata and the active ledger).
- Historical reports are immutable once the artifact is created.
- GCS costs are significantly lower than RDS/Postgres storage.
Negative:
- First-time generation for old legacy files will be slower (due to re-processing).
- We introduce a dependency on zlib and increased memory usage during the artifact fetch (must buffer the JSON file).

6. Action Items¶

Implement the artifact creation in process_ledger.
Refactor TemplaterHelper with the logic above.
Deploy to staging and verify that running a report on an old ledger triggers the “Reprocess and Backfill” logic successfully.