Skip to content

TDD-001: Historical Ledger Storage & Reporting Strategy

Status: Accepted Date: 2025-12-23 Context: General Ledger Processing Worker Tasks

1. Context and Problem Statement

Our system processes financial general ledgers (CSV/Excel/Text) containing anywhere from hundreds to hundreds of thousands of records. Currently, when a user uploads a new ledger, the system runs a “cleanup” strategy that deletes all existing normalized records for that account and fiscal year to conserve database storage.

The Conflict:

  • Storage Constraints: Storing millions of normalized rows for every historical upload in PostgreSQL is not scalable and leads to table bloat.
  • Feature Requirement: Users need to be able to generate reports (IDC/CTED) based on previously uploaded ledgers, not just the most recent one.
  • Current State: Because we delete the old rows, attempting to run a report on a historical ledger fails (no data found).

We need a solution that allows reporting on historical data without exponentially increasing database storage costs or performance degradation.

2. Decision: “Frozen Artifact” with Lazy Backfill

We will adopt a Hybrid Storage Strategy utilizing both PostgreSQL and Google Cloud Storage (GCS).

  1. Active Data (PostgreSQL): The most recently uploaded ledger’s records (for the account/fiscal year) remain in the database for fast, relational querying.
  2. Historical Data (GCS Artifacts): Upon processing, a compressed JSON artifact (.json.gz) of the normalized records is saved to GCS.
  3. Lazy Backfill: For legacy files that currently have neither DB rows nor an artifact, we will re-process the original raw file on-demand and generate the artifact then.

The “Waterfall” Retrieval Logic

When TemplaterHelper requests normalized records, it will follow this priority order:

  1. Hot Storage (DB): Check PostgreSQL. If records exist, use them.
  2. Cold Storage (Artifact): Check GCS for ledger-artifacts/{ledgerFileId}.json.gz. If found, download, unzip, and use.
  3. Legacy Fallback (Re-process): If neither exists
    • Download the original raw input file from GCS.
    • Run LedgerNormalizer in-memory.
    • Map the results to match the database schema (snake_case).
    • Save the result as a GCS artifact (so step 2 works next time).
    • Return the data.

3. Options Considered

Option Pros Cons Verdict
Keep All Rows in DB Simplest implementation; fast queries. Massive storage cost; table bloat; performance degradation over time. Rejected
JIT Normalization (No Storage) Zero storage cost; always uses latest logic. High CPU/Memory usage per report; risk of “Logic Drift” (old reports changing numbers due to code updates). Rejected
Frozen Artifact (Selected) Cheap storage (GCS); immutable data (audit safe); low database load. Slightly more complex implementation; first access to legacy files is slower. Accepted

4. Technical Implementation

A. Update process_ledger Task

Modify the worker task to save the artifact immediately after normalization, before any destructive DB operations.

B. Update TemplaterHelper

Refactor getNormalizedRecords to implement the waterfall logic.

5. Consequences

  • Positive:
    • Significant reduction in database row count (Postgres primarily stores metadata and the active ledger).
    • Historical reports are immutable once the artifact is created.
    • GCS costs are significantly lower than RDS/Postgres storage.
  • Negative:
    • First-time generation for old legacy files will be slower (due to re-processing).
    • We introduce a dependency on zlib and increased memory usage during the artifact fetch (must buffer the JSON file).

6. Action Items

  1. Implement the artifact creation in process_ledger.
  2. Refactor TemplaterHelper with the logic above.
  3. Deploy to staging and verify that running a report on an old ledger triggers the “Reprocess and Backfill” logic successfully.