Skip to main content

Data Quality

Every number on this site traces to on-chain transactions. The pipeline maintains accuracy through four validation layers: checks that run during processing, checks that run after processing, regression fixtures that catch regressions, and manual tools for deep-dive auditing.

From my professional experience, data quality is rarely 100% perfect. So I expect to find issues going forward. This is also why I sense-check every number I see. But something might slip. Feel free to drop me a comment if you see a number that doesn't make sense to you!


1. In-Pipeline Validation

Checks that run automatically during data processing. Failures block the pipeline.

Revenue Reconciliation

Both protocols reconcile computed fees against independent reference totals during processing. Examples:

  • DefiTuna: Two independent methods must match exactly -- raw WSOL inflows to treasury vs. per-type attribution. Over-attribution triggers an error.
  • Flash.Trade: Four-layer reconciliation engine compares fee components (RSU, compounding, vault-touch, migration) against on-chain protocol fee sweeps. Gaps above 1,000 atoms flag warnings.

Transaction Classification

Every transaction must be classified, attributed, and accounted for. Examples:

  • Log-based classification assigns a transaction type to each transaction
  • Two-pass system (fast pass + enrichment) ensures no transaction remains unclassified
  • Unknown types are auto-discovered and flagged for review

Ledger Persistence

Cross-day accuracy is maintained through automatic ledger carryforward. End-of-day ledger state is saved and loaded by the next day's processing, so tokens that accumulate over multiple days before swapping are attributed correctly.


2. Post-Pipeline Validation

Checks that run after processing completes, before data reaches the frontend.

Workflow Output Validation

Before deployment, automated validation verifies:

  • All required JSON files exist and are structurally valid
  • Field counts and required keys are consistent across files
  • Target dates are present in all time-series data (no gaps)
  • Pipeline aborts if validation fails -- bad data never reaches production

Cross-Validation Against External Sources

Computed results are compared against external data sources. Examples:

  • DefiTuna: Computed daily inflows are compared against the DefiTuna API, with tolerance of 0.1 SOL per day
  • Flash.Trade: API snapshots (pool APY, TVL, fee rates) are collected daily for comparison; wallet usage metrics are benchmarked against official protocol reports (e.g., monthly active wallet counts match within 1 wallet)

3. Regression Fixtures

Manually verified benchmark data that catches regressions when code changes. Examples:

FixtureProtocolWhat it verifies
Daily realized SOL totalsDefiTunaPer-day SOL amounts match ground truth
Daily transaction countsDefiTunaTransaction volumes within tolerance
Reconciliation golden file (320+ intervals)Flash.TradePer-interval fee gaps remain stable
Official staked FAF totalsFlash.TradeStaker cache matches protocol-published figures

Fixtures are sacred -- if a test fails, the code is fixed, never the fixture.


4. Manual Verification Tools

On-demand scripts for auditing and debugging specific dates or anomalies.

DefiTuna

  • Comprehensive SOL check: validates total realized SOL against fixtures and official numbers across all dates
  • Daily attribution comparison: deep-dive into a specific date, comparing Simple Method vs. Realized Types Method
  • API inflow comparison: spot-checks computed inflows against the DefiTuna API

Flash.Trade

  • Interval verification: replays fee calculation for a specific reconciliation interval against on-chain data
  • Vault balance continuity: verifies stake pool token balances are continuous across days (no unexplained jumps)
  • Staker cache validation: checks staker positions, unstake queue totals, and penalty reserves for internal consistency

Continuous Improvement

As protocols evolve, new transaction types are discovered, classification rules expand, and historical data is reprocessed with updated logic. This works because the system uses actual on-chain swap rates for conversions -- facts that don't change -- so classification improvements apply retroactively without affecting conversion accuracy.