Data Quality

Data Quality isn't just about accuracy—it's about user confidence. When every number traces to its on-chain source and survives multiple validation layers, users can trust the analytics for informed decisions. The pipeline maintains different standards through automated validation during processing and manual verification tools for deep auditing.

From my professional experience, data quality is rarely 100% perfect. So I expect to find issues going forward. This is also why I sense-check every number I see. But something might slip. Feel free to drop me a comment if you see a number that doesn't make sense to you!

Data Quality measures are constantly improving, below is a brief summary of what is in place for DefiTuna.

Automated Quality Checks

These checks run automatically during data processing to ensure data quality:

Accuracy Checks

Attribution Reconciliation

The revenue attribution system automatically reconciles two independent methods:

Raw WSOL tracking - Direct summation of all WSOL inflows to treasury
Type attribution - SOL attributed back to originating transaction types

During processing, the system detects and logs any discrepancies. Over-attribution (negative variance) triggers an error message.

Result: 100% attribution accuracy.

Ledger Persistence

Cross-day attribution accuracy is maintained through automatic ledger carryforward:

End-of-day ledger state saved automatically
Next-day processing loads previous ledger
Enables correct attribution when tokens accumulate over multiple days before swapping

Result: Prevents revenue loss from ATA token accumulations that span multiple days.

Completeness Checks

Workflow Validation

Before data deployment, automated validation verifies:

All exported JSON files exist and are valid
Structure consistency across all data files
Target date presence in all time-series data
Workflow aborts if validation fails - bad data never reaches production

Transaction Coverage

Pipeline smoke tests verify:

Summary files exist and are valid
Transaction totals match across aggregation methods
Daily transaction files exist for target dates

Quality Standards

Accuracy Standards

Attribution Accuracy

Total attributed SOL must equal raw WSOL inflows exactly
System has consistently achieved 100% accuracy against verified benchmarks
Critical safeguards prevent accidental modifications to attribution logic

Chronological Processing

Transactions processed in timestamp order to maintain correct attribution
Ensures Treasury token swap correctly attribute to earlier token accumulations

Completeness Standards

Transaction Classification

Every transaction must be:

Classified - Assigned a transaction type via log-based classification
Attributed - Labeled across all dimensions (token, type, pool, wallet, time)
Accounted - Either counted in realized revenue or tracked in pending ledger

Two-pass classification system (fast pass + enrichment for unknowns) ensures all transactions receive type assignments.

Data Coverage

All time periods must have:

Complete transaction enumeration from monitored accounts
All expected data files generated and validated
No gaps in time-series data

Manual Verification Tools

These tools exist for deep auditing but require manual execution:

Verification Scripts

Comprehensive SOL Check

Compares daily realized Treasury token against test fixtures
Validates total realized SOL against official numbers
Identifies coverage gaps and missing date ranges
Expected output: Confirmation of 100% accuracy with zero unattributed amounts

Daily Attribution Comparison

Deep-dive verification for specific dates
Compares Simple Method vs Realized Types Method
Shows detailed breakdown when discrepancies found
Run manually for date-specific debugging

Test Fixtures

Test fixtures contain manually verified benchmark data:

Daily Realized SOL - Contains manually verified daily totals serving as ground truth for accuracy testing.

Daily Transaction Counts - Expected transaction counts with acceptable tolerance for minor variations.

Sacred Rule: Never modify fixtures to make tests pass. Fixtures represent verified truth; code must be fixed to match fixtures.

Cache Integrity Tests

Cache validation tools verify:

Transaction counts against fixtures
Acceptable tolerance for minor variations
Pass/fail results for integration (not currently automated)

Common Quality Issues

Accuracy Issues

Unattributed Revenue

Symptom: Discrepancy between raw tracking and type attribution
Causes: Treasury token swap attribution errors, ledger persistence issues, missing token conversions
Resolution: Review ledger carryforward and Treasury token swap attribution logic

Over-Attribution

Symptom: Type attribution exceeds raw tracking (triggers ERROR log)
Causes: Double-counting Treasury token swap transactions, including non-revenue transactions, incorrect attribution logic
Resolution: Verify revenue filtering excludes non-revenue transaction types

Completeness Issues

Classification Gaps

Symptom: Transactions classified as "unknown"
Causes: New transaction types not in configuration, protocol updates changing log patterns, missing log messages
Resolution: Add new instruction patterns to transaction type configuration, reprocess affected dates

Missing Data

Symptom: Time-series gaps, missing files, incomplete date ranges
Causes: RPC failures during enumeration, transaction fetching errors, processing interruptions
Resolution: Re-run data collection for affected dates, verify account monitoring coverage

Continuous Improvement

As protocols evolve, the system improves through:

New transaction discovery - Unknown transaction types prompt classification rule expansion
Configuration updates - Add mappings to transaction type and pool label configurations
Historical reprocessing - Apply updated rules uniformly across all dates
Verification - Run manual scripts to confirm accuracy maintained after changes

Why reprocessing works: The system uses actual on-chain swap rates (facts that don't change) for conversions, so classification improvements can be applied retroactively without affecting conversion accuracy.

Automated Quality Checks​

Accuracy Checks​

Completeness Checks​

Quality Standards​

Accuracy Standards​

Completeness Standards​

Manual Verification Tools​

Verification Scripts​

Test Fixtures​

Cache Integrity Tests​

Common Quality Issues​

Accuracy Issues​

Completeness Issues​

Continuous Improvement​