How to Handle "Dirty Data" in Your Historical Databases

ProTrader · Sunday at 2:31 PM

Dirty data is any record in your historical market database that is incomplete, inconsistent, inaccurate, duplicated, or out of place. For traders, risk analysts, quants and even reporting teams in India, dirty data can mean bad signals, wrong backtests, regulatory headaches, and wasted time. This article explains simple, practical steps to find, fix and prevent dirty data in a way that is easy to follow.

Common causes of dirty data

Exchange or vendor feed errors: gaps, spikes or repeated ticks from NSE/BSE or third-party vendors.
Corporate actions and corporate events: splits, dividends, mergers not applied correctly.
Timezone and timestamp issues: data stored in UTC when your models expect IST, or timestamps truncated to the wrong precision.
Duplicates and replayed messages: same trade or quote recorded multiple times.
Incorrect symbols or mapping problems: old ISINs, changed tickers, or multiple instruments using similar codes.
Human errors and manual edits: bad uploads, inconsistent manual fixes without audit trails.

Why it matters in India
Even small errors can skew results. If you backtest a strategy on 10 years of historical price data that contains unadjusted dividends or wrong splits, the Sharpe ratio and drawdowns will be misleading. Regulatory reporting to SEBI or exchanges also demands accurate, auditable records. Costs can escalate if teams spend days chasing anomalies rather than improving models.

Quick ways to detect dirty data
Use automated checks early in your pipeline:
- Validate timestamps: ensure monotonic time and correct timezone (use IST for on-exchange Indian trades).
- Range checks: price should not be negative; volumes must be non-negative and within exchange limits.
- Gap detection: find missing trading days or extended gaps in tick data.
- Spike detection: outliers many standard deviations from nearby values often indicate feed errors.
- Cross-check with reference sources: compare a random sample against vendor snapshots or exchange end-of-day files.

Checklist to clean historical market data

Normalize timestamps to a single timezone (IST) and consistent precision (milliseconds or nanoseconds as needed).
De-duplicate trades and quotes using unique message IDs or a combination of timestamp+price+volume rules.
Adjust for corporate actions using a reliable corporate-actions table so historical prices reflect splits and dividends.
Fill or mark gaps intelligently: interpolation for small gaps, and explicit nulls or flags for long missing periods.
Keep an audit trail: record whether a row was corrected, imputed, or left untouched.

Tip: Always keep an immutable raw layer. Store the original feed separately so you can reprocess data if rules change or bugs are found.

Tools and storage choices that help
Use tools and formats that make validation and reprocessing easy. In India many teams use Python with Pandas for small to medium data, and Apache Spark or Dask for larger sets. Columnar formats like Parquet and table formats such as Delta Lake help with fast reads, partitioning and schema evolution. If you run low-latency trading desks, consider kdb+/q or ClickHouse for high-performance historical queries. Keep costs and scale in mind — for example, an exchange-grade tick feed subscription can run into lakhs of rupees per year, so protecting that investment with good quality controls is important.

Operational best practices
- Build quality checks into the ingestion pipeline so bad data is rejected or flagged immediately.
- Implement monitoring and alerts for unusual metrics: sudden rise in duplicates, missing partitions, or large volumes of nulls.
- Version your cleaned datasets and keep metadata about who changed what and why.
- Automate common fixes but require human review for ambiguous cases.
- Schedule regular reconciliation against exchange end-of-day files and an independent vendor snapshot.

People and process matter
Data quality is not only a tech problem. Create simple runbooks, assign owners for each dataset, and encourage a culture of documenting manual fixes. Small teams can benefit from weekly "data health" standups to discuss recurring issues and prevention.

Closing thought
Dirty historical market data is inevitable, but manageable. With clear checks, automated pipelines, trustworthy corporate-action processing, and an immutable raw layer, you can reduce surprises and trust your backtests and reports. Start with simple rules, monitor continuously, and evolve the system as your data volumes and use cases grow.

How to Handle "Dirty Data" in Your Historical Databases

ProTrader