3.5: Data Quality Standards and Metrics
✂️ Tl;dr 🥷
Establishes data quality and sovereignty foundations for the new eMap platform, mandating compliance with Australian data residency requirements through Azure hosting in Melbourne (primary) and Sydney (secondary). Data classification determines storage controls, with sensitive datasets subject to stricter handling. Core spatial data quality dimensions – accuracy, completeness, consistency, timeliness and lineage – provide measurable criteria for assessing dataset integrity. Quantifiable metrics and minimum thresholds translate these dimensions into actionable KPIs, including adherence rates and remediation timelines. Automated validation rules enforce schema conformance, attribute validity and spatial topology, integrated into data workflows to prevent quality breaches. Continuous monitoring via Azure dashboards and issue-tracking systems supports transparent reporting and remediation prioritisation. These standards ensure geospatial data remains trustworthy, compliant and fit-for-purpose across its lifecycle while aligning with organisational policies on security and sovereignty.
This section details the critical standards and metrics for data quality and data sovereignty that will underpin the new eMap platform; outlining the framework for defining, measuring and maintaining high-quality spatial data aligned with departmental policies.
3.5.1. Data Sovereignty and Residency Requirements¶
- Alignment with Organisational Policies: All data hosted on the new eMap platform must comply with applicable departmental policies regarding data storage, processing and access. This includes, but is not limited to, regulations concerning data privacy, security classifications and data flows.
- Data Classification Impact: The sensitivity level of each dataset, as determined during the inventory and classification process, will directly influence its specific residency and handling requirements. Sensitive data may necessitate stricter controls and potentially limit storage options.
- Impact on Storage Choices (Azure Regions and Replication):
- The primary Azure region for all environments (DEV, UAT, PROD) is Australia Southeast (Melbourne).
- The secondary Azure region for Production DR capabilities is Australia East (Sydney).
- All data, including backups and replicas, must reside within these Australian Azure regions to ensure compliance with data sovereignty mandates.
- The choice of Azure Storage replication strategies (LRS, ZRS, GRS, GZRS) for PROD PaaS resources will ensure data remains within Australia, with GRS/GZRS providing inter-region redundancy between Melbourne and Sydney. DEV and UAT environments will primarily use LRS or ZRS within the Melbourne region.
3.5.2. Defining Spatial Data Quality Dimensions¶
To ensure the utility and reliability of geospatial data within the new eMap platform, specific data quality dimensions must be defined and consistently applied. These dimensions provide a framework for assessing and improving dataset integrity:
- Accuracy:
- Thematic Accuracy: The correctness of attribute values associated with spatial features. This involves ensuring attributes accurately describe the real-world entity they represent (e.g., correct road classification, valid land use code).
- Temporal Accuracy: The degree to which data is current with respect to the phenomenon it represents and the accuracy of timestamps or date attributes.
- Completeness:
- Feature Completeness: The extent to which all required features are present in the dataset for a given area of interest and specification.
- Attribute Completeness: The extent to which all required attribute fields for features have valid, non-null values.
- Consistency:
- Logical Consistency: The adherence of data to defined logical rules and relationships (e.g., topological integrity – no overlapping polygons of a certain type, lines connecting at nodes). PostGIS functions can be leveraged for automated topology checks in user-managed Enterprise Geodatabases.
- Temporal Consistency: Ensuring that changes in data over time are logical and correctly recorded, without anachronisms.
- Format Consistency: Data adheres to specified file formats, data types and domain values, particularly when exchanged or integrated.
- Currency (Timeliness):
- The degree to which the data represents the state of the real world at a specific point in time, relative to user needs. This is closely linked to update frequency and data collection methodologies.
- Lineage:
- Provenance: Documenting the origin of the dataset, including source materials, collection methods and original scale.
- Transformation History: Recording the processes, transformations and algorithms applied to the data from its source to its current state. Automated capture of metadata related to modifications (user, timestamp, operation, source workflow) is crucial, especially for workflows interacting with user-managed Enterprise Geodatabases.
These dimensions are not mutually exclusive and often interrelate. They form the basis for establishing specific quality metrics and thresholds.
3.5.3. Establishing Quality Thresholds and Metrics¶
Defining measurable metrics and acceptable quality thresholds is essential for objectively assessing data quality and fitness for use.
- Importance of Measurable Metrics: Metrics transform abstract quality dimensions into quantifiable indicators. For example completeness can be measured as a percentage of required features present and thematic accuracy as a percentage of correctly classified features.
-
Setting Minimum Acceptable Quality Thresholds: For key enterprise datasets, Data Owners and Data Stewards should define minimum acceptable quality thresholds. These thresholds should be based on the intended use of the data and any operational requirements.
-
Key Performance Indicators (KPIs) for Data Quality:
- Percentage of enterprise datasets meeting defined quality thresholds.
- Average time to remediate identified data quality issues.
- Number of data quality incidents reported per month/quarter.
- User satisfaction scores related to data quality.
3.5.4. Data Quality Assessment and Remediation Processes¶
A systematic approach to assessing data quality and remediating issues is crucial for maintaining the integrity of the new eMap platform's data assets.
- Automated Validation Rules:
- Wherever feasible, automated validation rules should be implemented to assess data quality. This is particularly important for data within the Enterprise Geodatabases.
- Schema Conformance: Checks against defined table structures, field types and constraints (leveraging native PostgreSQL constraints where possible).
- Attribute Domain Validation: Verifying attribute values against predefined lists or ranges.
- Spatial Integrity and Topology Validation: Using PostGIS functions to check for valid geometries, spatial reference consistency and topological errors (e.g., overlaps, gaps, dangles) for relevant datasets.
- Business Rule Verification: Implementing checks to ensure data conforms to specific organisational or operational business rules (e.g., a road segment must be connected to other road segments).
- Workflows (e.g., VertiGIS Studio Workflows) that create or modify enterprise data should incorporate these validation gates before committing changes to ensure data integrity.
- Reporting Quality Metrics:
- Data quality metrics should be regularly reported and made available to Data Owners, Data Stewards and other relevant stakeholders.
- This reporting should ideally be integrated with the Monitoring and Observability Framework, potentially through dashboards in Azure Monitor.
- Remediation Workflows:
- A process for logging, tracking and resolving identified data quality issues should be established. This may involve using an issue tracking system (e.g., Jira).
- Remediation may involve manual data correction, updates to ETL processes, modification of data capture procedures, or enhancements to automated validation rules.