5.2: Data Migration and Validation

✂️ Tl;dr 🥷

This section discusses the technical approach for migrating spatial and attribute data from the legacy Esri SDE geodatabase on SQL Server to Azure PostgreSQL, emphasising automated scripting to ensure accuracy and efficiency. Core tools include ArcPy for handling Esri-specific elements such as raster conversion to cloud-optimised CRF/MRF formats, GDAL/OGR for bulk vector data transfer and spatial type conversion and GeoPandas for complex attribute transformations. The process involves schema mapping, coordinate system reprojection, and rigorous validation of schema conformance, attribute accuracy, spatial integrity and data completeness using PostGIS functions and automated scripts. Data Stewards and Owners validate results against source systems and approve datasets for production with credentials managed securely via Azure Key Vault. While FME is available for exceptional cases, primary reliance is placed on code-based tools aligned with long-term platform strategy. The combined approach minimises manual intervention, ensures data quality and supports the eMap platform’s transition to open standards and cloud-native storage.

This section focuses on the technical execution of data migration and discusses some of the tools, automated scripting approaches and critical validation processes required to ensure data integrity and fitness for purpose in the new eMap platform.

5.2.1. Automated Migration Scripts and Tools for SDE on SQL Server¶

The migration from a traditional Esri SDE geodatabase on SQL Server to a modern Azure PostgreSQL environment represents a significant architectural shift. To minimise manual intervention, reduces the risk of human error and ensures consistent application of transformation rules, this migration path should focus on developing automated scripts to manage the migration.

Script Development and Tooling Philosophy¶

The development and maintenance of migration scripts will be a collaborative effort between GIS Engineers, Data Engineers and DBAs. The selection of tools for specific migration tasks will be guided by:

Source Data Characteristics: Complexity of Esri geodatabase objects (e.g., feature classes, tables, relationship classes, domains, subtypes), data volumes, and specific SQL Server configurations.
Transformation Complexity: Requirements for schema mapping, attribute value changes, coordinate system reprojection and spatial type conversion.
Target Azure PaaS Services: Optimised loading methods for Azure Database for PostgreSQL and Azure Data Lake Storage Gen2.
Team Skillset and Maintainability: Preference for tools that are well-documented, widely supported, and align with the team's existing and developing skillsets.

Core Migration Tools and Their Roles¶

A combination of tools mostly using Python libraries will provide the flexibility and power needed for this migration.

1. Esri ArcPy¶

ArcPy, Esri's Python package, is indispensable when working with data originating from or destined for Esri environments. Its deep integration with the geodatabase model makes it a primary choice for certain aspects of this migration.

Strengths:
- Rich Geodatabase Access: Native understanding of all Esri geodatabase constructs, including feature classes, tables, domains, subtypes, and relationship classes. It can read data directly from the SDE geodatabase on SQL Server.
- Comprehensive Geoprocessing Suite: Access to a vast array of geoprocessing tools for data conversion, analysis, and transformation that will be needed in the transformation workflow.
- Raster Data Handling: Excellent for converting legacy raster formats into cloud-optimised formats such as Cloud Raster Format (CRF) and Meta Raster Format (MRF) including control over tiling, compression, and pyramid generation.
- Scripting and Automation: Fully scriptable.
Suitability for eMap Migration:
- Extracting Complex SDE Objects: If specific Esri SDE features like domains, subtypes, or complex relationship class logic need to be introspected or translated into new structures/rules in PostgreSQL, ArcPy is the best tool to understand and extract this information.
- Raster Data Conversion: ArcPy is the designated tool for converting all legacy raster datasets to CRF or MRF for storage in Azure Data Lake Storage Gen2.
- Pre processing Tasks: Performing any necessary
- Writing to Registered Enterprise Geodatabases: When data needs to be written to the Enterprise geodatabase, ArcPy should be used.

Conceptual Example (Exporting a feature class to a shapefile as an intermediate step):

arcpy_sde_export_conceptual.py
import arcpy
import os

# Connection to SDE on SQL Server
sde_connection_file = r"C:\connections\legacy_sde_on_sqlserver.sde"
feature_class_name = "SDE_SCHEMA.LegacyDataset" #(1)
output_folder = r"C:\migration_staging\shapefiles"
output_shapefile_name = "migrated_legacy_dataset.shp"

source_fc_path = os.path.join(sde_connection_file, feature_class_name)
output_shp_path = os.path.join(output_folder, output_shapefile_name)

if arcpy.Exists(source_fc_path):
    try:
        print(f"Exporting {source_fc_path} to {output_shp_path}...")
        # (2)
        arcpy.conversion.FeatureClassToFeatureClass(
            in_features=source_fc_path,
            out_path=output_folder,
            out_name=output_shapefile_name
        )
        print(f"Successfully exported to {output_shp_path}")
    except arcpy.ExecuteError:
        print(f"ArcPy ExecuteError: {arcpy.GetMessages(2)}")
    except Exception as e:
        print(f"Error: {e}")
else:
    print(f"Source feature class not found: {source_fc_path}")

SDE_SCHEMA.LegacyDataset: Refers to the fully qualified name of the feature class within the SDE geodatabase.
Use FeatureClassToFeatureClass to specify output filename.

2. GDAL/OGR (Geospatial Data Abstraction Library)¶

GDAL (for raster data) and OGR (for vector data) are powerful open-source libraries, often used via the ogr2ogr command-line utility.

Strengths:
- Broad Format Support: Can read from and write to an extensive list of GIS formats, including direct read access to Esri SDE geodatabases (via appropriate drivers and connection strings).
- Efficient Translations: Highly optimised for bulk data conversion and loading.
- Spatial Type and SRS Handling: Manages conversion between different spatial geometry types (e.g., SDE's internal geometry to PostGIS geometry) and performs on-the-fly coordinate system (SRS) transformations.
- Command-Line Interface (CLI): ogr2ogr is excellent for scripting and automating batch migration jobs.
- SQL Capabilities: Allows for attribute selection and simple transformations using SQL queries against the source data.
Suitability for eMap Migration:
- Bulk Vector Migration: Ideal for migrating numerous vector feature classes from the SDE on SQL Server directly into tables in Azure Database for PostgreSQL. It handles the critical ST_Geometry (or SDEBINARY) to PostGIS geometry conversion efficiently.
- Direct SDE to PostGIS: Often the most direct and performant path for "lifting and shifting" vector feature classes where complex Esri-specific object logic (like subtypes or relationship classes that need special handling beyond simple data transfer) is not the primary concern.

Example (ogr2ogr CLI for SDE to PostGIS):

ogr2ogr_sde_to_postgis.sh
# (1) Connection string for SDE on SQL Server
SDE_CONN_STRING="MSSQL:server=your_sql_server_host;database=your_sde_database;UID=your_sde_user;PWD=your_sde_password;DRIVER={ODBC Driver 17 for SQL Server}"

# (2) Connection string for Azure Database for PostgreSQL
PG_CONN_STRING="PG:host=your-pg-server.postgres.database.azure.com port=5432 dbname=your_emap_db user=your_pg_user password=your_pg_password sslmode=require"

# (3) Name of the source feature class in SDE and target table name in PostGIS
SOURCE_SDE_FC="SDE_SCHEMA.SourceFeatureClass"
TARGET_PG_TABLE="migrated_source_feature_class"
TARGET_PG_SCHEMA="enterprise_data"  # (4) Target schema in PostgreSQL
# (5)
ogr2ogr -f "PostgreSQL" "${PG_CONN_STRING}" \
        "${SDE_CONN_STRING}" \
        -sql "SELECT * FROM ${SOURCE_SDE_FC}" \
        -nln "${TARGET_PG_SCHEMA}.${TARGET_PG_TABLE}" \
        -lco GEOMETRY_NAME=geom \
        -lco FID=gid \
        -t_srs "EPSG:7844" \
        -progress
# (6)

SDE_CONN_STRING: The connection string for ogr2ogr to access the SDE geodatabase. The exact format can depend on the GDAL version and available ODBC drivers. Ensure the necessary SQL Server ODBC driver is installed on the machine running ogr2ogr.
PG_CONN_STRING: Standard PostgreSQL connection string. sslmode=require is crucial for Azure Database for PostgreSQL.
SOURCE_SDE_FC and TARGET_PG_TABLE: Define the source and target object names.
TARGET_PG_SCHEMA: Specifies the schema within the PostgreSQL database where the table will be created.
-sql "SELECT * FROM ${SOURCE_SDE_FC}": This syntax is one way to specify the source layer from an SDE connection. Alternatively, the SDE connection string itself can sometimes point to a specific layer, or the source layer name can be provided as a separate argument if not using -sql.
-t_srs "EPSG:7844": Specifies the target coordinate system. If the source data is in a different GDA datum (e.g. GDA94), ensure appropriate transformation paths are considered.

3. GeoPandas¶

GeoPandas is a Python library that extends the data types used by pandas to allow spatial operations on geometric types, making it easy to work with geospatial data in a Pythonic, tabular way.

Strengths:
- Pandas Integration: Leverages the powerful data manipulation capabilities of pandas for attribute data.
- Simplified Spatial Operations: Provides an intuitive API for common spatial operations (e.g., buffer, intersect, dissolve) by integrating with shapely.
- Format Reading/Writing: Can read various vector formats (including SDE layers via fiona, which in turn uses GDAL drivers) and write to PostGIS (using SQLAlchemy and psycopg2 or psycopg under the hood).
- Python Ecosystem: Fits naturally into Python-based ETL workflows.
Suitability for eMap Migration:
- Attribute-Intensive Transformations: When migration involves complex attribute cleaning, calculations, joining with non-spatial data, or restructuring of attribute tables before loading to PostGIS.
- Custom Python Workflows: For scenarios where a more programmatic and flexible approach than CLI tools is needed for transformation logic.

Conceptual Example (Reading SDE, transforming, writing to PostGIS):

geopandas_sde_transform_load.py
import geopandas as gpd
from sqlalchemy import create_engine

# (1)
pg_user = "your_pg_user"
pg_password = "your_pg_password"
pg_host = "your-pg-server.postgres.database.azure.com"
pg_db = "your_emap_db"
db_connection_url = f"postgresql+psycopg2://{pg_user}:{pg_password}@{pg_host}/{pg_db}?sslmode=require"
engine = create_engine(db_connection_url)

# Read intermediate FileGDB (ensure GDAL configured)
intermediate_gdb_path = "/migration_staging/intermediate_data.gdb"
gdf = gpd.read_file(intermediate_gdb_path, layer="SourceFeatureClass_Exported")

# CRS Check & Transformation
if not gdf.crs.is_projected:
    gdf = gdf.to_crs("EPSG:7856")  # Example projected CRS
gdf['area_sqm'] = gdf.geometry.area
gdf['migration_status'] = 'Migrated_2024'

# Write to PostGIS (batched)
target_table = "transformed_feature_class"
target_schema = "enterprise_data"

try:
    # Create empty table first
    gdf[:0].to_postgis(
        target_table, engine, schema=target_schema,
        if_exists="replace", index=False, geometry="geom"
    )

    # Batch insert (2)
    for i, chunk in enumerate(np.array_split(gdf, 10)):
        chunk.to_postgis(
            target_table, engine, schema=target_schema,
            if_exists="append", index=False, geometry="geom"
        )
        print(f"Inserted batch {i+1}/10")

except Exception as e:
    print(f"PostGIS Error: {str(e)}")
    raise

SQLAlchemy Engine: Provides the connection to the Azure Database for PostgreSQL instance. psycopg2 is specified as the driver.
Writing to PostGIS: The to_postgis method allows writing the GeoDataFrame to a new table in the specified schema.

4. PostGIS Shapefile Loader (`shp2pgsql`)¶

The shp2pgsql utility is a command-line tool that converts Esri Shapefiles into SQL commands for loading into a PostGIS database.

Strengths:
- Native PostGIS Tool: Specifically designed for Shapefile to PostGIS conversion.
- Efficient Loading: Generates SQL optimised for PostGIS.
- Geometry Validation: Can include options for basic geometry validation during the SQL generation process.
Suitability for eMap Migration:
- Limited Role: Given the primary source is an Esri SDE geodatabase on SQL Server, shp2pgsql will not be a primary migration tool for core enterprise datasets.
- Auxiliary Shapefiles: Its use would be limited to ingesting any standalone Shapefiles that might exist as part of the legacy system (e.g., user-generated data, third-party data deliveries in Shapefile format) that also need to be migrated.

Example (Loading a Shapefile):

shp2pgsql_example.sh
#!/usr/bin/env bash
set -euo pipefail

# !!! Never hardcode passwords in scripts. Use environment variables or .pgpass !!!
SHAPEFILE_PATH="/migration_staging/auxiliary_data/some_shapefile.shp"
PG_HOST="your-pg-server.postgres.database.azure.com"
PG_DB="your_emap_db"
PG_USER="your_pg_user"
TARGET_TABLE="auxiliary_shapefile_data"
TARGET_SCHEMA="staging_data"
SOURCE_SRID="4326"  # 4326 is WGS84

# Create target schema if not exists
psql -h "${PG_HOST}" -p 5432 -d "${PG_DB}" -U "${PG_USER}" \
    -c "CREATE SCHEMA IF NOT EXISTS ${TARGET_SCHEMA};"

# Determine Shapefile encoding (example using GDAL)
ENCODING=$(ogrinfo -al -so "${SHAPEFILE_PATH}" | grep "UTF-8" || echo "LATIN1")

# Load Shapefile with proper encoding
shp2pgsql -s ${SOURCE_SRID} -I -W "${ENCODING}" "${SHAPEFILE_PATH}" \
          "${TARGET_SCHEMA}.${TARGET_TABLE}" | \
psql -h "${PG_HOST}" -p 5432 -d "${PG_DB}" -U "${PG_USER}"

Note on Esri ArcGIS Data Interoperability Extension (FME)¶

The Esri ArcGIS Data Interoperability extension, which embeds Safe Software's FME, offers a visual environment for ETL processes. While it supports a range of formats and transformations, its use in this project should be approached with caution.

Limited FME Use Recommended

IRD has adopted the strategic direction to build expertise and ongoing solutions primarily using code-first tools and native Azure capabilities. FME should be reserved for: * Highly complex, one-off migration tasks where the development effort using scripted tools would be prohibitive. * Situations where other recommended tools have proven insufficient for a specific, isolated dataset. FME should not be used to develop workflows that will be part of the ongoing operational environment of the new eMap platform.

Key Script Functionalities for SDE Migration¶

Regardless of the specific tools chosen for a task, the automated migration scripts must address the following core functionalities:

Connection Management¶

Scripts require secure and reliable connections to both the source legacy SDE geodatabase on SQL Server and the target Azure PaaS services. Adhering to the Zero Trust Security Model, all credentials (database usernames/passwords, API keys) must be stored in Azure Key Vault. Scripts should retrieve these secrets at runtime, ideally using Managed Identities or service principals with appropriately scoped, least-privilege access.

Data Extraction from SDE on SQL Server¶

Scripts must efficiently extract data from the source SDE geodatabase. This involves: * Connecting to the SQL Server instance hosting the SDE. * Accessing the specific feature classes and tables, often identified by their fully qualified names (e.g., SDE_SCHEMA_OWNER.FEATURE_CLASS_NAME). * Handling potentially large data volumes. For very large tables, extraction might need to be batched or filtered. * Consideration for versioned data: For migration, the common practice is to migrate data from the DEFAULT version or a specific named version, effectively flattening the version history. The exact strategy for handling versioned data must be determined based on business requirements for historical data.

Data Transformation¶

Key transformations include:

Schema Mapping: Systematically translating source SDE schema (table names, field names, data types) to the target schemas defined for Azure Database for PostgreSQL. This might involve renaming fields to adhere to new naming conventions and PostgreSQL best practices (e.g., lowercase).
Attribute Value Transformation: Cleansing, standardising and re-coding attribute values. This can include converting data types (e.g., SQL Server datetime to PostgreSQL timestamp), handling NULL values or applying domain mappings.
Spatial Reference System (SRS) Transformation: Re-projecting datasets from their legacy coordinate system (e.g., GDA94, local grids) to the new eMap platform's standard SRS (e.g., GDA2020, identified by its EPSG code). Tools such as ArcPy and GDAL/OGR provide robust reprojection capabilities.
Spatial Type Conversion (SQL Server SDE to PostGIS): This is a critical and fundamental transformation. The legacy SDE geodatabase on SQL Server stores geometries using Esri's proprietary spatial types (e.g., ST_Geometry or the older SDEBINARY format). Azure Database for PostgreSQL, with the PostGIS extension, uses PostGIS native geometry types (e.g., geometry, geography). Migration scripts must convert the source Esri spatial types into formats compatible with PostGIS. This can often be done using Well-Known Text (WKT) or Well-Known Binary (WKB) as an intermediate step, or by tools such as ogr2ogr).
- Why this is important: PostGIS provides a rich set of open spatial functions and is the standard for geospatial data in PostgreSQL. The eMap platform will leverage these PostGIS capabilities.
Raster Format Conversion (to CRF/MRF): As defined in the Raster Data Storage Strategy, legacy raster formats (e.g., GeoTIFF, IMG, Esri Grid) must be converted to cloud-optimised formats using ArcPy.
- Cloud Raster Format (CRF): Primary target for analytical and new raster data, stored in ADLS Gen2. Scripts must ensure LERC compression, appropriate tiling (e.g., 512x512 pixels), and pyramid generation.
- Meta Raster Format (MRF): For specific use cases such as pre-rendered basemap caches.

This conceptual Python script shows CRF conversion using ArcPy:

conceptual_arcpy_crf_conversion.py
# This is a conceptual snippet for ArcPy-based CRF conversion.
# Actual implementation will require robust error handling, parameterisation,
# and integration into a larger migration framework.

import arcpy
import os

def convert_raster_to_crf(source_raster_path, target_adls_crf_folder, crf_filename):
    """
    Converts a source raster to Cloud Raster Format (CRF) in ADLS Gen2 using the CORRECT ArcPy tool.
    """
    # Configure environment settings
    arcpy.env.overwriteOutput = True
    arcpy.env.tileSize = [512, 512]  # Tile size as list of integers
    arcpy.env.compression = "LERC"
    arcpy.env.compressionQuality = 75 # Integer value (0-100)
    arcpy.env.pyramid = "PYRAMIDS -1 BILINEAR DEFAULT"  # Build pyramids with bilinear resampling


    target_crf_path = os.path.join(target_adls_crf_folder, crf_filename)

    try:
        arcpy.management.CopyRaster(
            in_raster=source_raster_path,
            out_rasterdataset=target_crf_path,
            format="CRF"
        )
        print(f"Successfully created CRF: {target_crf_path}")

    except arcpy.ExecuteError as e:
        print(f"ArcPy ExecuteError: {e}")
        print(arcpy.GetMessages(2))
    except Exception as e:
        print(f"Unexpected error: {e}")

Data Loading into Azure PaaS¶

Transformed data must be loaded efficiently into Azure Database for PostgreSQL and ADLS Gen2: * PostgreSQL: Use bulk loading mechanisms where possible. For ogr2ogr, this is handled internally. For Python scripts using psycopg2, methods such asexecute_values or the COPY command (by writing data to an in-memory buffer or temporary file) are highly efficient. For GeoPandas, to_postgis handles this. * ADLS Gen2: For raster files (CRF/MRF), scripts should write directly to the designated ADLS Gen2 file system paths. For large numbers of files, parallel uploads can improve throughput.

Logging and Error Handling¶

Migration scripts must implement comprehensive logging to track progress, record transformations, and capture any errors or warnings. Robust error handling is essential to manage issues gracefully, allow for potential resumption of failed jobs, and facilitate targeted re-runs.

Management and Execution of Migration Scripts¶

Version Control: All migration scripts (Python, shell scripts, SQL, etc.) should be stored in the designated git repository. This facilitates version tracking, collaborative development, code reviews, and rollback capabilities.
CI/CD Integration: For repeatable migration tasks (e.g., refreshing DEV/UAT environments) or for managing the deployment of migration scripts themselves, integration into CI/CD pipelines (e.g., GitHub Actions, Azure DevOps Pipelines) should be considered.
Batch Processing: For migrating large numbers of feature classes or extensive raster collections, scripts should be designed for batch processing, allowing them to be run iteratively or in parallel where feasible.

5.2.2. Data Quality Validation¶

The migrated data must be validated to ensure its quality, accuracy, and integrity within the new eMap platform. Data Stewards and Data Owners play a pivotal role in this process.

Validation Areas and Methods¶

Migrated datasets should undergo checks before being certified for production use:

Schema Conformance:
- Verification: Confirm that the schema of migrated datasets in Azure Database for PostgreSQL (table structures, column names, data types, constraints such as NOT NULL or UNIQUE, primary keys, foreign keys, and indexes) matches the defined target schemas.
- Tools: SQL queries against PostgreSQL system catalogs (e.g., information_schema.columns, pg_indexes), schema comparison tools, and reports from migration scripts.
Attribute Accuracy:
- Verification: Compare attribute values in migrated data against the source SDE data or predefined quality benchmarks. This includes checking for data loss, corruption, incorrect transformations, and adherence to domain constraints (which might be enforced by database constraints or application logic in the new system).
- Tools: SQL aggregate queries (COUNT(*), SUM(), AVG(), MIN(), MAX() on key numeric fields in both source and target), Python scripts for statistical sampling and value-by-value comparison for critical attributes, and manual review by Data Stewards.
Spatial Integrity:
- Verification:
  - Geometry Validity: Utilise PostGIS functions (e.g., ST_IsValid(geom), ST_IsValidReason(geom)) in SQL queries to identify invalid geometries. This is critical, as data valid in SDE might have subtle issues that PostGIS identifies more strictly.
  - Spatial Reference System (SRS): Confirm all migrated spatial data uses the correct, standardised SRS (e.g., GDA2020), and check the SRID value stored in geometry_columns or spatial_ref_sys in PostGIS.
  - Positional Accuracy: For critical datasets, compare sample feature locations against source data or known control points.
  - Completeness of Geometries: Ensure no features have lost their geometries during migration.
- Tools: PostGIS SQL functions, visual inspection in ArcGIS Pro or QGIS connected to Azure PostgreSQL, comparison with source data in a GIS environment.
Completeness:
- Verification: Ensure all expected records/features and entire datasets have been migrated. Compare record counts between source SDE tables and target PostgreSQL tables.
- Tools: SELECT COUNT(*) queries on source and target, reports from migration scripts.
Raster Data Validation:
- Verification:
  - Pixel Value Integrity: Sample comparisons of pixel values.
  - SRS and Georeferencing: Confirm correct spatial positioning.
  - Format Adherence: Verify CRF/MRF properties (tiling, compression, pyramids).
  - Visual Checks: Inspect in GIS software for visual artefacts or rendering issues.
- Tools: ArcPy raster properties, GDAL utilities (gdalinfo), visual inspection in GIS clients.

Automated Validation¶

Where practical, validation checks should be automated using Python scripts (using GeoPandas, psycopg2, and potentially ArcPy for specific checks) or SQL scripts with PostGIS functions. Automation offers: * Efficiency: Reduces manual effort, especially for large datasets. * Consistency: Ensures checks are performed uniformly. * Reporting: Generates detailed reports of discrepancies for review and remediation.

Role of Data Stewards and Data Owners¶

Review Validation Reports: Data Stewards should review automated validation outputs.
Perform Supplementary Checks: Apply domain expertise for manual or visual checks on critical data.
Sign-off: Formal approval by the Data Owner or delegated Steward confirming fitness for use.
Issue Resolution: If issues are found, Stewards shall collaborate with the technical team to investigate, determine corrective actions, oversee re-migration/correction, and ensure re-validation.

By combining an automated migration strategy with thorough manual validation, the new eMap platform should be populated with high-quality data.