5.2: Data Migration and Validation
✂️ Tl;dr 🥷
This section discusses the technical approach for migrating spatial and attribute data from the legacy Esri SDE geodatabase on SQL Server to Azure PostgreSQL, emphasising automated scripting to ensure accuracy and efficiency. Core tools include ArcPy for handling Esri-specific elements such as raster conversion to cloud-optimised CRF/MRF formats, GDAL/OGR for bulk vector data transfer and spatial type conversion and GeoPandas for complex attribute transformations. The process involves schema mapping, coordinate system reprojection, and rigorous validation of schema conformance, attribute accuracy, spatial integrity and data completeness using PostGIS functions and automated scripts. Data Stewards and Owners validate results against source systems and approve datasets for production with credentials managed securely via Azure Key Vault. While FME is available for exceptional cases, primary reliance is placed on code-based tools aligned with long-term platform strategy. The combined approach minimises manual intervention, ensures data quality and supports the eMap platform’s transition to open standards and cloud-native storage.
This section focuses on the technical execution of data migration and discusses some of the tools, automated scripting approaches and critical validation processes required to ensure data integrity and fitness for purpose in the new eMap platform.
5.2.1. Automated Migration Scripts and Tools for SDE on SQL Server¶
The migration from a traditional Esri SDE geodatabase on SQL Server to a modern Azure PostgreSQL environment represents a significant architectural shift. To minimise manual intervention, reduces the risk of human error and ensures consistent application of transformation rules, this migration path should focus on developing automated scripts to manage the migration.
Script Development and Tooling Philosophy¶
The development and maintenance of migration scripts will be a collaborative effort between GIS Engineers, Data Engineers and DBAs. The selection of tools for specific migration tasks will be guided by:
- Source Data Characteristics: Complexity of Esri geodatabase objects (e.g., feature classes, tables, relationship classes, domains, subtypes), data volumes, and specific SQL Server configurations.
- Transformation Complexity: Requirements for schema mapping, attribute value changes, coordinate system reprojection and spatial type conversion.
- Target Azure PaaS Services: Optimised loading methods for Azure Database for PostgreSQL and Azure Data Lake Storage Gen2.
- Team Skillset and Maintainability: Preference for tools that are well-documented, widely supported, and align with the team's existing and developing skillsets.
Core Migration Tools and Their Roles¶
A combination of tools mostly using Python libraries will provide the flexibility and power needed for this migration.
1. Esri ArcPy¶
ArcPy, Esri's Python package, is indispensable when working with data originating from or destined for Esri environments. Its deep integration with the geodatabase model makes it a primary choice for certain aspects of this migration.
- Strengths:
- Rich Geodatabase Access: Native understanding of all Esri geodatabase constructs, including feature classes, tables, domains, subtypes, and relationship classes. It can read data directly from the SDE geodatabase on SQL Server.
- Comprehensive Geoprocessing Suite: Access to a vast array of geoprocessing tools for data conversion, analysis, and transformation that will be needed in the transformation workflow.
- Raster Data Handling: Excellent for converting legacy raster formats into cloud-optimised formats such as Cloud Raster Format (CRF) and Meta Raster Format (MRF) including control over tiling, compression, and pyramid generation.
- Scripting and Automation: Fully scriptable.
- Suitability for eMap Migration:
- Extracting Complex SDE Objects: If specific Esri SDE features like domains, subtypes, or complex relationship class logic need to be introspected or translated into new structures/rules in PostgreSQL, ArcPy is the best tool to understand and extract this information.
- Raster Data Conversion: ArcPy is the designated tool for converting all legacy raster datasets to CRF or MRF for storage in Azure Data Lake Storage Gen2.
- Pre processing Tasks: Performing any necessary
- Writing to Registered Enterprise Geodatabases: When data needs to be written to the Enterprise geodatabase, ArcPy should be used.
-
Conceptual Example (Exporting a feature class to a shapefile as an intermediate step):
SDE_SCHEMA.LegacyDataset
: Refers to the fully qualified name of the feature class within the SDE geodatabase.- Use
FeatureClassToFeatureClass
to specify output filename.
2. GDAL/OGR (Geospatial Data Abstraction Library)¶
GDAL (for raster data) and OGR (for vector data) are powerful open-source libraries, often used via the ogr2ogr
command-line utility.
- Strengths:
- Broad Format Support: Can read from and write to an extensive list of GIS formats, including direct read access to Esri SDE geodatabases (via appropriate drivers and connection strings).
- Efficient Translations: Highly optimised for bulk data conversion and loading.
- Spatial Type and SRS Handling: Manages conversion between different spatial geometry types (e.g., SDE's internal geometry to PostGIS
geometry
) and performs on-the-fly coordinate system (SRS) transformations. - Command-Line Interface (CLI):
ogr2ogr
is excellent for scripting and automating batch migration jobs. - SQL Capabilities: Allows for attribute selection and simple transformations using SQL queries against the source data.
-
Suitability for eMap Migration:
- Bulk Vector Migration: Ideal for migrating numerous vector feature classes from the SDE on SQL Server directly into tables in Azure Database for PostgreSQL. It handles the critical ST_Geometry (or SDEBINARY) to PostGIS geometry conversion efficiently.
- Direct SDE to PostGIS: Often the most direct and performant path for "lifting and shifting" vector feature classes where complex Esri-specific object logic (like subtypes or relationship classes that need special handling beyond simple data transfer) is not the primary concern.
-
Example (
ogr2ogr
CLI for SDE to PostGIS):SDE_CONN_STRING
: The connection string forogr2ogr
to access the SDE geodatabase. The exact format can depend on the GDAL version and available ODBC drivers. Ensure the necessary SQL Server ODBC driver is installed on the machine runningogr2ogr
.PG_CONN_STRING
: Standard PostgreSQL connection string.sslmode=require
is crucial for Azure Database for PostgreSQL.SOURCE_SDE_FC
andTARGET_PG_TABLE
: Define the source and target object names.TARGET_PG_SCHEMA
: Specifies the schema within the PostgreSQL database where the table will be created.-sql "SELECT * FROM ${SOURCE_SDE_FC}"
: This syntax is one way to specify the source layer from an SDE connection. Alternatively, the SDE connection string itself can sometimes point to a specific layer, or the source layer name can be provided as a separate argument if not using-sql
.-t_srs "EPSG:7844"
: Specifies the target coordinate system. If the source data is in a different GDA datum (e.g. GDA94), ensure appropriate transformation paths are considered.
3. GeoPandas¶
GeoPandas is a Python library that extends the data types used by pandas to allow spatial operations on geometric types, making it easy to work with geospatial data in a Pythonic, tabular way.
- Strengths:
- Pandas Integration: Leverages the powerful data manipulation capabilities of pandas for attribute data.
- Simplified Spatial Operations: Provides an intuitive API for common spatial operations (e.g., buffer, intersect, dissolve) by integrating with
shapely
. - Format Reading/Writing: Can read various vector formats (including SDE layers via
fiona
, which in turn uses GDAL drivers) and write to PostGIS (using SQLAlchemy andpsycopg2
orpsycopg
under the hood). - Python Ecosystem: Fits naturally into Python-based ETL workflows.
-
Suitability for eMap Migration:
- Attribute-Intensive Transformations: When migration involves complex attribute cleaning, calculations, joining with non-spatial data, or restructuring of attribute tables before loading to PostGIS.
- Custom Python Workflows: For scenarios where a more programmatic and flexible approach than CLI tools is needed for transformation logic.
-
Conceptual Example (Reading SDE, transforming, writing to PostGIS):
- SQLAlchemy Engine: Provides the connection to the Azure Database for PostgreSQL instance.
psycopg2
is specified as the driver. - Writing to PostGIS: The
to_postgis
method allows writing the GeoDataFrame to a new table in the specified schema.
- SQLAlchemy Engine: Provides the connection to the Azure Database for PostgreSQL instance.
4. PostGIS Shapefile Loader (shp2pgsql
)¶
The shp2pgsql
utility is a command-line tool that converts Esri Shapefiles into SQL commands for loading into a PostGIS database.
- Strengths:
- Native PostGIS Tool: Specifically designed for Shapefile to PostGIS conversion.
- Efficient Loading: Generates SQL optimised for PostGIS.
- Geometry Validation: Can include options for basic geometry validation during the SQL generation process.
-
Suitability for eMap Migration:
- Limited Role: Given the primary source is an Esri SDE geodatabase on SQL Server,
shp2pgsql
will not be a primary migration tool for core enterprise datasets. - Auxiliary Shapefiles: Its use would be limited to ingesting any standalone Shapefiles that might exist as part of the legacy system (e.g., user-generated data, third-party data deliveries in Shapefile format) that also need to be migrated.
- Limited Role: Given the primary source is an Esri SDE geodatabase on SQL Server,
-
Example (Loading a Shapefile):
Note on Esri ArcGIS Data Interoperability Extension (FME)¶
The Esri ArcGIS Data Interoperability extension, which embeds Safe Software's FME, offers a visual environment for ETL processes. While it supports a range of formats and transformations, its use in this project should be approached with caution.
Limited FME Use Recommended
IRD has adopted the strategic direction to build expertise and ongoing solutions primarily using code-first tools and native Azure capabilities. FME should be reserved for: * Highly complex, one-off migration tasks where the development effort using scripted tools would be prohibitive. * Situations where other recommended tools have proven insufficient for a specific, isolated dataset. FME should not be used to develop workflows that will be part of the ongoing operational environment of the new eMap platform.
Key Script Functionalities for SDE Migration¶
Regardless of the specific tools chosen for a task, the automated migration scripts must address the following core functionalities:
Connection Management¶
Scripts require secure and reliable connections to both the source legacy SDE geodatabase on SQL Server and the target Azure PaaS services. Adhering to the Zero Trust Security Model, all credentials (database usernames/passwords, API keys) must be stored in Azure Key Vault. Scripts should retrieve these secrets at runtime, ideally using Managed Identities or service principals with appropriately scoped, least-privilege access.
Data Extraction from SDE on SQL Server¶
Scripts must efficiently extract data from the source SDE geodatabase. This involves: * Connecting to the SQL Server instance hosting the SDE. * Accessing the specific feature classes and tables, often identified by their fully qualified names (e.g., SDE_SCHEMA_OWNER.FEATURE_CLASS_NAME
). * Handling potentially large data volumes. For very large tables, extraction might need to be batched or filtered. * Consideration for versioned data: For migration, the common practice is to migrate data from the DEFAULT
version or a specific named version, effectively flattening the version history. The exact strategy for handling versioned data must be determined based on business requirements for historical data.
Data Transformation¶
Key transformations include:
- Schema Mapping: Systematically translating source SDE schema (table names, field names, data types) to the target schemas defined for Azure Database for PostgreSQL. This might involve renaming fields to adhere to new naming conventions and PostgreSQL best practices (e.g., lowercase).
- Attribute Value Transformation: Cleansing, standardising and re-coding attribute values. This can include converting data types (e.g., SQL Server
datetime
to PostgreSQLtimestamp
), handlingNULL
values or applying domain mappings. - Spatial Reference System (SRS) Transformation: Re-projecting datasets from their legacy coordinate system (e.g., GDA94, local grids) to the new eMap platform's standard SRS (e.g., GDA2020, identified by its EPSG code). Tools such as ArcPy and GDAL/OGR provide robust reprojection capabilities.
- Spatial Type Conversion (SQL Server SDE to PostGIS): This is a critical and fundamental transformation. The legacy SDE geodatabase on SQL Server stores geometries using Esri's proprietary spatial types (e.g.,
ST_Geometry
or the olderSDEBINARY
format). Azure Database for PostgreSQL, with the PostGIS extension, uses PostGIS native geometry types (e.g.,geometry
,geography
). Migration scripts must convert the source Esri spatial types into formats compatible with PostGIS. This can often be done using Well-Known Text (WKT) or Well-Known Binary (WKB) as an intermediate step, or by tools such asogr2ogr
).- Why this is important: PostGIS provides a rich set of open spatial functions and is the standard for geospatial data in PostgreSQL. The eMap platform will leverage these PostGIS capabilities.
- Raster Format Conversion (to CRF/MRF): As defined in the Raster Data Storage Strategy, legacy raster formats (e.g., GeoTIFF, IMG, Esri Grid) must be converted to cloud-optimised formats using ArcPy.
- Cloud Raster Format (CRF): Primary target for analytical and new raster data, stored in ADLS Gen2. Scripts must ensure LERC compression, appropriate tiling (e.g., 512x512 pixels), and pyramid generation.
- Meta Raster Format (MRF): For specific use cases such as pre-rendered basemap caches.
This conceptual Python script shows CRF conversion using ArcPy:
Data Loading into Azure PaaS¶
Transformed data must be loaded efficiently into Azure Database for PostgreSQL and ADLS Gen2: * PostgreSQL: Use bulk loading mechanisms where possible. For ogr2ogr
, this is handled internally. For Python scripts using psycopg2
, methods such asexecute_values
or the COPY
command (by writing data to an in-memory buffer or temporary file) are highly efficient. For GeoPandas, to_postgis
handles this. * ADLS Gen2: For raster files (CRF/MRF), scripts should write directly to the designated ADLS Gen2 file system paths. For large numbers of files, parallel uploads can improve throughput.
Logging and Error Handling¶
Migration scripts must implement comprehensive logging to track progress, record transformations, and capture any errors or warnings. Robust error handling is essential to manage issues gracefully, allow for potential resumption of failed jobs, and facilitate targeted re-runs.
Management and Execution of Migration Scripts¶
- Version Control: All migration scripts (Python, shell scripts, SQL, etc.) should be stored in the designated git repository. This facilitates version tracking, collaborative development, code reviews, and rollback capabilities.
- CI/CD Integration: For repeatable migration tasks (e.g., refreshing DEV/UAT environments) or for managing the deployment of migration scripts themselves, integration into CI/CD pipelines (e.g., GitHub Actions, Azure DevOps Pipelines) should be considered.
- Batch Processing: For migrating large numbers of feature classes or extensive raster collections, scripts should be designed for batch processing, allowing them to be run iteratively or in parallel where feasible.
5.2.2. Data Quality Validation¶
The migrated data must be validated to ensure its quality, accuracy, and integrity within the new eMap platform. Data Stewards and Data Owners play a pivotal role in this process.
Validation Areas and Methods¶
Migrated datasets should undergo checks before being certified for production use:
-
Schema Conformance:
- Verification: Confirm that the schema of migrated datasets in Azure Database for PostgreSQL (table structures, column names, data types, constraints such as
NOT NULL
orUNIQUE
, primary keys, foreign keys, and indexes) matches the defined target schemas. - Tools: SQL queries against PostgreSQL system catalogs (e.g.,
information_schema.columns
,pg_indexes
), schema comparison tools, and reports from migration scripts.
- Verification: Confirm that the schema of migrated datasets in Azure Database for PostgreSQL (table structures, column names, data types, constraints such as
-
Attribute Accuracy:
- Verification: Compare attribute values in migrated data against the source SDE data or predefined quality benchmarks. This includes checking for data loss, corruption, incorrect transformations, and adherence to domain constraints (which might be enforced by database constraints or application logic in the new system).
- Tools: SQL aggregate queries (
COUNT(*)
,SUM()
,AVG()
,MIN()
,MAX()
on key numeric fields in both source and target), Python scripts for statistical sampling and value-by-value comparison for critical attributes, and manual review by Data Stewards.
-
Spatial Integrity:
- Verification:
- Geometry Validity: Utilise PostGIS functions (e.g.,
ST_IsValid(geom)
,ST_IsValidReason(geom)
) in SQL queries to identify invalid geometries. This is critical, as data valid in SDE might have subtle issues that PostGIS identifies more strictly. - Spatial Reference System (SRS): Confirm all migrated spatial data uses the correct, standardised SRS (e.g., GDA2020), and check the SRID value stored in
geometry_columns
orspatial_ref_sys
in PostGIS. - Positional Accuracy: For critical datasets, compare sample feature locations against source data or known control points.
- Completeness of Geometries: Ensure no features have lost their geometries during migration.
- Geometry Validity: Utilise PostGIS functions (e.g.,
- Tools: PostGIS SQL functions, visual inspection in ArcGIS Pro or QGIS connected to Azure PostgreSQL, comparison with source data in a GIS environment.
- Verification:
-
Completeness:
- Verification: Ensure all expected records/features and entire datasets have been migrated. Compare record counts between source SDE tables and target PostgreSQL tables.
- Tools:
SELECT COUNT(*)
queries on source and target, reports from migration scripts.
-
Raster Data Validation:
- Verification:
- Pixel Value Integrity: Sample comparisons of pixel values.
- SRS and Georeferencing: Confirm correct spatial positioning.
- Format Adherence: Verify CRF/MRF properties (tiling, compression, pyramids).
- Visual Checks: Inspect in GIS software for visual artefacts or rendering issues.
- Tools: ArcPy raster properties, GDAL utilities (
gdalinfo
), visual inspection in GIS clients.
- Verification:
Automated Validation¶
Where practical, validation checks should be automated using Python scripts (using GeoPandas
, psycopg2
, and potentially ArcPy
for specific checks) or SQL scripts with PostGIS functions. Automation offers: * Efficiency: Reduces manual effort, especially for large datasets. * Consistency: Ensures checks are performed uniformly. * Reporting: Generates detailed reports of discrepancies for review and remediation.
Role of Data Stewards and Data Owners¶
- Review Validation Reports: Data Stewards should review automated validation outputs.
- Perform Supplementary Checks: Apply domain expertise for manual or visual checks on critical data.
- Sign-off: Formal approval by the Data Owner or delegated Steward confirming fitness for use.
- Issue Resolution: If issues are found, Stewards shall collaborate with the technical team to investigate, determine corrective actions, oversee re-migration/correction, and ensure re-validation.
By combining an automated migration strategy with thorough manual validation, the new eMap platform should be populated with high-quality data.