4.4: Data Tier Implementation

✂️ Tl;dr 🥷

The Data Tier forms the core storage layer for the ArcGIS Enterprise deployment, using Azure PaaS services: Azure Database for PostgreSQL (hosting the Enterprise Geodatabase with PostGIS), Blob Storage (portal content, service caches), ADLS Gen2 (cloud-optimised raster store) and Azure Files (server configuration, disaster recovery staging). Implementation progresses through four maturity stages. The MVP phase establishes foundational components with cost optimised redundancy (ZRS/LRS). Bronze focuses on vertical scaling, performance monitoring and lifecycle management. Silver introduces high availability through same-zone PostgreSQL replication and validated zone-redundant storage. Gold enables fully automated cross-region disaster recovery using asynchronous PostgreSQL replicas, geo-redundant storage and orchestrated failover triggered by health-monitoring Azure Functions. Infrastructure-as-Code (OpenTofu) and configuration management tools ensure consistent provisioning and DR automation. Security integrates RBAC, managed identities and encrypted storage across services. The tier evolves from initial setup to a resilient architecture minimising downtime and data loss, supporting seamless regional failover while maintaining strict access controls and cost efficiency.

The Data Tier forms the foundational storage layer for the entire ArcGIS Enterprise deployment, responsible for the persistence, management and accessibility of all geospatial and configuration data. This tier leverages a combination of Azure's robust Platform-as-a-Service (PaaS) storage offerings. Key components of this tier include the user-managed Enterprise Geodatabase hosted on Azure Database for PostgreSQL (with PostGIS), Azure Blob Storage for Portal content, service caches and server job outputs, Azure Data Lake Storage Gen2 (ADLS Gen2) for the Raster Store and Azure Files for ArcGIS Server's shared configuration and temporary webgisdr staging. This section details their provisioning, configuration for optimal performance and resilience and integration with the Application Tier across stages.

graph LR
subgraph AppTier ["Application Tier (Simplified)"]
direction TB
P4A[("🔑<br>Portal for ArcGIS")]
AGS[("⚙️<br>ArcGIS Server")]
ADS[("💾<br>ArcGIS Data Store")]
end

    subgraph DataTier ["Data Tier (Azure PaaS)"]
    direction TB
    DTEG[("🗄️<br>User-Managed Enterprise Geodatabase<br>(Azure Database for PostgreSQL + PostGIS)")]
    DTRaster[("🏞️<br>User-Managed Raster Store<br>(Azure ADLS Gen2)")]
    DTCacheOutput[("☁️<br>Portal Content, Caches, Outputs, Jobs, Backups<br>(Azure Blob Storage)")]
    DTConfig[("📂<br>Server Shared Configuration<br>(Azure Files for config-store, system)")]
    DTWebGISDRStage[("⏳<br>webgisdr Staging<br>(Azure Files for SHARED_LOCATION)")]
    end

    AppTier -->|"Accesses & Registers"| DTEG
    AGS -->|"Accesses & Registers Cloud Store"| DTRaster
    P4A -->|"Content Directory"| DTCacheOutput
    AGS -->|"Caches, Outputs, Jobs (Cloud Stores)"| DTCacheOutput
    ADS -->|"Backup to Blob via webgisdr"| DTCacheOutput
    AGS -->|"Config Store, System Dirs"| DTConfig
    P4A -->|"webgisdr Staging for Export/Import"| DTWebGISDRStage

    classDef apptier fill:#e6ffed,stroke:#198754,stroke-width:2px;
    classDef datatier fill:#fff8e1,stroke:#f57c00,stroke-width:2px;

    class P4A,AGS,ADS apptier;
    class DTEG,DTRaster,DTCacheOutput,DTConfig,DTWebGISDRStage datatier;

Diagram: Conceptual overview of the Data Tier components and their primary interactions with the Application Tier.

ArcGIS Data Store: Data Tier vs. Application Tier Focus

This Section (Data Tier) focuses on the Enterprise Azure PaaS data stores. Deployment and setup of the ArcGIS Data Store is covered as part of the Application Tier. As the ArcGIS Data Store manages its own internal PostgreSQL database, its underlying infrastructure (VMs and their local/managed disks) is considered part of the Application Tier in this architecture.

Stage	Focus Area	PostgreSQL Configuration	Storage Configuration
MVP	Foundation Setup	- Flexible Server - Burstable/GP tiers - LRS/ZRS	- Content containers - Basic lifecycle policies
Bronze	Performance Optimization	- Vertical scaling - Query optimization	- IOPS monitoring - Cost-tier optimization
Silver	High Availability	- Same-Zone HA - Sync replication	- ZRS validation - Enhanced monitoring
Gold	Disaster Recovery	- Cross-region replica - Auto-promotion	- GRS - Storage account failover automation

Table: Data Tier Implementation Summary

graph LR
    subgraph DT["Data Tier Implementation Stages"]
        direction TB
        MVP[MVP Stage<br>🏗️ Foundation] --> Bronze[Bronze Stage<br>📈 Performance]
        Bronze --> Silver[Silver Stage<br>🛡️ HA]
        Silver --> Gold[Gold Stage<br>🌍 DR]
    end

    subgraph PG["PostgreSQL Evolution"]
        direction LR
        PG_MVP["MVP: Single Instance<br>ZRS Backups"] --> PG_Bronze["Bronze: Vertical Scaling<br>Query Optimization"]
        PG_Bronze --> PG_Silver["Silver: Same-Zone HA Pair<br>Sync Replication"]
        PG_Silver --> PG_Gold["Gold: Cross-Region Replica<br>Auto-Failover"]
    end

    subgraph ST["Storage Services HA/DR"]
        direction TB
        ST_Silver["Silver Stage (ZRS)"] -->|Intra-Region| Z1[Zone 1] & Z2[Zone 2] & Z3[Zone 3]
        ST_Gold["Gold Stage (GRS)"] -->|Cross-Region| Primary[Melbourne] -->|Async Replication| Secondary[Sydney]
    end

    classDef stage fill:#f0f0f0,stroke:#666,stroke-width:2px;
    classDef storage fill:#fff8e1,stroke:#f57c00,stroke-width:2px;
    classDef db fill:#e6ffed,stroke:#198754,stroke-width:2px;

    class DT,PG,ST stage;
    class PG_Silver,PG_Gold db;
    class ST_Silver,ST_Gold storage;

Diagram: Data Tier Evolution - Progression through implementation stages with storage maturity paths

4.4.1 Data Tier MVP Stage¶

The MVP stage for the Data Tier focuses on establishing the foundational Azure PaaS storage resources for all environments (DEV, UAT and PROD - Melbourne). This includes provisioning dedicated instances of Azure Database for PostgreSQL, Azure Blob Storage, Azure ADLS Gen2 and Azure Files, with configurations optimised for initial functionality and cost-effectiveness. All provisioning should be automated via OpenTofu. This stage lays the groundwork for data storage and access, enabling core ArcGIS Enterprise functionalities.

Key Activities and Configurations:

Azure Database for PostgreSQL (Enterprise Geodatabase): Azure Database for PostgreSQL Flexible Server will host the authoritative Enterprise Geodatabases. This fully managed PaaS offering relieves the team from many traditional database administration burdens like OS patching, hardware maintenance and basic backup orchestration, allowing a focus on data modelling, performance and governance.

Provisioning (OpenTofu): Dedicated Azure Database for PostgreSQL Flexible Server instances will be provisioned for each environment.
- PROD (Melbourne): An appropriately sized General Purpose tier, based on initial capacity planning, will be used. Initial redundancy for PROD will be Zone-Redundant Storage (ZRS) for the database backups and transaction logs. This means backups are replicated across multiple physical locations, significantly increasing durability. The database server itself will be configured for high availability (HA) in the Silver stage.
- UAT: A smaller General Purpose tier or Burstable tier with Locally-Redundant Storage (LRS) backups to optimise costs. LRS provides three copies of data within a single data centre.
- DEV: A Burstable tier (e.g., B_Standard_B1ms) with LRS backups. Burstable tiers are ideal for development workloads that don't need full compute capacity continuously and offer significant cost savings. Auto-pause capabilities should be configured where appropriate, further reducing costs by automatically stopping the server during periods of inactivity.
Understanding PostgreSQL Flexible Server Architecture: The "Flexible Server" deployment model for Azure Database for PostgreSQL offers enhanced control and flexibility. It separates compute (the database engine running in a container on a Linux VM) from storage (data files on Azure storage). This architecture facilitates features such as zone-redundant high availability and better cost optimisation.
PostGIS Extension: The PostGIS extension is critical and must be enabled on all instances during provisioning. This extension adds support for geographic objects to the PostgreSQL database, allowing it to store and query spatial data. It is the foundation for creating an Enterprise Geodatabase recognisable by ArcGIS.

Configuration (CM Tool & Manual/Scripted for DB setup):

Networking: Network security will be enforced through VNet rules and private endpoints, allowing connections only from designated Application Tier VMs (Portal, ArcGIS Server). Public access should be disabled to enhance security.
Registration with ArcGIS Server: The PostgreSQL instance should be registered with ArcGIS Server as Enterprise Geodatabases. This is done using connection files (.sde) which store the connection parameters. Database user permissions (PostgreSQL roles and grants) should be configured following the principle of least privilege for the ArcGIS service accounts.
Automated Backups: Azure automatically creates server backups and stores them (on LRS or ZRS depending on configuration). The default backup retention period is seven days, configurable up to 35 days. These backups allow for Point-In-Time Recovery (PITR). For PROD, initial retention will be set to 35 days; for UAT/DEV, 7 days will suffice. All backups are encrypted using AES 256-bit encryption.
Supported PostgreSQL Versions: ArcGIS Enterprise 11.4 supports PostgreSQL versions 13, 14, 15 and 16. PostgreSQL 16 should be chosen as the starting point for the new eMap platform as it is considered stable and has full compatibility with PostGIS and ArcGIS Enterprise. Minor version updates are handled automatically by Azure during configured maintenance windows.
Connection Security: Connections to Azure Database for PostgreSQL Flexible Server are enforced using Transport Layer Security (TLS) by default (TLS 1.2 and later). This encrypts data in transit between ArcGIS Server and the database.

Aspect	PROD (Melbourne)	UAT	DEV
Provisioning Tier	General Purpose (Zone-Redundant Storage)	General Purpose/Burstable (LRS)	Burstable (B_Standard_B1ms, LRS)
HA Configuration	Silver Stage HA	Single Instance	Single Instance + Auto-Pause
Backup Policy	ZRS, 35-day retention	LRS, 7-day retention	LRS, 7-day retention
PostGIS Version	Enabled on PostgreSQL 16	Enabled on PostgreSQL 16	Enabled on PostgreSQL 16
Networking	VNet rules + Private endpoints	VNet rules	VNet rules
Security	TLS 1.2+ enforced, Managed Identity auth	TLS 1.2+ enforced	TLS 1.2+ enforced

Table: Azure Database for PostgreSQL MVP Configuration

Azure Blob Storage (Portal Content, Caches, Jobs, Outputs, webgisdr final backups): Azure Blob Storage is a highly scalable and cost-effective PaaS offering for storing large amounts of unstructured data, often referred to as "objects." It serves multiple critical roles in the new eMap platform, replacing traditional file shares or local disk storage for various ArcGIS Enterprise components.
- Provisioning (OpenTofu): Dedicated Azure Storage Accounts (Standard General Purpose v2 recommended for most scenarios, offering a balance of cost and performance) should be provisioned for each environment.
  - PROD: Standard tier with Zone-Redundant Storage (ZRS). This ensures data is replicated synchronously across three Azure availability zones, offering high durability (99.9999999999% - 12 nines over a year) and availability against data centre-level failures. Geo-Redundant Storage (GRS) for cross-region DR is targeted for the Gold Stage.
  - UAT/DEV: Standard tier with Locally-Redundant Storage (LRS). This provides cost-effectiveness for non-production environments by replicating data three times within a single data center, offering 11 nines of durability.
- Container Setup (OpenTofu): Separate blob containers should be created within each storage account for:
  - Portal for ArcGIS content directory: Stores item metadata, thumbnails and other files associated with Portal items.
  - arcgiscache: For map and image service tile caches. The directory name arcgiscache within the container is specifically required by ArcGIS Server when registered as a Cloud Store.
  - ArcGIS Server jobs and output directories: For asynchronous geoprocessing services and other server operations.
  - webgisdr final backup files: Provides off-VM storage for disaster recovery artifacts.
- Configuration & Integration (CM Tool & Portal Admin API):
  - The Portal content directory should be configured via the Portal Administrator API. Best practices for this container include enabling soft delete for blobs (e.g., 14-day retention) to recover from accidental deletions, blob versioning if granular recovery of content items is critical (note: this increases storage costs) and applying Azure Resource Locks (CanNotDelete) on the storage account to prevent accidental deletion of the entire account. Container soft delete should also be enabled for the Portal content container.
  - Azure Blob Storage containers for arcgiscache, jobs and output should be registered as Cloud Stores with ArcGIS Server. Authentication should use the Azure Managed Identities of the ArcGIS Server VMs/VMSS, adhering to the principle of least privilege (RBAC role: Storage Blob Data Contributor). This avoids storing access keys in configuration files.
  - Caches stored in Azure Blob Storage must be fully pre-generated using the CompactV2 cache format, which is optimised for cloud storage. Cache-on-demand is not supported when using Azure Blob Storage as a Cloud Store for tile caches.
  - Lifecycle Management Policies: Define policies to automatically transition data between storage tiers (Hot, Cool, Archive) or delete it after a defined period. For example, older webgisdr backups or unused caches could be moved to the Cool tier after 30 days and to the Archive tier after 90 days, significantly reducing storage costs.
- Security Considerations for Blob Storage:
  - Secure Transfer (HTTPS): Enforce HTTPS for all requests to the storage account by enabling the "Secure transfer required" setting.
  - Minimum TLS Version: Configure the storage account to require a minimum TLS version of 1.2.
  - Anonymous Access: Disable anonymous public read access for containers and blobs by default. Access should be granted via Managed Identities or, in rare, specific cases, via Shared Access Signatures (SAS) with limited permissions and expiry.
  - Shared Key Authorization: Consider disallowing Shared Key authorisation for the storage account if all access can be managed via using Managed Identities and RBAC. This strengthens security by removing a potential attack vector.
Azure Data Lake Storage Gen2 (ADLS Gen2 - Raster Store): Azure Data Lake Storage Gen2 is Azure's optimised solution for big data analytics, built on Azure Blob Storage but with the addition of a Hierarchical Namespace (HNS). This HNS is key for managing large collections of raster data, making ADLS Gen2 the designated Raster Store for the new eMap platform.
- Understanding Hierarchical Namespace (HNS): Unlike traditional flat blob storage where directory structures are just part of the blob name, HNS provides a true file system-like directory structure. Directories become actual objects, enabling:
  - Atomic directory operations: Renaming or deleting a directory containing thousands of raster tiles is a single, fast metadata operation, crucial for efficient management.
  - Familiar semantics: Organising data (e.g., /imagery/type/source/year/) is intuitive.
  - Performance: Analytics frameworks often perform better with HNS. Listing files in a directory is significantly faster than object storage.
  - Granular security: POSIX-like Access Control Lists (ACLs) can be set on directories and files, complementing Azure RBAC.
- Provisioning (OpenTofu): Dedicated Azure Storage Accounts (Standard General Purpose v2) with Hierarchical Namespace (HNS) enabled should be provisioned for each environment.
  - PROD: Standard tier with Zone-Redundant Storage (ZRS) for high durability and availability. Geo-Redundant Storage (GRS) for cross-region DR is targeted for the Gold Stage.
  - UAT/DEV: Standard tier with Locally-Redundant Storage (LRS) to manage costs.
- Configuration & Integration (CM Tool):
  - ADLS Gen2 should be registered as a Cloud Store with ArcGIS Server. ArcGIS Server interacts with ADLS Gen2 using the Azure Blob File System (ABFS) driver, benefiting from HNS for efficient directory operations.
  - A logical hierarchical folder structure (e.g., /imagery/[collection_type]/[source_identifier]/[year_of_acquisition]/, /elevation/dem/[source_identifier]/) should be implemented.
  - Migration of legacy raster formats to cloud-optimised formats such as Cloud Raster Format (CRF) or Meta Raster Format (MRF) should be a critical planned project activity.
    - CRF: Preferred for most analytical datasets and new acquisitions. Use LERC compression (quality ~75%), 512x512 tiling and build pyramids (bilinear resampling for continuous data, nearest neighbour for discrete).
    - MRF: Efficient for creating pre-rendered basemaps or tile services (cached to Blob Storage).
  - ArcGIS Server will authenticate to ADLS Gen2 using its Azure Managed Identity, granted appropriate RBAC roles (e.g., Storage Blob Data Contributor).
- Security Considerations for ADLS Gen2:
  - Similar to Blob Storage: enforce HTTPS, minimum TLS 1.2.
  - Utilise a combination of Azure RBAC and POSIX-like ACLs for fine-grained access control.
- Cost of HNS: Enabling HNS itself has no direct upgrade cost on a GPv2 account. Transaction costs can vary, but the efficiency gains from HNS often lead to lower costs for large-scale data management.
Azure Files (for ArcGIS Server config-store and system directories): Azure Files provides fully managed file shares in the cloud that are accessible via the SMB/Samba protocol. This is crucial for ArcGIS Server's config-store and system directories, which need to be shared and accessible by all ArcGIS Server instances.
- Purpose:
  - config-store: Contains essential configuration files for the ArcGIS Server site.
  - system: Contains files related to server operations and state.
- Provisioning (OpenTofu): Dedicated Azure Storage Accounts (FileStorage kind for Premium tier, General Purpose v2 for Standard tier) should be provisioned for Azure Files shares for each environment.
  - PROD: Premium tier (SSD-backed) with Zone-Redundant Storage (ZRS). This ensures high performance (low latency, high IOPS/throughput suitable for frequent access by config-store) and resilience for these critical directories.
  - UAT/DEV: Standard tier (HDD-backed) with Locally-Redundant Storage (LRS). This optimises costs for non-production environments while maintaining structural parity. The performance of Standard tier is generally sufficient for DEV/UAT workloads.
- Share Setup (OpenTofu): Separate file shares for config-store and system will be created within the respective storage accounts.
- Mounting & Permissions (CM Tool):
  - The cifs-utils package should be installed on all ArcGIS Server VMs (Ubuntu 24.04 LTS).
  - Shares will be mounted persistently (e.g., via /etc/fstab) on ArcGIS Server VMs.
  - The storage account key required for mounting will be retrieved securely from Azure Key Vault by the VMs using their Managed Identities. The credential file (e.g., /etc/smbcredentials/arcgisserver.cred) on the VM storing this key must be strictly permissioned (readable only by root).
  - Recommended mount options: vers=3.1.1,credentials=<path_to_cred_file>,uid=<arcgis_uid>,gid=<arcgis_gid>,dir_mode=0700,file_mode=0600,serverino,nosharesock,mfsymlinks,actimeo=30.
    - uid/gid: Ensures mounted files are owned by the arcgis service account.
    - dir_mode/file_mode: Sets appropriate permissions.
    - actimeo=30: Caches file and directory attributes for 30 seconds, which can improve performance for config-store operations by reducing metadata chattiness.
- Networking: Access to Azure Files shares should be restricted to the VNet. Secure transfer (SMB 3.1.1 with encryption) should be enforced.

Azure Files (for webgisdr SHARED_LOCATION staging): The webgisdr utility requires a temporary staging location (SHARED_LOCATION) on a file system accessible by the machine executing the utility (typically the active Portal for ArcGIS VM). Azure Files provides a resilient and accessible option for this.

Provisioning (OpenTofu): Dedicated Azure Storage Accounts for Azure Files shares for each environment. A separate storage account from the config-store shares is recommended for PROD for clarity and independent scaling/redundancy if needed.
- PROD (Melbourne): Premium tier (SSD-backed) with Zone-Redundant Storage (ZRS).
- UAT/DEV: Standard tier (HDD-backed) with Locally-Redundant Storage (LRS).
Share Setup (OpenTofu): A dedicated file share (e.g., webgisdr-staging) will be created.
Mounting & Permissions (CM Tool):
- Mounted persistently on the Portal for ArcGIS VMs.
- Storage account key management and mount options similar to those for the config-store Azure Files share, ensuring the arcgis service account (or the account running webgisdr) has read/write access.
Networking: Secure transfer (SMB 3.1.1 with encryption) enforced.

Service	PROD Configuration	UAT/DEV Configuration	Key Use Cases
Blob Storage	ZRS, Geo-Redundant (Gold Stage)	LRS	Portal content, caches, jobs, webgisdr backups
ADLS Gen2	ZRS + Hierarchical Namespace	LRS + Hierarchical Namespace	Raster store (CRF/MRF formats)
Azure Files	Premium ZRS (config-store/system)	Standard LRS	Server shared config, webgisdr staging
Lifecycle Policy	Hot → Cool (30d) → Archive (90d) → Delete	Hot → Cool (60d) → Delete	Automated tier management
Auth Method	Managed Identities (RBAC)	Managed Identities	Least privilege access

Table: Azure Storage Services MVP Configuration

Security and Access Management (Common Data Tier Considerations): A consistent and robust security posture across all Data Tier components is paramount.

Azure Role-Based Access Control (RBAC): Applied to all PaaS data resources (PostgreSQL, Storage Accounts for Blob/ADLS Gen2/Files), adhering to the principle of least privilege. Custom roles should be defined if built-in roles (e.g., Storage Blob Data Contributor, Storage File Data SMB Share Contributor, PostgreSQL Server Contributor) are too permissive for specific operational tasks or service accounts.
Managed Identities: System-assigned Managed Identities should be configured for Application Tier VMs (Portal, Server, Data Store), VMSS and Web Adaptor App Services. These identities should be granted the necessary RBAC roles to securely authenticate to:
- Azure Key Vault: For retrieving secrets like database passwords (for ArcGIS Server to connect to PostgreSQL), storage account keys (for mounting Azure Files shares).
- Azure Storage services (Blob, ADLS Gen2): Directly, where supported by the application (e.g., ArcGIS Server accessing Cloud Stores using Storage Blob Data Contributor role assigned to its Managed Identity). This is the preferred method over using storage account keys for Blob/ADLS Gen2 access by applications.
Storage Account Keys: For Azure Files mounting via SMB, storage account keys are currently the standard authentication mechanism from Linux VMs. These keys will be stored as secrets in Azure Key Vault. VMs will retrieve these keys at runtime using their Managed Identities specifically for the mount operation automated by the Configuration Management tool. Direct embedding of keys in scripts or fstab entries is strictly prohibited. Ideally, regular rotation of these storage account keys should be scheduled and automated.
TLS Enforcement: All Azure Storage Accounts (Blob, ADLS Gen2, Files) should be configured to require a minimum TLS version of 1.2. Azure Database for PostgreSQL also enforces TLS 1.2+ by default.
Azure Storage Firewall: Azure Storage firewalls should be configured to restrict access to selected virtual networks and IP addresses.
Microsoft Defender for Storage: This Azure security service should be enabled for all Azure Storage accounts. It provides an additional layer of security intelligence by detecting unusual and potentially harmful attempts to access or exploit your storage accounts, including Blob, Files and ADLS Gen2.

Microsoft Defender for open-source relational databases: Should be enabled for Azure Database for PostgreSQL instances. It detects anomalous activities indicating unusual and potentially harmful attempts to access or exploit databases, providing security alerts.

Control	PostgreSQL	Blob Storage	ADLS Gen2	Azure Files
Encryption	AES256 + TLS 1.2+	SSE + HTTPS	SSE + HTTPS	SMB 3.1.1 Encryption
Access Control	PostgreSQL RBAC	Azure RBAC + SAS	POSIX ACLs + RBAC	NTFS Permissions
Monitoring	Query Performance	Storage Analytics	Data Lake Audit	File Access Logs
Threat Detection	Defender for DB	Defender for Storage	Defender for Storage	Defender for Storage
Backup	PITR 35-day	Soft Delete + Versioning	Versioning	ZRS Snapshots

Table: Security Controls Matrix

4.4.2 Data Tier Bronze Stage¶

The Bronze Stage for the Data Tier primarily focuses on establishing strategies for scaling Azure Database for PostgreSQL in response to increased demand from the Application Tier. It also involves refining monitoring for all data services to ensure performance and capacity are well-managed. Configurations for Azure Blob Storage, ADLS Gen2 and Azure Files from the MVP stage are largely maintained, with an emphasis on ensuring they can handle potentially increased I/O.

Key Activities and Configurations:

Azure Database for PostgreSQL (Enterprise Geodatabase): With ArcGIS Server now capable of scaling out, the load on the Enterprise Geodatabase can increase. It's crucial to have a strategy to scale the database if it becomes a bottleneck. Azure Database for PostgreSQL Flexible Server offers several ways to manage performance and scale.
- Understanding Scaling Options:
  - Vertical Scaling (Scale-Up/Down): This involves changing the compute tier (e.g., from General Purpose to Memory Optimized) or adjusting the vCores, RAM and IOPS allocated to the server. Increasing resources (scaling up) typically requires a server restart, which Azure manages. Near-zero downtime scaling aims to minimise this interruption (typically <30 seconds), but it has limitations (e.g., not for HA-enabled servers in some scenarios, logical replication slots not preserved without pg_failover_slots). For the Bronze stage, this is the primary scaling method.
  - Storage Scaling: Storage size can only be increased. For Premium SSD, this is mostly an online operation unless crossing the 4TiB boundary. IOPS for Premium SSD scale with disk size or can be provisioned separately up to VM limits. For Premium SSD v2, IOPS and throughput can be tweaked independently of size.
- Scaling Strategy Documentation:
  - Detailed operational runbooks for manual vertical scaling of the PROD Azure Database for PostgreSQL Flexible Server instance should be documented.
  - Triggering Metrics: Identify key performance indicators from Azure Monitor (e.g., sustained CPU/memory utilisation > 80-85%, increased query latency, high disk queue depth, low max_connections headroom, IOPS/throughput nearing limits) that would initiate a scaling review. Alerts should be configured for these thresholds.
  - Procedure: The documented procedure will include:
    - Impact assessment (potential downtime, even if minimal with near-zero downtime scaling).
    - Change management approvals.
    - Communication plan.
    - Execution steps should be codified in OpenTofu and applied.
    - Post-scaling validation (checking performance metrics, application functionality).
    - Rollback considerations (e.g., restoring from backup if scaling causes issues, though rare).
- Read Replica Strategy (Further Investigation):
  - While read replica implementation will be utilised inlater stages, the Bronze stage should involve further investigation and documentation of how read replicas could be used to offload read-intensive workloads (e.g., heavily used map services, analytical queries) from the primary writable instance. This can significantly improve overall read performance and scalability.
- Performance Monitoring & Optimisation:
  - Enhanced monitoring of database performance metrics using Azure Monitor and PostgreSQL native tools (e.g., querying pg_stat_statements for slow queries, pg_stat_activity for active connections and wait events) is crucial. Baselines established in MVP should be refined under actual load.
  - Key Metrics: DTU/vCore utilisation, storage IOPS, active connections, query latency, index hit rates, replication lag (once replicas exist).
  - Regular Reviews: Plan for regular reviews of slow queries (using EXPLAIN ANALYZE to understand query plans) and indexing strategies as part of operational duties. PostgreSQL's VACUUM and ANALYZE commands should be understood by the team; autovacuum settings are a good start, but manual intervention might be needed for heavily changing tables.
  - Connection Pooling: While ArcGIS Server has its own pooling, if other applications connect directly, or if connection churn is high, the built-in PgBouncer offered by Azure Database for PostgreSQL Flexible Server should be considered. PgBouncer helps manage large numbers of connections efficiently.
Azure Blob Storage, ADLS Gen2, Azure Files: These services are generally highly scalable, but monitoring their performance and throughput is essential, especially with an auto-scaling Application Tier.
- Performance Monitoring & Throughput/IOPS Management:
  - Continuously monitor storage capacity, transaction rates (IOPS), throughput (MiB/s) and latency (Average Success E2E Latency, Average Success Server Latency) for these services using Azure Monitor.
  - Azure Files (Premium tier for PROD): Premium shares have provisioned IOPS/throughput based on share size. Monitor IOPS and throughput against these limits to ensure they can support the demands of a scaled-out ArcGIS Server VMSS (accessing config-store/system) and webgisdr operations (accessing the staging share). If limits are approached, the share quota (size) may need to be increased to get more IOPS/throughput.
  - Azure Blob Storage & ADLS Gen2: Monitor for any throttling events (Azure Storage has scalability targets per storage account for capacity, transaction rate and bandwidth). Increased ArcGIS Server activity (cache generation, geoprocessing outputs, raster data access) could push these limits. If throttling occurs, strategies such as distributing data across multiple storage accounts or using Azure CDN for frequently accessed public blobs should be considered in future optimisations.
- Lifecycle Management Review & Optimisation:
  - Review and refine initial lifecycle management policies based on early usage patterns. For example, webgisdr backups in Blob Storage might transition from Hot to Cool tier after 30 days, then to Archive after 90 days and be deleted after 1 year. Infrequently accessed raster datasets in ADLS Gen2 or large tile caches in Blob Storage could also benefit from tiering to Cool or Archive.
  - Ensure lifecycle rules correctly filter objects (e.g., by prefix or blob index tags) and that transitions occur as expected. Automated alerts should be set up for lifecycle policy errors.
- Blob Index Tags for Management:
  - Utilise blob index tags for more granular filtering in lifecycle policies or for cost tracking/categorisation of data within Blob/ADLS Gen2. This is especially useful for large, diverse raster collections.
- Azure Storage Explorer Usage: Document best practices for using Azure Storage Explorer for data management, particularly for GIS Engineers who might need to interact with data in DEV/UAT environments (e.g., uploading test rasters, browsing ADLS Gen2 directory structures).
- No structural changes (e.g., redundancy level changes from ZRS) are planned for these storage services in the Bronze stage for PROD. The focus remains on ensuring the MVP Data Tier setup robustly handles increased load and refining operational procedures.

The primary goal of the Data Tier in the Bronze stage is to ensure it can reliably support the now dynamically scaling Application Tier. This involves diligent monitoring, having clear procedures for scaling the database if it becomes a performance choke point and ensuring that the various Azure Storage services are configured and monitored to handle increased I/O demands efficiently and cost-effectively.

4.4.3 Data Tier Silver Stage¶

The Silver Stage for the Data Tier significantly enhances the resilience of the Production (PROD) environment. This is achieved by implementing High Availability (HA) configurations for the Azure Database for PostgreSQL instance, leveraging Azure's native HA capabilities to protect against single points of failure and ensure service continuity. This aligns with the HA enhancements made to the Application Tier in Section 4.3.3. For Azure Storage services (Blob, ADLS Gen2, Files), the ZRS configuration established in MVP/Bronze is sufficient to meet the HA requirements of the Silver Stage.

Key Activities and Configurations (PROD - Melbourne):

Azure Database for PostgreSQL (Enterprise Geodatabase): To protect the Enterprise Geodatabase from infrastructure failures within the Melbourne region, High Availability should be enabled for the PROD Azure Database for PostgreSQL Flexible Server instance.
- High Availability Configuration:
  - Given that the Azure Australia Southeast (Melbourne) region does not currently support Availability Zones for deploying HA pairs across zones for PostgreSQL Flexible Server, the HA configuration will be Same-Zone High Availability.
  - Mechanism: With Same-Zone HA, Azure provisions and maintains a warm standby replica in the same Availability Zone as the primary server. Data is synchronously replicated from the primary to the standby replica. While this doesn't protect against a full AZ outage, it does protect against server-level hardware failures or other issues affecting the primary compute instance.
  - Automatic Failover: Azure manages automatic failover to the standby replica in the event of an infrastructure failure affecting the primary instance. This process typically completes within 60-120 seconds, aiming to minimise downtime (RTO). The Recovery Point Objective (RPO) is near zero (no data loss) due to synchronous replication.
  - Connection Strings: ArcGIS Server and other applications will continue to use the primary server's FQDN. Azure handles the DNS redirection during a failover, so application-level changes to connection strings are not required post-failover. Applications should, however, implement robust connection retry logic to handle transient errors during the failover window.
  - Implementation (OpenTofu & CM Tool): OpenTofu scripts will be updated to configure the PostgreSQL Flexible Server for Same-Zone HA (e.g., setting high_availability.mode to SameZone or equivalent parameter). No specific changes should be needed for ArcGIS Server connection files (.sde) if they use the main server endpoint FQDN.
  - Impact on Performance: Synchronous replication to a standby (even in the same zone) can introduce some write/commit latency compared to a standalone server. This impact is generally minimal for same-zone HA.
  - Maintenance: During scheduled maintenance, Azure typically patches the standby server first, then fails over to it and then patches the former primary. This minimises downtime.
Azure Storage (Blob, ADLS Gen2, Files):
- Zone-Redundant Storage (ZRS) Verification: As established in the MVP/Bronze stages, all Azure Storage accounts (Azure Files for config-store/system and webgisdr-staging; Azure Blob Storage for Portal content, arcgiscache, jobs/outputs, webgisdr final backups; and Azure ADLS Gen2 for the Raster Store) must already be configured with Zone-Redundant Storage (ZRS).
  - Azure Files: Premium tier with ZRS for config-store, system and webgisdr-staging.
  - Azure Blob Storage & ADLS Gen2: Standard GPv2 tier (or Premium Block Blobs if specific high-IOPS workloads are identified) with ZRS.
- Benefit of ZRS for HA: ZRS synchronously replicates data across three distinct physical locations (Availability Zones, where supported by the underlying storage infrastructure, or across fault domains within a single DC if AZs aren't fully utilised by ZRS). This provides:
  - High Data Durability: Protecting against data loss even if an entire data centre (or fault domain) within the region experiences an outage.
  - High Availability: Allowing continued read and write access to the data even if one of the locations becomes unavailable, as the storage service automatically fails over to another replica. ZRS is fundamental for achieving robust HA for all storage types used by the new eMap platform.
- Data Protection Features Review:
  - Re-verify configurations for Blob soft delete (for blobs and containers) and versioning (if enabled for Blob/ADLS Gen2). Ensure retention periods are appropriate for business RPO/RTO needs in conjunction with webgisdr backups and PostgreSQL PITR capabilities.
  - Point-in-time restore for block blobs (which also applies to ADLS Gen2 data if enabled) relies on soft delete, versioning and the change feed. While powerful, webgisdr remains the primary application-level disaster recovery tool for ArcGIS Enterprise state and PostgreSQL PITR for the database itself.

Validation of Data Tier HA: Thorough testing is crucial to validate the HA configurations and ensure they meet the platform's resilience objectives.

Azure Database for PostgreSQL (Same-Zone HA):
- Simulate Failover: Initiate a user-triggered failover for the PostgreSQL Flexible Server instance via the Azure portal or CLI. This simulates an unexpected failure of the primary compute instance.
- Verify Promotion & Connectivity: Confirm the automatic promotion of the standby replica to primary. Ensure ArcGIS Server and other connected applications can successfully reconnect to the database after the failover event with minimal interruption. Monitor application logs for connection recovery behaviour and any errors.
- Measure RTO: Record the actual time taken for the database service to become fully available on the new primary (Recovery Time Objective). Compare this against the target RTO (e.g., < 120 seconds).
- Data Consistency: Verify data consistency post-failover by performing read/write operations and checking recently committed data.
Azure Storage (ZRS):
- Directly simulating a full zonal outage for storage services by a user is generally not feasible.
- Architectural Review: Confirm that ZRS is correctly provisioned for all relevant storage accounts in PROD via Azure portal, CLI, or OpenTofu state files.
- Application Resilience: Where possible, ensure applications connecting to Azure Storage (e.g., ArcGIS Server accessing Cloud Stores) implement appropriate retry logic for transient storage errors that might occur during an underlying service failover within ZRS.
- Monitoring: Monitor Azure Service Health dashboard for any Azure-reported zonal issues that might affect storage services in Melbourne.
Documentation: All HA test plans, execution steps, observed behaviours, measured recovery times and any identified issues must be meticulously documented. This documentation is vital for operational readiness, future reviews and refining DR procedures.

Component	HA Mechanism	Failover Time	Data Replication	Validation Tests
PostgreSQL	Same-Zone HA Pair	60-120s	Synchronous	Manual failover, RTO/RPO measurement
Storage Services	ZRS Native	Instant	3x sync copies	AZ outage simulation
Azure Files	Premium ZRS	Instant	Sync across zones	Mount persistence tests
Validation Metrics	Connection recovery time	Data consistency checks	Service health probes	End-to-end service testing

Table: Silver Stage HA Implementation

By implementing these Silver Stage HA measures, the Data Tier in the PROD Melbourne environment achieves a robust level of intra-region resilience. This safeguards critical enterprise data and ensures service continuity in the face of common infrastructure failures. DEV and UAT environments will continue with their LRS configurations to maintain cost-effectiveness.

4.4.4 Data Tier Gold Stage¶

The Gold Stage for the Data Tier elevates the platform's resilience by implementing fully automated inter-region Disaster Recovery (DR). This ensures that if the Melbourne data centre experiences a significant outage, services will fail over to Sydney with minimal RTO and data loss. The DR orchestration is initiated by a timer-triggered Azure Function deployed in Sydney, which continuously monitors the health of the Melbourne Web Tier. Upon detecting a sustained failure, this Azure Function kickstarts a sequence of automated actions involving the Configuration Management tool and leveraging infrastructure defined by OpenTofu, leading to the Global Server Load Balancer (GSLB) redirecting traffic to the activated Sydney environment.

graph LR
    subgraph MEL["Primary Region (Melbourne)"]
        direction TB
        PG_Primary["🗄️ PostgreSQL Primary"] -->|Sync Replication| PG_Standby["🛡️ PostgreSQL Standby"]
        Blob_ZRS["☁️ Blob Storage (ZRS)"] -->|Sync| Z1[(Zone 1)] & Z2[(Zone 2)] & Z3[(Zone 3)]
        Files_ZRS["📂 Azure Files (ZRS)"] -->|Sync| FZ1[(Zone 1)] & FZ2[(Zone 2)] & FZ3[(Zone 3)]
    end

    subgraph SYD["DR Region (Sydney)"]
        PG_Replica["🗄️ PostgreSQL Read Replica"] -.->|Async Replication| PG_Primary
        Blob_GRS["☁️ Blob Storage (GRS)"] -.->|Async| Blob_ZRS
        Files_GRS["📂 Azure Files (GRS)"] -.->|Async| Files_ZRS
    end

    classDef primary fill:#fff8e1,stroke:#f57c00,stroke-width:2px;
    classDef dr fill:#e6ffed,stroke:#198754,stroke-width:2px;
    classDef storage fill:#e3f2fd,stroke:#0b5ed7,stroke-width:2px;

    class MEL primary;
    class SYD dr;
    class Blob_ZRS,Files_ZRS,Blob_GRS,Files_GRS storage;

Diagram: Silver/Gold Stage Architecture - Illustrates intra-region HA (Silver) and cross-region DR (Gold) configurations

Core Principles for Fully Automated DR:

Proactive Health Monitoring (Sydney Azure Function): A dedicated, timer-triggered Azure Function running in Sydney continuously probes the health of the Melbourne Web Adaptors. This function acts as the primary sentinel for DR initiation.
Automated Orchestration (Configuration Management Tool): Once a DR event is declared by the Azure Function, a pre-defined DR script, executed by the Configuration Management tool orchestrates the sequence of failover operations for Data Tier and Application Tier components.
Infrastructure as Code for DR Readiness (OpenTofu): OpenTofu defines all Data Tier resources in both Melbourne (primary) and Sydney (DR). This includes replication configurations (PostgreSQL read replicas, GRS storage) and ensures Sydney's infrastructure is a "hot standby," ready for automated activation and scaling.
Seamless Traffic Redirection (GSLB): The Global Server Load Balancer automatically reroutes user traffic to the Sydney endpoints once they become healthy and active post-failover, minimising downtime.

graph TB
    subgraph DR_Detection ["🔍 1: Monitoring & DR Declaration"]
        direction TB
        A_Func["🛰️ Azure Function (Sydney)"] -- "Continuously Probes Health" --> B_MelbWeb["🌐 Web Tier Health Endpoints"]
        B_MelbWeb -- "Sustained Failure Detected" --> A_Func
        A_Func -- "Declares DR Event" --> C_TriggerCM["🚨 Trigger CM Tool"]
    end

    subgraph DR_Orchestration ["⚙️ 2: Automated DR"]
        direction TB
        C_TriggerCM --> D_Start((🟢 Start))

        subgraph DataTierFailover ["Data Tier Failover"]
            direction TB
            D_Start --> E_StorageFailover["📦 Storage Account Failover"]
            E_StorageFailover --> F_PGSQLPromote["🐘 Promote PostgreSQL"]
        end

        subgraph AppTierActivation ["Application Tier Activation"]
            direction TB
            F_PGSQLPromote --> G_AppTierActivate["🚀 Activate Sydney App Tier"]
            G_AppTierActivate --> H_WebGISDR["🔄 Restore webgisdr"]
        end

        H_WebGISDR --> I_ValidateServices["✅ Service Validation"]
        I_ValidateServices --> J_OrchComplete((🟣 Complete))
    end

    subgraph Traffic_Shift ["🌍 3: Traffic Redirection"]
        direction TB
        J_OrchComplete --> K_GSLB["🔄 GSLB Update"]
        K_GSLB --> L_SydneyActive["🌏 Sydney Web Tier Active"]
        L_SydneyActive --> M_Users["👥 End Users"]
    end

    classDef detection fill:#f0f4f8,stroke:#4a90e2;
    classDef process fill:#ffe8e8,stroke:#d32f2f;
    classDef traffic fill:#e8f5e9,stroke:#2e7d32;
    classDef event fill:#f3e5f5,stroke:#6a1b9a;
    classDef storage fill:#e1f5fe,stroke:#0288d1;
    classDef database fill:#e8f5e9,stroke:#1b5e20;
    classDef startend fill:#f5f5f5,stroke:#616161;

    class DR_Detection detection;
    class DR_Orchestration process;
    class Traffic_Shift traffic;
    class A_Func,B_MelbWeb,C_TriggerCM event;
    class E_StorageFailover storage;
    class F_PGSQLPromote database;
    class G_AppTierActivate,H_WebGISDR,I_ValidateServices process;
    class K_GSLB,L_SydneyActive traffic;
    class D_Start,J_OrchComplete startend;

Diagram: Gold Stage Automated Disaster Recovery Process Flow

Key Activities and Detailed Automated DR Process:

DR Trigger: Sydney-Based Azure Function for Melbourne Health Monitoring:
- Deployment & Purpose: An Azure Function is deployed in Sydney. Its sole purpose is to monitor the health of the Melbourne ArcGIS Web Adaptor endpoints (https://<adc-melbourne-vip>/portal/webadaptor/rest/info/health and /server/webadaptor/rest/info/health).
- Timer Trigger: The function is configured with a timer trigger, executing every minute.
- Health Check & Retry Logic:
  1. The function makes HTTPS requests to the Melbourne Web Adaptor health endpoints.
  2. If a health check fails (e.g., timeout, non-200 HTTP status), it initiates a retry sequence: attempt n additional checks (e.g., n=10) with short intervals (e.g., 6 seconds apart).
  3. If all n retries within a cycle fail, the function increments a persistent failure counter (e.g., stored in Azure Table Storage instance in Sydney).
  4. If the health check cycle is successful, the failure counter is reset.
- DR Event Declaration: If the persistent failure counter reaches a predefined threshold z (e.g., z=5 consecutive 1-minute cycles of failed health checks, implying approximately 5-7 minutes of confirmed unresponsiveness after initial retries), the Azure Function declares a DR event.
- Security: The Azure Function uses a System-Assigned Managed Identity. This identity is granted the necessary permissions to:
  - Make outbound HTTPS requests to the Melbourne endpoints.
  - Write to its state/counter store (e.g., Azure Table Storage).
  - Securely trigger the DR script managed by the Configuration Management tool (e.g., by calling a webhook or invoking a GitHub Actions pipeline). This is the primary action upon DR declaration.
DR Orchestration: Configuration Management Tool Takes Over: Upon being triggered by the Sydney Azure Function, the Configuration Management (CM) tool executes the DR script. This playbook automates the failover of the Data Tier and coordinates with the Application Tier activation.
- Phase 1: Initial Notification & Data Tier Failover (CM Tool)
  1. Logging & Alerting: The CM playbook immediately logs the DR initiation and sends critical alerts to operations teams.
  2. Azure Storage Account Failover:
    - The CM tool executes Azure CLI/API commands to initiate storage account failover for all critical PROD GRS Storage Accounts (Azure Files for config-store/system/webgisdr-staging; Azure Blob for Portal content/arcgiscache/jobs/outputs/webgisdr backups; ADLS Gen2 for Raster Store).
    - This makes the Sydney storage endpoints primary and writable. The CM tool should check the Last Sync Time of each storage account before failover and log this information for RPO assessment.
  3. Azure Database for PostgreSQL Replica Promotion:
    - The CM tool executes Azure CLI/API commands to promote the Sydney PostgreSQL read replica to become a standalone, writable primary server. This breaks replication from Melbourne.
- Phase 2: Application Tier Activation & Reconfiguration (CM Tool & OpenTofu)
  1. Activate/Scale Sydney Application Tier: The CM tool orchestrates the bring-up of the Application Tier in Sydney. This involves:
    - Starting the VMs and/or scaling up VMSS instances to production capacity using Azure CLI/API calls. OpenTofu defines the target state and the CM tool ensures resources reach it.
  2. Apply CM Configurations: Once Sydney Application Tier VMs/VMSS are active, the CM tool runs its standard scripts on them to:
    - Ensure all ArcGIS Enterprise software is correctly configured.
    - Update ArcGIS Server connection files (.sde) and any other application configurations to point to the newly promoted Sydney PostgreSQL FQDN and the failed-over Sydney Storage Account endpoints.
- Phase 3: Automated webgisdr Restoration (CM Tool)
  1. With the Sydney Data Tier and Application Tier infrastructure active and reconfigured, the CM tool (orchestrating on the Sydney Portal VM) automates the webgisdr --import process.
  2. The latest webgisdr backup file is retrieved from the GRS Azure Blob Storage container (now primary in Sydney).
  3. The SHARED_LOCATION uses the Azure Files share for staging.
  4. Dynamically generated webgisdr.properties (with DR-specific paths/credentials from Azure App Configuration/Key Vault) are used for the import.
- Phase 4: Service Validation & GSLB Traffic Shift (CM Tool & GSLB)
  1. Automated Health Checks: The CM tool performs automated health checks on the key ArcGIS Enterprise services now running in Sydney.
  2. GSLB Redirection: The GSLB, continuously probing the health of regional endpoints (specifically the Web Tier in Sydney), will automatically detect that Sydney endpoints are healthy and Melbourne's are not. It then reroutes all user traffic to the Sydney Web Tier.
Role of OpenTofu in Fully Automated DR:
- Defines DR Infrastructure State: OpenTofu is responsible for defining the entire infrastructure in Sydney required for DR. This includes:
  - The Azure Function App in Sydney and its dependent resources (e.g., storage for state).
  - The PostgreSQL read replica configuration in Sydney, ready for promotion.
  - All Azure Storage accounts configured with GRS for data replication.
  - Application Tier resources in Sydney (VMs, VMSS, App Services for Web Adaptors), defined in a scaled-down "hot standby" state to reduce costs but allowing for rapid scaling/activation by the CM tool.
- Ensures DR Site Readiness: OpenTofu ensures the Sydney site is correctly provisioned before any DR event, allowing the automated failover processes to operate on a known, consistent infrastructure base. It is not ideal for the dynamic failover commands during the event (which are imperative actions better suited for scripting/CM tools), but rather to define the end-state infrastructure that the CM tool activates or scales.
Data Replication Mechanisms (Unchanged from previous plan, but critical for automation):
- Azure Database for PostgreSQL: Asynchronous physical streaming replication from Melbourne primary to the Sydney read replica. Geo-redundant backups provide a secondary DR data source.
- Azure Storage (Blob, ADLS Gen2, Files): GRS or GZRS for asynchronous replication of data from Melbourne to Sydney.
RPO/RTO Considerations with Fully Automated DR:
- RPO: Remains dependent on the asynchronous replication lag:
  - PostgreSQL: Replica Lag in Seconds, typically less than a minute.
  - Azure Storage GRS: Last Sync Time (typically <15 minutes, but variable).
- RTO: Significantly minimized due to full automation. The total time includes:
  1. Detection time by Azure Function (e.g., 1 min check interval * z failure cycles + n retries per cycle).
  2. Azure Function processing and CM tool pipeline trigger latency.
  3. Execution time for CM tool script: storage account failovers, PostgreSQL promotion, Application Tier activation/scaling and webgisdr import.
  4. GSLB health probe interval and DNS update time for traffic redirection. The target RTO should be clearly defined and DR drills will validate this. The duration of the webgisdr import will be a significant factor. If webgisdr become a barrier to reaching the desired RTO, alternative backup strategies such as Azure Backup Service should be investigated.
Failback Strategy (Sydney to Melbourne): Failback is a highly complex undertaking and typically remains a more controlled, semi-automated process.
- Disable Sydney DR Trigger: The Sydney Azure Function must be disabled or its logic altered to prevent re-triggering DR back to Sydney during failback.
- Melbourne Restoration: OpenTofu ensures Melbourne infrastructure is fully restored or re-provisioned to a clean state.
- Data Resynchronization (Critical & Complex):
  - PostgreSQL: Establish replication from Sydney (acting primary) back to a new or restored Melbourne instance (acting as a new replica). Once synced, a planned failover (with downtime) is performed.
  - Azure Storage: Re-initiate geo-replication for GRS accounts from Sydney back to Melbourne. This "re-protect" operation synchronizes changes made in Sydney back to Melbourne.
- Application State (webgisdr): A fresh webgisdr export from Sydney would be taken and restored onto the Melbourne environment after data stores are synced and Melbourne is ready to become primary.
- GSLB Reconfiguration: Update GSLB to prioritize Melbourne again. This process requires planning and thorough testing due to the risks of data divergence.

Testing and Validation of Fully Automated DR:

Mandatory Drills: Regular, comprehensive DR drills are non-negotiable. These drills must test the entire automated sequence:
- Simulate Melbourne Web Tier unavailability to trigger the Sydney Azure Function.
- Verify the Azure Function correctly declares a DR event and triggers the CM tool.
- Validate the CM tool's successful orchestration of Data Tier failovers (PostgreSQL promotion, Storage Account failovers).
- Confirm Application Tier activation and reconfiguration in Sydney.
- Verify successful webgisdr import.
- Confirm GSLB traffic redirection to Sydney.
Measure RPO/RTO: Accurately measure data loss (RPO) against Last Sync Times and actual downtime (RTO) during drills.
Iterative Refinement: Use learnings from DR drills to continuously refine the Azure Function logic, CM scripts, OpenTofu configurations and DR runbook documentation.

Phase	Components Involved	Automation Tools	Key Metrics
DR Declaration	Azure Function, Key Vault	Timer triggers + Retry logic	Health check failures
Storage Failover	GRS Storage Accounts	Azure CLI/PowerShell	LastSyncTime validation
PostgreSQL Promotion	Sydney Read Replica	Azure Database CLI	Replica lag <60s
App Tier Activation	VMSS, Web Adaptors	OpenTofu + CM Tool	Instance health status
webgisdr Restoration	Blob Storage, Azure Files	Python Automation	Backup file versioning
Traffic Shift	GSLB, DNS	Traffic Manager API	Endpoint response times

Table: Gold Stage DR Automation Process

By implementing this fully automated Gold Stage Data Tier DR strategy, the new eMap platform achieves a high degree of resilience. This minimises human intervention during a disaster, significantly reduces RTO and ensures business continuity for critical geospatial services.