4.4: Data Tier Implementation
βοΈ Tl;dr π₯·
The Data Tier forms the core storage layer for the ArcGIS Enterprise deployment, using Azure PaaS services: Azure Database for PostgreSQL (hosting the Enterprise Geodatabase with PostGIS), Blob Storage (portal content, service caches), ADLS Gen2 (cloud-optimised raster store) and Azure Files (server configuration, disaster recovery staging). Implementation progresses through four maturity stages. The MVP phase establishes foundational components with cost optimised redundancy (ZRS/LRS). Bronze focuses on vertical scaling, performance monitoring and lifecycle management. Silver introduces high availability through same-zone PostgreSQL replication and validated zone-redundant storage. Gold enables fully automated cross-region disaster recovery using asynchronous PostgreSQL replicas, geo-redundant storage and orchestrated failover triggered by health-monitoring Azure Functions. Infrastructure-as-Code (OpenTofu) and configuration management tools ensure consistent provisioning and DR automation. Security integrates RBAC, managed identities and encrypted storage across services. The tier evolves from initial setup to a resilient architecture minimising downtime and data loss, supporting seamless regional failover while maintaining strict access controls and cost efficiency.
The Data Tier forms the foundational storage layer for the entire ArcGIS Enterprise deployment, responsible for the persistence, management and accessibility of all geospatial and configuration data. This tier leverages a combination of Azure's robust Platform-as-a-Service (PaaS) storage offerings. Key components of this tier include the user-managed Enterprise Geodatabase hosted on Azure Database for PostgreSQL (with PostGIS), Azure Blob Storage for Portal content, service caches and server job outputs, Azure Data Lake Storage Gen2 (ADLS Gen2) for the Raster Store and Azure Files for ArcGIS Server's shared configuration and temporary webgisdr
staging. This section details their provisioning, configuration for optimal performance and resilience and integration with the Application Tier across stages.
graph LR
subgraph AppTier ["Application Tier (Simplified)"]
direction TB
P4A[("π<br>Portal for ArcGIS")]
AGS[("βοΈ<br>ArcGIS Server")]
ADS[("πΎ<br>ArcGIS Data Store")]
end
subgraph DataTier ["Data Tier (Azure PaaS)"]
direction TB
DTEG[("ποΈ<br>User-Managed Enterprise Geodatabase<br>(Azure Database for PostgreSQL + PostGIS)")]
DTRaster[("ποΈ<br>User-Managed Raster Store<br>(Azure ADLS Gen2)")]
DTCacheOutput[("βοΈ<br>Portal Content, Caches, Outputs, Jobs, Backups<br>(Azure Blob Storage)")]
DTConfig[("π<br>Server Shared Configuration<br>(Azure Files for config-store, system)")]
DTWebGISDRStage[("β³<br>webgisdr Staging<br>(Azure Files for SHARED_LOCATION)")]
end
AppTier -->|"Accesses & Registers"| DTEG
AGS -->|"Accesses & Registers Cloud Store"| DTRaster
P4A -->|"Content Directory"| DTCacheOutput
AGS -->|"Caches, Outputs, Jobs (Cloud Stores)"| DTCacheOutput
ADS -->|"Backup to Blob via webgisdr"| DTCacheOutput
AGS -->|"Config Store, System Dirs"| DTConfig
P4A -->|"webgisdr Staging for Export/Import"| DTWebGISDRStage
classDef apptier fill:#e6ffed,stroke:#198754,stroke-width:2px;
classDef datatier fill:#fff8e1,stroke:#f57c00,stroke-width:2px;
class P4A,AGS,ADS apptier;
class DTEG,DTRaster,DTCacheOutput,DTConfig,DTWebGISDRStage datatier;
Diagram: Conceptual overview of the Data Tier components and their primary interactions with the Application Tier.
ArcGIS Data Store: Data Tier vs. Application Tier Focus
This Section (Data Tier) focuses on the Enterprise Azure PaaS data stores. Deployment and setup of the ArcGIS Data Store is covered as part of the Application Tier. As the ArcGIS Data Store manages its own internal PostgreSQL database, its underlying infrastructure (VMs and their local/managed disks) is considered part of the Application Tier in this architecture.
Stage | Focus Area | PostgreSQL Configuration | Storage Configuration |
---|---|---|---|
MVP | Foundation Setup | - Flexible Server - Burstable/GP tiers - LRS/ZRS | - Content containers - Basic lifecycle policies |
Bronze | Performance Optimization | - Vertical scaling - Query optimization | - IOPS monitoring - Cost-tier optimization |
Silver | High Availability | - Same-Zone HA - Sync replication | - ZRS validation - Enhanced monitoring |
Gold | Disaster Recovery | - Cross-region replica - Auto-promotion | - GRS - Storage account failover automation |
Table: Data Tier Implementation Summary
graph LR
subgraph DT["Data Tier Implementation Stages"]
direction TB
MVP[MVP Stage<br>ποΈ Foundation] --> Bronze[Bronze Stage<br>π Performance]
Bronze --> Silver[Silver Stage<br>π‘οΈ HA]
Silver --> Gold[Gold Stage<br>π DR]
end
subgraph PG["PostgreSQL Evolution"]
direction LR
PG_MVP["MVP: Single Instance<br>ZRS Backups"] --> PG_Bronze["Bronze: Vertical Scaling<br>Query Optimization"]
PG_Bronze --> PG_Silver["Silver: Same-Zone HA Pair<br>Sync Replication"]
PG_Silver --> PG_Gold["Gold: Cross-Region Replica<br>Auto-Failover"]
end
subgraph ST["Storage Services HA/DR"]
direction TB
ST_Silver["Silver Stage (ZRS)"] -->|Intra-Region| Z1[Zone 1] & Z2[Zone 2] & Z3[Zone 3]
ST_Gold["Gold Stage (GRS)"] -->|Cross-Region| Primary[Melbourne] -->|Async Replication| Secondary[Sydney]
end
classDef stage fill:#f0f0f0,stroke:#666,stroke-width:2px;
classDef storage fill:#fff8e1,stroke:#f57c00,stroke-width:2px;
classDef db fill:#e6ffed,stroke:#198754,stroke-width:2px;
class DT,PG,ST stage;
class PG_Silver,PG_Gold db;
class ST_Silver,ST_Gold storage;
Diagram: Data Tier Evolution - Progression through implementation stages with storage maturity paths 4.4.1 Data Tier MVP StageΒΆ
The MVP stage for the Data Tier focuses on establishing the foundational Azure PaaS storage resources for all environments (DEV, UAT and PROD - Melbourne). This includes provisioning dedicated instances of Azure Database for PostgreSQL, Azure Blob Storage, Azure ADLS Gen2 and Azure Files, with configurations optimised for initial functionality and cost-effectiveness. All provisioning should be automated via OpenTofu. This stage lays the groundwork for data storage and access, enabling core ArcGIS Enterprise functionalities.
Key Activities and Configurations:
-
Azure Database for PostgreSQL (Enterprise Geodatabase): Azure Database for PostgreSQL Flexible Server will host the authoritative Enterprise Geodatabases. This fully managed PaaS offering relieves the team from many traditional database administration burdens like OS patching, hardware maintenance and basic backup orchestration, allowing a focus on data modelling, performance and governance.
- Provisioning (OpenTofu): Dedicated Azure Database for PostgreSQL Flexible Server instances will be provisioned for each environment.
- PROD (Melbourne): An appropriately sized General Purpose tier, based on initial capacity planning, will be used. Initial redundancy for PROD will be Zone-Redundant Storage (ZRS) for the database backups and transaction logs. This means backups are replicated across multiple physical locations, significantly increasing durability. The database server itself will be configured for high availability (HA) in the Silver stage.
- UAT: A smaller General Purpose tier or Burstable tier with Locally-Redundant Storage (LRS) backups to optimise costs. LRS provides three copies of data within a single data centre.
- DEV: A Burstable tier (e.g.,
B_Standard_B1ms
) with LRS backups. Burstable tiers are ideal for development workloads that don't need full compute capacity continuously and offer significant cost savings. Auto-pause capabilities should be configured where appropriate, further reducing costs by automatically stopping the server during periods of inactivity.
- Understanding PostgreSQL Flexible Server Architecture: The "Flexible Server" deployment model for Azure Database for PostgreSQL offers enhanced control and flexibility. It separates compute (the database engine running in a container on a Linux VM) from storage (data files on Azure storage). This architecture facilitates features such as zone-redundant high availability and better cost optimisation.
- PostGIS Extension: The PostGIS extension is critical and must be enabled on all instances during provisioning. This extension adds support for geographic objects to the PostgreSQL database, allowing it to store and query spatial data. It is the foundation for creating an Enterprise Geodatabase recognisable by ArcGIS.
-
Configuration (CM Tool & Manual/Scripted for DB setup):
- Networking: Network security will be enforced through VNet rules and private endpoints, allowing connections only from designated Application Tier VMs (Portal, ArcGIS Server). Public access should be disabled to enhance security.
- Registration with ArcGIS Server: The PostgreSQL instance should be registered with ArcGIS Server as Enterprise Geodatabases. This is done using connection files (
.sde
) which store the connection parameters. Database user permissions (PostgreSQL roles and grants) should be configured following the principle of least privilege for the ArcGIS service accounts. - Automated Backups: Azure automatically creates server backups and stores them (on LRS or ZRS depending on configuration). The default backup retention period is seven days, configurable up to 35 days. These backups allow for Point-In-Time Recovery (PITR). For PROD, initial retention will be set to 35 days; for UAT/DEV, 7 days will suffice. All backups are encrypted using AES 256-bit encryption.
- Supported PostgreSQL Versions: ArcGIS Enterprise 11.4 supports PostgreSQL versions 13, 14, 15 and 16. PostgreSQL 16 should be chosen as the starting point for the new eMap platform as it is considered stable and has full compatibility with PostGIS and ArcGIS Enterprise. Minor version updates are handled automatically by Azure during configured maintenance windows.
- Connection Security: Connections to Azure Database for PostgreSQL Flexible Server are enforced using Transport Layer Security (TLS) by default (TLS 1.2 and later). This encrypts data in transit between ArcGIS Server and the database.
Aspect PROD (Melbourne) UAT DEV Provisioning Tier General Purpose (Zone-Redundant Storage) General Purpose/Burstable (LRS) Burstable (B_Standard_B1ms, LRS) HA Configuration Silver Stage HA Single Instance Single Instance + Auto-Pause Backup Policy ZRS, 35-day retention LRS, 7-day retention LRS, 7-day retention PostGIS Version Enabled on PostgreSQL 16 Enabled on PostgreSQL 16 Enabled on PostgreSQL 16 Networking VNet rules + Private endpoints VNet rules VNet rules Security TLS 1.2+ enforced, Managed Identity auth TLS 1.2+ enforced TLS 1.2+ enforced Table: Azure Database for PostgreSQL MVP Configuration
- Provisioning (OpenTofu): Dedicated Azure Database for PostgreSQL Flexible Server instances will be provisioned for each environment.
-
Azure Blob Storage (Portal Content, Caches, Jobs, Outputs,
webgisdr
final backups): Azure Blob Storage is a highly scalable and cost-effective PaaS offering for storing large amounts of unstructured data, often referred to as "objects." It serves multiple critical roles in the new eMap platform, replacing traditional file shares or local disk storage for various ArcGIS Enterprise components.- Provisioning (OpenTofu): Dedicated Azure Storage Accounts (Standard General Purpose v2 recommended for most scenarios, offering a balance of cost and performance) should be provisioned for each environment.
- PROD: Standard tier with Zone-Redundant Storage (ZRS). This ensures data is replicated synchronously across three Azure availability zones, offering high durability (99.9999999999% - 12 nines over a year) and availability against data centre-level failures. Geo-Redundant Storage (GRS) for cross-region DR is targeted for the Gold Stage.
- UAT/DEV: Standard tier with Locally-Redundant Storage (LRS). This provides cost-effectiveness for non-production environments by replicating data three times within a single data center, offering 11 nines of durability.
- Container Setup (OpenTofu): Separate blob containers should be created within each storage account for:
- Portal for ArcGIS
content
directory: Stores item metadata, thumbnails and other files associated with Portal items. arcgiscache
: For map and image service tile caches. The directory namearcgiscache
within the container is specifically required by ArcGIS Server when registered as a Cloud Store.- ArcGIS Server
jobs
andoutput
directories: For asynchronous geoprocessing services and other server operations. webgisdr
final backup files: Provides off-VM storage for disaster recovery artifacts.
- Portal for ArcGIS
- Configuration & Integration (CM Tool & Portal Admin API):
- The Portal
content
directory should be configured via the Portal Administrator API. Best practices for this container include enabling soft delete for blobs (e.g., 14-day retention) to recover from accidental deletions, blob versioning if granular recovery of content items is critical (note: this increases storage costs) and applying Azure Resource Locks (CanNotDelete
) on the storage account to prevent accidental deletion of the entire account. Container soft delete should also be enabled for the Portal content container. - Azure Blob Storage containers for
arcgiscache
,jobs
andoutput
should be registered as Cloud Stores with ArcGIS Server. Authentication should use the Azure Managed Identities of the ArcGIS Server VMs/VMSS, adhering to the principle of least privilege (RBAC role:Storage Blob Data Contributor
). This avoids storing access keys in configuration files. - Caches stored in Azure Blob Storage must be fully pre-generated using the
CompactV2
cache format, which is optimised for cloud storage. Cache-on-demand is not supported when using Azure Blob Storage as a Cloud Store for tile caches. - Lifecycle Management Policies: Define policies to automatically transition data between storage tiers (Hot, Cool, Archive) or delete it after a defined period. For example, older
webgisdr
backups or unused caches could be moved to the Cool tier after 30 days and to the Archive tier after 90 days, significantly reducing storage costs.
- The Portal
- Security Considerations for Blob Storage:
- Secure Transfer (HTTPS): Enforce HTTPS for all requests to the storage account by enabling the "Secure transfer required" setting.
- Minimum TLS Version: Configure the storage account to require a minimum TLS version of 1.2.
- Anonymous Access: Disable anonymous public read access for containers and blobs by default. Access should be granted via Managed Identities or, in rare, specific cases, via Shared Access Signatures (SAS) with limited permissions and expiry.
- Shared Key Authorization: Consider disallowing Shared Key authorisation for the storage account if all access can be managed via using Managed Identities and RBAC. This strengthens security by removing a potential attack vector.
- Provisioning (OpenTofu): Dedicated Azure Storage Accounts (Standard General Purpose v2 recommended for most scenarios, offering a balance of cost and performance) should be provisioned for each environment.
-
Azure Data Lake Storage Gen2 (ADLS Gen2 - Raster Store): Azure Data Lake Storage Gen2 is Azure's optimised solution for big data analytics, built on Azure Blob Storage but with the addition of a Hierarchical Namespace (HNS). This HNS is key for managing large collections of raster data, making ADLS Gen2 the designated Raster Store for the new eMap platform.
- Understanding Hierarchical Namespace (HNS): Unlike traditional flat blob storage where directory structures are just part of the blob name, HNS provides a true file system-like directory structure. Directories become actual objects, enabling:
- Atomic directory operations: Renaming or deleting a directory containing thousands of raster tiles is a single, fast metadata operation, crucial for efficient management.
- Familiar semantics: Organising data (e.g.,
/imagery/type/source/year/
) is intuitive. - Performance: Analytics frameworks often perform better with HNS. Listing files in a directory is significantly faster than object storage.
- Granular security: POSIX-like Access Control Lists (ACLs) can be set on directories and files, complementing Azure RBAC.
- Provisioning (OpenTofu): Dedicated Azure Storage Accounts (Standard General Purpose v2) with Hierarchical Namespace (HNS) enabled should be provisioned for each environment.
- PROD: Standard tier with Zone-Redundant Storage (ZRS) for high durability and availability. Geo-Redundant Storage (GRS) for cross-region DR is targeted for the Gold Stage.
- UAT/DEV: Standard tier with Locally-Redundant Storage (LRS) to manage costs.
- Configuration & Integration (CM Tool):
- ADLS Gen2 should be registered as a Cloud Store with ArcGIS Server. ArcGIS Server interacts with ADLS Gen2 using the Azure Blob File System (ABFS) driver, benefiting from HNS for efficient directory operations.
- A logical hierarchical folder structure (e.g.,
/imagery/[collection_type]/[source_identifier]/[year_of_acquisition]/
,/elevation/dem/[source_identifier]/
) should be implemented. - Migration of legacy raster formats to cloud-optimised formats such as Cloud Raster Format (CRF) or Meta Raster Format (MRF) should be a critical planned project activity.
- CRF: Preferred for most analytical datasets and new acquisitions. Use LERC compression (quality ~75%), 512x512 tiling and build pyramids (bilinear resampling for continuous data, nearest neighbour for discrete).
- MRF: Efficient for creating pre-rendered basemaps or tile services (cached to Blob Storage).
- ArcGIS Server will authenticate to ADLS Gen2 using its Azure Managed Identity, granted appropriate RBAC roles (e.g.,
Storage Blob Data Contributor
).
- Security Considerations for ADLS Gen2:
- Similar to Blob Storage: enforce HTTPS, minimum TLS 1.2.
- Utilise a combination of Azure RBAC and POSIX-like ACLs for fine-grained access control.
- Cost of HNS: Enabling HNS itself has no direct upgrade cost on a GPv2 account. Transaction costs can vary, but the efficiency gains from HNS often lead to lower costs for large-scale data management.
- Understanding Hierarchical Namespace (HNS): Unlike traditional flat blob storage where directory structures are just part of the blob name, HNS provides a true file system-like directory structure. Directories become actual objects, enabling:
-
Azure Files (for ArcGIS Server
config-store
andsystem
directories): Azure Files provides fully managed file shares in the cloud that are accessible via the SMB/Samba protocol. This is crucial for ArcGIS Server'sconfig-store
andsystem
directories, which need to be shared and accessible by all ArcGIS Server instances.- Purpose:
config-store
: Contains essential configuration files for the ArcGIS Server site.system
: Contains files related to server operations and state.
- Provisioning (OpenTofu): Dedicated Azure Storage Accounts (FileStorage kind for Premium tier, General Purpose v2 for Standard tier) should be provisioned for Azure Files shares for each environment.
- PROD: Premium tier (SSD-backed) with Zone-Redundant Storage (ZRS). This ensures high performance (low latency, high IOPS/throughput suitable for frequent access by
config-store
) and resilience for these critical directories. - UAT/DEV: Standard tier (HDD-backed) with Locally-Redundant Storage (LRS). This optimises costs for non-production environments while maintaining structural parity. The performance of Standard tier is generally sufficient for DEV/UAT workloads.
- PROD: Premium tier (SSD-backed) with Zone-Redundant Storage (ZRS). This ensures high performance (low latency, high IOPS/throughput suitable for frequent access by
- Share Setup (OpenTofu): Separate file shares for
config-store
andsystem
will be created within the respective storage accounts. - Mounting & Permissions (CM Tool):
- The
cifs-utils
package should be installed on all ArcGIS Server VMs (Ubuntu 24.04 LTS). - Shares will be mounted persistently (e.g., via
/etc/fstab
) on ArcGIS Server VMs. - The storage account key required for mounting will be retrieved securely from Azure Key Vault by the VMs using their Managed Identities. The credential file (e.g.,
/etc/smbcredentials/arcgisserver.cred
) on the VM storing this key must be strictly permissioned (readable only by root). - Recommended mount options:
vers=3.1.1,credentials=<path_to_cred_file>,uid=<arcgis_uid>,gid=<arcgis_gid>,dir_mode=0700,file_mode=0600,serverino,nosharesock,mfsymlinks,actimeo=30
.uid
/gid
: Ensures mounted files are owned by thearcgis
service account.dir_mode
/file_mode
: Sets appropriate permissions.actimeo=30
: Caches file and directory attributes for 30 seconds, which can improve performance forconfig-store
operations by reducing metadata chattiness.
- The
- Networking: Access to Azure Files shares should be restricted to the VNet. Secure transfer (SMB 3.1.1 with encryption) should be enforced.
- Purpose:
-
Azure Files (for
webgisdr
SHARED_LOCATION
staging): Thewebgisdr
utility requires a temporary staging location (SHARED_LOCATION
) on a file system accessible by the machine executing the utility (typically the active Portal for ArcGIS VM). Azure Files provides a resilient and accessible option for this.- Provisioning (OpenTofu): Dedicated Azure Storage Accounts for Azure Files shares for each environment. A separate storage account from the
config-store
shares is recommended for PROD for clarity and independent scaling/redundancy if needed.- PROD (Melbourne): Premium tier (SSD-backed) with Zone-Redundant Storage (ZRS).
- UAT/DEV: Standard tier (HDD-backed) with Locally-Redundant Storage (LRS).
- Share Setup (OpenTofu): A dedicated file share (e.g.,
webgisdr-staging
) will be created. - Mounting & Permissions (CM Tool):
- Mounted persistently on the Portal for ArcGIS VMs.
- Storage account key management and mount options similar to those for the
config-store
Azure Files share, ensuring thearcgis
service account (or the account runningwebgisdr
) has read/write access.
- Networking: Secure transfer (SMB 3.1.1 with encryption) enforced.
Service PROD Configuration UAT/DEV Configuration Key Use Cases Blob Storage ZRS, Geo-Redundant (Gold Stage) LRS Portal content, caches, jobs, webgisdr backups ADLS Gen2 ZRS + Hierarchical Namespace LRS + Hierarchical Namespace Raster store (CRF/MRF formats) Azure Files Premium ZRS (config-store/system) Standard LRS Server shared config, webgisdr staging Lifecycle Policy Hot β Cool (30d) β Archive (90d) β Delete Hot β Cool (60d) β Delete Automated tier management Auth Method Managed Identities (RBAC) Managed Identities Least privilege access Table: Azure Storage Services MVP Configuration
- Provisioning (OpenTofu): Dedicated Azure Storage Accounts for Azure Files shares for each environment. A separate storage account from the
-
Security and Access Management (Common Data Tier Considerations): A consistent and robust security posture across all Data Tier components is paramount.
- Azure Role-Based Access Control (RBAC): Applied to all PaaS data resources (PostgreSQL, Storage Accounts for Blob/ADLS Gen2/Files), adhering to the principle of least privilege. Custom roles should be defined if built-in roles (e.g.,
Storage Blob Data Contributor
,Storage File Data SMB Share Contributor
,PostgreSQL Server Contributor
) are too permissive for specific operational tasks or service accounts. - Managed Identities: System-assigned Managed Identities should be configured for Application Tier VMs (Portal, Server, Data Store), VMSS and Web Adaptor App Services. These identities should be granted the necessary RBAC roles to securely authenticate to:
- Azure Key Vault: For retrieving secrets like database passwords (for ArcGIS Server to connect to PostgreSQL), storage account keys (for mounting Azure Files shares).
- Azure Storage services (Blob, ADLS Gen2): Directly, where supported by the application (e.g., ArcGIS Server accessing Cloud Stores using
Storage Blob Data Contributor
role assigned to its Managed Identity). This is the preferred method over using storage account keys for Blob/ADLS Gen2 access by applications.
- Storage Account Keys: For Azure Files mounting via SMB, storage account keys are currently the standard authentication mechanism from Linux VMs. These keys will be stored as secrets in Azure Key Vault. VMs will retrieve these keys at runtime using their Managed Identities specifically for the mount operation automated by the Configuration Management tool. Direct embedding of keys in scripts or fstab entries is strictly prohibited. Ideally, regular rotation of these storage account keys should be scheduled and automated.
- TLS Enforcement: All Azure Storage Accounts (Blob, ADLS Gen2, Files) should be configured to require a minimum TLS version of 1.2. Azure Database for PostgreSQL also enforces TLS 1.2+ by default.
- Azure Storage Firewall: Azure Storage firewalls should be configured to restrict access to selected virtual networks and IP addresses.
- Microsoft Defender for Storage: This Azure security service should be enabled for all Azure Storage accounts. It provides an additional layer of security intelligence by detecting unusual and potentially harmful attempts to access or exploit your storage accounts, including Blob, Files and ADLS Gen2.
-
Microsoft Defender for open-source relational databases: Should be enabled for Azure Database for PostgreSQL instances. It detects anomalous activities indicating unusual and potentially harmful attempts to access or exploit databases, providing security alerts.
Control PostgreSQL Blob Storage ADLS Gen2 Azure Files Encryption AES256 + TLS 1.2+ SSE + HTTPS SSE + HTTPS SMB 3.1.1 Encryption Access Control PostgreSQL RBAC Azure RBAC + SAS POSIX ACLs + RBAC NTFS Permissions Monitoring Query Performance Storage Analytics Data Lake Audit File Access Logs Threat Detection Defender for DB Defender for Storage Defender for Storage Defender for Storage Backup PITR 35-day Soft Delete + Versioning Versioning ZRS Snapshots
Table: Security Controls Matrix
- Azure Role-Based Access Control (RBAC): Applied to all PaaS data resources (PostgreSQL, Storage Accounts for Blob/ADLS Gen2/Files), adhering to the principle of least privilege. Custom roles should be defined if built-in roles (e.g.,
4.4.2 Data Tier Bronze StageΒΆ
The Bronze Stage for the Data Tier primarily focuses on establishing strategies for scaling Azure Database for PostgreSQL in response to increased demand from the Application Tier. It also involves refining monitoring for all data services to ensure performance and capacity are well-managed. Configurations for Azure Blob Storage, ADLS Gen2 and Azure Files from the MVP stage are largely maintained, with an emphasis on ensuring they can handle potentially increased I/O.
Key Activities and Configurations:
-
Azure Database for PostgreSQL (Enterprise Geodatabase): With ArcGIS Server now capable of scaling out, the load on the Enterprise Geodatabase can increase. It's crucial to have a strategy to scale the database if it becomes a bottleneck. Azure Database for PostgreSQL Flexible Server offers several ways to manage performance and scale.
- Understanding Scaling Options:
- Vertical Scaling (Scale-Up/Down): This involves changing the compute tier (e.g., from General Purpose to Memory Optimized) or adjusting the vCores, RAM and IOPS allocated to the server. Increasing resources (scaling up) typically requires a server restart, which Azure manages. Near-zero downtime scaling aims to minimise this interruption (typically <30 seconds), but it has limitations (e.g., not for HA-enabled servers in some scenarios, logical replication slots not preserved without pg_failover_slots). For the Bronze stage, this is the primary scaling method.
- Storage Scaling: Storage size can only be increased. For Premium SSD, this is mostly an online operation unless crossing the 4TiB boundary. IOPS for Premium SSD scale with disk size or can be provisioned separately up to VM limits. For Premium SSD v2, IOPS and throughput can be tweaked independently of size.
- Scaling Strategy Documentation:
- Detailed operational runbooks for manual vertical scaling of the PROD Azure Database for PostgreSQL Flexible Server instance should be documented.
- Triggering Metrics: Identify key performance indicators from Azure Monitor (e.g., sustained CPU/memory utilisation > 80-85%, increased query latency, high disk queue depth, low
max_connections
headroom, IOPS/throughput nearing limits) that would initiate a scaling review. Alerts should be configured for these thresholds. - Procedure: The documented procedure will include:
- Impact assessment (potential downtime, even if minimal with near-zero downtime scaling).
- Change management approvals.
- Communication plan.
- Execution steps should be codified in OpenTofu and applied.
- Post-scaling validation (checking performance metrics, application functionality).
- Rollback considerations (e.g., restoring from backup if scaling causes issues, though rare).
- Read Replica Strategy (Further Investigation):
- While read replica implementation will be utilised inlater stages, the Bronze stage should involve further investigation and documentation of how read replicas could be used to offload read-intensive workloads (e.g., heavily used map services, analytical queries) from the primary writable instance. This can significantly improve overall read performance and scalability.
- Performance Monitoring & Optimisation:
- Enhanced monitoring of database performance metrics using Azure Monitor and PostgreSQL native tools (e.g., querying
pg_stat_statements
for slow queries,pg_stat_activity
for active connections and wait events) is crucial. Baselines established in MVP should be refined under actual load. - Key Metrics: DTU/vCore utilisation, storage IOPS, active connections, query latency, index hit rates, replication lag (once replicas exist).
- Regular Reviews: Plan for regular reviews of slow queries (using
EXPLAIN ANALYZE
to understand query plans) and indexing strategies as part of operational duties. PostgreSQL'sVACUUM
andANALYZE
commands should be understood by the team; autovacuum settings are a good start, but manual intervention might be needed for heavily changing tables. - Connection Pooling: While ArcGIS Server has its own pooling, if other applications connect directly, or if connection churn is high, the built-in
PgBouncer
offered by Azure Database for PostgreSQL Flexible Server should be considered.PgBouncer
helps manage large numbers of connections efficiently.
- Enhanced monitoring of database performance metrics using Azure Monitor and PostgreSQL native tools (e.g., querying
- Understanding Scaling Options:
-
Azure Blob Storage, ADLS Gen2, Azure Files: These services are generally highly scalable, but monitoring their performance and throughput is essential, especially with an auto-scaling Application Tier.
- Performance Monitoring & Throughput/IOPS Management:
- Continuously monitor storage capacity, transaction rates (IOPS), throughput (MiB/s) and latency (Average Success E2E Latency, Average Success Server Latency) for these services using Azure Monitor.
- Azure Files (Premium tier for PROD): Premium shares have provisioned IOPS/throughput based on share size. Monitor IOPS and throughput against these limits to ensure they can support the demands of a scaled-out ArcGIS Server VMSS (accessing
config-store
/system
) andwebgisdr
operations (accessing the staging share). If limits are approached, the share quota (size) may need to be increased to get more IOPS/throughput. - Azure Blob Storage & ADLS Gen2: Monitor for any throttling events (Azure Storage has scalability targets per storage account for capacity, transaction rate and bandwidth). Increased ArcGIS Server activity (cache generation, geoprocessing outputs, raster data access) could push these limits. If throttling occurs, strategies such as distributing data across multiple storage accounts or using Azure CDN for frequently accessed public blobs should be considered in future optimisations.
- Lifecycle Management Review & Optimisation:
- Review and refine initial lifecycle management policies based on early usage patterns. For example,
webgisdr
backups in Blob Storage might transition from Hot to Cool tier after 30 days, then to Archive after 90 days and be deleted after 1 year. Infrequently accessed raster datasets in ADLS Gen2 or large tile caches in Blob Storage could also benefit from tiering to Cool or Archive. - Ensure lifecycle rules correctly filter objects (e.g., by prefix or blob index tags) and that transitions occur as expected. Automated alerts should be set up for lifecycle policy errors.
- Review and refine initial lifecycle management policies based on early usage patterns. For example,
- Blob Index Tags for Management:
- Utilise blob index tags for more granular filtering in lifecycle policies or for cost tracking/categorisation of data within Blob/ADLS Gen2. This is especially useful for large, diverse raster collections.
- Azure Storage Explorer Usage: Document best practices for using Azure Storage Explorer for data management, particularly for GIS Engineers who might need to interact with data in DEV/UAT environments (e.g., uploading test rasters, browsing ADLS Gen2 directory structures).
- No structural changes (e.g., redundancy level changes from ZRS) are planned for these storage services in the Bronze stage for PROD. The focus remains on ensuring the MVP Data Tier setup robustly handles increased load and refining operational procedures.
- Performance Monitoring & Throughput/IOPS Management:
The primary goal of the Data Tier in the Bronze stage is to ensure it can reliably support the now dynamically scaling Application Tier. This involves diligent monitoring, having clear procedures for scaling the database if it becomes a performance choke point and ensuring that the various Azure Storage services are configured and monitored to handle increased I/O demands efficiently and cost-effectively.
4.4.3 Data Tier Silver StageΒΆ
The Silver Stage for the Data Tier significantly enhances the resilience of the Production (PROD) environment. This is achieved by implementing High Availability (HA) configurations for the Azure Database for PostgreSQL instance, leveraging Azure's native HA capabilities to protect against single points of failure and ensure service continuity. This aligns with the HA enhancements made to the Application Tier in Section 4.3.3. For Azure Storage services (Blob, ADLS Gen2, Files), the ZRS configuration established in MVP/Bronze is sufficient to meet the HA requirements of the Silver Stage.
Key Activities and Configurations (PROD - Melbourne):
-
Azure Database for PostgreSQL (Enterprise Geodatabase): To protect the Enterprise Geodatabase from infrastructure failures within the Melbourne region, High Availability should be enabled for the PROD Azure Database for PostgreSQL Flexible Server instance.
- High Availability Configuration:
- Given that the Azure Australia Southeast (Melbourne) region does not currently support Availability Zones for deploying HA pairs across zones for PostgreSQL Flexible Server, the HA configuration will be Same-Zone High Availability.
- Mechanism: With Same-Zone HA, Azure provisions and maintains a warm standby replica in the same Availability Zone as the primary server. Data is synchronously replicated from the primary to the standby replica. While this doesn't protect against a full AZ outage, it does protect against server-level hardware failures or other issues affecting the primary compute instance.
- Automatic Failover: Azure manages automatic failover to the standby replica in the event of an infrastructure failure affecting the primary instance. This process typically completes within 60-120 seconds, aiming to minimise downtime (RTO). The Recovery Point Objective (RPO) is near zero (no data loss) due to synchronous replication.
- Connection Strings: ArcGIS Server and other applications will continue to use the primary server's FQDN. Azure handles the DNS redirection during a failover, so application-level changes to connection strings are not required post-failover. Applications should, however, implement robust connection retry logic to handle transient errors during the failover window.
- Implementation (OpenTofu & CM Tool): OpenTofu scripts will be updated to configure the PostgreSQL Flexible Server for Same-Zone HA (e.g., setting
high_availability.mode
toSameZone
or equivalent parameter). No specific changes should be needed for ArcGIS Server connection files (.sde
) if they use the main server endpoint FQDN. - Impact on Performance: Synchronous replication to a standby (even in the same zone) can introduce some write/commit latency compared to a standalone server. This impact is generally minimal for same-zone HA.
- Maintenance: During scheduled maintenance, Azure typically patches the standby server first, then fails over to it and then patches the former primary. This minimises downtime.
- High Availability Configuration:
-
Azure Storage (Blob, ADLS Gen2, Files):
- Zone-Redundant Storage (ZRS) Verification: As established in the MVP/Bronze stages, all Azure Storage accounts (Azure Files for
config-store
/system
andwebgisdr-staging
; Azure Blob Storage for Portalcontent
,arcgiscache
,jobs
/outputs
,webgisdr
final backups; and Azure ADLS Gen2 for the Raster Store) must already be configured with Zone-Redundant Storage (ZRS).- Azure Files: Premium tier with ZRS for
config-store
,system
andwebgisdr-staging
. - Azure Blob Storage & ADLS Gen2: Standard GPv2 tier (or Premium Block Blobs if specific high-IOPS workloads are identified) with ZRS.
- Azure Files: Premium tier with ZRS for
- Benefit of ZRS for HA: ZRS synchronously replicates data across three distinct physical locations (Availability Zones, where supported by the underlying storage infrastructure, or across fault domains within a single DC if AZs aren't fully utilised by ZRS). This provides:
- High Data Durability: Protecting against data loss even if an entire data centre (or fault domain) within the region experiences an outage.
- High Availability: Allowing continued read and write access to the data even if one of the locations becomes unavailable, as the storage service automatically fails over to another replica. ZRS is fundamental for achieving robust HA for all storage types used by the new eMap platform.
- Data Protection Features Review:
- Re-verify configurations for Blob soft delete (for blobs and containers) and versioning (if enabled for Blob/ADLS Gen2). Ensure retention periods are appropriate for business RPO/RTO needs in conjunction with
webgisdr
backups and PostgreSQL PITR capabilities. - Point-in-time restore for block blobs (which also applies to ADLS Gen2 data if enabled) relies on soft delete, versioning and the change feed. While powerful,
webgisdr
remains the primary application-level disaster recovery tool for ArcGIS Enterprise state and PostgreSQL PITR for the database itself.
- Re-verify configurations for Blob soft delete (for blobs and containers) and versioning (if enabled for Blob/ADLS Gen2). Ensure retention periods are appropriate for business RPO/RTO needs in conjunction with
- Zone-Redundant Storage (ZRS) Verification: As established in the MVP/Bronze stages, all Azure Storage accounts (Azure Files for
-
Validation of Data Tier HA: Thorough testing is crucial to validate the HA configurations and ensure they meet the platform's resilience objectives.
- Azure Database for PostgreSQL (Same-Zone HA):
- Simulate Failover: Initiate a user-triggered failover for the PostgreSQL Flexible Server instance via the Azure portal or CLI. This simulates an unexpected failure of the primary compute instance.
- Verify Promotion & Connectivity: Confirm the automatic promotion of the standby replica to primary. Ensure ArcGIS Server and other connected applications can successfully reconnect to the database after the failover event with minimal interruption. Monitor application logs for connection recovery behaviour and any errors.
- Measure RTO: Record the actual time taken for the database service to become fully available on the new primary (Recovery Time Objective). Compare this against the target RTO (e.g., < 120 seconds).
- Data Consistency: Verify data consistency post-failover by performing read/write operations and checking recently committed data.
- Azure Storage (ZRS):
- Directly simulating a full zonal outage for storage services by a user is generally not feasible.
- Architectural Review: Confirm that ZRS is correctly provisioned for all relevant storage accounts in PROD via Azure portal, CLI, or OpenTofu state files.
- Application Resilience: Where possible, ensure applications connecting to Azure Storage (e.g., ArcGIS Server accessing Cloud Stores) implement appropriate retry logic for transient storage errors that might occur during an underlying service failover within ZRS.
- Monitoring: Monitor Azure Service Health dashboard for any Azure-reported zonal issues that might affect storage services in Melbourne.
- Documentation: All HA test plans, execution steps, observed behaviours, measured recovery times and any identified issues must be meticulously documented. This documentation is vital for operational readiness, future reviews and refining DR procedures.
Component HA Mechanism Failover Time Data Replication Validation Tests PostgreSQL Same-Zone HA Pair 60-120s Synchronous Manual failover, RTO/RPO measurement Storage Services ZRS Native Instant 3x sync copies AZ outage simulation Azure Files Premium ZRS Instant Sync across zones Mount persistence tests Validation Metrics Connection recovery time Data consistency checks Service health probes End-to-end service testing Table: Silver Stage HA Implementation
- Azure Database for PostgreSQL (Same-Zone HA):
By implementing these Silver Stage HA measures, the Data Tier in the PROD Melbourne environment achieves a robust level of intra-region resilience. This safeguards critical enterprise data and ensures service continuity in the face of common infrastructure failures. DEV and UAT environments will continue with their LRS configurations to maintain cost-effectiveness.
4.4.4 Data Tier Gold StageΒΆ
The Gold Stage for the Data Tier elevates the platform's resilience by implementing fully automated inter-region Disaster Recovery (DR). This ensures that if the Melbourne data centre experiences a significant outage, services will fail over to Sydney with minimal RTO and data loss. The DR orchestration is initiated by a timer-triggered Azure Function deployed in Sydney, which continuously monitors the health of the Melbourne Web Tier. Upon detecting a sustained failure, this Azure Function kickstarts a sequence of automated actions involving the Configuration Management tool and leveraging infrastructure defined by OpenTofu, leading to the Global Server Load Balancer (GSLB) redirecting traffic to the activated Sydney environment.
graph LR
subgraph MEL["Primary Region (Melbourne)"]
direction TB
PG_Primary["ποΈ PostgreSQL Primary"] -->|Sync Replication| PG_Standby["π‘οΈ PostgreSQL Standby"]
Blob_ZRS["βοΈ Blob Storage (ZRS)"] -->|Sync| Z1[(Zone 1)] & Z2[(Zone 2)] & Z3[(Zone 3)]
Files_ZRS["π Azure Files (ZRS)"] -->|Sync| FZ1[(Zone 1)] & FZ2[(Zone 2)] & FZ3[(Zone 3)]
end
subgraph SYD["DR Region (Sydney)"]
PG_Replica["ποΈ PostgreSQL Read Replica"] -.->|Async Replication| PG_Primary
Blob_GRS["βοΈ Blob Storage (GRS)"] -.->|Async| Blob_ZRS
Files_GRS["π Azure Files (GRS)"] -.->|Async| Files_ZRS
end
classDef primary fill:#fff8e1,stroke:#f57c00,stroke-width:2px;
classDef dr fill:#e6ffed,stroke:#198754,stroke-width:2px;
classDef storage fill:#e3f2fd,stroke:#0b5ed7,stroke-width:2px;
class MEL primary;
class SYD dr;
class Blob_ZRS,Files_ZRS,Blob_GRS,Files_GRS storage;
Diagram: Silver/Gold Stage Architecture - Illustrates intra-region HA (Silver) and cross-region DR (Gold) configurations Core Principles for Fully Automated DR:
- Proactive Health Monitoring (Sydney Azure Function): A dedicated, timer-triggered Azure Function running in Sydney continuously probes the health of the Melbourne Web Adaptors. This function acts as the primary sentinel for DR initiation.
- Automated Orchestration (Configuration Management Tool): Once a DR event is declared by the Azure Function, a pre-defined DR script, executed by the Configuration Management tool orchestrates the sequence of failover operations for Data Tier and Application Tier components.
- Infrastructure as Code for DR Readiness (OpenTofu): OpenTofu defines all Data Tier resources in both Melbourne (primary) and Sydney (DR). This includes replication configurations (PostgreSQL read replicas, GRS storage) and ensures Sydney's infrastructure is a "hot standby," ready for automated activation and scaling.
- Seamless Traffic Redirection (GSLB): The Global Server Load Balancer automatically reroutes user traffic to the Sydney endpoints once they become healthy and active post-failover, minimising downtime.
graph TB
subgraph DR_Detection ["π 1: Monitoring & DR Declaration"]
direction TB
A_Func["π°οΈ Azure Function (Sydney)"] -- "Continuously Probes Health" --> B_MelbWeb["π Web Tier Health Endpoints"]
B_MelbWeb -- "Sustained Failure Detected" --> A_Func
A_Func -- "Declares DR Event" --> C_TriggerCM["π¨ Trigger CM Tool"]
end
subgraph DR_Orchestration ["βοΈ 2: Automated DR"]
direction TB
C_TriggerCM --> D_Start((π’ Start))
subgraph DataTierFailover ["Data Tier Failover"]
direction TB
D_Start --> E_StorageFailover["π¦ Storage Account Failover"]
E_StorageFailover --> F_PGSQLPromote["π Promote PostgreSQL"]
end
subgraph AppTierActivation ["Application Tier Activation"]
direction TB
F_PGSQLPromote --> G_AppTierActivate["π Activate Sydney App Tier"]
G_AppTierActivate --> H_WebGISDR["π Restore webgisdr"]
end
H_WebGISDR --> I_ValidateServices["β
Service Validation"]
I_ValidateServices --> J_OrchComplete((π£ Complete))
end
subgraph Traffic_Shift ["π 3: Traffic Redirection"]
direction TB
J_OrchComplete --> K_GSLB["π GSLB Update"]
K_GSLB --> L_SydneyActive["π Sydney Web Tier Active"]
L_SydneyActive --> M_Users["π₯ End Users"]
end
classDef detection fill:#f0f4f8,stroke:#4a90e2;
classDef process fill:#ffe8e8,stroke:#d32f2f;
classDef traffic fill:#e8f5e9,stroke:#2e7d32;
classDef event fill:#f3e5f5,stroke:#6a1b9a;
classDef storage fill:#e1f5fe,stroke:#0288d1;
classDef database fill:#e8f5e9,stroke:#1b5e20;
classDef startend fill:#f5f5f5,stroke:#616161;
class DR_Detection detection;
class DR_Orchestration process;
class Traffic_Shift traffic;
class A_Func,B_MelbWeb,C_TriggerCM event;
class E_StorageFailover storage;
class F_PGSQLPromote database;
class G_AppTierActivate,H_WebGISDR,I_ValidateServices process;
class K_GSLB,L_SydneyActive traffic;
class D_Start,J_OrchComplete startend;
Diagram: Gold Stage Automated Disaster Recovery Process Flow
Key Activities and Detailed Automated DR Process:
-
DR Trigger: Sydney-Based Azure Function for Melbourne Health Monitoring:
- Deployment & Purpose: An Azure Function is deployed in Sydney. Its sole purpose is to monitor the health of the Melbourne ArcGIS Web Adaptor endpoints (
https://<adc-melbourne-vip>/portal/webadaptor/rest/info/health
and/server/webadaptor/rest/info/health
). - Timer Trigger: The function is configured with a timer trigger, executing every minute.
- Health Check & Retry Logic:
- The function makes HTTPS requests to the Melbourne Web Adaptor health endpoints.
- If a health check fails (e.g., timeout, non-200 HTTP status), it initiates a retry sequence: attempt
n
additional checks (e.g.,n=10
) with short intervals (e.g., 6 seconds apart). - If all
n
retries within a cycle fail, the function increments a persistent failure counter (e.g., stored in Azure Table Storage instance in Sydney). - If the health check cycle is successful, the failure counter is reset.
- DR Event Declaration: If the persistent failure counter reaches a predefined threshold
z
(e.g.,z=5
consecutive 1-minute cycles of failed health checks, implying approximately 5-7 minutes of confirmed unresponsiveness after initial retries), the Azure Function declares a DR event. - Security: The Azure Function uses a System-Assigned Managed Identity. This identity is granted the necessary permissions to:
- Make outbound HTTPS requests to the Melbourne endpoints.
- Write to its state/counter store (e.g., Azure Table Storage).
- Securely trigger the DR script managed by the Configuration Management tool (e.g., by calling a webhook or invoking a GitHub Actions pipeline). This is the primary action upon DR declaration.
- Deployment & Purpose: An Azure Function is deployed in Sydney. Its sole purpose is to monitor the health of the Melbourne ArcGIS Web Adaptor endpoints (
-
DR Orchestration: Configuration Management Tool Takes Over: Upon being triggered by the Sydney Azure Function, the Configuration Management (CM) tool executes the DR script. This playbook automates the failover of the Data Tier and coordinates with the Application Tier activation.
-
Phase 1: Initial Notification & Data Tier Failover (CM Tool)
- Logging & Alerting: The CM playbook immediately logs the DR initiation and sends critical alerts to operations teams.
- Azure Storage Account Failover:
- The CM tool executes Azure CLI/API commands to initiate storage account failover for all critical PROD GRS Storage Accounts (Azure Files for
config-store
/system
/webgisdr-staging
; Azure Blob for Portalcontent
/arcgiscache
/jobs
/outputs
/webgisdr
backups; ADLS Gen2 for Raster Store). - This makes the Sydney storage endpoints primary and writable. The CM tool should check the
Last Sync Time
of each storage account before failover and log this information for RPO assessment.
- The CM tool executes Azure CLI/API commands to initiate storage account failover for all critical PROD GRS Storage Accounts (Azure Files for
- Azure Database for PostgreSQL Replica Promotion:
- The CM tool executes Azure CLI/API commands to promote the Sydney PostgreSQL read replica to become a standalone, writable primary server. This breaks replication from Melbourne.
-
Phase 2: Application Tier Activation & Reconfiguration (CM Tool & OpenTofu)
- Activate/Scale Sydney Application Tier: The CM tool orchestrates the bring-up of the Application Tier in Sydney. This involves:
- Starting the VMs and/or scaling up VMSS instances to production capacity using Azure CLI/API calls. OpenTofu defines the target state and the CM tool ensures resources reach it.
- Apply CM Configurations: Once Sydney Application Tier VMs/VMSS are active, the CM tool runs its standard scripts on them to:
- Ensure all ArcGIS Enterprise software is correctly configured.
- Update ArcGIS Server connection files (
.sde
) and any other application configurations to point to the newly promoted Sydney PostgreSQL FQDN and the failed-over Sydney Storage Account endpoints.
- Activate/Scale Sydney Application Tier: The CM tool orchestrates the bring-up of the Application Tier in Sydney. This involves:
-
Phase 3: Automated
webgisdr
Restoration (CM Tool)- With the Sydney Data Tier and Application Tier infrastructure active and reconfigured, the CM tool (orchestrating on the Sydney Portal VM) automates the
webgisdr --import
process. - The latest
webgisdr
backup file is retrieved from the GRS Azure Blob Storage container (now primary in Sydney). - The
SHARED_LOCATION
uses the Azure Files share for staging. - Dynamically generated
webgisdr.properties
(with DR-specific paths/credentials from Azure App Configuration/Key Vault) are used for the import.
- With the Sydney Data Tier and Application Tier infrastructure active and reconfigured, the CM tool (orchestrating on the Sydney Portal VM) automates the
-
Phase 4: Service Validation & GSLB Traffic Shift (CM Tool & GSLB)
- Automated Health Checks: The CM tool performs automated health checks on the key ArcGIS Enterprise services now running in Sydney.
- GSLB Redirection: The GSLB, continuously probing the health of regional endpoints (specifically the Web Tier in Sydney), will automatically detect that Sydney endpoints are healthy and Melbourne's are not. It then reroutes all user traffic to the Sydney Web Tier.
-
-
Role of OpenTofu in Fully Automated DR:
- Defines DR Infrastructure State: OpenTofu is responsible for defining the entire infrastructure in Sydney required for DR. This includes:
- The Azure Function App in Sydney and its dependent resources (e.g., storage for state).
- The PostgreSQL read replica configuration in Sydney, ready for promotion.
- All Azure Storage accounts configured with GRS for data replication.
- Application Tier resources in Sydney (VMs, VMSS, App Services for Web Adaptors), defined in a scaled-down "hot standby" state to reduce costs but allowing for rapid scaling/activation by the CM tool.
- Ensures DR Site Readiness: OpenTofu ensures the Sydney site is correctly provisioned before any DR event, allowing the automated failover processes to operate on a known, consistent infrastructure base. It is not ideal for the dynamic failover commands during the event (which are imperative actions better suited for scripting/CM tools), but rather to define the end-state infrastructure that the CM tool activates or scales.
- Defines DR Infrastructure State: OpenTofu is responsible for defining the entire infrastructure in Sydney required for DR. This includes:
-
Data Replication Mechanisms (Unchanged from previous plan, but critical for automation):
- Azure Database for PostgreSQL: Asynchronous physical streaming replication from Melbourne primary to the Sydney read replica. Geo-redundant backups provide a secondary DR data source.
- Azure Storage (Blob, ADLS Gen2, Files): GRS or GZRS for asynchronous replication of data from Melbourne to Sydney.
-
RPO/RTO Considerations with Fully Automated DR:
- RPO: Remains dependent on the asynchronous replication lag:
- PostgreSQL:
Replica Lag in Seconds
, typically less than a minute. - Azure Storage GRS:
Last Sync Time
(typically <15 minutes, but variable).
- PostgreSQL:
- RTO: Significantly minimized due to full automation. The total time includes:
- Detection time by Azure Function (e.g., 1 min check interval *
z
failure cycles +n
retries per cycle). - Azure Function processing and CM tool pipeline trigger latency.
- Execution time for CM tool script: storage account failovers, PostgreSQL promotion, Application Tier activation/scaling and
webgisdr
import. - GSLB health probe interval and DNS update time for traffic redirection. The target RTO should be clearly defined and DR drills will validate this. The duration of the
webgisdr
import will be a significant factor. Ifwebgisdr
become a barrier to reaching the desired RTO, alternative backup strategies such as Azure Backup Service should be investigated.
- Detection time by Azure Function (e.g., 1 min check interval *
- RPO: Remains dependent on the asynchronous replication lag:
-
Failback Strategy (Sydney to Melbourne): Failback is a highly complex undertaking and typically remains a more controlled, semi-automated process.
- Disable Sydney DR Trigger: The Sydney Azure Function must be disabled or its logic altered to prevent re-triggering DR back to Sydney during failback.
- Melbourne Restoration: OpenTofu ensures Melbourne infrastructure is fully restored or re-provisioned to a clean state.
- Data Resynchronization (Critical & Complex):
- PostgreSQL: Establish replication from Sydney (acting primary) back to a new or restored Melbourne instance (acting as a new replica). Once synced, a planned failover (with downtime) is performed.
- Azure Storage: Re-initiate geo-replication for GRS accounts from Sydney back to Melbourne. This "re-protect" operation synchronizes changes made in Sydney back to Melbourne.
- Application State (
webgisdr
): A freshwebgisdr
export from Sydney would be taken and restored onto the Melbourne environment after data stores are synced and Melbourne is ready to become primary. - GSLB Reconfiguration: Update GSLB to prioritize Melbourne again. This process requires planning and thorough testing due to the risks of data divergence.
-
Testing and Validation of Fully Automated DR:
- Mandatory Drills: Regular, comprehensive DR drills are non-negotiable. These drills must test the entire automated sequence:
- Simulate Melbourne Web Tier unavailability to trigger the Sydney Azure Function.
- Verify the Azure Function correctly declares a DR event and triggers the CM tool.
- Validate the CM tool's successful orchestration of Data Tier failovers (PostgreSQL promotion, Storage Account failovers).
- Confirm Application Tier activation and reconfiguration in Sydney.
- Verify successful
webgisdr
import. - Confirm GSLB traffic redirection to Sydney.
- Measure RPO/RTO: Accurately measure data loss (RPO) against
Last Sync Times
and actual downtime (RTO) during drills. - Iterative Refinement: Use learnings from DR drills to continuously refine the Azure Function logic, CM scripts, OpenTofu configurations and DR runbook documentation.
Phase Components Involved Automation Tools Key Metrics DR Declaration Azure Function, Key Vault Timer triggers + Retry logic Health check failures Storage Failover GRS Storage Accounts Azure CLI/PowerShell LastSyncTime validation PostgreSQL Promotion Sydney Read Replica Azure Database CLI Replica lag <60s App Tier Activation VMSS, Web Adaptors OpenTofu + CM Tool Instance health status webgisdr Restoration Blob Storage, Azure Files Python Automation Backup file versioning Traffic Shift GSLB, DNS Traffic Manager API Endpoint response times Table: Gold Stage DR Automation Process
- Mandatory Drills: Regular, comprehensive DR drills are non-negotiable. These drills must test the entire automated sequence:
By implementing this fully automated Gold Stage Data Tier DR strategy, the new eMap platform achieves a high degree of resilience. This minimises human intervention during a disaster, significantly reduces RTO and ensures business continuity for critical geospatial services.