4.2: Application State Replication and Disaster Recovery Strategy
✂️ Tl;dr 🥷
Discusses replicating and recovering the stateful components of ArcGIS Enterprise (Portal, Server, Data Store) within an automated DevOps framework. Esri’s webgisdr
utility is central to capturing and restoring dynamic application state, including configurations, user data and service definitions. Infrastructure provisioning via OpenTofu and baseline software installation via configuration management tools are decoupled from state replication, enabling immutable infrastructure principles. Automated backups integrate with Azure services: credentials are securely managed via Key Vault, configuration parameters via App Configuration and backups stored in geo-redundant Azure Blob Storage. Disaster recovery orchestration rebuilds infrastructure in a secondary region using IaC, then restores the latest application state via webgisdr
. The process emphasises end-to-end automation, zero-trust security practices and alignment with RPO/RTO targets through scheduled backups and rigorous testing. A conceptual Python script demonstrates secure credential handling, dynamic configuration generation and integration with Azure’s SDKs.
This section outlines the approach for replicating and recovering the core ArcGIS Enterprise application state. This encompasses the stateful components: Portal for ArcGIS, ArcGIS Server and the ArcGIS Data Store. A central element of this strategy is the utilisation of Esri's webgisdr
utility. Given this architecture's adherence to a fully automated, DevOps and Infrastructure as Code (IaC) approach, it is crucial to detail how webgisdr
, a tool that produces a single backup file representing the application's state, integrates with these modern principles.
4.2.1 webgisdr
in an Automated DevOps Setup¶
The new eMap platform is centred on a cloud-native architecture where infrastructure is immutable, defined by code and system configurations are applied consistently via the designated Configuration Management tool.
- Infrastructure Provisioning (IaC - OpenTofu): OpenTofu is exclusively responsible for the provisioning of all Azure infrastructure resources. This includes Virtual Machines (VMs) as well as Azure PaaS services.
- Base Software Installation and Configuration (CM Tool): The designated Configuration Management tool will be responsible for installing and configuring the ArcGIS Enterprise software components (Portal for ArcGIS, ArcGIS Server, ArcGIS Data Store) to a baseline, operational state on the infrastructure provisioned by OpenTofu. This includes applying licences, defining initial primary site administrator (PSA) accounts and performing basic component registrations.
-
Application State Management (
webgisdr
): As a stateful application, ArcGIS Enterprise doesn't support many of the modern cloud-native replication and DR patterns. To bridge this gap, Esri provies,webgisdr
, a utility for managing the ArcGIS Enterprise's application state. It is designed to capture and restore the dynamic content, configurations and inter-component relationships that constitute the operational state of an ArcGIS Enterprise deployment. This state information includes:- Portal for ArcGIS items (e.g., maps, applications, layers), users, groups and organisational settings.
- ArcGIS Server service definitions, site configurations and security settings.
- ArcGIS Data Store content (e.g., hosted feature layers, specific tile caches managed by the Data Store).
- Federation relationships between Portal for ArcGIS and ArcGIS Server sites, including the hosting server designation.
By treating the webgisdr
utility as the designated tool for application state backup and recovery—distinct from infrastructure provisioning and base software installation—a clean and effective integration with automated workflows can be achieved.
4.2.2 Automation of Backup Processes¶
For robust High Availability (Silver Stage) and Disaster Recovery (Gold Stage), regular and automated backups of the ArcGIS Enterprise application state are paramount. The webgisdr
utility must be fully integrated into these automated processes, ensuring consistent and reliable capture of the application state.
The key components for automating webgisdr
backups are:
-
Trigger Mechanism:
- Backup operations MUST be initiated automatically based on a predefined schedule. The recommended pattern for the new eMap platform involves scheduled triggers within GitHub Actions CI/CD pipelines. These triggers will invoke the Configuration Management tool to execute a script on the primary Portal for ArcGIS VM.
- Alternative mechanisms, such as direct cron jobs or
systemd
timers on the Portal VM, or Azure Automation Runbooks, are possible but not recommended in this architecture.
-
Secure Credential Management:
- The
webgisdr
utility requires Portal for ArcGIS primary site administrator (PSA) credentials. - In accordance with security best practices and the Zero Trust Security Model, these credentials (username and password) MUST be stored securely as secrets within Azure Key Vault. Automation scripts, executed by the Configuration Management tool, MUST** retrieve these credentials at runtime using the Managed Identity of the Portal VM. Hardcoding credentials is strictly prohibited.
- The
-
webgisdr.properties
File Management:- The
webgisdr.properties
file content MUST be created dynamically at runtime using Azure App Configuration and Azure Key Vault, rather than relying on static template files on the VM. - Azure App Configuration: Non-sensitive parameters for
webgisdr
(e.g.,PORTAL_ADMIN_URL
,SHARED_LOCATION
,AZURE_STORAGE_ACCOUNT_NAME
,AZURE_BLOB_CONTAINER_NAME
, boolean flags such asRESTORE_RELATIONAL_DATA
) should be stored in an Azure App Configuration store. Environment-specific values will should managed using labels within App Configuration (e.g., for DEV, UAT, PROD-Melbourne). - Azure Key Vault Integration: Sensitive values, specifically the
PORTAL_ADMIN_USERNAME
andPORTAL_ADMIN_PASSWORD
, MUST be stored in Azure Key Vault. Azure App Configuration will store references to these Key Vault secrets. The Managed Identity of the Portal VM (acting on behalf of the CM tool) requires permissions to read from both App Configuration and the referenced Key Vault secrets. - Dynamic Generation: The automation script executed by the CM tool will:
- Authenticate to Azure App Configuration using the VM's Managed Identity.
- Fetch all necessary configuration parameters. App Configuration will resolve Key Vault references to retrieve the actual secret values.
- Construct the content of the
webgisdr.properties
file in memory. - Securely write this dynamically generated content to a temporary file in a restricted location on the VM (e.g.,
/tmp/webgisdr_runtime.properties
on Linux, with permissions set to be readable only by the execution context). - This temporary properties file MUST be deleted immediately after the
webgisdr
command execution completes, regardless of success or failure (e.g., within afinally
block in the script). This approach centralises configuration, enhances security by minimising the on-disk presence of sensitive information and aligns with immutable infrastructure principles.
- The
-
Backup Locations (
SHARED_LOCATION
andBACKUP_STORE_PROVIDER
/BACKUP_LOCATION
):SHARED_LOCATION
: As defined in App Configuration, this parameter defines a temporary staging area on a file system accessible by the machine executing thewebgisdr
utility (typically the active Portal for ArcGIS VM). An Azure Files share, mounted to the Portal VM, serves as an appropriate and resilient choice for this staging location.- For the PROD environment, this Azure Files share should be configured with Zone-Redundant Storage (ZRS). This ensures intra-region resilience for the staging area.
- **
BACKUP_STORE_PROVIDER
: should be set toAzureBlob
(retrieved from App Configuration). AZURE_STORAGE_ACCOUNT_NAME
/AZURE_BLOB_CONTAINER_NAME
: These specify the Azure Blob Storage container where the final.webgissite
backup file will be stored, with values sourced from App Configuration. For Disaster Recovery purposes (Gold Stage), the Azure Storage Account hosting this container for the PROD environment should be configured with *Geo-Redundant Storage (GRS). This ensures backups are asynchronously replicated from the Melbourne region to the designated DR region (Sydney). Container and storage account names should be parameterised per environment via App Configuration.
-
Execution and Output:
- The automation script will invoke the
webgisdr
command (e.g.,webgisdr --export --file /path/to/temporary_properties_file
). - The utility creates the backup archive in the
SHARED_LOCATION
(the Azure Files share) and subsequently uploads it automatically to the specified Azure Blob Storage container whenBACKUP_STORE_PROVIDER=AzureBlob
is configured.
- The automation script will invoke the
-
Post-Backup Operations (Logging and Cleanup):
- The automation script MUST comprehensively log the success or failure of each backup operation and integrate with Azure Monitor.
- The script should implement logic for cleaning up older backup files from the local
SHARED_LOCATION
(Azure Files share) to manage staging space. - The temporary
webgisdr.properties
file MUST be deleted. - Retention of backup files within Azure Blob Storage MUST be managed using Azure Storage lifecycle management policies.
4.2.3 Automated Disaster Recovery Orchestration¶
In a Disaster Recovery scenario, the primary objective is to restore service in the secondary Azure region (Sydney) with minimal downtime and data loss. The webgisdr
utility is integral to restoring the ArcGIS Enterprise application state onto infrastructure that has been rebuilt by IaC and CM processes.
The DR restoration process is orchestrated in phases:
-
Phase 1: Infrastructure and Base Software Provisioning (IaC and CM Driven)
- Trigger: This phase is initiated either by a manual declaration of a disaster or by an automated trigger from the Monitoring & Observability framework (Section 3.7) detecting a severe and prolonged outage in the primary region (Melbourne).
- OpenTofu Execution: IaC scripts (OpenTofu) are executed to provision all necessary Azure resources in the DR region (Sydney), establishing the "pilot light" infrastructure. This encompasses VMs for Portal for ArcGIS, ArcGIS Server and ArcGIS Data Store; Azure App Services for Web Adaptors; networking components; storage accounts (which would already contain replicated data via GRS/GZRS if used for
webgisdr
backups and other shared storage); and the Azure Database for PostgreSQL instance (which is failed over from the primary region's replica). - Configuration Management Tool Execution: The CM tool runs on the newly provisioned VMs in Sydney to:
- Apply OS hardening configurations (Ubuntu 24.04 LTS).
- Install the ArcGIS Enterprise software components (Portal, Server, Data Store) to a "clean" or "default site" state.
- Configure the ArcGIS Web Adaptors on the App Service instances to point to these new, unconfigured backend components.
- Outcome: At the end of this phase, a functional, but essentially empty and unconfigured, ArcGIS Enterprise deployment is operational in the DR region (Sydney).
-
Phase 2: Application State Restore (Scripted
webgisdr
Import)- This phase is orchestrated by a dedicated DR automation script, managed within the CI/CD pipeline.
- Retrieve DR Configuration: The script securely fetches all necessary configuration parameters (including PSA credentials for the newly created, clean Portal instance in the DR region,
SHARED_LOCATION
for DR, targetAZURE_STORAGE_ACCOUNT_NAME
andAZURE_BLOB_CONTAINER_NAME
where replicated backups reside) from Azure App Configuration (using a DR-specific label, e.g., "PROD-Sydney") and Azure Key Vault, similar to the backup process. - Access Backup File:
- The script identifies the latest valid
.webgissite
backup file from the geo-replicated Azure Blob Storage container (the PROD backup storage account configured with GRS/GZRS). - The chosen backup file is downloaded from Azure Blob Storage to the
SHARED_LOCATION
(e.g., a mounted ZRS Azure Files share) accessible by the Portal VM in the DR region.
- The script identifies the latest valid
- Prepare DR
webgisdr.properties
File: Using the configuration retrieved from App Configuration and Key Vault, the script dynamically generates a temporarywebgisdr.properties
file tailored for the DR environment and the specific backup file. This temporary file is securely written to the DR Portal VM and deleted post-execution. - Execute
webgisdr
Import: The script invokes thewebgisdr --import --file /path/to/temporary_dr_webgisdr.properties
command on the Portal VM in the DR region. - Outcome: The complete ArcGIS Enterprise application state—including Portal items, users, groups, Server services, ArcGIS Data Store content and federation settings—is restored onto the newly provisioned DR infrastructure.
-
Phase 3: Post-Restore Finalisation (Scripted and Manual Steps)
- DNS/GSLB Update: As detailed in Section 4.1.4, automated scripts update DNS records or the Global Server Load Balancer (GSLB) configuration to redirect user traffic to the now-active DR environment in Sydney.
- Validation: Automated smoke tests and validation scripts are executed to confirm the health and functionality of the restored services.
- Notifications: Relevant stakeholders are alerted that the DR failover process is complete and services are operational from the Sydney region.
The webgisdr.properties
file for a DR import will have its parameters (such as DR Portal URL, DR PSA credentials and the specific BACKUP_FILE_NAME
to restore) dynamically sourced from Azure App Configuration (with a DR label) and Azure Key Vault by the DR automation script.
- Recovery Point Objective (RPO) and Recovery Time Objective (RTO) Considerations:
- The RPO for content managed by
webgisdr
(such as Portal items and ArcGIS Data Store content) is directly determined by the frequency of the automated backup operations. Frequent, scheduled backups to geo-replicated Azure Blob Storage help minimise potential data loss. - The RTO related to
webgisdr
restoration can be a significant component of the overall DR RTO. However, by fully automating the preceding IaC (OpenTofu) and CM steps for infrastructure rebuild and base software installation and by automating thewebgisdr
import process itself, the overall DR RTO can be significantly optimised and made more predictable. The "pilot light" DR infrastructure strategy, where minimal resources are pre-provisioned in the DR region (Sydney) and scaled upon failover, further reduces the time spent on infrastructure provisioning during a DR event.
- The RPO for content managed by
Key to Success: Comprehensive Automation and Rigorous Testing
The successful integration of the webgisdr
utility into a modern DevOps operational model is contingent upon comprehensive automation of both backup and restore processes, leveraging cloud-native configuration management as detailed. Equally crucial is regular, rigorous testing of Disaster Recovery procedures (Gold Stage). This includes end-to-end testing of the full orchestration, from infrastructure rebuild in the DR region to application state restoration and final service validation via the GSLB.
Defining these distinct roles and leveraging the appropriate tools and services for each aspect of the platform—IaC/CM for infrastructure and base software, webgisdr
for ArcGIS application state (with configuration managed via Azure App Configuration and Key Vault and backups to GRS/GZRS Blob storage for PROD) and Azure-native replication for user-managed PaaS data storesestablishes a comprehensive, automated and efficient High Availability and Disaster Recovery strategy for the new eMap platform.
4.2.4 Conceptual Backup Implementation¶
The following code is a conceptual implementation of the strategy discussed in this section, showing patterns to run webgisdr
via a Python script.
The script shows integration with Azure App Configuration and Key Vault using the Azure SDK and Managed Identities, configuration validation using Pydantic
and asynchronous operation.
webgisdr_backup.py | |
---|---|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 |
|
- Using
asyncio
for concurrent execution of I/O-bound tasks. - Pydantic models (
ScriptArguments
,WebGISDRProperties
) are used to define the expected structure, types and validation rules. - Pydantic used to define the expected schema, data types (e.g.,
HttpUrl
,FilePath
,SecretStr
) and validation rules. - Maps user-friendly internal configuration names to the specific key names used in the Azure App Configuration store.
- Using
SecretStr
type from Pydantic for sensitive values. - Configuration settings in Azure App Configuration are references to secrets stored in Azure Key Vault. The script resolves these references at runtime, fetching the actual secret value directly from Key Vault.
webgisdr
is a synchronous, potentially long-running process. To avoid blocking the asyncio event loop,subprocess.run
is executed in a separate thread usingasyncio.to_thread
.- Primary entry point and orchestrates the entire backup workflow: parsing arguments, fetching configuration, creating the properties file, running the
webgisdr
tool and cleaning up. - Logic for failure notifications (e.g., sending alerts to Azure Monitor ) can go here.
- Ensures that critical cleanup operations are performed regardless of success or failure.
- Add other webgisdr properties here.
- Avoid duplicate logs if root logger is also configured.
- /secrets/SECRET_NAME[/VERSION] -> SECRET_NAME.
- Use asyncio.TaskGroup for concurrent fetching.
- Validate and structure the configuration using Pydantic.
- In a real system, catch Pydantic ValidationError specifically.
- Access SecretStr value securely.
- Ensure executable.
- Add execute for user, group, other.
- Log stdout/stderr regardless of success for better diagnostics.
- webgisdr often outputs progress to stderr, so log as info unless error code.
- Raise an error to be caught by the main try/except.
- Already logged, just re-raise