4.1: Web Tier Implementation

✂️ Tl;dr 🥷

The Web Tier acts as the secure entry point for user access to ArcGIS Enterprise on Azure, managing traffic routing, threat protection and SSL termination through Web Application Firewalls, Application Delivery Controllers and Web Adaptors. Implementation progresses through four stages: MVP establishes core components including Web Adaptors on Azure App Service, path-based routing and WAF policies. Bronze introduces auto-scaling for Web Adaptors and manual vertical scaling for the Portal VM. Silver enhances resilience with high availability pairs for critical components and shared storage. Gold extends disaster recovery to a secondary Azure region using a global load balancer and pilot-light standby infrastructure. Each stage incrementally improves security, scalability and fault tolerance, ensuring the platform adapts to demand, withstands regional failures and maintains service continuity through automated failover mechanisms. Detailed configurations and health monitoring underpin all tiers to align with zero-trust principles while balancing cost efficiency.

The Web Tier serves as the primary interface for user interaction and service consumption. It is responsible for securely routing client requests to the appropriate backend ArcGIS Enterprise components.

Guidelines Only

The information provided in this section offers high-level recommendations for deploying ArcGIS Enterprise on Azure. These guidelines are intended to be further refined and expanded by the project team during the detailed design and implementation phases.

The Web Tier is the outermost layer of the ArcGIS Enterprise deployment, acting as the secure gateway for all incoming client requests from the public internet. It encompasses the network and application services responsible for traffic management, security enforcement and request routing to the backend application components. This tier includes:

Web Application Firewall (WAF): Threat protection.
Application Delivery Controller (ADC): SSL/TLS termination and path-based routing.
ArcGIS Web Adaptors: Translate incoming HTTP requests into a format understood by Portal for ArcGIS and ArcGIS Server.
Content Delivery Network (CDN): Optional
Global Server Load Balancer (GSLB):, Required for Gold Stage.

graph TD
    subgraph "Internet Users"
        direction LR
        USR[("👥<br>Users")]
    end

    subgraph "Azure Web Tier (Conceptual)"
        direction LR
        USR --> CDN["🌐<br>CDN<br>(Optional - PROD Gold)"]
        CDN --> GSLB["🌍<br>GSLB<br>(PROD Gold)"]
        GSLB --> WAF["🛡️<br>WAF"]
        WAF --> ADC["🚦<br>ADC"]
        ADC -- "/portal/*" --> WA_P["📱<br>Portal Web Adaptor<br>(App Service)"]
        ADC -- "/server/*" --> WA_S["🖥️<br>Server Web Adaptor<br>(App Service)"]
    end

    subgraph "Application Tier (Backend)"
        direction LR
        WA_P --> P4A[("🔑<br>Portal for ArcGIS VM")]
        WA_S --> AGS[("⚙️<br>ArcGIS Server VM/VMSS")]
    end

    classDef internet fill:#f9f9f9,stroke:#333,stroke-width:2px;
    classDef webtier fill:#e3f2fd,stroke:#0b5ed7,stroke-width:2px;
    classDef apptier fill:#e6ffed,stroke:#198754,stroke-width:2px;

    class USR internet;
    class CDN,GSLB,WAF,ADC,WA_P,WA_S webtier;
    class P4A,AGS apptier;

Diagram: Conceptual overview of the Web Tier components and their interaction.

Portal for ArcGIS: Web Tier vs. Application Tier Focus

Section 4.1 (Web Tier) details how Portal for ArcGIS is accessed and exposed through the web infrastructure. This includes: * Configuration of the ArcGIS Web Adaptor for Portal on Azure App Service. * ADC path-based routing rules (e.g., /portal/*) and WAF policies protecting Portal endpoints. * Portal configurations directly related to web accessibility and content presentation, such as the Portal content directory on Azure Blob Storage and SAML/OpenID Connect for SSO.

Section 4.3 (Application Tier) will focus on the deployment, installation, core configuration and backend resilience of the Portal for ArcGIS software itself on its dedicated Virtual Machine(s), including federation, HA and DR aspects related to the Portal VM and its internal state.

4.1.1. Web Tier MVP Stage¶

The Minimum Viable Product (MVP) for the Web Tier establishes the foundational components necessary to expose Portal for ArcGIS and ArcGIS Server services. This stage focuses on deploying and configuring the Web Application Firewall (WAF), Application Delivery Controller (ADC) and ArcGIS Web Adaptors hosted on Azure App Service.

Key Activities and Configurations:

ArcGIS Web Adaptors on Azure App Service:
- Deployment Strategy: Two dedicated Azure App Service instances, utilising a Linux runtime with Tomcat should be provisioned per environment (DEV, UAT and initially the Melbourne region for PROD MVP).
  - One App Service instance will host the Portal for ArcGIS Web Adaptor (.war file).
  - The second App Service instance will host the ArcGIS Server Web Adaptor (.war file).
- App Service Plan Configuration:
  - PROD: An appropriately sized App Service Plan should be configured for manual scaling initially.
  - DEV/UAT: Cost-effective App Service Plans (e.g., Basic or Standard tier, typically single instance) should be utilised. These environments maintain structural parity with PROD by using App Services for Web Adaptors but are optimised for non-production workloads and costs.
- Virtual Network (VNet) Integration: Each App Service instance should be configured with VNet Integration. This enables secure, private communication from the App Service to the backend Portal for ArcGIS and ArcGIS Server Virtual Machines (VMs) via their Fully Qualified Domain Names (FQDNs).
- Application Settings Management: Critical Web Adaptor configurations should be managed securely via App Service Application Settings. These settings should be injected by the CI/CD pipeline:
  - WEBADAPTOR_PORTAL_URL: Set to the internal FQDN of the Portal for ArcGIS VM (e.g., https://portalvm.gis.ffmvic.gov.au:7443).
  - WEBADAPTOR_SERVER_URL: Set to the internal FQDN of the ArcGIS Server site's load balancer or primary node (e.g., https://arcgisservervm.gis.ffmvic.gov.au:6443).
  - Relevant Java memory options (e.g., JAVA_OPTS) should also be configured for optimal performance.
- Security Considerations:
  - Administrative access through the Server Web Adaptor (e.g., to ArcGIS Server Manager or Administrator API) must be disabled via its configuration settings immediately after deployment. This is a critical security measure.
  - App Service Access Restrictions should be configured to permit inbound traffic exclusively from the Application Delivery Controller's (ADC) IP addresses or designated subnet.
  - Azure Managed Identities are not required to be enabled for App Services as configuration is handled through CI/CD pipelines..
- Deployment Automation: The deployment of ArcGIS Web Adaptor .war files to the respective App Service instances should be an automated step within the CI/CD pipeline.
Application Delivery Controller (ADC) Configuration:
- Core Function: The ADC (e.g., NetScaler) is responsible for managing all inbound HTTPS traffic from the WAF, performing SSL/TLS termination and executing path-based routing to the correct backend Web Adaptor App Service.
- Environment-Specific Deployment:
  - PROD (Melbourne MVP): A regional ADC instance should be deployed. High Availability for the ADC itself (e.g., an HA pair) is a target for Silver Stage.
  - DEV/UAT: A simplified ADC configuration should be employed, such as a single ADC instance to balance functionality with cost-effectiveness.
- Path-Based Routing Rules:
  - Traffic destined for /portal/* should be directed to the Portal Web Adaptor App Service instance.
  - Traffic destined for /server/* should be directed to the Server Web Adaptor App Service instance.
- SSL/TLS Termination and Management: The ADC will terminate public SSL/TLS connections using certificates securely managed within Azure Key Vault. For enhanced security, re-encryption of traffic from the ADC to the backend App Services is recommended to ensure end-to-end encryption.
- Health Probes: Specific health probes must be configured on the ADC to continuously monitor the health of the backend Web Adaptor App Services. These probes should target reliable health endpoints on the App Services, such as /portal/webadaptor/rest/info/health and /server/webadaptor/rest/info/health. This ensures that traffic is only routed to healthy and responsive instances.
- Security Role: The ADC is a critical component of the "Zero Trust Security Model", acting as a controlled and intelligent ingress point for all application traffic.
Web Application Firewall (WAF) Policies:
- Primary Function: The WAF (e.g., Imperva) inspects all incoming HTTP/S traffic for threats before it reaches the ADC, providing a crucial layer of defence.
- Policy Configuration and Rulesets:
  - Implement robust policies based on industry best practices, such as protection against the OWASP Top 10 common web vulnerabilities (e.g., SQL injection, cross-site scripting).
  - Define custom rules tailored to ArcGIS Enterprise traffic patterns. This includes explicitly allowing legitimate paths such as /portal/sharing/rest/*, /portal/home/* and /server/rest/services/*.
  - Crucially, block access to administrative endpoints (e.g., /portal/portaladmin/*, /server/manager/*, /server/admin/*) at the WAF layer. This complements the disabling of administrative access at the Web Adaptor level and aligns with the principle of defence-in-depth.
- Operational Mode: Deploy the WAF in Detection mode initially to monitor and fine-tune rules, transitioning to Prevention mode after validation and testing to actively block malicious traffic.
Portal for ArcGIS Configuration:
- Content Directory on Azure Blob Storage: The Portal for ArcGIS content directory, which stores item metadata, thumbnails and uploaded files, should be configured to use a designated Azure Blob Storage container. This configuration should be performed using the Portal Administrator REST API.
  - Security Best Practices for Blob Storage: To safeguard Portal content, the designated Blob container must be configured with:
    - Soft Delete: Enabled with an appropriate retention period (e.g., 7-14 days) to allow recovery from accidental deletions.
    - Versioning: Enabled to preserve previous versions of items, facilitating rollback or historical tracking if needed.
    - Azure Resource Locks: A CanNotDelete lock applied to the storage account hosting the Portal content directory to prevent accidental deletion of the entire account.
- Authentication and Single Sign-On (SSO): Portal for ArcGIS will be configured to use SAML 2.0 or OpenID Connect for federated authentication against the organisation's enterprise Identity Provider (IdP).
  - This configuration enables a seamless Single Sign-On experience for users.
  - Administrative user accounts should be configured using the REST API.
  - Enterprise IdP groups should be mapped to Portal roles (e.g., Viewer, User, Publisher, Administrator and any custom roles defined) to effectively manage user privileges and access control.
SSL/TLS Certificate Management:
- All public-facing endpoints, primarily the WAF and ADC, must utilise SSL/TLS certificates issued by a trusted public Certificate Authority (CA). The use of self-signed certificates in PROD is strictly prohibited.
- These certificates should be securely stored and managed within Azure Key Vault, with automated renewal processes to prevent expiration and service disruption.
- The ADC should be configured to enforce strong cipher suites and modern TLS versions (TLS 1.3 preferred if supported by both endpoints) for all client connections.

4.1.2. Web Tier Bronze Stage¶

The Bronze Stage focuses on enhancing the platform's responsiveness and efficiency by implementation of automatic scaling mechanisms for key components that support the Web Tier: the Azure App Service Plans hosting the ArcGIS Web Adaptors. Additionally, it outlines strategies manual vertical scaling of the Portal for ArcGIS VM.

Auto-Scaling for Web Adaptor App Service Plans¶

The ArcGIS Web Adaptors, hosted on Azure App Service instances, are critical entry points for user traffic to Portal for ArcGIS and ArcGIS Server. As user load fluctuates, the App Service Plans hosting these Web Adaptors must scale accordingly to maintain performance and availability, ensuring enough resources are available to support demand without over-provisioning and incurring unnecessary costs.

Rationale and Approach¶

In traditional on-premises deployments, scaling web server capacity often involved manual provisioning of new servers or VMs, a time-consuming and potentially disruptive process. Azure App Service, through Azure Autoscale, offers dynamic scaling, allowing the number of instances supporting the Web Adaptors to increase or decrease automatically based on demand or a predefined schedule. This ensures that sufficient resources are available during peak times without over-provisioning (and over-paying) during quieter periods.

Two primary Azure Autoscale approaches are available:

Metric-Based Scaling: Adjusts the instance count based on real-time performance metrics such as CPU percentage, memory usage, or HTTP queue length. This is ideal for handling unpredictable load patterns.
Schedule-Based Scaling: Adjusts the instance count at specific times, useful for predictable peak and off-peak periods (e.g., scaling up during business hours and down overnight).

For the new eMap platform, metric-based scaling (Azure Autoscale) is generally preferred for its responsiveness, though schedule-based scaling can complement it.

Alternative: Azure App Service 'Automatic scaling'

Azure App Service also offers a simpler, platform-managed "Automatic scaling" feature (distinct from Azure Autoscale) for Premium V2/V3 tiers. This feature, if enabled on the App Service Plan, automatically handles scaling decisions based on HTTP traffic without requiring explicit rule definition and can manage prewarmed instances to reduce cold starts. While this architecture recommends Azure Monitor Autoscale for granular control via metric-based rules, platform-managed "Automatic scaling" can be an alternative for scenarios where simpler, traffic-based scaling is preferred.

Configuration¶

Azure Autoscale rules for App Service Plans are configured within Azure using Azure Monitor Autoscale settings. These rules allow precise control over how and when the App Service Plan scales and can be defined declaratively using OpenTofu.

Key parameters for auto-scaling rules include:

Minimum and Maximum Instances: Defines the lower and upper bounds for the number of instances; the maximum helps control costs.
Scale-Out Rules: Conditions that trigger an increase in instance count (e.g., CPU average > 70% for 10 minutes).
Scale-In Rules: Conditions that trigger a decrease in instance count (e.g., CPU average < 30% for 20 minutes).
Cooldown Period: A duration after a scale event during which further scaling actions are paused. This allows metrics to stabilise and prevents rapid fluctuations (flapping), where the system might scale out and in repeatedly in short succession.
Notifications: Autoscale settings can be configured to send email notifications or trigger webhooks when scaling events occur, informing administrators and operations teams.

The following hypothetical OpenTofu code snippet provides a conceptual example for configuring metric-based auto-scaling for an App Service Plan hosting a Web Adaptor. Separate, similar configurations would be applied to the App Service Plans for both the Portal Web Adaptor and the Server Web Adaptor.

app_service_autoscale_prod.tf
# Hypothetical variables for completeness
# In a real configuration, these would be defined in a variables.tf file or passed as input.
variable "location_short_code" {
  description = "Azure Region Shortcode, e.g., Australia Southeast."
  type        = string
  default     = "ause"
}

variable "prod_web_tier_resource_group_name" {
  description = "Name of the Azure Resource Group for the PROD web tier components."
  type        = string
  default     = "rg-emap-prod-web-ause"
}

variable "location" {
  description = "The Azure region where resources will be deployed."
  type        = string
  default     = "Australia Southeast"
}

variable "common_tags" {
  description = "A map of common tags to apply to all resources."
  type        = map(string)
  default = {
    "Environment" = "Production"
    "Project"     = "eMap"
    "Owner"       = "DAIS"
  }
}

# This would be defined elsewhere in a complete Terraform configuration.
# For this example, we include a minimal definition to make the autoscale setting valid

resource "azurerm_service_plan" "portal_wa_asp_prod" {
  name                = "asp-portalwa-prod-${var.location_short_code}"
  resource_group_name = var.prod_web_tier_resource_group_name
  location            = var.location
  os_type             = "Linux"
  sku_name            = "P1v3"  # Example: Premium v3

  tags = var.common_tags
}

# Azure Monitor Action Group for centralised notifications
resource "azurerm_monitor_action_group" "autoscale_notifications_prod" {
  name                = "ag-portalwa-autoscale-prod-${var.location_short_code}"
  resource_group_name = var.prod_web_tier_resource_group_name
  short_name          = "AutoScaleWA" // (1) A short, user-friendly name for the action group.

  email_receiver {
    name          = "GisOpsAlerts"
    email_address = "gis.alerts@ffmvic.vic.gov.au" // (2) For autoscale event notifications.
  }
  email_receiver {
    name          = "CloudTeamAlerts"
    email_address = "cloud.team.alerts@ffmvic.vic.gov.au" // (3) Additional recipient.
  }
  // Other notification types (e.g., webhook_receiver for Teams/Slack, sms_receiver) can be added here.

  tags = var.common_tags
}

resource "azurerm_monitor_autoscale_setting" "portal_wa_app_service_plan_scaling_prod" {
  name                = "asc-portalwa-prod-${var.location_short_code}"    // (4) A unique and descriptive name
  resource_group_name = var.prod_web_tier_resource_group_name
  target_resource_id  = azurerm_service_plan.portal_wa_asp_prod.id      // (5) Links this autoscale setting to the specific Azure Service Plan for the Portal Web Adaptor.
  enabled             = true  // Enables the autoscale setting upon creation.

  profile {
    name = "defaultMetricsProfile" // (6) Defines a set of scaling rules. Multiple profiles can be used for different schedules or logic.
    capacity {
      default = 1 // (7) The instance count when no scaling rules are met or upon initial deployment.
      minimum = 1 // (8)
      maximum = 5 // (9) Sets the upper limit for how many instances the App Service Plan can scale out to, controlling costs.
    }

    // Scale Out Rule based on CPU Percentage
    rule {
      metric_trigger {
        metric_name        = "CpuPercentage"
        metric_resource_id = azurerm_service_plan.portal_wa_asp_prod.id // (10) Specifies that the CPU percentage of the target Service Plan is monitored.
        time_grain         = "PT1M"    // The frequency at which metrics are collected (every 1 minute).
        statistic          = "Average" // The metric statistic to evaluate (e.g., Average, Minimum, Maximum).
        time_window        = "PT10M"   // (11) The duration over which the metric is averaged before a scaling decision is made (10 minutes).
        time_aggregation   = "Average" // How the metric is aggregated over the time_window.
        operator           = "GreaterThan" // The comparison operator for the threshold.
        threshold          = 70        // If average CPU exceeds 70%, this rule triggers.
      }
      scale_action {
        direction = "Increase" // Specifies scaling out (increasing the instance count).
        type      = "ChangeCount" // The type of scaling action (ChangeCount, PercentChangeCount, ExactCount).
        value     = "1"         // (12) The number of instances to add during this scaling event.
        cooldown  = "PT10M"     // (13) A period after a scale-out action during which further scale-out actions for this rule are paused, allowing metrics to stabilize.
      }
    }

    // Scale Out Rule based on HTTP Queue Length
    rule {
      metric_trigger {
        metric_name        = "HttpQueueLength"                               // (14) Monitors the number of requests waiting in the HTTP queue for the App Service Plan.
        metric_resource_id = azurerm_service_plan.portal_wa_asp_prod.id
        time_grain         = "PT1M"
        statistic          = "Average"
        time_window        = "PT5M"    // A shorter time window to react more quickly to request backlogs.
        time_aggregation   = "Average"
        operator           = "GreaterThan"
        threshold          = 20        // E.g., if the average HTTP queue length exceeds 20 requests, trigger scale-out.
      }
      scale_action {
        direction = "Increase"
        type      = "ChangeCount"
        value     = "1"         // Adds one instance to help process the queue.
        cooldown  = "PT5M"      // A shorter cooldown period as queue length can change rapidly.
      }
    }

    // Scale In Rule based on CPU Percentage
    rule {
      metric_trigger {
        metric_name        = "CpuPercentage"
        metric_resource_id = azurerm_service_plan.portal_wa_asp_prod.id
        time_grain         = "PT1M"
        statistic          = "Average"
        time_window        = "PT20M"   // A longer observation window before scaling in, to avoid premature instance reduction.
        time_aggregation   = "Average"
        operator           = "LessThan"
        threshold          = 30        // If average CPU is below 30%, consider scaling in.
      }
      # To prevent scaling in while HttpQueueLength is high, ensure cooldowns are adequate
      # or consider more complex logic.

      scale_action {
        direction = "Decrease" // Specifies scaling in (decreasing the instance count).
        type      = "ChangeCount"
        value     = "1"         // The number of instances to remove.
        cooldown  = "PT20M"     // A longer cooldown for scale-in actions to prevent flapping.
      }
    }
  }

  notification { // (15) Configures how administrators are notified of scaling events.
    email {
      send_to_subscription_administrator    = true // Sends notifications to users with the Azure subscription Administrator role.
      send_to_subscription_co_administrators = true // Sends notifications to users with the Co-Administrator role.
      custom_emails                         = []   // Specific custom email notifications
    }
    webhook {
      service_uri = azurerm_monitor_action_group.autoscale_notifications_prod.id // (16) Routes notifications through the centrally defined Azure Monitor Action Group.
      properties  = {} // Optional: Custom properties to send with the webhook notification if the action group uses webhooks.
    }
  }

  tags = merge(
    var.common_tags,
    {
      "Service" = "ArcGIS Web Adaptor App Service Plan - Portal"
      "Tier"    = "Web"
    }
  )
}

Provides a concise identifier for the action group, often used in notifications like SMS or Azure mobile app alerts.
Configures the Action Group to send an email notification when a scaling event triggers this action group.
Configures the Action Group to send an email notification to the Cloud Team.
A unique and descriptive name for the autoscale setting resource, incorporating environment and location, making it easily identifiable in Azure.
This crucial attribute links the autoscale settings directly to the specific Azure Service Plan that hosts the Portal Web Adaptor. All metrics and scaling actions apply to this targeted resource.
Defines a collection of scaling rules. An autoscale setting can have multiple profiles, for instance, to apply different scaling logic on a schedule (e.g., weekdays vs. weekends) or for different default capacities.
Specifies the number of instances the Service Plan should have when no scaling rules are active or when the autoscale setting is first applied.
Defines the absolute minimum number of instances that the Service Plan must maintain, ensuring a baseline level of availability and performance, typically two for production environments.
Sets the upper limit on how many instances the Service Plan can scale out to. This helps control costs and prevents runaway scaling under unexpected load.
The Azure Service Plan whose CpuPercentage metric will be monitored to trigger this specific scaling rule.
Defines the durationover which the CpuPercentage metric is averaged. A scaling decision is made based on this average, smoothing out temporary spikes.
Determines that one instance will be added to the Service Plan when the CPU scale-out rule is triggered.
After a scale-out action, a 10-minute cooldown period begins. During this time, this specific scale-out rule will not trigger again, allowing the system and metrics to stabilize.
This metric trigger monitors the number of HTTP requests currently queued and waiting to be processed by the App Service instances. A high queue length indicates that the application is struggling to keep up with incoming traffic.
Configures how alerts and notifications are sent out when autoscale events (scale-out or scale-in actions) occur, ensuring operational teams are aware.
Connects the autoscale setting's notifications to the previously defined azurerm_monitor_action_group. This centralizes notification logic, allowing for easier management and diverse notification channels (email, SMS, ITSM tools, etc.) configured within the action group.

Implementing Azure Autoscale for Web Adaptor App Service Plans ensures the web tier remains responsive and cost-effective, automatically adapting to user demand through defined rules.

Manual Vertical Scaling for Portal for ArcGIS VM¶

Portal for ArcGIS can only be deployed on a single VM. If the Portal VM becomes a performance bottleneck, manual vertical scaling (increasing the VM's size, e.g., CPU, RAM, disk IOPS/throughput by changing to a larger VM SKU) is the primary remediation strategy.

ArcGIS Limitations

An inherent limitation of the Esri ArcGIS platform is its lack of automatic scaling options for Portal for ArcGIS. Esri does not support an active-active configuration for Portal, so it cannot be horizontally scaled. While the VM can theoretically be automatically scaled using VMSS in Azure, Portal is inherently a "stateful" application. Though this architecture recommends storing all relevant directories and configurations is shared locations (e.g., Blob Storage, Azure Files), Portal for ArcGIS internally maintains runtime state (e.g., user sessions, tokens, service registrations) that is not stored externally. Using VMSS to automatically scale Portal for ArcGIS would cause user disruptions. This limitation can be overcome by using Kubernetes or a container-based solution such as Azure App Containers, but these solutions were deemed outside the scope of this project.

Practical Solution

Performance bottlenecks in an ArcGIS Enterprise deployment are nearly always, the ArcGIS Server, the Web Server (ArcGIS Web Adaptor), or the underlying RDBMS, which this architecture provides solutions for automatic scaling based on demand. With automatic scaling of these key components and proactive monitoring of health metrics of Portal for ArcGIS, the platform should nearly always remain responsive and scale based on demand.

Rationale¶

Unlike the stateless nature of Web Adaptors or the design of ArcGIS Server for horizontal scaling in a VMSS, Portal for ArcGIS cannot easily scale. Therefore, a well-documented manual vertical scaling procedure is essential for scenarios where the existing VM resources become insufficient.

Process for Documentation¶

The procedures to be documented should cover:

Identifying the Need for Scaling:
- Key Metrics to Monitor:
  - CPU Utilisation (sustained high levels, e.g., > 80%).
  - Memory Utilisation (available memory consistently low, high swap usage).
  - Disk I/O (high read/write latency, queue depths).
  - Portal Application Response Times (slow loading of Portal Home, item pages, search results).
  - User-Reported Slowness specific to Portal interactions.
- Thresholds for Review: Define specific metric thresholds that trigger a review for potential vertical scaling. For example, "Average CPU utilisation above 85% for more than 1 hour during peak load" or "Portal page load times exceeding X seconds for 95th percentile users."
Planning for Vertical Scaling:
- Impact Assessment: Vertical scaling of a VM typically requires a reboot, leading to downtime for the Portal component. The duration of downtime must be estimated.
- Change Management: Adherence to organisational change management processes.
- Communication Plan: Notifying stakeholders and users of planned maintenance and expected downtime.
- Timing: Scheduling the scaling activity during off-peak hours or a planned maintenance window.
- Rollback Plan: Documenting steps to revert to the previous VM size if issues arise post-scaling.
Execution Steps:
- Instructions for resizing the Azure VM (changing its SKU), ideally by updating the OpenTofu configuration (e.g., size property in azurerm_linux_virtual_machine) and re-applying the configuration.
- Pre-scaling checks (e.g., ensuring no critical operations are in progress, taking a VM snapshot if feasible).
- Post-scaling checks (e.g., verifying Portal services are running, basic functionality testing).
Validation:
- Confirming the VM reflects the new size/SKU in Azure.
- Monitoring key performance metrics post-scaling to ensure the issue is resolved and performance has improved.
- Conducting targeted functional tests to ensure all Portal operations are performing as expected.

By implementing these auto-scaling mechanisms for the Web Adaptor App Service Plans and by documenting the manual scaling procedures for the Portal VM, the new eMap platform will be significantly more resilient, performant and cost-efficient.

4.1.3. Web Tier Silver Stage¶

The Silver Stage for the Web Tier focuses on enhancing the resilience of the Production (PROD) environment within the primary Azure region (Melbourne). This is achieved by introducing High Availability (HA) configurations for critical components: the Application Delivery Controller (ADC), Portal for ArcGIS and the Azure App Service Plans hosting the ArcGIS Web Adaptors. The primary objective is to mitigate single points of failure and ensure continuous service availability in the event of individual component or infrastructure issues within the region.

Application Delivery Controller (ADC) High Availability¶

The Application Delivery Controller (ADC), such as NetScaler, is a pivotal component in the web tier, responsible for managing all inbound HTTPS traffic from the Web Application Firewall (WAF), performing SSL/TLS termination and executing path-based routing to the correct Web Adaptor App Services.

No Availability Zones in Melbourne

For High Availability, resources should ideally be placed in distinct Availability Zones to protect against data-centre-level failures. However, as the Azure Australia Southeast (Melbourne) region currently lacks Availability Zones, High Availability for VM-based components should be achieved using Azure Availability Sets. Availability Sets provide protection against server rack and hardware failures within a data centre by distributing VMs across different fault domains and update domains.

Rationale and Approach: Deploying the ADC in a High Availability configuration ensures that if one ADC instance becomes unavailable due to hardware failure or maintenance, the other instance can seamlessly take over traffic management responsibilities, minimising service disruption.

Implementation Details:

HA Pair Deployment in Availability Set:
- The ADC for the PROD environment in Melbourne should be deployed as an HA pair.
- These VMs will be configured within an Azure Availability Set to ensure they are distributed across different fault and update domains.
State Synchronisation:
- The ADC solution (e.g., NetScaler) must be configured for state synchronisation between the instances in the HA pair. This typically includes synchronising session information (e.g., for user persistence, if configured), configuration settings and SSL/TLS session states. This ensures a smooth transition for users if a failover event occurs.
Automatic Failover:
- The HA pair should be configured for automatic failover. If the active ADC instance fails its health checks or becomes unresponsive, the standby instance should automatically become active and assume responsibility for processing traffic.
Health Probes:
- Each ADC instance must actively monitor the health of its peer in the HA pair (e.g., via heartbeat).
- The ADC instances will continue to use health probes, as configured in the MVP and Bronze stages, to monitor the health of the Web Adaptor App Service instances (e.g., targeting /portal/webadaptor/rest/info/health and /server/webadaptor/rest/info/health). This ensures traffic is only routed to healthy backends.
Automation:
- OpenTofu scripts must be updated to support the deployment and configuration of the ADC in an HA pair, utilising Azure Availability Sets. This includes provisioning the necessary network resources (such as the ADC Virtual IP) and configuring the ADC instances for HA operation.

The following diagram illustrates the high-availability setup for the ADC using an Availability Set:

graph TD
    subgraph "Azure Region: Australia Southeast (Melbourne)"
        Users["👥 Users / WAF"] --> ADC_VIP["🌐 ADC Virtual IP (VIP)"]

        subgraph "ADC Availability Set"
            ADC1["🚦 ADC Instance 1 (Active/Standby)<br>(Fault Domain A / Update Domain X)"]
            ADC1 -.->|Health Probe| Portal_WA_ASP["<br>📱 Portal WA <br>App Service Plan Instances"]
            ADC1 -.->|Health Probe| Server_WA_ASP["<br>🖥️ Server WA <br>App Service Plan Instances"]

            ADC2["🚦 ADC Instance 2 (Active/Standby)<br>(Fault Domain B / Update Domain Y)"]
            ADC2 -.->|Health Probe| Portal_WA_ASP
            ADC2 -.->|Health Probe| Server_WA_ASP
        end

        ADC_VIP --> ADC1
        ADC_VIP --> ADC2

        ADC1 <-->|🔄 State Sync & Failover Heartbeat| ADC2
    end

    classDef default fill:#fff,stroke:#333,stroke-width:2px;
    classDef network fill:#9cf,stroke:#333,stroke-width:2px;
    classDef paas fill:#ccf,stroke:#333,stroke-width:2px;

    class ADC_VIP,ADC1,ADC2 network;
    class Portal_WA_ASP,Server_WA_ASP paas;

Diagram: High Availability configuration for the Application Delivery Controller (ADC) using an Availability Set within the Melbourne region, with health probes to backend App Service Plans.

Portal for ArcGIS High Availability¶

Portal for ArcGIS should also be configured for High Availability. This involves deploying two Portal for ArcGIS VMs in an active-passive configuration, ensuring that if the primary Portal VM fails, the passive VM can take over with minimal disruption.

Rationale and Approach: An active-passive HA configuration for Portal for ArcGIS mitigates Portal as a single point of failure. This architecture relies on shared storage for content and internal data replication mechanisms provided by Esri.

Implementation Details:

Active-Passive VM Pair in Availability Set:
- Two Portal for ArcGIS VMs will be deployed in the PROD environment within an Azure Availability Set. This ensures the VMs are distributed across different fault and update domains in the Melbourne region.
- These VMs should be configured for an active-passive high availability setup as per Esri's guidelines.
Shared Content Directory:
- The Portal for ArcGIS content directory, which stores item metadata, thumbnails and uploaded files, must be shared between both Portal VMs.
- This shared directory will be configured to use a designated Azure Blob Storage container. For the PROD environment, this storage account should use Zone-Redundant Storage (ZRS) to ensure data durability within the Melbourne region.
State Synchronisation and Replication:
- Portal for ArcGIS has built-in mechanisms for maintaining consistency between the active and passive nodes:
  - Internal Database Replication: The portal's internal system database, which stores information about users, groups, items and security settings, is replicated from the active machine to the passive machine.
  - Index Service Synchronisation: The search index is also synchronised between the two machines to ensure consistent search results after a failover.
Automatic Failover:
- The Portal for ArcGIS HA configuration provides automatic failover. If the active Portal VM becomes unavailable, the passive VM is promoted to active status.
- Failover properties, such as monitoring intervals and frequency, can be configured in the portal-ha-config.properties file on each Portal VM.
ADC Integration:
- The highly available ADC will route traffic destined for /portal/* to the Portal Web Adaptor App Service. The Web Adaptor, in turn, is configured with the URL of the Portal machines. The Web Adaptor can be configured to point to both Portals and rely on health checks to determine the active node, or alternatively the Web Adaptor can point to a hostname that resolves to the active Portal machine's IP after failover (ensure DNS caching is handled properly in this case). The ADC will rely on the health of the Web Adaptor App Service for its routing decisions.
- Portal for ArcGIS provides a health check endpoint (e.g., https://<portal.domain.com>:7443/arcgis/portaladmin/system/healthcheck) that can be used by monitoring systems or indirectly by load balancers assessing Web Adaptor health.
Automation:
- OpenTofu scripts must be updated to deploy the two Portal VMs within an Availability Set.
- Configuration Management scripts will handle the installation and HA configuration of Portal for ArcGIS on both VMs, including setting up the shared content directory.

The following diagram illustrates the high-availability setup for all Web Tier components:

graph TD
    subgraph "Azure Region: Australia Southeast (Melbourne)"
        Users_WAF["👥 Users / WAF"] --> ADC_VIP["🌐 ADC Virtual IP (VIP)"]

        subgraph ADC_Avail ["ADC Availability Set"]
            ADC1["🚦 ADC Instance 1 (Active)<br>(Fault Domain A / Update Domain X)"]
            ADC2["🚦 ADC Instance 2 (Standby)<br>(Fault Domain B / Update Domain Y)"]
        end

        ADC_VIP --> ADC1
        ADC_VIP --> ADC2
        ADC1 <-->|🔄 State Sync & Heartbeat| ADC2

        subgraph Portal_App_Ser ["Portal Web Adaptor App Service Plan"]
            WA1["📱 WA Instance 1<br>(Fault Domain C / Update Domain Z)"]
            WA2["📱 WA Instance 2<br>(Fault Domain D / Update Domain W)"]
        end

        ADC1 -->|Routes /portal/*| WA1
        ADC1 -->|Routes /portal/*| WA2
        ADC2 -->|Routes /portal/*| WA1
        ADC2 -->|Routes /portal/*| WA2

        WA1 -->|Requests| PortalVM1
        WA2 -->|Requests| PortalVM1
        WA1 & WA2 -.->|Health Checks| PortalVM1 & PortalVM2

        subgraph Portal_Avail ["Portal Availability Set"]
            PortalVM1["🔑 Portal VM 1 (Active)<br>(Fault Domain E / Update Domain V)"]
            PortalVM2["🔑 Portal VM 2 (Passive)<br>(Fault Domain F / Update Domain U)"]
        end

        PortalVM1 <-->|🔄 Internal DB & Index Sync| PortalVM2
        PortalVM1 -->|Shared Content| Blob_ContentDir["🗄️ Azure Blob Storage<br>(Portal Content Directory - ZRS)"]
        PortalVM2 -->|Shared Content| Blob_ContentDir
    end

    style Portal_Avail fill:#e8f5e9,stroke:#38761d,stroke-width:2px
    style Portal_App_Ser fill:#e3f2fd,stroke:#0d47a1,stroke-width:2px
    style ADC_Avail fill:#17A2B8,stroke:#0d47a1,stroke-width:2px



    classDef default fill:#fff,stroke:#333,stroke-width:1px;
    classDef vm fill:#f9f,stroke:#333,stroke-width:1px;
    classDef paas fill:#ccf,stroke:#333,stroke-width:1px;
    classDef storage fill:#ff9,stroke:#333,stroke-width:1px;
    classDef network fill:#9cf,stroke:#333,stroke-width:1px;

    class PortalVM1,PortalVM2 vm;
    class WA1,WA2 paas;
    class Blob_ContentDir storage;
    class ADC1,ADC2,ADC_VIP network;

Diagram: High Availability configuration for all components of the Web Tier in the Silver Stage. Portal for ArcGIS uses an active-passive VM pair in an Availability Set, with a shared content directory on Azure Blob Storage.

App Service Plan High Availability (Intra-Region Redundancy)¶

The ArcGIS Web Adaptors, hosted on Azure App Service, also require resilience against infrastructure issues within the region. In the absence of Availability Zones in Melbourne, intra-region redundancy for App Service Plans relies on deploying multiple instances and leveraging Azure's platform capabilities.

Rationale and Approach: By deploying the App Service Plan with multiple instances, Azure distributes these instances across different underlying hardware (fault domains and update domains). This, combined with auto-scaling (configured in Bronze Stage) and Azure's built-in instance health management, provides a high degree of resilience against localised hardware failures.

Implementation Details:

Deploy App Service Plan with Multiple Instances:
- The App Service Plans hosting the Portal Web Adaptor and the Server Web Adaptor in the PROD environment should be configured to run a minimum of two instances.
- The use of a Premium v3 SKU (or higher) for the App Service Plan is critical for production workloads, enabling features such as VNet integration, more robust performance and supporting the necessary instance counts for redundancy.
Leverage Auto-Scaling for Instance Health:
- The auto-scaling rules configured in the Bronze Stage (based on metrics such as CPU, memory, or HTTP queue length) play a vital role in HA. If an instance becomes unhealthy, auto-scaling mechanisms can help ensure that the desired number of healthy instances is maintained by provisioning replacements.
Azure App Service’s Built-In Redundancy:
- Azure App Service inherently manages the underlying infrastructure. In regions without Availability Zones, the platform still works to distribute application instances across different physical hardware to protect against single points of hardware failure. If an instance or the hardware it runs on fails, Azure App Service attempts to replace the failed instance automatically to maintain the configured instance count.

PaaS Benefits for Intra-Region HA

Even without Availability Zones in Melbourne, leveraging Azure App Service with multiple instances provides a significant level of resilience. Azure manages the underlying infrastructure and instance health, reducing operational burden while offering protection against common hardware failure scenarios within the region.

Validation of Failover Capabilities and Recovery Times¶

Implementing HA configurations is only the first step; validating their effectiveness is crucial. The Silver Stage includes comprehensive testing to ensure that failover mechanisms function as expected and that Recovery Time Objectives (RTOs) for resilience are met.

Methodology:

Component Failure Simulation:
- Systematic tests should be conducted to simulate failures of individual components within the Web Tier in the PROD Melbourne environment. This includes:
  - Simulating the failure of one ADC instance in the HA pair (e.g., by stopping it).
  - Simulating the failure of the active Portal for ArcGIS VM.
  - Simulating the failure of individual App Service instances (e.g., by stopping an instance via Azure portal or CLI, or by introducing load that might cause instance unhealthiness).
Verification of Automatic Failover/Recovery:
- During these simulations, the behaviour of the system will be closely monitored to verify that:
  - The ADC HA pair correctly fails over to the healthy instance.
  - Portal for ArcGIS correctly fails over to the passive instance and it becomes active.
  - Traffic to App Services continues to be served by the remaining healthy instances and that Azure App Service replaces any failed instances to maintain the configured count.
  - There is minimal or no interruption to user-facing services.
Recovery Time Objective (RTO) Measurement:
- The time taken for the system to detect a failure and fully recover service availability (the RTO) should be measured for each component.
Testing Tools and Techniques:
- Testing may involve manual actions, scripted procedures, or the use of tools such as Azure Chaos Studio to inject faults in a controlled manner and observe the system's response. Azure Chaos Studio experiments cannot be created in Melbourne but they can be created in Sydney (Azure Region: Australia East) and target the deployment in Melbourne.
Documentation:
- All HA test plans, execution steps, observed behaviours and measured recovery times should be meticulously documented. This documentation is vital for operational readiness and future reviews.

4.1.4. Web Tier Gold Stage: Inter-Region Disaster Recovery¶

The Gold Stage for the Web Tier elevates the platform's resilience by introducing inter-region Disaster Recovery (DR) capabilities for the Production (PROD) environment. Building upon the intra-region High Availability (HA) established in the Silver Stage, this stage focuses on ensuring service continuity in the event of a full regional outage in Melbourne. This is achieved through the implementation of a Global Server Load Balancer (GSLB) and the deployment of a "pilot light" Web Tier in the Sydney region.

Global Server Load Balancer (GSLB) Implementation¶

The GSLB is a critical component for enabling automated failover between the primary (Melbourne) and secondary (Sydney) Azure regions.

Role and Benefits: The GSLB provides a single, globally resolvable Fully Qualified Domain Name (FQDN) (e.g., gis.ffmvic.vic.gov.au) as the primary entry point for all user traffic. It monitors the health of the web tier in each region and automatically redirects traffic to the healthy region, thereby minimising service disruption during a regional outage. This automated approach is significantly more reliable and faster than manual DNS record changes.
Configuration:
- Routing Mechanism: The GSLB (e.g., NetScaler GSLB or Azure Traffic Manager) should be configured for Priority-based routing. Melbourne should be designated as the primary (highest priority) endpoint and Sydney as the secondary (lower priority) endpoint. Traffic will be directed to Melbourne under normal operations.
- Health Probes: Robust health probes are essential for the GSLB to accurately determine regional health.
  - Target: Probes should monitor the health of the ADC endpoints in both regions (e.g., the ADC_VIP established in Silver Stage). Ensure WAF rules never interfere with such health checks.
  - Method: HTTPS probes on port 443, targeting a reliable path that reflects the overall health of the regional web stack (e.g., /portal/home or a dedicated health status page).
  - Frequency and Thresholds: Probes should be configured with an interval of approximately 30 seconds. A region might be considered unhealthy after a defined number of consecutive failures (e.g., 3 failures). This configuration aims to facilitate DNS-based failover within approximately 5 minutes of a confirmed regional outage, though actual failover time is also subject to DNS propagation delays across the internet.
DNS Considerations: The primary public DNS record for the eMap platform should point to the GSLB. TTL (Time-To-Live) values for this DNS record should be carefully considered to balance responsiveness during failover against DNS caching behaviour.

Web Tier Deployment in Disaster Recovery (DR) Region (Sydney)¶

To support failover, a corresponding Web Tier infrastructure should be deployed in the Sydney DR region.

"Pilot Light" Approach: To manage costs effectively, the components in Sydney (WAF, ADC, Web Adaptor App Services) should be deployed using a "pilot light" model. This means they are provisioned with minimal resources (e.g., lower-tier SKUs, minimum instance counts for App Service Plans and ADC) during normal operations when Sydney is passive.
Components:
- A regional WAF instance.
- A regional ADC instance (or HA pair, mirroring Melbourne's eventual HA ADC setup if deemed necessary for DR readiness, though initially can be single).
- Two Azure App Service instances (one for Portal Web Adaptor, one for Server Web Adaptor). These should be set to the lowest possible SKU on the Premium v3 tier which supports vertical auto scaling.
Configuration Alignment: The configuration of WAF policies, ADC routing rules (path-based routing for /portal/* and /server/*) and Web Adaptor App Service settings in Sydney must mirror those in Melbourne to ensure consistent application behaviour post-failover.
Scaling Strategy upon Failover: Upon failover to Sydney, the App Service Plans and VMs should mirror Melbourne's Silver Stage and allow for automatically scaled up to handle the full production load.

Automation for DR Web Tier¶

The Infrastructure as Code (IaC) and Configuration Management (CM) scripts developed in earlier stages should be updated to support the multi-region deployment of the Web Tier.

OpenTofu Parameterisation: OpenTofu modules for WAF, ADC, App Services and associated networking should be further parameterised to accommodate differences between the primary and DR regions (e.g., resource names, IP addresses, initial scaling configurations).
Configuration Management Consistency: CM scripts should ensure that application-level configurations (e.g., Web Adaptor settings within App Service Application Settings) are consistently applied in both regions.
Deployment Pipelines: CI/CD pipelines should be adapted to manage deployments to both Melbourne and Sydney environments, allowing for coordinated updates.

Testing and Validation of Web Tier DR¶

Thorough testing is crucial to validate the Web Tier DR capabilities.

DR Drills: Regular DR drills (e.g., annually) will simulate a full Melbourne outage. These drills will test:
- The GSLB's ability to detect Melbourne's unavailability and correctly redirect traffic to Sydney.
- The activation and scaling of the Sydney Web Tier components.
- The accessibility and functionality of ArcGIS Enterprise services.
Runbook Validation: DR drills also serve to validate and refine the documented DR runbooks, ensuring that procedures are accurate, efficient and well understood by the operations team.
Documentation: All test results, failover times and any issues encountered should be documented and DR plans updated accordingly.

The following diagram illustrates the Web Tier architecture in the Gold Stage, with the GSLB managing traffic across the active Melbourne region and the passive Sydney DR region.

graph TD
    subgraph "Global Traffic Management"
        Users["👥 Internet Users"] -->|HTTPS gis.ffmvic.vic.gov.au| GSLB["🌐 Global Server Load Balancer (GSLB)<br/>(Priority Routing: Melbourne > Sydney)"]
    end

    subgraph MEL_Region ["Azure Region: Australia Southeast (Melbourne - Active)"]
        direction TB
        GSLB -->|Primary Path| WAF_MEL["🛡️ WAF Melbourne"]
        WAF_MEL --> ADC_VIP_MEL["🚦 ADC Virtual IP Melbourne<br/>(HA Pair in Availability Set)"]
        ADC_VIP_MEL -- "/portal/*" --> Portal_WA_ASP_MEL["📱 Portal WA App Service Plan<br/>(P1v3: 2-5 Instances)"]
        ADC_VIP_MEL -- "/server/*" --> Server_WA_ASP_MEL["🖥️ Server WA App Service Plan<br/>(P1v3: 2-5 Instances)"]
        GSLB -.->|HTTPS Health Probe<br/>/portal/home| ADC_VIP_MEL
    end

    subgraph SYD_Region ["Azure Region: Australia East (Sydney - Passive DR)"]
        direction TB
        GSLB -->|Failover Path| WAF_SYD["🛡️ WAF Sydney"]
        WAF_SYD --> ADC_VIP_SYD["🚦 ADC Virtual IP Sydney<br/>(Pilot Light: B1ms Scalable)"]
        ADC_VIP_SYD -- "/portal/*" --> Portal_WA_ASP_SYD["📱 Portal WA App Service Plan<br/>(P1v3: 1 Instance)"]
        ADC_VIP_SYD -- "/server/*" --> Server_WA_ASP_SYD["🖥️ Server WA App Service Plan<br/>(P1v3: 1 Instance)"]
        GSLB -.->|HTTPS Health Probe<br/>/portal/home| ADC_VIP_SYD
    end

    style MEL_Region fill:#e8f5e9,stroke:#38761d,stroke-width:2px
    style SYD_Region fill:#e3f2fd,stroke:#0d47a1,stroke-width:2px

    classDef default fill:#fff,stroke:#333,stroke-width:1px;
    classDef network fill:#9cf,stroke:#333,stroke-width:1px;
    classDef paas fill:#ccf,stroke:#333,stroke-width:1px;
    classDef security fill:#f66,stroke:#333,stroke-width:1px;

    class GSLB,ADC_VIP_MEL,ADC_VIP_SYD network;
    class Portal_WA_ASP_MEL,Server_WA_ASP_MEL,Portal_WA_ASP_SYD,Server_WA_ASP_SYD paas;
    class WAF_MEL,WAF_SYD security;

Diagram: Gold Stage Web Tier architecture with GSLB for inter-region Disaster Recovery between Melbourne (Active) and Sydney (Passive).

By completing the Gold Stage for the Web Tier, the new eMap platform achieves a high level of resilience against regional outages.