4.6: Monitoring, Alerting & Lifecycle Service Management

✂️ Tl;dr 🥷

Explains monitoring, alerting and lifecycle management strategy for the ArcGIS Enterprise platform on Azure, ensuring operational stability, security and performance. Centralised logging via Azure Monitor provides visibility across all components, collecting VM metrics, PaaS diagnostics and application insights. Automated alerts detect critical issues such as resource overutilisation or service unavailability, triggering notifications through defined channels. Lifecycle management enforces structured patching schedules for VMs, systematic ArcGIS software updates and vulnerability scanning across environments. SSL/TLS certificates are managed securely through Azure Key Vault with automated deployment and renewal. Component specific monitoring tracks key performance indicators for ArcGIS Server, Portal and Data Store, alongside Azure PaaS resources. Advanced alerting incorporates dashboards, granular thresholds and escalation paths for incident response. Health probes and automated smoke tests validate platform functionality post deployment and inform load-balancing decisions. These practices are phased across project stages, establishing foundational monitoring early and progressively integrating advanced features to maintain platform reliability throughout its lifecycle.

Following the implementation details for the Web, Application and Data Tiers, this section outlines the strategy for monitoring, alerting and lifecycle service management for the new eMap platform. The capabilities described here should be implemented progressively across the project stages (MVP, Bronze, Silver, Gold), with foundational elements established early and more advanced features layered in as the platform matures.

4.6.1 Initial Monitoring Setup and Centralised Logging¶

Effective monitoring begins with comprehensive data collection and centralised logging, enabling visibility into the health and performance of all platform components. Azure Monitor will serve as the core service for these activities.

Azure Monitor Agent Deployment: The Azure Monitor Agent (AMA) should be deployed to all Azure Virtual Machines (VMs) hosting ArcGIS Enterprise components (Portal for ArcGIS, ArcGIS Server, ArcGIS Data Store). This deployment should be automated using the Configuration Management tool and baked into the "golden image". The AMA is responsible for collecting guest operating system logs (e.g., syslog, security events from /var/log/auth.log or /var/log/syslog on Ubuntu) and performance metrics (CPU, memory, disk I/O, network statistics). This provides granular insight into the operational state of the VMs.
Log Analytics Workspaces: Environment-specific Azure Log Analytics workspaces (e.g., log-emap-dev-ause, log-emap-uat-ause, log-emap-prod-ause) should be established as the central repositories for all collected logs and performance data. This segregation ensures data isolation between environments, allows for tailored data retention policies according to each environment's needs (e.g., shorter retention for DEV, longer for PROD) and facilitates granular access control for different operational teams.
PaaS Service Diagnostics: Diagnostic settings for all Azure PaaS services used by the new eMap platform – including Azure Database for PostgreSQL, Azure App Service (hosting Web Adaptors) and Azure Storage accounts (Blob, Files, ADLS Gen2) – should be configured to stream their metrics and logs to the respective Log Analytics workspace for that environment. This configuration should be managed via OpenTofu.
Initial Alert Configuration: A baseline set of alerts should be configured in Azure Monitor during the MVP stage to provide early notification of critical issues that could impact service availability or performance. Examples include:
- VM unresponsive or sustained high CPU/memory utilisation (e.g., > 90% for 15 minutes).
- App Service Plan high CPU/memory utilisation or a spike in HTTP 5xx server error rates.
- Azure Database for PostgreSQL high DTU/vCore utilisation or critically low storage space.
- Key service endpoints (e.g., Portal home page, Server REST endpoint via ADC health probes) becoming unavailable. These initial alerts ensure that fundamental operational issues are detected and can be addressed promptly.
Basic Azure Dashboards (Optional): Optionally, dashboards can be created in Azure Monitor to provide an overview of key health indicators for each environment. These dashboards should ideally display critical metrics from VMs, App Services.

The following diagram illustrates the conceptual flow of monitoring data:

flowchart TB
    subgraph AzureResources["📦 Azure Resources"]
        direction TB
        VMs["🖥️ ArcGIS VMs<br><small>Portal | Server | Data Store</small>"]
        PaaS_DB["🗄️ Azure PostgreSQL<br><small>Managed Database</small>"]
        PaaS_Storage["💾 Azure Storage<br><small>Blob | Files | ADLS2</small>"]
        AppServices["🌐 App Services<br><small>Web Adaptors</small>"]
    end

    subgraph DataCollectors["📡 Data Collection"]
        direction TB
        AMA["📎 Azure Monitor Agent<br><small>VM Telemetry</small>"]
        PaaS_Diag["⚙️ PaaS Diagnostics<br><small>Resource Metrics</small>"]
        AppInsights["🔍 App Insights<br><small>SDK Integration</small>"]
    end

    subgraph MonitorCore["📊 Azure Monitor"]
        direction LR
        LogAnalytics["📚 Log Analytics<br><small>Central Repository</small>"]
        Metrics["📈 Platform Metrics<br><small>Time Series Data</small>"]
        Alerts["🚨 Alert Rules<br><small>Condition Triggers</small>"]
    end

    subgraph Outputs["🖥️ Operational Outputs"]
        direction TB
        Dashboards["📊 Live Dashboards<br><small>Health Views</small>"]
        Notifications["📧 Alert Channels<br><small>Email | SMS | ITSM</small>"]
        KQL["🔎 Query Analytics<br><small>Kusto Exploration</small>"]
    end

    VMs -->|OS Metrics| AMA
    PaaS_DB -->|Diagnostics| PaaS_Diag
    PaaS_Storage -->|Storage Logs| PaaS_Diag
    AppServices -->|App Logs| PaaS_Diag
    AppServices -->|Telemetry| AppInsights

    AMA --> LogAnalytics
    PaaS_Diag --> LogAnalytics
    AppInsights --> LogAnalytics

    LogAnalytics --> Metrics
    LogAnalytics --> Alerts
    Metrics --> Alerts

    Alerts -->|Trigger| Notifications
    LogAnalytics -->|Visualise| Dashboards
    Metrics -->|Plot| Dashboards
    LogAnalytics -->|Analyse| KQL

    classDef box fill:#fff,stroke:#333,stroke-width:2px,color:#000
    classDef azure fill:#e6f3ff,stroke:#007fff,color:#000
    classDef collection fill:#e0f7fa,stroke:#00b7c3,color:#000
    classDef monitor fill:#fff3e0,stroke:#ff9d00,color:#000
    classDef output fill:#e8f5e9,stroke:#34b234,color:#000

    class AzureResources,DataCollectors,MonitorCore,Outputs box
    class AzureResources azure
    class DataCollectors collection
    class MonitorCore monitor
    class Outputs output

Diagram: Conceptual flow of monitoring data into Azure Monitor for the new eMap platform.

4.6.2 Lifecycle Service Management¶

Lifecycle service management encompasses patching, updates and vulnerability management, all critical for maintaining a secure and stable platform.

Operating System Patching (VMs): A regular schedule for applying operating system security patches to all Ubuntu 24.04 LTS VMs should be established. Patches will be tested in the DEV and UAT environments before being applied to the PROD environment. This process should be semi-automated using the Configuration Management tool and coordinated with planned maintenance windows.

Automatic Patching with Azure VM Image Builder?

Azure VM Image Builder is Azure's PaaS version of HashiCorp's Packer. It enables automated creation, customisation and distribution of standardised "golden" VM images using a code-first approach and can integrate patching/updates during image builds. While not explicitly implemented in the current architecture, its suitability for streamlining golden image creation especially in tandem with OpenTofu and the Configuration Management tool should be evaluated further.

ArcGIS Software and Web Adaptor Updates:

Updates and patches for ArcGIS Enterprise software components (Portal for ArcGIS, ArcGIS Server, ArcGIS Data Store) and ArcGIS Web Adaptor .war files should be managed systematically: 1. Monitoring Esri's patch notifications and release notes for relevant updates. 2. Planning and testing of updates in the DEV environment to identify any compatibility issues or unexpected behaviour. 3. Conducting User Acceptance Testing (UAT) validation of updates to ensure they meet business requirements and do not negatively impact user workflows. 4. Scheduling and deploying updates to the PROD environment during approved maintenance windows. The Configuration Management tool will automate the application of software updates on VMs. The CI/CD pipeline will manage the deployment of updated ArcGIS Web Adaptor .war files to Azure App Service instances.
Azure App Service Runtime Patching: The underlying operating system and Java/Tomcat runtime for Azure App Service instances (hosting the Web Adaptors) are managed and patched by Microsoft Azure.
Vulnerability Scanning: Regular vulnerability scanning should be implemented for:
- "Golden" VM images used for deploying ArcGIS components.
- Running VM instances in all environments.
- Web applications hosted on Azure App Service.

4.6.3 SSL/TLS Certificate Management¶

SSL/TLS certificates should be managed centrally to ensure trust and encryption for all endpoints.

Valid Certificates: The use of self-signed certificates for any externally accessible endpoint is strictly prohibited.
Storage: Certificates and their private keys should be stored in Azure Key Vault. Access to these secrets should be tightly controlled using RBAC and Managed Identities.
Deployment and Renewal:
- Deployment of certificates to the WAF and ADC should be an automated process, integrated with Azure Key Vault.
- Automated renewal processes for certificates should be implemented where feasible.

4.6.4. Monitoring Strategy¶

A monitoring strategy should be developed to provide insights into platform behaviour and performance.

Data Collection: Collect platform metrics, activity logs (control plane operations) and diagnostic logs from all Azure resources. Ensure that diagnostic settings are consistently applied via OpenTofu to all new resources and that data is streamed to the designated Log Analytics workspaces.
Kusto Query Language (KQL) Utilisation: Technical teams (Cloud Infrastructure, GIS Engineers, Security Specialists) can leverage KQL for advanced log analysis and querying within Azure Log Analytics:
- Troubleshooting complex issues by correlating events across multiple components and tiers.
- Creating custom queries for specific operational checks, security auditing, or performance investigations.
- Building custom visualisations and workbooks in Azure Monitor based on KQL queries to present tailored views of platform data.
- Developing queries to track user activity patterns, service consumption and data access trends for capacity planning and governance reporting.
Application Insights for Web Adaptors: Azure Application Insights will be integrated with the ArcGIS Web Adaptor Azure App Service instances. This provides crucial Application Performance Monitoring (APM) capabilities for the web tier, including:
- Real-time tracking of request rates, response times (average and percentiles) and failure rates for specific Web Adaptor endpoints (e.g., /portal/sharing/rest, /server/rest/services).
- Identification of performance bottlenecks, such as slow dependencies (e.g., if backend Portal or ArcGIS Server calls are taking too long).
- Visualisation of application topology and dependencies via the Application Map feature.
- Automated detection of performance anomalies.
- Detailed exception tracking and diagnostics, capturing stack traces and request context for application errors originating in the Web Adaptors.

4.6.5. Component-Specific Monitoring¶

Monitoring should be implemented for key ArcGIS components and underlying Azure PaaS services.

ArcGIS Server:
- Key Metrics: Service request latency (average and percentiles), request throughput, error rates per service, instance health within VMSS (CPU, memory, disk I/O), geoprocessing job queue length and execution times.
- Tools: Azure Monitor (VMSS metrics, guest OS metrics via AMA), ArcGIS Server logs (forwarded to Log Analytics).
Portal for ArcGIS:
- Key Metrics: User login times, concurrent user sessions, item access rates and response times, internal service request error rates, health check endpoint status.
- Tools: Azure Monitor (VM metrics via AMA), Portal for ArcGIS logs (forwarded to Log Analytics), Application Insights (for Web Adaptor interaction).
ArcGIS Data Store:
- Key Metrics: Health status reported by describedatastore utility, disk space utilisation on the Data Store VM (OS disk and data disk), CPU/memory utilisation of the VM, query performance characteristics for key hosted feature layers and replication status (for HA configurations in PROD Silver Stage onwards).
- Tools: Azure Monitor (VM metrics via AMA), ArcGIS Data Store logs and utility outputs (potentially scripted and ingested into Log Analytics).
Azure PaaS Resource Monitoring:
- Azure Database for PostgreSQL: DTU/vCore utilisation, storage utilisation percentage, active database connections, failed connections, query latency (average and specific slow queries identified via pg_stat_statements), replication lag for HA/DR configurations.
- Azure App Service (Web Adaptors): HTTP response times, HTTP error rates (4xx, 5xx), CPU/memory utilisation of the App Service Plan, HTTP queue length, instance health.
- Azure Storage (Blob, ADLS Gen2, Files): Availability percentage, end-to-end latency for read/write operations, throughput (IOPS/Bandwidth), storage capacity utilisation against quotas.
- Tools: Azure Monitor provides native platform metrics and diagnostic logs for all these PaaS services.

Baselines are Key

Establishing performance baselines for these KPIs and metrics in each environment (DEV, UAT and PROD) shortly after deployment and under normal load conditions is crucial. These baselines will serve as the reference point for identifying anomalies, performance degradation and future capacity planning needs.

4.6.6 Advanced Alerting Strategy¶

A more sophisticated alerting strategy cab be implemented to ensure proactive issue detection and timely response.

Role-Specific Dashboards (Optional Enhancement): Where beneficial, custom Azure Monitor Dashboards can be designed and implemented to provide tailored views of platform health and KPIs for different operational teams or stakeholders (e.g., a dashboard for Cloud Infrastructure Engineers focusing on Azure resource health, another for GIS Administrators focusing on ArcGIS service performance).
Alert Rule Refinement and Expansion: Alert rules can be defined with greater granularity in Azure Monitor, based on established KPI thresholds and specific log events indicative of potential or actual issues. Examples include:
- Infrastructure Health: Sustained high CPU/memory/disk utilisation on VMs or App Service Plans (e.g., >80% for >10 minutes), critically low available disk space, VM becoming unresponsive, network latency between tiers exceeding defined thresholds.
- Application Performance: Specific critical ArcGIS service response times, high error rates for key services (e.g., >5% errors in 5 minutes), geoprocessing job queues exceeding a certain length for an extended period, Portal login success rate dropping below a defined percentage.
- Database Performance: Sustained high DTU/vCore utilisation (e.g., >85% for >15 minutes), critically low database storage space, high number of failed database connections, specific slow queries appearing frequently, significant replication lag for PROD HA/DR configurations.
- Storage Issues: Storage account unavailability or approaching capacity limits, sustained high latency for storage operations, an increase in storage throttling events.
- Security Events: Detection of suspicious activities from WAF logs (e.g., repeated SQL injection attempts), critical security vulnerabilities, anomalous login patterns to administrative interfaces or sensitive data.
- Data Governance Compliance: ArcGIS Data Store exceeding predefined storage thresholds (e.g., 100GB), failures in data tiering lifecycle policies for Azure Blob/ADLS Gen2.
Action Groups and Notification Channels: Azure Monitor Action Groups can be configured to define the precise recipients and notification methods for various types of alerts. Notification channels can include:
- Targeted email distribution lists for relevant teams (e.g., gis-ops@ffm.vic.gov.au, cloud-infra@ffm.vic.gov.au).
- SMS notifications for P1/Critical alerts requiring immediate attention.
- Integration with Jira via webhooks for automated incident ticket creation and tracking.
- Notifications to relevant channels in collaboration platforms (e.g., Microsoft Teams, Slack) for operational awareness and team response coordination.
Escalation Paths: Clear escalation paths should be established for different alert severities (e.g., Critical/P1, Error/P2, Warning/P3, Information/P4) and types. These paths should specify:
- The initial team/individual responsible for acknowledging and investigating the alert.
- Timeframes for acknowledgment and resolution.
- The sequence of individuals or teams to be notified (escalated to) if an alert is not acknowledged or resolved within the defined timeframe.

4.6.7. Validation and Health Probes¶

Ensuring components are genuinely healthy and operational requires robust validation mechanismse.

Post-Deployment Smoke Tests: Automated smoke tests should be integrated into CI/CD pipelines and run after every successful deployment to any environment. These tests should perform basic validation of key functionalities, such as checking the availability of Portal and Server home pages, verifying that core services respond with a successful HTTP status code and confirming basic login capabilities.
API Validation: Automated API tests should be developed to validate the responses, performance and data integrity of key ArcGIS REST API endpoints. These tests can simulate common user actions and ensure services are returning expected data in the correct format.
Health Probe Configuration for Load Balancing and Traffic Management: Intelligent health probes are essential for HA and DR mechanisms, enabling load balancers and traffic managers to route traffic to healthy and responsive instances or regions.
- Global Server Load Balancer (Gold Stage): Probes should target the primary public entry points of each region (e.g., the regional WAF/ADC VIPs). These probes should monitor a reliable, lightweight application endpoint (e.g., /portal/home or a dedicated health status page) to assess overall regional health. GSLB uses these probe results to determine when to redirect traffic during a DR event.
- Application Delivery Controller (ADC): ADCs should be configured with specific health probes for each backend pool (i.e., the Portal Web Adaptor App Service instances and the Server Web Adaptor App Service instances). These probes should target application-specific health endpoints on the Web Adaptors, such as /portal/webadaptor/rest/info/health and /server/webadaptor/rest/info/health. Successful responses from these endpoints indicate the Web Adaptor and its backend component (Portal/Server) are operational.
- Azure Load Balancers (Internal for VMSS): Internal load balancers distributing traffic to ArcGIS Server VMSS instances should be configured with health probes targeting an appropriate port and health check endpoint on the ArcGIS Server instances themselves (e.g., port 6080/TCP.

By implementing this monitoring, alerting and lifecycle management strategy, the new eMap platform will achieve a high degree of operational maturity, ensuring its stability, performance and security throughout its lifecycle.