3.7: Monitoring & Observability

✂️ Tl;dr 🥷

Defines the monitoring and observability strategy for the new eMap platform, using Azure native services to ensure operational health and performance. KPIs are established across ArcGIS components and Azure resources including user login times, service latency, database utilisation and storage metrics, with baselines set post deployment for anomaly detection. Azure Monitor and Application Insights provide centralised logging, application performance monitoring and advanced analytics via Kusto queries, offering a unified view of platform health. Data governance monitoring tracks storage growth, lifecycle policy adherence and access patterns for sensitive datasets. A proactive alerting strategy uses threshold-based rules and dashboards in Azure Monitor, triggering notifications through email, SMS and integrated ticketing systems. This approach enables rapid issue resolution, reduces operational complexity and supports proactive management by correlating infrastructure, application and governance metrics within a single observability framework.

This section outlines the Monitoring and Observability strategy for the new eMap platform. This strategy leverages Azure-native services to provide a unified and in-depth view of the platform's operational status.

3.7.1 Key Performance Indicators (KPIs) and Health Metrics¶

To effectively manage the new eMap platform, a well-defined set of Key Performance Indicators (KPIs) and health metrics should be established. These metrics provide quantifiable measures of the platform's performance, availability and utilisation, enabling proactive management and informed decision-making. Baselines for these metrics should be established post-deployment in each environment to understand normal operational parameters.

To provide a baseline, the following metrics are recommended:

ArcGIS Enterprise Components:
- Portal for ArcGIS:
  - User Login Times: Average and percentile times for user authentication and portal access.
  - Concurrent Users: Number of active user sessions.
  - Item Access Rates: Frequency and response times for accessing common portal items (maps, apps, layers).
  - Service Request Errors: Rate of failed requests to portal services.
- ArcGIS Server (including VMSS instances):
  - Service Request Latency: Average and percentile response times for map, feature and geoprocessing services.
  - Service Request Throughput: Number of requests processed per unit of time.
  - Service Instance Health: Availability and responsiveness of individual ArcGIS Server instances within the VM Scale Set.
  - Job Queue Length (Geoprocessing Services): Number of pending and executing asynchronous jobs.
  - Resource Utilisation per Instance: CPU, memory and disk I/O for individual server instances.
- ArcGIS Data Store (Relational):
  - Health Status: Availability and replication status (for HA configurations in Production).
  - Storage Utilisation: Disk space consumption on VMs.
  - Query Performance: Response times for operations involving hosted feature layers.
Underlying Azure PaaS Resources:
- Azure Database for PostgreSQL (Enterprise Geodatabase):
  - DTU/vCore Utilisation: Percentage of database compute resources in use.
  - Storage Utilisation: Percentage of provisioned database storage consumed.
  - Connection Counts: Number of active and failed database connections.
  - Query Latency & Throughput: Performance of queries against the enterprise geodatabases.
  - Replication Lag (for HA/DR configurations in Production): Time delay in data replication to standby/replica instances.
- Azure App Service (Web Adaptors):
  - HTTP Response Times: Average and percentile latency for requests to Web Adaptors.
  - HTTP Error Rates (4xx, 5xx): Frequency of client-side and server-side errors.
  - CPU and Memory Utilisation (App Service Plan): Resource consumption of the underlying plan.
  - HTTP Queue Length: Number of requests waiting to be processed.
- Azure Storage (Blob, ADLS Gen2, Files):
  - Availability: Percentage of successful requests.
  - Latency: Average time for read/write operations.
  - Throughput (IOPS/Bandwidth): Data transfer rates.
  - Capacity Utilisation: Percentage of provisioned storage consumed.
Data Governance Specific Metrics:
- ArcGIS Data Store Growth Rate: Rate of data volume increase in the relational ArcGIS Data Store.
- Storage Lifecycle Policy Adherence: Number of items successfully transitioned between storage tiers (e.g., Hot to Cool in Azure Blob/ADLS Gen2).
- Raster Processing Job Success Rates: Percentage of successful CRF/MRF conversion and other raster processing jobs.
- Access Audit Log Volume: Monitoring the volume of access logs for sensitive datasets as an indicator of activity.

Importance of Baselines

Establishing performance baselines for these KPIs and metrics in each environment (DEV, UAT and PROD) shortly after deployment and under normal load conditions is crucial. These baselines will serve as a reference point for identifying anomalies, performance degradation and capacity planning needs.

These KPIs and health metrics will form the foundation for dashboards, alerts and performance reviews, ensuring the platform operates efficiently and reliably.

3.7.2 Integration Strategy with Azure Monitor and Application Insights¶

To provide a holistic and centralised view of the new eMap platform's health and performance, an integration strategy with Azure Monitor and Azure Application Insights will be followed.

Centralised Logging with Azure Log Analytics:
- All relevant logs from across the platform should be consolidated into environment-specific Azure Log Analytics workspaces. This includes:
  - Operating System logs (e.g., syslog, security logs) from all Ubuntu VMs (Portal, Server, Data Store).
  - ArcGIS Enterprise component logs (Portal for ArcGIS, ArcGIS Server, ArcGIS Data Store). Configuration will be required to forward these logs.
  - Azure PaaS service logs (e.g., Azure Database for PostgreSQL logs, Azure Storage diagnostic logs).
  - Azure App Service logs for Web Adaptors (e.g., AppServiceHTTPLogs, AppServiceConsoleLogs, AppServiceAppLogs for Tomcat output).
- This centralisation allows for powerful cross-correlation of events and simplified troubleshooting.
Platform Metrics and Diagnostics with Azure Monitor:
- Azure Monitor will be the primary tool for collecting platform metrics, activity logs (control plane operations) and diagnostic settings from all Azure resources.
- Azure Monitor Agent should be deployed to all VMs to collect guest OS metrics and logs.
- Diagnostic settings for all PaaS resources (Databases, Storage Accounts, App Services) should be configured to stream metrics and logs to the designated Log Analytics workspace.
Application Performance Monitoring (APM) with Azure Application Insights:
- Azure Application Insights should be integrated with the ArcGIS Web Adaptor Azure App Service instances.
- This will provide detailed APM capabilities, including:
  - Tracking request rates, response times and failure rates for Web Adaptor endpoints.
  - Identifying performance bottlenecks and dependencies.
  - End-to-end transaction tracing to understand the flow of requests through the web tier.
  - Exception tracking and diagnostics.
- This insight is critical for understanding the performance of the user-facing web tier and its interaction with backend ArcGIS components.
Advanced Analysis with Kusto Query Language (KQL):
- The rich datasets collected in Log Analytics workspaces will be queryable using KQL.
- This enables technical teams to perform advanced log analysis, create custom queries for specific troubleshooting scenarios, build custom visualisations and derive deeper insights into platform behaviour.

ArcGIS Monitor

Esri provides ArcGIS Monitor, a separate optional component which can be added to ArcGIS Enterprise and provides specialised monitoring capabilities for ArcGIS Enterprise components. Following the Azure PaaS First principle, ArcGIS Monitor will not be implemented as part of the new eMap platform. This architecture focuses on native integration of ArcGIS Enterprise components with Azure Monitor. This approach ensures a unified observability framework, reduces toolchain complexity and leverages Azure-native capabilities for metric collection, alerting and dashboarding.

Benefits of Centralised Observability

This integrated approach offers significant advantages over traditional, siloed monitoring methods. It provides a single pane of glass for platform health, reduces mean time to resolution (MTTR) for incidents and enables data-driven capacity planning and performance optimisation.

The implementation of this integration strategy is a key activity during the MVP stage and will be refined as the platform evolves through subsequent stages.

3.7.3 Data Governance Specific Monitoring Requirements¶

Specific monitoring practices should be established to support and enforce the Data Governance and lifecycle management policies. This ensures that data assets are managed according to defined standards and helps mitigate risks associated with uncontrolled data growth.

ArcGIS Data Store Utilisation and Growth Monitoring:
- Objective: Prevent uncontrolled expansion of the ArcGIS Data Store, which is intended for Portal-hosted feature layers and transient analysis outputs, not primary enterprise data.
- Metrics:
  - Regular tracking of storage utilisation (disk space consumed by the internal PostgreSQL instance on the Data Store VM).
  - Monitoring the growth rate of data volume within the Data Store.
- Actions: Alerts should be configured for storage thresholds (e.g., 50GB, 100GB, or specific percentages of provisioned capacity). These alerts should trigger reviews to identify datasets that may need to be migrated to the user-managed Enterprise Geodatabase (Azure Database for PostgreSQL) or archived/deleted according to retention policies.
Adherence to Data Lifecycle and Retention Policies:
- Objective: Ensure data stored in Azure PaaS services (Blob, ADLS Gen2) and the ArcGIS Data Store complies with defined lifecycle and retention rules.
- Metrics:
  - Successful execution of Azure Storage lifecycle management policies (e.g., tracking the volume of data transitioned between Hot, Cool and Archive tiers in Azure Blob Storage and ADLS Gen2).
  - Identifying datasets within the ArcGIS Data Store that exceed their defined retention period (e.g., 90 days for transient data).
- Actions: Automated checks should verify policy adherence. Deviations should trigger reviews by Data Stewards.
Access Pattern Auditing for Sensitive Datasets:
- Objective: Ensure compliance with security policies and detect unauthorised or anomalous access to sensitive enterprise datasets stored in Enterprise Geodatabase or other controlled stores.
- Metrics: Volume and patterns of access logs related to designated sensitive datasets.
- Actions: Utilise Azure Database for PostgreSQL audit logging capabilities. KQL queries in Azure Monitor can be used to analyse access patterns and alerts can be configured for suspicious activities.
Raster Data Store Monitoring (ADLS Gen2):
- Objective: Ensure the health, efficiency and compliance of the designated Raster Store.
- Metrics:
  - Growth trends of the raster data store.
  - Success rates of CRF/MRF conversion jobs.
  - Performance metrics related to raster service access.
  - Storage consumption by format and age.
- Actions: Regular review of raster data organisation, validation of cloud-optimised format compliance (LERC compression, pyramid generation for CRF) and pruning of temporary or obsolete processing outputs from ADLS Gen2.

These data governance specific monitoring activities should be integrated into the overall observability framework, providing Data Owners and Stewards with the necessary insights to manage their data domains effectively.

3.7.4 Alerting Strategy¶

A proactive alerting strategy is fundamental to maintaining the operational stability and performance of the new eMap platform. Alerts should be configured in Azure Monitor to notify relevant teams of critical conditions, potential issues, or deviations from expected behaviour, enabling timely intervention.

Role-Specific Dashboards in Azure Monitor:
- To provide tailored views of platform health and KPIs, role-specific dashboards should ideally be designed and implemented within Azure Monitor. These may include:
  - Operational Dashboards: Displaying real-time health status of critical services, active alerts and key performance indicators for immediate operational awareness (e.g., for Cloud Infrastructure Engineers).
  - Tactical Dashboards: Showing performance trends over time, capacity utilisation, error rates and identifying areas for optimisation (e.g., for GIS Engineers and application support teams).
  - Strategic Dashboards: Summarising overall platform availability, cost trends and data governance metrics (e.g., for Data Owners and management).
Alert Rule Definition:
- Alert rules should be defined in Azure Monitor based on thresholds for the KPIs and health metrics identified, as well as specific log events. Examples of conditions that should trigger alerts include:
  - Infrastructure Health: High CPU/memory/disk utilisation on VMs or App Service Plans, low available disk space, VM unresponsive, network connectivity issues.
  - Application Performance: High error rates for ArcGIS services or Web Adaptors, excessive service response times, long job queues for geoprocessing services, Portal login failures.
  - Database Performance: High DTU/vCore utilisation on Azure Database for PostgreSQL, low storage space, failed connections, excessive query times, replication issues (for PROD).
  - Storage Issues: Storage account unavailability, high latency, approaching capacity limits.
  - Security Events: Detection of suspicious activities from WAF logs, unauthorised access attempts, critical security vulnerabilities identified by Azure Security Center.
  - Data Governance Compliance: ArcGIS Data Store exceeding storage thresholds, failure of data tiering lifecycle policies.
Action Groups and Notification Channels:
- Azure Monitor Action Groups should be configured to define the recipients and notification methods for alerts.
- Notifications should be disseminated through appropriate channels, including:
  - Email distributions lists for relevant teams.
  - SMS notifications for high-severity alerts requiring immediate attention.
  - Integration with Jira for automated ticket creation and incident tracking.
  - Integration with collaboration platforms such as Microsoft Teams or Slack for real-time notifications to operational channels.
Escalation Paths:
- Well-defined escalation paths should be established for different alert severities (e.g., Critical, Error, Warning, Information) and types.
- These paths should specify the sequence of individuals or teams to be notified if an alert is not acknowledged or resolved within a defined timeframe, ensuring appropriate response levels.

Proactive Management through Alerting

A well-implemented alerting strategy shifts operations from a reactive to a proactive stance. By receiving timely notifications of potential issues, technical teams can address problems before they impact end-users or escalate into major incidents, contributing significantly to platform stability and reliability.

The alerting strategy should be an iterative process, with initial alerts focusing on critical system components and high-impact scenarios. As the platform matures and operational experience is gained, alert rules and thresholds should be refined to reduce noise and improve the accuracy of notifications.