4.7: Testing, User Training and Operational Handover

✂️ Tl;dr 🥷

This section outlines a phased approach to testing, training and operational handover for the eMap platform, aligned with its staged implementation. The MVP stage validates core functionality through user acceptance testing, initial training and backup verification. Bronze introduces load testing for auto-scaling validation and cost management practices. Silver focuses on high availability testing within the primary region, simulating failures to ensure resilience. Gold conducts annual disaster recovery drills, testing inter-region failover and application restoration. Each phase includes updating operational procedures, refining documentation and targeted training to equip users and support teams. Collectively, these activities ensure the platform meets technical, resilience and operational requirements while enabling effective user adoption and sustainable cloud operations.

This section details testing, user training and operational handover. These activities are linked to the phased implementation of the platform, ensuring that each new layer of functionality and resilience is verified and that stakeholders are equipped to leverage and support the evolving system.

flowchart LR
    MVP["🧪 4.7.1 MVP Stage<br><small>Focus: Core Functionality & UAT</small>"] --> Bronze;
    Bronze["📈 4.7.2 Bronze Stage<br><small>Focus: Scaling & Cost Validation</small>"] --> Silver;
    Silver["🛡️ 4.7.3 Silver Stage<br><small>Focus: High Availability (Intra-Region)</small>"] --> Gold;
    Gold["🌍 4.7.4 Gold Stage<br><small>Focus: Disaster Recovery (Inter-Region)</small>"];

    classDef default fill:#f9f9f9,stroke:#333,stroke-width:2px;
    classDef mvpStyle fill:#e6ffe6,stroke:#008000,stroke-width:2px;
    classDef bronzeStyle fill:#fff0e6,stroke:#ff8000,stroke-width:2px;
    classDef silverStyle fill:#e6f3ff,stroke:#0066cc,stroke-width:2px;
    classDef goldStyle fill:#ffe6e6,stroke:#cc0000,stroke-width:2px;

    class MVP mvpStyle;
    class Bronze bronzeStyle;
    class Silver silverStyle;
    class Gold goldStyle;

Diagram: Progression of testing focus across stages.

4.7.1 MVP Stage: Functional Validation, Initial Training and Operational Handover¶

The MVP stage culminates in verifying that the core functionalities of the new eMap platform are operational and meet initial business requirements. This stage also includes user enablement and the inclusion of the system in the operation and support rosters.

User Acceptance Testing (UAT)¶

Comprehensive User Acceptance Testing (UAT) should be conducted in the dedicated UAT environment. This testing, led by key stakeholders, validates that the platform's core functionalities perform as expected and meets operational requirements.

Key areas for UAT include:

Service Publishing and Consumption:
- Verification of publishing services from data registered in the Enterprise Geodabase. This includes map services, feature services and potentially initial geoprocessing services.
- Validation of publishing hosted feature layers via Portal for ArcGIS (which utilises the ArcGIS Data Store).
- Confirmation of accessing and utilising these services in Portal Map Viewer, ArcGIS Pro, VertiGIS Studio Workflow and other relevant client applications identified in the project scope. Testing should cover various operations such as map navigation, feature identification and attribute querying.
Portal Functionality:
- Successful user authentication, including the robust testing of Single Sign-On (SSO) integration with the enterprise Identity Provider (IdP). This ensures a smooth and secure login experience.
- Thorough testing of item creation (maps, apps, layers), sharing mechanisms (with individuals, groups and the organisation) and content discovery through search and browsing.
- Validation of functionalities including layer addition, symbology configuration and basic map interaction tools.
Data Interaction:
- Validating data query capabilities across various services and client applications.
- Testing basic editing functionalities, ensuring data integrity is maintained and changes are correctly reflected.
- Confirming accurate attribute display and table interactions.

Initial Training and Guidance¶

To ensure effective adoption and proficient use of the new eMap platform, initial training sessions should be targeted at key users who will regularly interact with the system and relevant support desk staff who will handle first-line user queries.

The training curriculum should cover:

Platform Overview: An introduction to the new eMap platform, its cloud-native architecture and key differences from the legacy eMap system. This helps set the context and manage expectations.
Publishing Workflows:
- Detailed guidance on publishing various types of services (map, feature, geoprocessing). A strong emphasis should be placed on the distinction between publishing from authoritative enterprise data residing in Azure Database for PostgreSQL (the preferred method for robust, managed services) and publishing user-generated content as hosted layers (suitable for temporary or less formal datasets).
- Best practices for service definition should be covered, including data sourcing strategies, optimising service performance (e.g., simplifying geometries, setting appropriate scale dependencies), configuring caching (especially for services using the new CompactV2 cache format stored in Azure Blob Storage) and managing sharing permissions effectively.
Data Governance Awareness: Reinforcement of the established data governance policies (as detailed in Chapter 3). This includes highlighting the importance of data classification and adherence to data lifecycle management principles.

Operational Procedures and Handover¶

Transitioning the MVP system to operational support requires clear documentation and a handover process.

Documentation: Development of initial Standard Operating Procedures (SOPs) is critical. These SOPs should cover:
- MVP-level monitoring practices using Azure Monitor, including guidance on interpreting key metrics and identifying common alert conditions.
- Basic troubleshooting steps for common user-reported issues or system anomalies.
- Defined user support processes, including how to log issues and the escalation paths for unresolved problems.
- Procedures for verifying the health and success of Azure PaaS backups (e.g., checking the Point-In-Time-Recovery (PITR) status for Azure Database for PostgreSQL and confirming Azure Storage snapshot creation and replication health for Blob, ADLS Gen2 and Files).
Handover: Handover sessions should be conducted with the operations and support teams. These sessions should include a review of all MVP documentation, a walkthrough of the system architecture (highlighting key components and dependencies) and a discussion of known operational aspects and potential issues.

Backup Verification¶

Validating backup and restore mechanisms is crucial to ensure data protection and recoverability.

Azure PaaS Backups: Procedures should be established and tested to verify the successful completion of automated backups for Azure PaaS services. This includes:
- Azure Database for PostgreSQL: Confirming that Point-In-Time-Recovery (PITR) capabilities are active and backups are being successfully created according to the defined schedule and retention policy. This involves checking the Azure portal for backup status and performing test restores to a temporary instance.
- Azure Storage (Blob, ADLS Gen2, Files): Verifying snapshot creation (if configured as part of the backup strategy) and the replication status (LRS for DEV/UAT, ZRS for PROD MVP) of storage accounts.
webgisdr Restore Drills:
- Restore drills using the webgisdr utility are essential for validating the recovery of the ArcGIS Enterprise application state. This process involves:
  1. Taking a full webgisdr backup from a representative environment (e.g., ideally a temporary PROD-like "staging" environment created for this purpose).
  2. Restoring this .webgissite backup file to this temporary, isolated environment.
  3. Validating the integrity of the restored Portal items (maps, apps, layers), ArcGIS Server configurations (services, site settings) and ArcGIS Data Store content (hosted feature layers).
- These drills serve to confirm the viability of webgisdr as a disaster recovery tool for the ArcGIS Enterprise application state, familiarise the operations team with the restoration process and identify any potential issues in the backup or restore procedures.

4.7.2 Bronze Stage: Scaling Validation and Cost Management¶

With auto-scaling capabilities for the ArcGIS Server Virtual Machine Scale Set (VMSS) and Azure App Service Plans (hosting Web Adaptors) introduced in the Production (PROD) environment during the Bronze stage, testing efforts shift towards validating these dynamic scaling mechanisms. Establishing robust cost management practices becomes paramount to ensure efficient cloud resource utilisation.

Load Testing and Scaling Validation¶

The primary objective of this testing phase is to ensure that the ArcGIS Server VMSS and the Azure App Service Plans scale efficiently, reliably and automatically under varying load conditions, maintaining optimal performance and availability.

Tools: Industry-standard load testing tools such as Azure Load Testing, Apache JMeter, or k6 can be used to generate realistic workloads.
Methodology:
- Simulate realistic user loads, encompassing a mix of concurrent users and diverse request patterns (e.g., map rendering, feature queries, geoprocessing tasks). Test scenarios should reflect anticipated peak usage.
- Gradually increase the simulated load from baseline to peak levels and then reduce it, to observe both scale-out and scale-in behaviours.
Verification:
- Confirm that the auto-scaling rules defined for the ArcGIS Server VMSS (e.g., based on CPU utilisation and potentially other metrics) and for the Web Adaptor App Service Plans (e.g., based on CPU, memory, or HTTP queue length) trigger correctly and in a timely manner.
- Monitor key performance indicators (KPIs) such as service response times, error rates and resource utilisation (CPU, memory, network I/O) across all tiers (Web, Application, Data) during scaling events.
- Ensure the platform remains stable and responsive throughout the scaling operations, with no service degradation or failures.
- Validate that new ArcGIS Server VMSS instances, auto deploy and successfully join the existing ArcGIS Server site (via IaC and automated Configuration Management scripts) and begin processing requests distributed by the load balancer.

Cost Management Validation¶

Effective cost management is crucial for sustainable cloud operations. The Bronze stage includes activities to validate and operationalise cost control mechanisms.

Azure Cost Management + Billing Tools: Utilise Azure's native tools to establish financial governance:
- Consider establishing budgets for the PROD environment (and potentially UAT/DEV with lower thresholds) within Azure Cost Management.
- Configure spending alerts to notify relevant stakeholders when costs approach or exceed budgeted amounts.
- Regularly analyse cost trends and review spending against allocated budgets to identify anomalies or areas for optimisation.
Resource Tagging: Enforce a comprehensive and consistent resource tagging strategy for all Azure resources deployed. Tags such as Environment (e.g., PROD, UAT, DEV), ApplicationName (e.g., eMap), Owner (e.g., DAIS) and ChargeCode (if applicable) are essential for accurate cost allocation and filtering within cost analysis reports.
Resource Utilisation Review:
- Regularly review resource utilisation reports from Azure Monitor and Azure Cost Management for VMs, App Service Plans, databases and storage.
- Identify any over-provisioned resources (which lead to unnecessary costs) or underutilised resources that could be downsized.
- Document procedures for rightsizing VMs and adjusting PaaS service tiers based on actual demand and performance data collected during load testing and ongoing monitoring. This iterative optimisation process ensures cost-efficiency without compromising performance and availability.

4.7.3 Silver Stage: High Availability Testing¶

The Silver stage introduces High Availability (HA) configurations for critical components within the Production (PROD) environment in Melbourne. Testing in this stage should focus on validating the resilience of the platform against failures, ensuring service continuity and data integrity.

High Availability Testing Methodology¶

The primary objective is to verify that the HA configurations for Portal for ArcGIS (active-passive pair), ArcGIS Server VMSS (distributed instances), ArcGIS Data Store (primary-standby pair), Azure Database for PostgreSQL (Same-Zone HA) and the Application Delivery Controller (ADC HA pair) function as designed.

Simulating Failures:
- Systematic and controlled failure simulations must be conducted within the PROD Melbourne environment. This involves targeting individual components to observe the system's response. Examples include:
  - Simulating the failure of the active Portal for ArcGIS VM in the HA pair (e.g., by stopping the VM or its services).
  - Simulating the failure of the primary ArcGIS Data Store VM in its HA pair.
  - Simulating the failure of one ADC instance in its HA configuration.
  - Simulating the failure of one or more ArcGIS Server VMSS instances.
  - Triggering a failover for the primary Azure Database for PostgreSQL instance to its Same-Zone HA standby replica.
- Azure Chaos Studio: Consider using Azure Chaos Studio for more sophisticated fault injection experiments. While Availability Zones are not present in Melbourne, Chaos Studio can still be used to simulate various failure scenarios such VM shutdown, CPU pressure, or network latency against specific resources, helping to validate the resilience of HA configurations and application behaviour under stress.
Verification:
- Confirm that automatic failover mechanisms engage correctly and within the expected timeframe for each tested component.
- Measure the actual Recovery Time Objective (RTO) for each component failover. This should be compared against the target RTOs defined in the non-functional requirements (e.g., a target of <3 minutes for failover).
- Verify data consistency and service availability post-failover. For instance, after a Portal VM failover, ensure users can still access the Portal and items are consistent. After a database failover, ensure services can reconnect and data integrity is maintained.
- Ensure that monitoring systems (Azure Monitor, Application Insights) accurately detect and report failover events, providing visibility to the operations team.

As HA capabilities are introduced, operational procedures must be updated and validated.

Manual Scaling SOPs: Refine and validate existing SOPs for manual vertical scaling. While the Silver stage focuses on HA, understanding manual scaling procedures for components remains important, especially if temporary capacity boosts are needed or if certain components require adjustments.
Auto-Scaling Tuning SOPs: Document procedures for monitoring and fine-tuning the auto-scaling rules for the ArcGIS Server VMSS and the Web Adaptor App Service Plans. This includes guidance on adjusting metric thresholds, instance counts (minimum/maximum) and cooldown periods based on observed performance trends, evolving load patterns and the behaviour of the HA configurations.

4.7.4 Gold Stage: Disaster Recovery Drills¶

The Gold stage implements inter-region Disaster Recovery (DR) capabilities for the Production (PROD) environment, enabling failover from the primary Azure region (Melbourne) to the secondary DR region (Sydney). Regular (e.g., annual) DR drills are essential to validate these capabilities and ensure organisational preparedness for a major outage.

Achieving Enterprise Resilience

Successfully completing these phased testing and validation activities, from MVP functional checks through to comprehensive Gold Stage DR drills, alongside targeted user training and formal operational handover, ensures that the new eMap platform is not only technically sound but also operationally robust, resilient and aligned with critical business requirements.

Disaster Recovery (DR) Drill Execution¶

DR drills are comprehensive exercises designed to simulate a significant outage in the primary region and test the end-to-end process of failing over to and operating from the DR region.

Frequency: DR drills should be conducted regularly, with a recommended frequency of at least annually and potentially bi-annually.
Scope: The drill should simulate a full Melbourne regional outage affecting the PROD environment. This involves assuming all Melbourne-based resources are unavailable.
Key Activities Tested During a DR Drill:
1. DR Declaration and Global Server Load Balancer (GSLB) Failover:
  - Confirm that the automated DR detection mechanism (e.g., the Sydney-based Azure Function monitoring Melbourne's Web Tier health) correctly declares a DR event based on predefined failure thresholds.
  - Verify that the GSLB automatically redirects user traffic from Melbourne to the Sydney DR environment's WAF/ADC endpoints. This redirection should be based on health probe failures detected by the GSLB.
2. Data Tier Failover in Sydney:
  - Confirm the successful promotion of the Azure Database for PostgreSQL read replica in Sydney to a standalone, writable primary instance. This process breaks the replication from Melbourne.
  - Validate the failover of Azure Storage accounts (those configured with GRS) to the Sydney region, ensuring data replicated from Melbourne becomes accessible and writable in the DR site.
3. Application Tier Activation in Sydney:
  - Verify the automated activation (e.g., starting stopped VMs, scaling up VMSS/App Service Plans) of the "pilot light" resources in Sydney to full production capacity.
  - Confirm the successful restoration of the ArcGIS Enterprise application state using webgisdr backups retrieved from the geo-replicated Azure Blob Storage. This is a critical step and involves validating the integrity of Portal items, ArcGIS Server configurations and ArcGIS Data Store content in Sydney.
4. Full Application Functionality Validation in Sydney:
  - Thoroughly test critical application functionalities and key user workflows in the now-active Sydney DR environment. This ensures that the platform is not just "up" but fully operational and capable of supporting business processes.
5. Failback Procedures Testing:
  - Test and validate the documented procedures for failing back operations to Melbourne once it has been restored and stabilised. Failback requires careful planing and operation, involving data re-synchronisation from Sydney back to Melbourne and managed traffic redirection.
Validation and Reporting:
- Measure the actual Recovery Time Objective (RTO) and Recovery Point Objective (RPO) achieved during the DR drill. These measurements should be compared against the defined business continuity targets (e.g., target RPO <45 minutes for full service restoration in DR).
- Verify data integrity and consistency in the DR site post-failover. This includes checking for any data loss or corruption.
- Document all steps taken, timings for each phase, any issues encountered during the drill and the resolutions applied.
- Conduct a post-drill review to identify lessons learned, areas for improvement and necessary updates to DR runbooks, automation scripts and configurations.