3.4: Data Categorisation and Storage
✂️ Tl;dr 🥷
This section establishes a data categorisation framework and corresponding Azure storage solutions for the new eMap platform. Data is classified as enterprise data (authoritative, long-term datasets requiring governance) or user-generated/temporary data (ad-hoc, short-term content). Enterprise data resides in Azure Database for PostgreSQL, benefiting from versioning, complex data modelling and enterprise integration capabilities. User-generated data is managed via ArcGIS Data Store for Portal uploads, web app features and transient analysis outputs, with automated retention policies. Platform infrastructure components use Azure Blob Storage for tile caches and server outputs, ADLS Gen2 (with Cloud Raster Format) for raster files and Azure Files for server configuration. Storage allocation follows strict criteria: PostgreSQL handles performance-critical services and governed datasets, ArcGIS Data Store supports self-service workflows, while object storage services address scalability needs for raster data, backups and system directories. This structure ensures appropriate lifecycle management, cost optimisation and performance alignment across data types.
3.4. Data Categorisation and Storage¶
This section establishes a clear framework for data categorisation and defines the appropriate Azure storage solutions for each category within the new eMap platform.
graph LR
A["🌐 Data Categorisation"] --> B["🏛️ Enterprise Data"]
A --> C["📤 User-Generated/Temporary"]
A --> D["🖥️ Platform Infrastructure"]
subgraph B["Enterprise Data Storage"]
direction LR
B1["📦 Azure Database for PostgreSQL<br>(Enterprise Geodatabase)"]
B1 --> B1a["⚖️ Authoritative Datasets"]
B1 --> B1b["🛡️ Versioned Editing"]
B1 --> B1c["🧩 Complex Data Models"]
B1 --> B1d["🌐 Enterprise Integration"]
end
subgraph C["User-Generated"]
direction LR
C1["⏳ ArcGIS Data Store"]
C1 --> C1a["📤 Portal Uploads"]
C1 --> C1b["🖥️ Web App Features"]
C1 --> C1c["📊 Analysis Outputs"]
C1 --> C1d["🔀 Promotion Pipeline"]
end
subgraph D["Platform Infrastructure"]
direction LR
D1["🗄️ Azure Blob Storage"] --> D1a["📁 Portal Content"]
D1 --> D1b["🗺️ Tile Caches"]
D1 --> D1c["⚙️ Server Jobs/Outputs"]
D2["🌄 ADLS Gen2 (Raster Store)"] --> D2a["🛰️ Satellite Imagery"]
D2 --> D2b["⛰️ Elevation Models"]
D2 --> D2c["📐 Cloud Raster Format"]
D3["📂 Azure Files"] --> D3a["⚙️ Server Config"]
D3 --> D3b["🖥️ System Directories"]
end
style A fill:#f0f5ff,stroke:#1a73e8
classDef enterprise fill:#e8f5e9,stroke:#4caf50
classDef temp fill:#fff3e0,stroke:#ff9800
classDef infra fill:#f3e5f5,stroke:#9c27b0
class B1,C1,D1,D2,D3 enterprise,temp,infra
3.4.1. Data Categories¶
Data classification informs storage allocation, management practices, security controls and lifecycle policies. The two primary categories are "Enterprise Data" and "User-Generated/Temporary Data."
3.4.1.1. Enterprise Data¶
Enterprise Data is defined as authoritative, curated, high-quality and often long-lived datasets that serve as systems of record or are broadly shared and consumed across the organisation.
- Characteristics:
- Serves as an official source of truth.
- Undergoes quality assurance and validation processes.
- Has a defined lifecycle, often long-term, with established update and maintenance procedures.
- Can be integrated with other systems.
- Management: Managed by dedicated GIS professionals, Data Stewards and Database Administrators (DBAs) following established data governance policies.
- Primary Storage: Stored within user-managed Azure Database for PostgreSQL instances, configured as Enterprise Geodatabases.
3.4.1.2. User-Generated/Temporary Data¶
User-Generated/Temporary Data encompasses ad-hoc datasets, project-specific content, outputs from self-service analysis tools, data directly uploaded by users to Portal for ArcGIS, or content with a deliberately short-term lifecycle.
- Characteristics:
- Often created for specific, immediate tasks or exploratory analysis.
- May not undergo formal quality assurance processes.
- Has a shorter, less defined lifecycle.
- Management: Primarily managed within the Esri-managed ArcGIS Data Store.
3.4.2. Data Storage Allocation Framework¶
The classification of data into these categories directly influences its allocation to specific storage services. The decision framework guides the placement of data to ensure it is managed appropriately throughout its lifecycle.
Diagram: Decision framework for allocating data to primary storage types based on categorisation.
This framework ensures that "Enterprise Data" benefits from the robust management capabilities of Azure Database for PostgreSQL, while "User-Generated/Temporary Data" is handled by the ArcGIS Data Store, aligning with its design for supporting Portal's self-service functionalities.
3.4.3. Storage Services and Specific Criteria¶
The new eMap platform will leverage several Azure storage services, each designated for specific types of data based on the categorisation and functional requirements of ArcGIS Enterprise.
3.4.3.1. Azure Database for PostgreSQL (Enterprise Geodatabase)¶
- Role: This PaaS offering, with the PostGIS extension enabled, serves as the primary repository for all "Enterprise Data". It functions as the user-managed Enterprise Geodatabase.
- Content Criteria: Datasets stored here meet one or more of the following criteria:
- Authoritative Enterprise Datasets: Systems of record, curated datasets forming the backbone of the organisation's geospatial information, managed by GIS Engineers and DBAs.
- Data Requiring Robust Versioning: Datasets used in multi-user editing environments that necessitate long-term versioning, conflict detection and historical tracking capabilities.
- Complex Data Models and Relationships: Datasets with intricate schemas, dependencies, relational integrity constraints, or those that could benefit from advanced database features such as triggers or stored procedures.
- Data for Broad Sharing and Integration: Datasets intended for consumption by multiple business units, integration with other enterprise systems (e.g., Business Intelligence platforms such asPowerBI, data warehouses, Azure Databricks), or requiring direct SQL access for advanced queries and analysis.
- Long-Term Persistence and Governance: Data with a defined long-term lifecycle, subject to formal data governance policies, stewardship and archival strategies.
- Performance-Critical Services: Datasets that underpin high-demand map, feature, or geoprocessing services where database-level tuning, indexing and query optimisation are crucial for performance.
- Management & Considerations:
- Full lifecycle management (schema changes, backups, performance tuning, versioning) is performed using native PostgreSQL tools and Azure capabilities.
- ArcGIS Server registers these databases to publish services.
3.4.3.2. ArcGIS Data Store¶
- Role: In Esri's reference architecture, the ArcGIS Data Store (Relational type) is a mandatory component of the base ArcGIS Enterprise deployment, supporting the hosting server functionality of Portal for ArcGIS. It manages an internal PostgreSQL instance on its dedicated Virtual Machine and is primarily used for "User-Generated/Temporary Data".
- Content Criteria: The use of the ArcGIS Data Store should be limited to:
- Portal Analysis Tool Outputs: Results generated by the built-in spatial analysis tools within the Portal Map Viewer (e.g., buffers, overlays).
- Features Created in Web Applications: Content generated directly by users within the Portal Map Viewer or other web applications for temporary, project-specific, or exploratory use.
- Field Collection App Data: Data submitted by field collection applications, which may be staged in the ArcGIS Data Store before potential cleansing, validation and promotion to an enterprise dataset in Azure Database for PostgreSQL.
- Transient or Short-Lifecycle Data: Datasets with a defined short-term lifecycle. A default retention policy (e.g., data older than 90 days) will apply, requiring formal promotion and migration to an enterprise geodatabase if long-term persistence is needed.
- Management & Considerations:
- This data store is Esri-managed in terms of its internal database instance.
- It is not intended for authoritative, long-term enterprise data storage.
- Its performance and capacity are tied to the underlying VM resources.
3.4.3.3. Azure Blob Storage¶
- Role: Azure Blob Storage provides versatile and scalable object storage for various components and outputs of the ArcGIS Enterprise deployment.
- Content Criteria:
- Portal for ArcGIS
content
directory: Stores item metadata, thumbnails and other files associated with Portal items. This is configured via the Portal Administrator API. Best practices such as enabling soft delete, versioning and resource locks should be applied to this container. arcgiscache
directory: Registered as a Cloud Store with ArcGIS Server, this directory within Azure Blob Storage will house map and image service tile caches. TheCompactV2
cache format is recommended for optimal performance when using cloud storage.- ArcGIS Server
jobs
andoutput
directories: These directories, critical for asynchronous geoprocessing services and other server operations, will also be registered as Cloud Stores pointing to Azure Blob Storage. webgisdr
utility backups: Storage for backup files created by thewebgisdr
utility, used for ArcGIS Enterprise disaster recovery.
- Portal for ArcGIS
- Management & Considerations:
- Lifecycle management policies should be implemented to transition data between Hot, Cool and Archive tiers to optimise costs.
- Managed Identities should be used for secure access by ArcGIS Server components.
3.4.3.4. Azure Data Lake Storage Gen2 (ADLS Gen2)¶
- Role: Azure Data Lake Storage Gen2, with its hierarchical namespace, is designated as the Raster Store for the new eMap platform. It will be registered as a Cloud Store with ArcGIS Server.
- Content Criteria & Format Requirements:
- All types of raster data, including satellite imagery, aerial photography, elevation models and thematic raster datasets.
- Primary Format Recommendation: Cloud Raster Format (CRF) is the preferred format for most analytical datasets and all new raster acquisitions.
- Compression: LERC (Lossless or Near-lossless Error Compression) should be utilised, with a quality setting of approximately 75% (this is adjustable based on data characteristics and specific storage versus quality trade-offs).
- Tiling: A tile configuration of 512x512 pixels is recommended for optimal read performance.
- Pyramids (Overviews): Pyramids must be built for CRF datasets to ensure efficient display at various scales. Bilinear resampling is generally recommended for continuous data, while Nearest Neighbour may be appropriate for discrete data.
- Secondary Format Recommendation: Meta Raster Format (MRF) is recommended as an efficient source format for creating pre-rendered basemaps or highly optimised tile services. When these MRF-sourced services are cached, the resulting tile caches (e.g.,
CompactV2
format) would typically reside in Azure Blob Storage. - Logical Hierarchical Namespace Structure: A logical folder structure should be implemented within ADLS Gen2 to organise raster data effectively. Exact structures should be defined in the planning phase of the project. Example structure includes:
/imagery/[collection_type]/[source_identifier]/[year_of_acquisition]/
/elevation/dem/[source_identifier]/
/basemaps/[map_name]/
/mosaic_definitions/[definition_name]/
/temp_processing/[project_or_user]/
- Management & Considerations:
- ADLS Gen2's hierarchical namespace enables efficient management of large raster collections.
- Lifecycle management policies should be applied to manage storage costs for extensive raster archives.
- Migration of legacy raster formats to CRF or MRF should be a planned activity.
3.4.3.5. Azure Files¶
- Role: Azure Files provides SMB shares, offering shared file access primarily for ArcGIS Server components.
- Content Criteria:
- ArcGIS Server
config-store
directory: This directory contains essential configuration files for the ArcGIS Server site and must be accessible by all machines. - ArcGIS Server
system
directory: Contains files related to server operations and state. - While ArcGIS Server
jobs
andoutput
directories can be hosted on Azure Files, the recommendation for this deployment is to use Azure Blob Storage (as Cloud Stores) for these, due to better scalability and cost-effectiveness for potentially large and numerous output files.
- ArcGIS Server
- Management & Considerations:
- The Provisioned v1 (SSD) tier of Azure Files is recommended for the Production (PROD) environment to ensure adequate performance.
- The Provisioned v2 (HDD) tier of Azure Files should be used for Development (DEV) and User Acceptance Testing (UAT) environments to optimise costs.
- Appropriate Samba mounting options to increase throughput and number of connections.
3.4.3.6. ArcGIS Server Directory Storage¶
Server Directory | Azure Storage Service | Configuration Requirements |
---|---|---|
config-store | Azure Files | Mounted via SMB to all Server instances. PROD: Provisioned v1 (SSD), DEV/UAT: HDD tier |
system | Azure Files | Shared across all Server machines. Requires persistent storage for server state management |
jobs | Azure Blob Storage | Registered as Cloud Store with Managed Identity auth. Lifecycle policies for auto-cleanup |
output | Azure Blob Storage | Configured as Cloud Store. Default retention: 10 minutes via server cleanup process |
arcgiscache | Azure Blob Storage | Requires arcgiscache subfolder. Use CompactV2 cache format for cloud storage |
Raster Store | ADLS Gen2 | Registered as Cloud Store with hierarchical namespace. CRF/MRF formats with LERC compression |
Notes: * Managed Identity Auth: Server VMSS instances use Azure Managed Identities with Storage Blob Data Contributor
RBAC roles * Path Requirements: arcgiscache
requires exact subfolder name in Blob container for tile cache recognition * Lifecycle Management: Blob Storage requires tiering policies (Hot → Cool → Archive) aligned with data access patterns
This structured approach to data categorisation and storage allocation ensures that each type of data is managed using the most appropriate Azure service.