3.3: Inventory of Existing Datasets

✂️ Tl;dr 🥷

Outlines the process of conducting an inventory of existing spatial datasets before migrating to the new eMap platform. The primary goal is to create a detailed catalogue of all current data assets understanding their scope, volume and dependencies. This inventory is vital for migration, planning, defining project scope, enabling robust data governance, identifying potential risks and prioritising datasets based on business criticality and usage. The methodology described involves identifying all data sources, engaging with data owners and stewards, collecting key metadata for each dataset and assessing and prioritising them. The key output is a comprehensive Dataset and Asset Register, a living document that will inform data categorisation quality improvements and the overall transition strategy; ensuring data is well managed and used in the new cloud environment.

3.3.1. Purpose and Importance of the Dataset Inventory¶

The primary purpose of the dataset inventory is to create a detailed and accurate catalogue of all current spatial datasets within the existing eMap environment and other relevant data sources. This inventory serves several vital functions:

Migration Planning: Provides the foundational information required to plan the migration of data to the new eMap platform, including identifying data volumes, formats and dependencies.
Scope Definition: Helps to clearly define the scope of data to be managed, migrated and potentially transformed.
Data Governance Enablement: Supports the data categorisation process, data quality assessments and the application of lifecycle management policies.
Risk Identification: Uncovers potential challenges early, such as datasets with poor quality, unclear ownership, complex dependencies, or formats requiring significant transformation.
Prioritisation: Enables the prioritisation of datasets for migration and integration into the platform based on business criticality, usage and strategic importance.
Resource Allocation: Informs resource planning for data migration, quality improvement efforts and ongoing data management.

3.3.2. Methodology for Dataset Inventory¶

The inventory process should be systematic and collaborative, involving Data Owners, Data Stewards, GIS Engineers and other relevant technical staff. The following methodology is recommended:

flowchart TB
    subgraph Process["📥 Inventory Process"]
        direction TB
        A[("🚀 Start: Inventory Mandate")] --> B[/"🗂️ Identify Data Sources<br>(legacy eMap & Other Systems)"/]
        B --> C[/"👥 Engage Data Owners & Stewards<br>(Workshops & Interviews)"/]
        C --> D[/"📝 Collect Key Metadata<br>(Datasets, Services, Workflows)"/]
        D --> E[/"🔍 Assess Datasets<br>(Criticality, Usage, Quality)"/]
        E --> F[/"⭐ Prioritise for Migration<br>(New eMap Integration)"/]
    end

    F --> G[["📄 Output: Comprehensive<br>Dataset & Asset Register"]]

    subgraph NextSteps["⚙️ Downstream Processes"]
        G --> H1["📋 Migration Planning"]
        G --> H2["🏷️ Data Categorisation"]
        G --> H3["✅ Quality Improvement"]
    end

    style Process fill:#f5f5f5,stroke:#666,stroke-width:1px
    style NextSteps fill:#e8f5e9,stroke:#2e7d32,stroke-width:1px
    style A fill:#fff9c4,stroke:#f57f17
    style G fill:#e8f5e9,stroke:#2e7d32

Diagram: High-level workflow for the dataset inventory process.

Identification of Data Sources:
- Compile a list of all potential locations where spatial data resides. This primarily includes the existing legacy eMap system (databases, file shares, service definitions) but must also extend to other departmental systems, shared drives, or individual repositories that may contain relevant geospatial datasets.
Stakeholder Engagement:
- Identify and work closely with designated Data Owners and Data Stewards for each identified data domain. Their knowledge is invaluable for identifying datasets, understanding their context and assessing their importance.
- Conduct workshops or interviews to gather information and validate findings.
Data Collection (Metadata Gathering):
- For each identified dataset, systematically collect a defined set of metadata attributes .
- This may involve a combination of:
  - Automated tools or scripts (e.g., to query database catalogues, scan file systems for GIS data types).
  - Manual inspection of legacy eMap configurations and documentation.
  - Information provided by Data Owners and Stewards.
- The scope of this collection extends beyond just datasets to include related services, applications and user workflows to understand dependencies and usage context.
Assessment and Prioritisation:
- Once metadata is collected, each dataset will be assessed against criteria such as business criticality, usage patterns, data quality and technical complexity.
- This assessment will inform its prioritisation for migration to the new eMap platform.
Documentation and Output (Dataset Register):
- All collected information will be compiled into a central, comprehensive Dataset and Asset Register. This register will serve as the single source of truth for the existing geospatial assets and their characteristics.
- This register should be a living document, maintained and updated as the project progresses.

3.3.3. Key Metadata to Document¶

For each dataset and, where applicable, associated assets (services, applications, workflows), the following key metadata attributes must be documented. This information will form the core of the Dataset and Asset Register.

Metadata Field	Description	Example	Data Source Indication
Dataset Name	Official or common name of the dataset.	`Grasslands`	Legacy eMap, DB Schema, File Name
Description	Brief summary of the dataset's content, purpose and scope.	`Authoritative property boundaries for Victoria`	Steward/Owner Input, Existing Docs
Data Owner	The business unit or individual ultimately accountable for the dataset.	`Land Use Victoria`	Stakeholder Engagement
Data Steward	The individual or team responsible for the day-to-day management and quality of the dataset.	`Jane Doe (LUV)`, `DAIS Team A`	Stakeholder Engagement
Source System	The system or location where the dataset currently resides or originates.	`Legacy eMap SQL Server`, `SharedDrive/Projects/Environment`	Technical Audit, Steward Input
Current Format	The existing data format or type.	`Esri File Geodatabase`, `Shapefile`, `SQL Server`	Technical Audit
Approximate Size / Data Volume	Estimated storage size or number of features/records.	`1.5 GB`, `~2.5 million polygons`	File System, DB Query, Technical Audit
Update Frequency	How often the dataset is updated.	`Daily`, `Monthly`, `As required`, `Static`	Steward/Owner Input, Existing Docs
Data Sensitivity	Classification of data sensitivity	`Official`, `Sensitive`, etc	Security Policy, Steward Input
Dependencies (Consumed By)	List of other datasets, services, applications, or workflows that rely on this dataset.	`Tarnook Map X`, `VertiGIS Service Y`, `PowerBI Report Z`	Technical Audit, Steward Input
Dependencies (Relies On)	List of other datasets or systems this dataset depends on for its creation or updates.	`Source System A`, `Dataset B for geometry updates`	Technical Audit, Steward Input
Complexity	Assessment of the dataset's structural or semantic complexity (e.g., intricate data model, many attributes).	`High (complex relationships)`, `Low (simple feature class)`	Technical Review, Steward Input
Known Quality Issues	Documented problems with accuracy, completeness, consistency, currency, or lineage.	`Incomplete attribution for records pre-2018`, `Topology errors`	Steward Input, User Feedback, QA Docs
Business Criticality	The importance of the dataset to business operations.	`High (Essential for emergency response)`, `Medium`, `Low`	Owner/Steward Input, Usage Analysis
Usage Patterns	How, by whom and how frequently the dataset is used.	`High (Daily use by 100+ users)`, `Ad-hoc analysis by GIS team`	Usage Logs, Steward/Owner Input
Associated Services	Any map, feature, or geoprocessing services published from this dataset in legacy eMap.	`/arcgis/rest/services/Basemaps/MapServer`	Legacy eMap Audit
Associated Applications	Custom applications or integrations that consume this dataset or its services.	`Internal Asset Viewer App`, `Mobile Field Collector`	Technical Audit, User Feedback
Associated User Workflows	Key business processes or user workflows that depend on this dataset.	`Strategic Fuel Breaks`	Stakeholder Engagement
Migration Priority Notes	Initial assessment of priority for migration to the new eMap.	`High (Core dataset, migrate early)`, `Low (Archive candidate)`	Preliminary Assessment
New eMap Target Storage	Preliminary indication of target storage in new eMap (e.g., Azure PostgreSQL, ADLS Gen2).	`Azure PostgreSQL`, `ADLS Gen2 (CRF)`	Based on Data Categorisation

3.3.4. Assessment Criteria for Prioritisation¶

Once the metadata is collected, each dataset will be assessed to inform its migration priority. Key assessment criteria include:

Business Criticality:
- Datasets essential for core organisational functions, legal mandates, or public safety will receive higher priority.
- Input from Data Owners is crucial for this assessment.
Usage Patterns:
- Frequently used datasets supporting a large user base or critical applications will generally be prioritised.
- Understanding who uses the data and for what purpose helps gauge impact.
Data Quality:
- Datasets with known significant quality issues may require remediation before or during migration, potentially affecting their timeline.
- High-quality, authoritative datasets might be prioritised to establish a solid foundation in the new eMap platform.
Technical Complexity & Dependencies:
- Datasets with complex structures, transformations, or numerous dependencies (both upstream and downstream) may require more planning and effort, influencing their position in the migration sequence.
- Highly interconnected datasets might be grouped for migration.
Strategic Alignment:
- Datasets supporting key strategic initiatives or new capabilities planned for the new eMap may be prioritised.

3.3.5. Expected Output: The Dataset and Asset Register¶

The primary output of this inventory process will be a comprehensive Dataset and Asset Register. This register, likely maintained in a shared spreadsheet, database, or dedicated cataloguing tool, will provide a detailed and structured view of all existing geospatial data assets.

This register will be a living document, serving as a vital input for:

Data Categorisation and Storage: Classifying data for appropriate storage in the new eMap Platform, ensuring compliance with data sovereignty.
Data Quality Standards and Metrics: Establishing baselines for quality improvement.
Transition & Legacy Decommissioning: Guiding the detailed migration plan.

Systematically inventorying and assessing existing datasets lays a robust foundation for a successful transition to the new eMap platform, ensuring that data assets are effectively managed, leveraged and governed in the new cloud environment.