-
Notifications
You must be signed in to change notification settings - Fork 16
Using the Midden Dataset Editor
Metadata is the heart of Midden and the tools to create metadata are the Dataset Editor and the Project Editor. The Dataset Editor is covered here.
NOTE: The Dataset Editor can be accessed through the browser by going to: {your midden address}/editor/dataset. For example, if you deployed using Github Pages using your Organization called "TheRads" then the address will be:
https://therads.github.io/Midden/editor/dataset
.
Midden is designed to allow the creation of metadata early in the data lifecycle. In this vein there are few required fields. Similarly, although Midden is designed to foster good data management practices, there is a preference on flexibility and agility over rigid convention. This is all to say that there are responsibilities placed on the metadata creators. The quality of the resultant data catalog will depend heavily on the thought that is put into the metadata.
Midden is not opinionated, but it should be used in an opinionated way.
The whole purpose of the Dataset Editor is to create a file that describes a dataset.
Basically, a person uses the Dataset Editor to describe a dataset, downloads a .midden
file, and places the file alongside the dataset.
- Create a new metadata file by clicking the "New" button or edit existing metadata by clicking the "Upload" or clicking "Edit" while viewing a dataset
- Edit the metadata fields (see "Metadata fields" below for details)
- Review the information by clicking "Preview"
- Download the
.midden
file by clicking "Download" - Move the downloaded
.midden
file to the location where the associated dataset is saved
This section contains essential information pertaining to the described dataset. Metadata is considered adequate if these fields are completed.
The data zone that the dataset belongs to.
NOTE: Items in the dropdown menu are populated by the
zones
array in theapp-config.json
file
The name of the dataset.
NOTE: The "Name" here also determines the name of the .midden file that is created by the Editor. For example, a "Name" value of "MyDataset_v1" would result in a file called "MyDataset_v1.midden".
The name of the project that the dataset belongs to.
A description of the dataset. Can be short, but should include enough information for a data user to understand the basic origin and purpose of the data.
BEST PRACTICE: If the metadata is updated, a timestamp and description of the update should be included here. For example: "2021-03-03: Added variable definitions"
Contact information for contributors to the dataset. Because Midden protects the data itself by not providing download links, the contact information here is important for potential data-users to start a conversation about access.
The name of the contact person.
The email address of the contact person.
The role of the contact person.
NOTE: Items in the dropdown menu are populated by the
roles
array in theapp-config.json
file. The default values for roles are from the ISO 19115 metadata standard as described here https://wiki.esipfed.org/ISO_19115-3_Codelists#CI_RoleCode
Tags (i.e. "labels", "hashtags") are used to make the dataset more discoverable. The "Catalog" supports browsing datasets by tags so users can find similar datasets that have the same tag.
It is recommended that a dataset should contain at least a few tags. The value of the tags should be as consistent as possible.
NOTE: Items in the dropdown menu are populated by the
tags
array in theapp-config.json
file. The default values for tags are from the ISO 19115 metadata standards and the EPA metadata standards.
BEST PRACTICE: Consistency in tag names is important for data discovery. It is recommended that organizations have custom tags with an
[Org]
prefix defined in theapp-config.json
file, where "Org" can be any short term that represents the organization.
This represents the variables within the dataset; i.e. the data dictionary.
The name of the variable (e.g. a column header in a csv file or a field name in a shapefile/geojson).
BEST PRACTICE: Some datasets, such as Excel Workbooks, have nested data structures where variable names may be spread across different Worksheets. To aid in specifying such names, the use of a forward slash "
/
" can be used: e.g.worksheet1/myVariable
.
The description of the variable; should include any coded values, expected types, ranges, etc. Also describes formats (e.g. ISO 8601 dates) and meaning of qualitative values (if units do not apply).
The units of the variable, if applicable
A list of method details that are specific to the variable; sensors, analytic equipment, etc.
A list of quality control checks that have been applied to the variable. These are intended to be general categories of checks that can be used to filter variables when searching for certain quality. Specific details of the quality control checks can be described in the methods field.
NOTE: Items in the dropdown menu are populated by the
qualityControlTags
array in theapp-config.json
file. The default values are those used by the Cook Agronomy Farm LTAR site, as described here: https://docs.google.com/document/d/1ufsDxVAh0E_PTHp-uGKmPzok3adPvPEYbxuFp5A8Uds
Indication of the origin of the value. Similar to the Quality Control tag, this is intended to be a general category used to filter variables when searching for a certain level of processing (e.g. raw data vs modeled data). Details of the processing should be defined in the methods field.
NOTE: Items in the dropdown menu are populated by the
processingLevels
array in theapp-config.json
file. The default values are those used by the Cook Agronomy Farm LTAR site, as described here: https://docs.google.com/document/d/1ufsDxVAh0E_PTHp-uGKmPzok3adPvPEYbxuFp5A8Uds
This provides additional filtering ability and further context to the variable. Examples based on statistical fields are "discrete", "continuous", "nominal", etc.
NOTE: Items in the dropdown menu are populated by the
variableType
array in theapp-config.json
file. The default values are those used by the Cook Agronomy Farm LTAR site and are loosely based on definitions in dimensional modeling. A "dimension" describes the "who, what, where, when, why, and how". A "metric" is a measurement (quantitative or, stretching the formal definition, nominal/ordinal).
A list of tags specific to each variable.
BEST PRACTICE: Any controlled terms that can be used as an analog to the variable name can be specified here.
The height of the measurement respective to the ground; positive indicates above ground, negative indicates below ground.
NOTE: This is deprecated and will likely not be used in future versions of Midden
These variables are used to specify the spatial and temporal coverage of the dataset.
The number of locations of repeated measurements that are represented in the dataset. For example, a dataset that contains soil temperature measurements at five different locations buried at 5 different depths would have a spatial repeats value of 25.
The area at which the data were collected or represent. This should be represented as valid GeoJSON; point, line, polygon.
NOTE: Items in the dropdown menu are populated by the
geometries
array in theapp-config.json
file.
BEST PRACTICE: Although any polygon is valid, it is recommended that a bounding box be used instead of a complex polygon. The reason for this is to reduce the file-size of the generated metadata.
NOTE: Until Midden has an embedded map tool, consider using the online tool https://geojson.io to obtain valid GeoJSON. Copy the
geometry
object, starting with the opened angle bracket: "{
" and include everything until the closing angle bracket: "}
". E.g.:{"type":"Polygon","coordinates":[[[...]]]}
The frequency at which the variables of the dataset were measured. Air temperature measured every 15 minutes may have the value of 15 min
. A dataset that contains plant community survey data taken annually may have a value of 1 year
or annually
.
BEST PRACTICE: Be consistent with how temporal resolution is defined to make it more machine readable: e.g. choose between using
1 year
orannually
and do not mix them.
The starting and ending dates that contain the time the data were collected.
BEST PRACTICE: Consider using the ISO 8601 format for time-intervals: e.g.
1997-07-16/1997-07-17
corresponds to a time-period starting on July 16, 1997, and ending on July 17, 1997.
Use these fields to specify the structure of the dataset to aid in machine-readability. Ideally, a consumer of the metadata should have enough information to read the dataset without any further exploration (e.g. a person can write a script to download the data).
The format that the data are stored in. This could be a file extension (e.g. .json
, .txt
, .jpg
), general category (e.g. tabular
, image
, time-series
), or some standard (e.g. MIME types: text/csv
, image/gif
, application/java-archive
).
A description of the directory and file structure within the dataset folder, if applicable. For example, this can be used to describe a dataset comprised of time-series files generated every hour and separated into monthly folders: {YYYY-MM}/{DD}T{hh}:{mm}_{VariableName}.csv
A description of the file path template where each variable is described. E.g. "{YYYY-MM} is the four digit year (YYYY) and two digit month (MM) that data were collected...
.
A category tag that broadly indicates how the data are structured.
NOTE: Items in the dropdown menu are populated by the
datasetStructures
array in theapp-config.json
file. The default values,Single
andMultiple
are used to represent a dataset containing multiple files of different versions of the dataset and a dataset comprised of multiple files that can be aggregated together, respectively.
These fields are used to describe how the dataset was created and any associated products.
The methods used to generate the dataset. Depending upon scope, this could include field methods, data processing, data pipelines, and so on.
This is used to specify datasets that this dataset was derived from. Values are expected to be linked resources (URL/DOI) but a citation or reference is fine. This field is important for documenting data lineage.
NOTE: Listing the full URL of metadata in your Midden catalog is encouraged and may be formally supported in the future (perhaps a visualization of the dependency graph??)
This is used to indicate related products that use the dataset; published papers, presentations, decision support tools, etc.