Skip to content

Commit

Permalink
[#5472] improvement(docs): Add example to use cloud storage fileset a…
Browse files Browse the repository at this point in the history
…nd polish hadoop-catalog document. (#6230)

### What changes were proposed in this pull request?

1. Add full example about how to use cloud storage fileset like S3, GCS,
OSS and ADLS
2. Polish how-to-use-gvfs.md and hadoop-catalog-md.
3. Add document how fileset using credential.

### Why are the changes needed?

For better user experience.

Fix: #5472


### Does this PR introduce _any_ user-facing change?

N/A.

### How was this patch tested?

N/A

Co-authored-by: Qi Yu <[email protected]>
  • Loading branch information
github-actions[bot] and yuqi1129 authored Jan 14, 2025
1 parent 6c9a0d0 commit dec7ea0
Show file tree
Hide file tree
Showing 9 changed files with 2,157 additions and 293 deletions.
4 changes: 2 additions & 2 deletions clients/client-python/gravitino/filesystem/gvfs_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -42,8 +42,8 @@ class GVFSConfig:
GVFS_FILESYSTEM_OSS_SECRET_KEY = "oss_secret_access_key"
GVFS_FILESYSTEM_OSS_ENDPOINT = "oss_endpoint"

GVFS_FILESYSTEM_AZURE_ACCOUNT_NAME = "abs_account_name"
GVFS_FILESYSTEM_AZURE_ACCOUNT_KEY = "abs_account_key"
GVFS_FILESYSTEM_AZURE_ACCOUNT_NAME = "azure_storage_account_name"
GVFS_FILESYSTEM_AZURE_ACCOUNT_KEY = "azure_storage_account_key"

# This configuration marks the expired time of the credential. For instance, if the credential
# fetched from Gravitino server has expired time of 3600 seconds, and the credential_expired_time_ration is 0.5
Expand Down
26 changes: 26 additions & 0 deletions docs/hadoop-catalog-index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
---
title: "Hadoop catalog index"
slug: /hadoop-catalog-index
date: 2025-01-13
keyword: Hadoop catalog index S3 GCS ADLS OSS
license: "This software is licensed under the Apache License version 2."
---

### Hadoop catalog overall

Gravitino Hadoop catalog index includes the following chapters:

- [Hadoop catalog overview and features](./hadoop-catalog.md): This chapter provides an overview of the Hadoop catalog, its features, capabilities and related configurations.
- [Manage Hadoop catalog with Gravitino API](./manage-fileset-metadata-using-gravitino.md): This chapter explains how to manage fileset metadata using Gravitino API and provides detailed examples.
- [Using Hadoop catalog with Gravitino virtual file system](how-to-use-gvfs.md): This chapter explains how to use Hadoop catalog with the Gravitino virtual file system and provides detailed examples.

### Hadoop catalog with cloud storage

Apart from the above, you can also refer to the following topics to manage and access cloud storage like S3, GCS, ADLS, and OSS:

- [Using Hadoop catalog to manage S3](./hadoop-catalog-with-s3.md).
- [Using Hadoop catalog to manage GCS](./hadoop-catalog-with-gcs.md).
- [Using Hadoop catalog to manage ADLS](./hadoop-catalog-with-adls.md).
- [Using Hadoop catalog to manage OSS](./hadoop-catalog-with-oss.md).

More storage options will be added soon. Stay tuned!
522 changes: 522 additions & 0 deletions docs/hadoop-catalog-with-adls.md

Large diffs are not rendered by default.

500 changes: 500 additions & 0 deletions docs/hadoop-catalog-with-gcs.md

Large diffs are not rendered by default.

538 changes: 538 additions & 0 deletions docs/hadoop-catalog-with-oss.md

Large diffs are not rendered by default.

541 changes: 541 additions & 0 deletions docs/hadoop-catalog-with-s3.md

Large diffs are not rendered by default.

87 changes: 18 additions & 69 deletions docs/hadoop-catalog.md

Large diffs are not rendered by default.

173 changes: 7 additions & 166 deletions docs/how-to-use-gvfs.md

Large diffs are not rendered by default.

59 changes: 3 additions & 56 deletions docs/manage-fileset-metadata-using-gravitino.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,9 @@ filesets to manage non-tabular data like training datasets and other raw data.

Typically, a fileset is mapped to a directory on a file system like HDFS, S3, ADLS, GCS, etc.
With the fileset managed by Gravitino, the non-tabular data can be managed as assets together with
tabular data in Gravitino in a unified way.
tabular data in Gravitino in a unified way. The following operations will use HDFS as an example, for other
HCFS like S3, OSS, GCS, etc, please refer to the corresponding operations [hadoop-with-s3](./hadoop-catalog-with-s3.md), [hadoop-with-oss](./hadoop-catalog-with-oss.md), [hadoop-with-gcs](./hadoop-catalog-with-gcs.md) and
[hadoop-with-adls](./hadoop-catalog-with-adls.md).

After a fileset is created, users can easily access, manage the files/directories through
the fileset's identifier, without needing to know the physical path of the managed dataset. Also, with
Expand Down Expand Up @@ -53,24 +55,6 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
}
}' http://localhost:8090/api/metalakes/metalake/catalogs

# create a S3 catalog
curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
-H "Content-Type: application/json" -d '{
"name": "catalog",
"type": "FILESET",
"comment": "comment",
"provider": "hadoop",
"properties": {
"location": "s3a://bucket/root",
"s3-access-key-id": "access_key",
"s3-secret-access-key": "secret_key",
"s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com",
"filesystem-providers": "s3"
}
}' http://localhost:8090/api/metalakes/metalake/catalogs

# For others HCFS like GCS, OSS, etc., the properties should be set accordingly. please refer to
# The following link about the catalog properties.
```

</TabItem>
Expand All @@ -93,25 +77,8 @@ Catalog catalog = gravitinoClient.createCatalog("catalog",
"hadoop", // provider, Gravitino only supports "hadoop" for now.
"This is a Hadoop fileset catalog",
properties);

// create a S3 catalog
s3Properties = ImmutableMap.<String, String>builder()
.put("location", "s3a://bucket/root")
.put("s3-access-key-id", "access_key")
.put("s3-secret-access-key", "secret_key")
.put("s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com")
.put("filesystem-providers", "s3")
.build();

Catalog s3Catalog = gravitinoClient.createCatalog("catalog",
Type.FILESET,
"hadoop", // provider, Gravitino only supports "hadoop" for now.
"This is a S3 fileset catalog",
s3Properties);
// ...

// For others HCFS like GCS, OSS, etc., the properties should be set accordingly. please refer to
// The following link about the catalog properties.
```

</TabItem>
Expand All @@ -124,23 +91,6 @@ catalog = gravitino_client.create_catalog(name="catalog",
provider="hadoop",
comment="This is a Hadoop fileset catalog",
properties={"location": "/tmp/test1"})

# create a S3 catalog
s3_properties = {
"location": "s3a://bucket/root",
"s3-access-key-id": "access_key"
"s3-secret-access-key": "secret_key",
"s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com"
}

s3_catalog = gravitino_client.create_catalog(name="catalog",
type=Catalog.Type.FILESET,
provider="hadoop",
comment="This is a S3 fileset catalog",
properties=s3_properties)

# For others HCFS like GCS, OSS, etc., the properties should be set accordingly. please refer to
# The following link about the catalog properties.
```

</TabItem>
Expand Down Expand Up @@ -371,11 +321,8 @@ The `storageLocation` is the physical location of the fileset. Users can specify
when creating a fileset, or follow the rules of the catalog/schema location if not specified.

The value of `storageLocation` depends on the configuration settings of the catalog:
- If this is a S3 fileset catalog, the `storageLocation` should be in the format of `s3a://bucket-name/path/to/fileset`.
- If this is an OSS fileset catalog, the `storageLocation` should be in the format of `oss://bucket-name/path/to/fileset`.
- If this is a local fileset catalog, the `storageLocation` should be in the format of `file:///path/to/fileset`.
- If this is a HDFS fileset catalog, the `storageLocation` should be in the format of `hdfs://namenode:port/path/to/fileset`.
- If this is a GCS fileset catalog, the `storageLocation` should be in the format of `gs://bucket-name/path/to/fileset`.

For a `MANAGED` fileset, the storage location is:

Expand Down

0 comments on commit dec7ea0

Please sign in to comment.