[#5472] improvement(docs): Add example to use cloud storage fileset a…

…nd polish hadoop-catalog document. (#6230) ### What changes were proposed in this pull request? 1. Add full example about how to use cloud storage fileset like S3, GCS, OSS and ADLS 2. Polish how-to-use-gvfs.md and hadoop-catalog-md. 3. Add document how fileset using credential. ### Why are the changes needed? For better user experience. Fix: #5472 ### Does this PR introduce _any_ user-facing change? N/A. ### How was this patch tested? N/A Co-authored-by: Qi Yu <[email protected]>
apache · Jan 14, 2025 · dec7ea0 · dec7ea0
1 parent 6c9a0d0
commit dec7ea0
Show file tree

Hide file tree

Showing 9 changed files with 2,157 additions and 293 deletions.
diff --git a/clients/client-python/gravitino/filesystem/gvfs_config.py b/clients/client-python/gravitino/filesystem/gvfs_config.py
@@ -42,8 +42,8 @@ class GVFSConfig:
     GVFS_FILESYSTEM_OSS_SECRET_KEY = "oss_secret_access_key"
     GVFS_FILESYSTEM_OSS_ENDPOINT = "oss_endpoint"
 
-    GVFS_FILESYSTEM_AZURE_ACCOUNT_NAME = "abs_account_name"
-    GVFS_FILESYSTEM_AZURE_ACCOUNT_KEY = "abs_account_key"
+    GVFS_FILESYSTEM_AZURE_ACCOUNT_NAME = "azure_storage_account_name"
+    GVFS_FILESYSTEM_AZURE_ACCOUNT_KEY = "azure_storage_account_key"
 
     # This configuration marks the expired time of the credential. For instance, if the credential
     # fetched from Gravitino server has expired time of 3600 seconds, and the credential_expired_time_ration is 0.5

diff --git a/docs/hadoop-catalog-index.md b/docs/hadoop-catalog-index.md
@@ -0,0 +1,26 @@
+---
+title: "Hadoop catalog index"
+slug: /hadoop-catalog-index
+date: 2025-01-13
+keyword: Hadoop catalog index S3 GCS ADLS OSS
+license: "This software is licensed under the Apache License version 2."
+---
+
+### Hadoop catalog overall
+
+Gravitino Hadoop catalog index includes the following chapters:
+
+- [Hadoop catalog overview and features](./hadoop-catalog.md): This chapter provides an overview of the Hadoop catalog, its features, capabilities and related configurations.
+- [Manage Hadoop catalog with Gravitino API](./manage-fileset-metadata-using-gravitino.md): This chapter explains how to manage fileset metadata using Gravitino API and provides detailed examples.
+- [Using Hadoop catalog with Gravitino virtual file system](how-to-use-gvfs.md): This chapter explains how to use Hadoop catalog with the Gravitino virtual file system and provides detailed examples.
+
+### Hadoop catalog with cloud storage
+
+Apart from the above, you can also refer to the following topics to manage and access cloud storage like S3, GCS, ADLS, and OSS:
+
+- [Using Hadoop catalog to manage S3](./hadoop-catalog-with-s3.md). 
+- [Using Hadoop catalog to manage GCS](./hadoop-catalog-with-gcs.md). 
+- [Using Hadoop catalog to manage ADLS](./hadoop-catalog-with-adls.md). 
+- [Using Hadoop catalog to manage OSS](./hadoop-catalog-with-oss.md). 
+
+More storage options will be added soon. Stay tuned!
diff --git a/docs/hadoop-catalog-with-adls.md b/docs/hadoop-catalog-with-adls.md
diff --git a/docs/hadoop-catalog-with-gcs.md b/docs/hadoop-catalog-with-gcs.md
diff --git a/docs/hadoop-catalog-with-oss.md b/docs/hadoop-catalog-with-oss.md
diff --git a/docs/hadoop-catalog-with-s3.md b/docs/hadoop-catalog-with-s3.md
diff --git a/docs/hadoop-catalog.md b/docs/hadoop-catalog.md
diff --git a/docs/how-to-use-gvfs.md b/docs/how-to-use-gvfs.md
diff --git a/docs/manage-fileset-metadata-using-gravitino.md b/docs/manage-fileset-metadata-using-gravitino.md
@@ -15,7 +15,9 @@ filesets to manage non-tabular data like training datasets and other raw data.
 
 Typically, a fileset is mapped to a directory on a file system like HDFS, S3, ADLS, GCS, etc.
 With the fileset managed by Gravitino, the non-tabular data can be managed as assets together with
-tabular data in Gravitino in a unified way.
+tabular data in Gravitino in a unified way. The following operations will use HDFS as an example, for other
+HCFS like S3, OSS, GCS, etc, please refer to the corresponding operations [hadoop-with-s3](./hadoop-catalog-with-s3.md), [hadoop-with-oss](./hadoop-catalog-with-oss.md), [hadoop-with-gcs](./hadoop-catalog-with-gcs.md) and 
+[hadoop-with-adls](./hadoop-catalog-with-adls.md).
 
 After a fileset is created, users can easily access, manage the files/directories through
 the fileset's identifier, without needing to know the physical path of the managed dataset. Also, with
@@ -53,24 +55,6 @@ curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
   }
 }' http://localhost:8090/api/metalakes/metalake/catalogs
 
-# create a S3 catalog
-curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
--H "Content-Type: application/json" -d '{
-  "name": "catalog",
-  "type": "FILESET",
-  "comment": "comment",
-  "provider": "hadoop",
-  "properties": {
-    "location": "s3a://bucket/root",
-    "s3-access-key-id": "access_key",
-    "s3-secret-access-key": "secret_key",
-    "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com",
-    "filesystem-providers": "s3"
-  }
-}' http://localhost:8090/api/metalakes/metalake/catalogs
-
-# For others HCFS like GCS, OSS, etc., the properties should be set accordingly. please refer to
-# The following link about the catalog properties.
 ```
 
 </TabItem>
@@ -93,25 +77,8 @@ Catalog catalog = gravitinoClient.createCatalog("catalog",
     "hadoop", // provider, Gravitino only supports "hadoop" for now.
     "This is a Hadoop fileset catalog",
     properties);
-
-// create a S3 catalog
-s3Properties = ImmutableMap.<String, String>builder()
-    .put("location", "s3a://bucket/root")
-    .put("s3-access-key-id", "access_key")
-    .put("s3-secret-access-key", "secret_key")
-    .put("s3-endpoint", "http://s3.ap-northeast-1.amazonaws.com")
-    .put("filesystem-providers", "s3")
-    .build();
-
-Catalog s3Catalog = gravitinoClient.createCatalog("catalog",
-    Type.FILESET,
-    "hadoop", // provider, Gravitino only supports "hadoop" for now.
-    "This is a S3 fileset catalog",
-    s3Properties);
 // ...
 
-// For others HCFS like GCS, OSS, etc., the properties should be set accordingly. please refer to
-// The following link about the catalog properties.
 ```
 
 </TabItem>
@@ -124,23 +91,6 @@ catalog = gravitino_client.create_catalog(name="catalog",
                                           provider="hadoop", 
                                           comment="This is a Hadoop fileset catalog",
                                           properties={"location": "/tmp/test1"})
-
-# create a S3 catalog
-s3_properties = {
-    "location": "s3a://bucket/root",
-    "s3-access-key-id": "access_key"
-    "s3-secret-access-key": "secret_key",
-    "s3-endpoint": "http://s3.ap-northeast-1.amazonaws.com"
-}
-
-s3_catalog = gravitino_client.create_catalog(name="catalog",
-                                             type=Catalog.Type.FILESET,
-                                             provider="hadoop",
-                                             comment="This is a S3 fileset catalog",
-                                             properties=s3_properties)
-
-# For others HCFS like GCS, OSS, etc., the properties should be set accordingly. please refer to
-# The following link about the catalog properties.
 ```
 
 </TabItem>
@@ -371,11 +321,8 @@ The `storageLocation` is the physical location of the fileset. Users can specify
 when creating a fileset, or follow the rules of the catalog/schema location if not specified.
 
 The value of `storageLocation` depends on the configuration settings of the catalog:
-- If this is a S3 fileset catalog, the `storageLocation` should be in the format of `s3a://bucket-name/path/to/fileset`.
-- If this is an OSS fileset catalog, the `storageLocation` should be in the format of `oss://bucket-name/path/to/fileset`.
 - If this is a local fileset catalog, the `storageLocation` should be in the format of `file:///path/to/fileset`.
 - If this is a HDFS fileset catalog, the `storageLocation` should be in the format of `hdfs://namenode:port/path/to/fileset`.
-- If this is a GCS fileset catalog, the `storageLocation` should be in the format of `gs://bucket-name/path/to/fileset`.
 
 For a `MANAGED` fileset, the storage location is: