From 8408a7b7643eeb569ea6de66cdcc0461fdd0f761 Mon Sep 17 00:00:00 2001 From: Walter Teng Date: Tue, 19 Nov 2024 20:36:10 +0800 Subject: [PATCH] Bug fix arg connected comp (#382) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * update obsolete flag Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * build: Improve caching (#352) Signed-off-by: Oliver Koenig Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Run on main (#354) * ci: Run gpuci on main * fix checkout Signed-off-by: Oliver Koenig Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Run on merge commit (#355) Signed-off-by: Oliver Koenig Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * build: Add conda env to `$PATH` (#357) * build: Add conda env to `$PATH` Signed-off-by: Oliver Koenig * test Signed-off-by: Oliver Koenig * add newline Signed-off-by: Oliver Koenig * run cleanup always Signed-off-by: Oliver Koenig --------- Signed-off-by: Oliver Koenig Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * Add `build-test-publish-wheel` CI file (#356) * Create build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Create package_info.py Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * run black Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update __init__.py Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update package_info.py Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update .github/workflows/build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * remove extra version string Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update __init__.py Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * add `__all__` Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Fix version Signed-off-by: oliver könig * Update .github/workflows/build-test-publish-wheel.yml Signed-off-by: oliver könig * Ko3n1g/sarahyurick/ci/build test publish wheel (#358) * fix * fix Signed-off-by: Oliver Koenig * fix Signed-off-by: Oliver Koenig * fix Signed-off-by: Oliver Koenig * fix Signed-off-by: Oliver Koenig * fix Signed-off-by: Oliver Koenig * fix * fix Signed-off-by: Oliver Koenig * fix * fix --------- Signed-off-by: Oliver Koenig * run black Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * run isort Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update __init__.py Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update pyproject.toml Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> --------- Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: oliver könig Signed-off-by: Oliver Koenig Co-authored-by: oliver könig Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * Fix broken TestPyPi builder (#362) * Update build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update Dockerfile Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> * Update build-test-publish-wheel.yml Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> --------- Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * chore: Add `CHANGELOG.md` file (#359) * chore: Add `CHANGELOG.md` file * fix * add end of line Signed-off-by: Oliver Koenig --------- Signed-off-by: Oliver Koenig Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * Release workflow (#360) * add file Signed-off-by: Sarah Yurick * trailing whitespace Signed-off-by: Sarah Yurick --------- Signed-off-by: Sarah Yurick Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Bump release workflow to allow of `devN` semver (#366) * ci: Bump release workflow for `devN` Signed-off-by: Oliver Koenig * fix Signed-off-by: Oliver Koenig * fix Signed-off-by: Oliver Koenig * fix Signed-off-by: Oliver Koenig --------- Signed-off-by: Oliver Koenig Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Add code-freeze workflow (#367) Signed-off-by: Oliver Koenig Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Add cherry pick workflow (#368) * ci: Add cherry pick workflow Signed-off-by: Oliver Koenig * fix Signed-off-by: Oliver Koenig --------- Signed-off-by: Oliver Koenig Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * Fix broken NeMo dependencies (#372) * add packaging Signed-off-by: Sarah Yurick * move to requires Signed-off-by: Sarah Yurick * move to github ci file Signed-off-by: Sarah Yurick * add pin Signed-off-by: Sarah Yurick * add torch Signed-off-by: Sarah Yurick * add suggestion from mamba readme Signed-off-by: Sarah Yurick * try github install Signed-off-by: Sarah Yurick * add comma Signed-off-by: Sarah Yurick * another attempt Signed-off-by: Sarah Yurick * remove nemo toolkit Signed-off-by: Sarah Yurick * add datasets Signed-off-by: Sarah Yurick * try removing cython Signed-off-by: Sarah Yurick * remove cython Signed-off-by: Sarah Yurick * sentencepiece Signed-off-by: Sarah Yurick * run black Signed-off-by: Sarah Yurick * apply ryan's suggestion Signed-off-by: Sarah Yurick --------- Signed-off-by: Sarah Yurick Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * ci: Bump release workflow (#373) Signed-off-by: Oliver Koenig Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * Skip reading files with incorrect extension (#318) * filter_files_by_extension function Signed-off-by: Sarah Yurick * add type checking Signed-off-by: Sarah Yurick * add filter_by param to get_all_files_paths_under Signed-off-by: Sarah Yurick * isort Signed-off-by: Sarah Yurick * address ayush's comments Signed-off-by: Sarah Yurick * run black Signed-off-by: Sarah Yurick * trailing whitespace Signed-off-by: Sarah Yurick * more whitespace Signed-off-by: Sarah Yurick * address praateek's review Signed-off-by: Sarah Yurick * praateek's review Signed-off-by: Sarah Yurick --------- Signed-off-by: Sarah Yurick Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> * remove deprecated convert_str_ids args from ConnectedComponents Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> --------- Signed-off-by: Walter Teng <16046667+davzoku@users.noreply.github.com> Signed-off-by: Oliver Koenig Signed-off-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> Signed-off-by: oliver könig Signed-off-by: Sarah Yurick Co-authored-by: oliver könig Co-authored-by: Sarah Yurick <53962159+sarahyurick@users.noreply.github.com> --- .../red-pajama-v2-curation-tutorial.ipynb | 6 ++---- tutorials/single_node_tutorial/single_gpu_tutorial.ipynb | 1 - .../zyda2-tutorial/1_fuzzy_dedup/3_connected_components.py | 1 - 3 files changed, 2 insertions(+), 6 deletions(-) diff --git a/tutorials/pretraining-data-curation/red-pajama-v2-curation-tutorial.ipynb b/tutorials/pretraining-data-curation/red-pajama-v2-curation-tutorial.ipynb index 42c92bfab..0d1e23a85 100644 --- a/tutorials/pretraining-data-curation/red-pajama-v2-curation-tutorial.ipynb +++ b/tutorials/pretraining-data-curation/red-pajama-v2-curation-tutorial.ipynb @@ -2692,7 +2692,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "id": "6bee85f3-5477-4b9c-b606-7bbbefbe6cfc", "metadata": { "tags": [] @@ -2749,7 +2749,6 @@ " cache_dir=cache_dir,\n", " jaccard_pairs_path=jaccard_pairs_path,\n", " id_column=id_field,\n", - " convert_str_ids=True,\n", " jaccard_threshold=jaccard_threshold,\n", ")\n", "components_stage.cc_workflow(output_path=output_path)\n", @@ -4416,7 +4415,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "id": "89398db9-d4e6-48ec-bad8-1d5ac553cadd", "metadata": { "tags": [] @@ -4454,7 +4453,6 @@ " cache_dir=cache_dir,\n", " jaccard_pairs_path=jaccard_pairs_path,\n", " id_column=id_field,\n", - " convert_str_ids=True,\n", " jaccard_threshold=jaccard_threshold,\n", ")\n", "components_stage.cc_workflow(output_path=output_path)\n", diff --git a/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb b/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb index de585e08d..3170b3502 100644 --- a/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb +++ b/tutorials/single_node_tutorial/single_gpu_tutorial.ipynb @@ -1749,7 +1749,6 @@ " cache_dir=connected_component_cache_dir,\n", " jaccard_pairs_path=jaccard_pairs_path,\n", " id_column=input_id_field,\n", - " convert_str_ids=True,\n", " jaccard_threshold=jaccard_threshold,\n", ")\n", "\n", diff --git a/tutorials/zyda2-tutorial/1_fuzzy_dedup/3_connected_components.py b/tutorials/zyda2-tutorial/1_fuzzy_dedup/3_connected_components.py index 67796ec45..467e3c4e2 100644 --- a/tutorials/zyda2-tutorial/1_fuzzy_dedup/3_connected_components.py +++ b/tutorials/zyda2-tutorial/1_fuzzy_dedup/3_connected_components.py @@ -41,7 +41,6 @@ cache_dir=connected_component_cache_dir, jaccard_pairs_path=buckets_to_edges_out, id_column=input_id_field, - convert_str_ids=True, ) # Load and run connected components