mlrun · aviaIguazio · Dec 26, 2023 · Dec 25, 2023 · Dec 26, 2023
diff --git a/translate/item.yaml b/translate/item.yaml
@@ -2,6 +2,8 @@ apiVersion: v1
 categories:
 - data-preparation
 - machine-learning
+- deep-learning
+- NLP
 description: Translate text files from one language to another
 doc: ''
 example: translate.ipynb
@@ -26,5 +28,5 @@ spec:
     - torch
     - tqdm
 url: ''
-version: 0.0.1
+version: 0.0.2
 test_valid: True
diff --git a/translate/translate.ipynb b/translate/translate.ipynb
@@ -8,13 +8,124 @@
     "# Translate tutorial"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "afe4a3ee-f886-461c-9830-0fd9a5b625c3",
+   "metadata": {},
+   "source": [
+    "## Short description and explenation"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "313ed5c3-7416-4bbb-a7fb-aa37ab1f8445",
    "metadata": {},
    "source": [
-    "Imagine a translation function that's as smart as it is easy to use – that's exactly what translate brings to the table.<br>\n",
-    "Simply tell it where your file is and the languages you're working with (the one you're translating from and the one you want),<br> and this function takes care of the rest. It cleverly picks the right pre-trained model for your language pair, ensuring top-notch translations.<br>No need to worry about finding the perfect model or dealing with complex setup – it's all handled behind the scenes.<br> With this function, language translation becomes a breeze, making your documents accessible in any language without breaking a sweat."
+    "Machine translation has made huge strides in recent years thanks to advances in deep learning, our translte function makes it even easier to use. <br>\n",
+    "Simply tell it where your file is and the languages you're working with (the one you're translating from and the one you want),<br>\n",
+    "and this function takes care of the rest. It cleverly picks the right pre-trained model for your language pair, ensuring top-notch translations.<br>\n",
+    "\n",
+    "No need to worry about finding the perfect model or dealing with complex setup – it's all handled behind the scenes.<br>\n",
+    "\n",
+    "With this function, language translation becomes a breeze, making your documents accessible in any language without breaking a sweat."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9352f799-fe99-4ace-9b44-ca0e28bb1fb4",
+   "metadata": {},
+   "source": [
+    "## Background"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6026a8bd-e2e7-454a-b325-9550561a587e",
+   "metadata": {},
+   "source": [
+    "The function takes two parameters: a model name or the source and target languages, and a path to one or more text files to translate.\n",
+    "\n",
+    "It first checks if a model name was passed. If so, it loads that Helsinki-NLP model.<br>\n",
+    "If not, it looks at the source and target languages and loads the appropriate Helsinki-NLP translation model.\n",
+    "\n",
+    "It then reads in the text files and translates them using the loaded model.\n",
+    "\n",
+    "Finally, it writes the translated text out to new files and returns the filename or dir name. <br>\n",
+    "\n",
+    "This allows the user to easily translate a text file to another language using Helsinki-NLP's pre-trained models by just passing the model name or language pair and source text file.<br>\n",
+    "\n",
+    "This function auto-model selection is based on the great translation models offered by Helsinki. Check them out https://huggingface.co/Helsinki-NLP"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "42ec9bc3-2b90-40f1-b10b-5493d9e2b75e",
+   "metadata": {},
+   "source": [
+    "## Requirements"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6b756726-e750-4da4-b032-bf5385f85311",
+   "metadata": {},
+   "source": [
+    "`transformers` <br>\n",
+    "`tqdm` <br>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "212b8161-3e75-459e-98f3-a5b7c5a15efe",
+   "metadata": {},
+   "source": [
+    "## Documentation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9b5fe561-4fbb-4471-91bb-532fa55559f9",
+   "metadata": {},
+   "source": [
+    "`data_path`:          A directory of text files or a single text file or a list of files to translate.\n",
+    "\n",
+    "`output_directory`:   Directory where the translated files will be saved.\n",
+    "\n",
+    "`model_name`:         The name of a model to load. If None, the model name is constructed using the source and<br>\n",
+    "                           target languages parameters from the \"Helsinki-NLP\" group.\n",
+    "                           \n",
+    "`source_language`:    The source language code (e.g., 'en' for English).\n",
+    "\n",
+    "`target_language`:    The target language code (e.g., 'en' for English).\n",
+    "\n",
+    "`model_kwargs`:       Keyword arguments to pass regarding the loading of the model in HuggingFace's \"pipeline\"\n",
+    "                           function.\n",
+    "                           \n",
+    "`device`:             The device index for transformers. Default will prefer cuda if available.\n",
+    "\n",
+    "`batch_size`:         The number of batches to use in translation. The files are translated one by one, but the\n",
+    "                           sentences can be batched.\n",
+    "                           \n",
+    "`translation_kwargs`: Additional keyword arguments to pass to a \"transformers.TranslationPipeline\" when doing<br>\n",
+    "                               the translation inference. Notice the batch size here is being added automatically.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2e6f44a6-d6ac-48ed-a7d1-936d25e7426c",
+   "metadata": {},
+   "source": [
+    "## Demo "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2b231e4c-0224-41a2-87cf-400a4680e2b9",
+   "metadata": {},
+   "source": [
+    "The following demo will show an example of translating a text file written in turkish to eanglish using the _tranlate_ function. <br>\n",
+    "\n",
+    "### (1.) Import the function (import mlrun, set project and import function)"
    ]
   },
   {
@@ -32,8 +143,7 @@
    "id": "1ff51127-dc54-44d2-bd13-0b81165b2033",
    "metadata": {},
    "source": [
-    "## Writing a data file to translate\n",
-    "We want to translate the following turkish sentence into english."
+    "We want to translate the following turkish sentence into english, so we will write it to a text file."
    ]
   },
   {
@@ -52,15 +162,15 @@
    ],
    "source": [
     "%%writefile data.txt\n",
-    "Ali her gece bir kitap okur."
+    "Ali her gece bir kitap okur. # which means: \"Ali reads a book every night.\""
    ]
   },
   {
    "cell_type": "markdown",
    "id": "c24d71a7-9400-475a-9472-424658801914",
    "metadata": {},
    "source": [
-    "## Setting a project and importing the translate function"
+    "Setting a project and importing the translate function"
    ]
   },
   {
@@ -82,13 +192,22 @@
     "translate_fn = project.set_function(\"hub://translate\", \"translate\")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "558260ce-e453-4e05-a6a7-b2df39cff1b9",
+   "metadata": {},
+   "source": [
+    "## Usage"
+   ]
+  },
   {
    "cell_type": "markdown",
    "id": "5a1781ee-a210-4dc1-82de-0f4f5d191173",
    "metadata": {},
    "source": [
-    "## Translating\n",
-    "Here we run our function that we've imported from the MLRun Function Hub."
+    "### (2.1.) Manual model selection\n",
+    "Here we run our function that we've imported from the MLRun Function Hub. <br>\n",
+    "We select the specific model, give the function a path to to the file and output directory and choose to run on the cpu."
    ]
   },
   {
@@ -365,6 +484,72 @@
     ")"
    ]
   },
+  {
+   "cell_type": "markdown",
+   "id": "8b2fcf2b-3893-4dda-85e2-4a2b9ed0d963",
+   "metadata": {},
+   "source": [
+    "### (2.1.) Auto model detectyion"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8c3d24ca-8df7-4204-8b0d-e7a08d53d8c9",
+   "metadata": {},
+   "source": [
+    "Here we run our function that we've imported from the MLRun Function Hub. <br>\n",
+    "We select the languages to use for choosing the model, give the function a path to to the file and output directory and choose to run on the cpu."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "dbe10afd-5ede-4475-abc2-bb07dfdf33aa",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "translate_run = translate_fn.run(\n",
+    "    handler=\"translate\",\n",
+    "    inputs={\"data_path\": \"data.txt\"},\n",
+    "    params={\n",
+    "        \"target_language\": \"en\",\n",
+    "        \"source_language\": \"tr\",\n",
+    "        \"device\": \"cpu\",\n",
+    "        \"output_directory\": \"./\",\n",
+    "    },\n",
+    "    local=True,\n",
+    "    returns=[\n",
+    "        \"files: path\",\n",
+    "        \"text_files_dataframe: dataset\",\n",
+    "        \"errors: dict\",\n",
+    "    ],\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "40e4a666-9680-40d6-93ee-9466d31a9efc",
+   "metadata": {},
+   "source": [
+    "We can take alook at the file created"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "89a1952c-f3c3-4a7b-bad4-b59c701a5af6",
+   "metadata": {},
+   "source": [
+    "### (3.) Review results"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9d583cf9-7e81-4d0d-982f-aba345d4cf9c",
+   "metadata": {},
+   "source": [
+    "We can look at the articat returned, the import "
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": 9,
@@ -424,7 +609,7 @@
    "id": "580a20a2-4877-48b4-8f83-59cbfc2f3b83",
    "metadata": {},
    "source": [
-    "Checking that translation is correct"
+    "Checking that translation is correct, we print the text file created by function, and can see the sentence is as expected."
    ]
   },
   {