Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Translate] updated notebook #759

Merged
merged 2 commits into from
Dec 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion translate/item.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,8 @@ apiVersion: v1
categories:
- data-preparation
- machine-learning
- deep-learning
- NLP
description: Translate text files from one language to another
doc: ''
example: translate.ipynb
Expand All @@ -26,5 +28,5 @@ spec:
- torch
- tqdm
url: ''
version: 0.0.1
version: 0.0.2
test_valid: True
203 changes: 194 additions & 9 deletions translate/translate.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -8,13 +8,124 @@
"# Translate tutorial"
]
},
{
"cell_type": "markdown",
"id": "afe4a3ee-f886-461c-9830-0fd9a5b625c3",
"metadata": {},
"source": [
"## Short description and explenation"
]
},
{
"cell_type": "markdown",
"id": "313ed5c3-7416-4bbb-a7fb-aa37ab1f8445",
"metadata": {},
"source": [
"Imagine a translation function that's as smart as it is easy to use – that's exactly what translate brings to the table.<br>\n",
"Simply tell it where your file is and the languages you're working with (the one you're translating from and the one you want),<br> and this function takes care of the rest. It cleverly picks the right pre-trained model for your language pair, ensuring top-notch translations.<br>No need to worry about finding the perfect model or dealing with complex setup – it's all handled behind the scenes.<br> With this function, language translation becomes a breeze, making your documents accessible in any language without breaking a sweat."
"Machine translation has made huge strides in recent years thanks to advances in deep learning, our translte function makes it even easier to use. <br>\n",
"Simply tell it where your file is and the languages you're working with (the one you're translating from and the one you want),<br>\n",
"and this function takes care of the rest. It cleverly picks the right pre-trained model for your language pair, ensuring top-notch translations.<br>\n",
"\n",
"No need to worry about finding the perfect model or dealing with complex setup – it's all handled behind the scenes.<br>\n",
"\n",
"With this function, language translation becomes a breeze, making your documents accessible in any language without breaking a sweat."
]
},
{
"cell_type": "markdown",
"id": "9352f799-fe99-4ace-9b44-ca0e28bb1fb4",
"metadata": {},
"source": [
"## Background"
]
},
{
"cell_type": "markdown",
"id": "6026a8bd-e2e7-454a-b325-9550561a587e",
"metadata": {},
"source": [
"The function takes two parameters: a model name or the source and target languages, and a path to one or more text files to translate.\n",
"\n",
"It first checks if a model name was passed. If so, it loads that Helsinki-NLP model.<br>\n",
"If not, it looks at the source and target languages and loads the appropriate Helsinki-NLP translation model.\n",
"\n",
"It then reads in the text files and translates them using the loaded model.\n",
"\n",
"Finally, it writes the translated text out to new files and returns the filename or dir name. <br>\n",
"\n",
"This allows the user to easily translate a text file to another language using Helsinki-NLP's pre-trained models by just passing the model name or language pair and source text file.<br>\n",
"\n",
"This function auto-model selection is based on the great translation models offered by Helsinki. Check them out https://huggingface.co/Helsinki-NLP"
]
},
{
"cell_type": "markdown",
"id": "42ec9bc3-2b90-40f1-b10b-5493d9e2b75e",
"metadata": {},
"source": [
"## Requirements"
]
},
{
"cell_type": "markdown",
"id": "6b756726-e750-4da4-b032-bf5385f85311",
"metadata": {},
"source": [
"`transformers` <br>\n",
"`tqdm` <br>"
]
},
{
"cell_type": "markdown",
"id": "212b8161-3e75-459e-98f3-a5b7c5a15efe",
"metadata": {},
"source": [
"## Documentation"
]
},
{
"cell_type": "markdown",
"id": "9b5fe561-4fbb-4471-91bb-532fa55559f9",
"metadata": {},
"source": [
"`data_path`: A directory of text files or a single text file or a list of files to translate.\n",
"\n",
"`output_directory`: Directory where the translated files will be saved.\n",
"\n",
"`model_name`: The name of a model to load. If None, the model name is constructed using the source and<br>\n",
" target languages parameters from the \"Helsinki-NLP\" group.\n",
" \n",
"`source_language`: The source language code (e.g., 'en' for English).\n",
"\n",
"`target_language`: The target language code (e.g., 'en' for English).\n",
"\n",
"`model_kwargs`: Keyword arguments to pass regarding the loading of the model in HuggingFace's \"pipeline\"\n",
" function.\n",
" \n",
"`device`: The device index for transformers. Default will prefer cuda if available.\n",
"\n",
"`batch_size`: The number of batches to use in translation. The files are translated one by one, but the\n",
" sentences can be batched.\n",
" \n",
"`translation_kwargs`: Additional keyword arguments to pass to a \"transformers.TranslationPipeline\" when doing<br>\n",
" the translation inference. Notice the batch size here is being added automatically.\n"
]
},
{
"cell_type": "markdown",
"id": "2e6f44a6-d6ac-48ed-a7d1-936d25e7426c",
"metadata": {},
"source": [
"## Demo "
]
},
{
"cell_type": "markdown",
"id": "2b231e4c-0224-41a2-87cf-400a4680e2b9",
"metadata": {},
"source": [
"The following demo will show an example of translating a text file written in turkish to eanglish using the _tranlate_ function. <br>\n",
"\n",
"### (1.) Import the function (import mlrun, set project and import function)"
]
},
{
Expand All @@ -32,8 +143,7 @@
"id": "1ff51127-dc54-44d2-bd13-0b81165b2033",
"metadata": {},
"source": [
"## Writing a data file to translate\n",
"We want to translate the following turkish sentence into english."
"We want to translate the following turkish sentence into english, so we will write it to a text file."
]
},
{
Expand All @@ -52,15 +162,15 @@
],
"source": [
"%%writefile data.txt\n",
"Ali her gece bir kitap okur."
"Ali her gece bir kitap okur. # which means: \"Ali reads a book every night.\""
]
},
{
"cell_type": "markdown",
"id": "c24d71a7-9400-475a-9472-424658801914",
"metadata": {},
"source": [
"## Setting a project and importing the translate function"
"Setting a project and importing the translate function"
]
},
{
Expand All @@ -82,13 +192,22 @@
"translate_fn = project.set_function(\"hub://translate\", \"translate\")"
]
},
{
"cell_type": "markdown",
"id": "558260ce-e453-4e05-a6a7-b2df39cff1b9",
"metadata": {},
"source": [
"## Usage"
]
},
{
"cell_type": "markdown",
"id": "5a1781ee-a210-4dc1-82de-0f4f5d191173",
"metadata": {},
"source": [
"## Translating\n",
"Here we run our function that we've imported from the MLRun Function Hub."
"### (2.1.) Manual model selection\n",
"Here we run our function that we've imported from the MLRun Function Hub. <br>\n",
"We select the specific model, give the function a path to to the file and output directory and choose to run on the cpu."
]
},
{
Expand Down Expand Up @@ -365,6 +484,72 @@
")"
]
},
{
"cell_type": "markdown",
"id": "8b2fcf2b-3893-4dda-85e2-4a2b9ed0d963",
"metadata": {},
"source": [
"### (2.1.) Auto model detectyion"
]
},
{
"cell_type": "markdown",
"id": "8c3d24ca-8df7-4204-8b0d-e7a08d53d8c9",
"metadata": {},
"source": [
"Here we run our function that we've imported from the MLRun Function Hub. <br>\n",
"We select the languages to use for choosing the model, give the function a path to to the file and output directory and choose to run on the cpu."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "dbe10afd-5ede-4475-abc2-bb07dfdf33aa",
"metadata": {},
"outputs": [],
"source": [
"translate_run = translate_fn.run(\n",
" handler=\"translate\",\n",
" inputs={\"data_path\": \"data.txt\"},\n",
" params={\n",
" \"target_language\": \"en\",\n",
" \"source_language\": \"tr\",\n",
" \"device\": \"cpu\",\n",
" \"output_directory\": \"./\",\n",
" },\n",
" local=True,\n",
" returns=[\n",
" \"files: path\",\n",
" \"text_files_dataframe: dataset\",\n",
" \"errors: dict\",\n",
" ],\n",
")"
]
},
{
"cell_type": "markdown",
"id": "40e4a666-9680-40d6-93ee-9466d31a9efc",
"metadata": {},
"source": [
"We can take alook at the file created"
]
},
{
"cell_type": "markdown",
"id": "89a1952c-f3c3-4a7b-bad4-b59c701a5af6",
"metadata": {},
"source": [
"### (3.) Review results"
]
},
{
"cell_type": "markdown",
"id": "9d583cf9-7e81-4d0d-982f-aba345d4cf9c",
"metadata": {},
"source": [
"We can look at the articat returned, the import "
]
},
{
"cell_type": "code",
"execution_count": 9,
Expand Down Expand Up @@ -424,7 +609,7 @@
"id": "580a20a2-4877-48b4-8f83-59cbfc2f3b83",
"metadata": {},
"source": [
"Checking that translation is correct"
"Checking that translation is correct, we print the text file created by function, and can see the sentence is as expected."
]
},
{
Expand Down
Loading