From 1c0316e03ddefd152befdc9bb0b0469b9e79712a Mon Sep 17 00:00:00 2001 From: Harshil Agrawal Date: Fri, 11 Oct 2024 14:20:28 +0200 Subject: [PATCH] add new r2 event notification tutorial --- .../docs/r2/tutorials/summarize-pdf.mdx | 459 ++++++++++++++++++ 1 file changed, 459 insertions(+) create mode 100644 src/content/docs/r2/tutorials/summarize-pdf.mdx diff --git a/src/content/docs/r2/tutorials/summarize-pdf.mdx b/src/content/docs/r2/tutorials/summarize-pdf.mdx new file mode 100644 index 000000000000000..ee4f0dcf34a3839 --- /dev/null +++ b/src/content/docs/r2/tutorials/summarize-pdf.mdx @@ -0,0 +1,459 @@ +--- +title: Use event notification to summarize PDF files on upload +pcx_content_type: tutorial +content_type: 📝 Tutorial +products: + - Queues + - Workers +difficulty: Intermediate +updated: 2024-10-10 +languages: + - TypeScript +--- + +import { Render, PackageManagers, Details } from "~/components"; + +In this tutorial, you will learn how to use [event notifications](/r2/buckets/event-notifications/) to process a PDF file when it is uploaded to an R2 bucket. You will use [Workers AI](/workers-ai/) to summarize the PDF and store the summary as a text file in the same bucket. + +## Prerequisites + +To continue, you will need: + +1. A [Cloudflare account](https://dash.cloudflare.com/sign-up/workers-and-pages) with access to R2. +2. Have an existing R2 bucket. Refer to [Get started tutorial for R2](/r2/get-started/#2-create-a-bucket). +3. Install [`Node.js`](https://docs.npmjs.com/downloading-and-installing-node-js-and-npm). + +
+ Use a Node version manager like [Volta](https://volta.sh/) or + [nvm](https://github.com/nvm-sh/nvm) to avoid permission issues and change + Node.js versions. [Wrangler](/workers/wrangler/install-and-update/), discussed + later in this guide, requires a Node version of `16.17.0` or later. +
+ +## 1. Create a new project + +You will create a new Worker project that will use [Static Assets](/workers/static-assets/) to serve the front-end of your application. A user can upload a PDF file using this front-end, which will then be processed by your Worker. + +Create a new Worker project by running the following commands: + + + + + +Navigate to the `pdf-summarizer` directory: + +```sh frame="none" +cd pdf-summarizer +``` + +## 2. Create the front-end + +Using Static Assets, you can serve the front-end of your application from your Worker. To use Static Assets, you need to add the required bindings to your `wrangler.toml` file. + +```toml +[assets] +directory = "public" +``` + +Next, create a `public` directory and add an `index.html` file. The `index.html` file should contain the following HTML code: + +
+ +Select to view the HTML code + +```html + + + + + + PDF Summarizer + + + +
+
+

Upload PDF File

+
+ + +
+
+
+ + + + + + + + +``` + +
+ +To view the front-end of your application, run the following command and navigate to the URL displayed in the terminal: + +```sh +npm run dev +```` + +```output + ⛅️ wrangler 3.80.2 +------------------- + +⎔ Starting local server... +[wrangler:inf] Ready on http://localhost:8787 +╭───────────────────────────╮ +│ [b] open a browser │ +│ [d] open devtools │ +│ [l] turn off local mode │ +│ [c] clear console │ +│ [x] to exit │ +╰───────────────────────────╯ +``` + +When you open the URL in your browser, you will see that there is a file upload form. If you try uploading a file, you will notice that the file is not uploaded to the server. This is because the front-end is not connected to the back-end. In the next step, you will update your Worker that will handle the file upload. + +## 3. Handle file upload + +To handle the file upload, you will first need to add the R2 binding. In the `wrangler.toml` file, add the following code: + +```toml +[[r2_buckets]] +binding = "MY_BUCKET" +bucket_name = "" +``` + +Replace `` with the name of your R2 bucket. + +Next, update the `src/index.ts` file. The `src/index.ts` file should contain the following code: + +```ts title="src/index.ts" +export default { + async fetch(request, env, ctx): Promise { + // Get the pathname from the request + const pathname = new URL(request.url).pathname; + + if (pathname === "/api/upload" && request.method === "POST") { + // Get the file from the request + const formData = await request.formData(); + const file = formData.get("pdfFile") as File; + + // Upload the file to Cloudflare R2 + const upload = await env.MY_BUCKET.put(file.name, file); + return new Response("File uploaded successfully", { status: 200 }); + } + + return new Response("incorrect route", { status: 404 }); + }, +} satisfies ExportedHandler; +``` + +The above code does the following: + +- Check if the request is a POST request to the `/api/upload` endpoint. If it is, it gets the file from the request and uploads it to Cloudflare R2 using the [Workers API](/r2/api/workers/). +- If the request is not a POST request to the `/api/upload` endpoint, it returns a 404 response. + +Since the Worker code is written in TypeScript, you should run the following command to add the necessary type definitions. While this is not required, it will help you avoid errors. + +```sh +npm run cf-typegen +``` + +You can restart the developer server to test the changes: + +```sh +npm run dev +``` + +## 4. Create a queue + +:::note + +You will need a [Workers Paid plan](/workers/platform/pricing/) to create and use [Queues](/queues/) and Cloudflare Workers to consume event notifications. + +::: + +Event notifications capture changes to data in your R2 bucket. You will need to create a new queue `pdf-summarize` to receive notifications: + +```sh +npx wrangler queues create pdf-summarizer +``` + +Add the binding to the `wrangler.toml` file: + +```toml title="wrangler.toml" +[[queues.consumers]] +queue = "pdf-summarizer" +``` + +## 5. Handle event notifications + +Now that you have a queue to receive event notifications, you need to update the Worker to handle the event notifications. You will need to add a Queue handler that will extract the textual content from the PDF, use Workers AI to summarize the content, and then save it in the R2 bucket. + +Update the `src/index.ts` file to add the Queue handler: + +```ts title="src/index.ts" +export default { + async fetch(request, env, ctx): Promise { + // No changes in the fetch handler + }, + async queue(batch, env) { + for (let message of batch.messages) { + console.log(`Processing the file: ${message.body.object.key}`); + } + }, +} satisfies ExportedHandler; +``` + +The above code does the following: + +- The `queue` handler is called when a new message is added to the queue. It loops through the messages in the batch and logs the name of the file. + +For now the `queue` handler is not doing anything. In the next steps, you will update the `queue` handler to extract the textual content from the PDF, use Workers AI to summarize the content, and then add it to the bucket. + +## 6. Extract the textual content from the PDF + +To extract the textual content from the PDF, the Worker will use the [unpdf](https://github.com/unjs/unpdf) library. The `unpdf` library provides utilities to work with PDF files. + +Install the `unpdf` library by running the following command: + +```sh +npm install unpdf +``` + +Update the `src/index.ts` file to import the required modules from the `unpdf` library: + +```ts title="src/index.ts" ins={1} +import { extractText, getDocumentProxy } from "unpdf"; +``` + +Next, update the `queue` handler to extract the textual content from the PDF: + +```ts title="src/index.ts" ins={4-15} +async queue(batch, env) { + for(let message of batch.messages) { + console.log(`Processing file: ${message.body.object.key}`); + // Get the file from the R2 bucket + const file = await env.MY_BUCKET.get(message.body.object.key); + if (!file) { + console.error(`File not found: ${message.body.object.key}`); + continue; + } + // Extract the textual content from the PDF + const buffer = await file.arrayBuffer(); + const document = await getDocumentProxy(new Uint8Array(buffer)); + + const {text} = await extractText(document, {mergePages: true}); + console.log(`Extracted text: ${text.substring(0, 100)}...`); + } +} +``` + +The above code does the following: + +- The `queue` handler gets the file from the R2 bucket. +- The `queue` handler extracts the textual content from the PDF using the `unpdf` library. +- The `queue` handler logs the textual content. + +## 7. Use Workers AI to summarize the content + +To use Workers AI, you will need to add the Workers AI binding to the `wrangler.toml` file. The `wrangler.toml` file should contain the following code: + +```toml title="wrangler.toml" +[ai] +binding = "AI" +``` + +Execute the following command to add the AI type definition: + +```sh +npm run cf-typegen +``` + +Update the `src/index.ts` file to use Workers AI to summarize the content: + +```ts title="src/index.ts" ins={7-15} +async queue(batch, env) { + for(let message of batch.messages) { + // Extract the textual content from the PDF + const {text} = await extractText(document, {mergePages: true}); + console.log(`Extracted text: ${text.substring(0, 100)}...`); + + // Use Workers AI to summarize the content + const result: AiSummarizationOutput = await env.AI.run( + "@cf/facebook/bart-large-cnn", + { + input_text: text, + } + ); + const summary = result.summary; + console.log(`Summary: ${summary.substring(0, 100)}...`); + } +} +``` + +The `queue` handler now uses Workers AI to summarize the content. + +## 8. Add the summary to the R2 bucket + +Now that you have the summary, you need to add it to the R2 bucket. Update the `src/index.ts` file to add the summary to the R2 bucket: + +```ts title="src/index.ts" ins={8-14} +async queue(batch, env) { + for(let message of batch.messages) { + // Extract the textual content from the PDF + // ... + // Use Workers AI to summarize the content + // ... + + // Add the summary to the R2 bucket + const upload = await env.MY_BUCKET.put(`${message.body.object.key}-summary.txt`, summary, { + httpMetadata: { + contentType: 'text/plain', + }, + }); + console.log(`Summary added to the R2 bucket: ${upload.key}`); + } +} +``` + +The queue handler now adds the summary to the R2 bucket as a text file. + +## 9. Enable event notifications + +Your `queue` handler is ready to handle incoming event notification messages. You need to enable event notifications with the [`wrangler r2 bucket notification create` command](/workers/wrangler/commands/#notification-create) for your bucket. The following command creates an event notification for the `object-create` event type for the `pdf` suffix: + +```sh +npx wrangler r2 bucket notification create --event-type object-create --queue pdf-summarizer --suffix "pdf" +``` + +Replace `` with the name of your R2 bucket. + +An event notification is created for the `pdf` suffix. When a new file with the `pdf` suffix is uploaded to the R2 bucket, the `pdf-summarizer` queue is triggered. + +## 10. Deploy your Worker + +To deploy your Worker, run the [`wrangler deploy`](/workers/wrangler/commands/#deploy) command: + +```sh +npx wrangler deploy +``` + +In the output of the `wrangler deploy` command, copy the URL. This is the URL of your deployed application. + +## 11. Test + +To test the application, navigate to the URL of your deployed application and upload a PDF file. Alternatively, you can use the [Cloudflare dashboard](https://dash.cloudflare.com/) to upload a PDF file. + +To view the logs, you can use the [`wrangler tail`](/workers/wrangler/commands/#tail) command. + +```sh +npx wrangler tail +``` + +You will see the logs in your terminal. You can also navigate to the Cloudflare dashboard and view the logs in the Workers Logs section. + +If you check your R2 bucket, you will see the summary file. + +## Conclusion + +In this tutorial, you learned how to use R2 event notifications to process an object on upload. You created an application to upload a PDF file, and created a consumer Worker that creates a summary of the PDF file. You also learned how to use Workers AI to summarize the content of the PDF file, and upload the summary to the R2 bucket. + +You can use the same approach to process other types of files, such as images, videos, and audio files. You can also use the same approach to process other types of events, such as object deletion, and object update. + +If you want to view the code for this tutorial, you can find it on [GitHub](https://github.com/harshil1712/pdf-summarizer-r2-event-notification).