Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable accumulating files in memory while using attachFieldsToBody #385

Open
2 tasks done
MolotovCherry opened this issue Aug 30, 2022 · 33 comments
Open
2 tasks done
Labels
feature request New feature to be added semver-minor Issue or PR that should land as semver minor

Comments

@MolotovCherry
Copy link

MolotovCherry commented Aug 30, 2022

Prerequisites

  • I have written a descriptive issue title
  • I have searched existing issues to ensure the feature has not already been requested

🚀 Feature Proposal

Allow a flag in options to disable accumulation of files in memory, and let the caller handle the files themselves

Currently, if you want to use attachFieldsToBody, you're forced to either allow the library to accumulate files in memory, or provide an onFile function... Which also forces you to process the file (whether by streaming or accumulating).

I'd like to use the attachFieldsToBody feature, but I don't want any of my files accumulated or processed before the api function, because I need to handle them separately later and do streaming.

Motivation

attachFieldsToBody is a useful feature, and I'd rather not have to re-implement it so I can disable the accumulation

Example

fastify.register(require('@fastify/multipart'), { attachFieldsToBody: 'keyValues', disableAccumulation: true })

@mcollina
Copy link
Member

How would you implement this feature?

@MolotovCherry
Copy link
Author

MolotovCherry commented Aug 30, 2022

How would you implement this feature?

I think something like this is simple enough.
fastify.register(require('@fastify/multipart'), { attachFieldsToBody: 'keyValues', disableAccumulation: true })

@mcollina
Copy link
Member

Thanks for reporting! Would you like to send a Pull Request to address this issue? Remember to add unit tests.

@mcollina mcollina added the semver-minor Issue or PR that should land as semver minor label Aug 30, 2022
@Eomm
Copy link
Member

Eomm commented Sep 4, 2022

Since this usage does not accumulate:

fastify.register(require('@fastify/multipart'), { attachFieldsToBody: true, onFile })

I would support this new usage instead of adding a new option:

fastify.register(require('@fastify/multipart'), { attachFieldsToBody: 'keyValues', onFile })

@MolotovCherry
Copy link
Author

MolotovCherry commented Sep 4, 2022

Since this usage does not accumulate:

fastify.register(require('@fastify/multipart'), { attachFieldsToBody: true, onFile })

I would support this new usage instead of adding a new option:

fastify.register(require('@fastify/multipart'), { attachFieldsToBody: 'keyValues', onFile })

We have a need to decide on how to process files by stream inside the API method itself. From what I understand, this usage requires the files be processed before it even reaches the API method.

Isn't this the case? Did I miss something, and it's possible to let the API method handle the file by stream even though an onFile callback is specified?

@Uzlopak
Copy link
Contributor

Uzlopak commented Sep 4, 2022

Well you have somewhere to store the data. There is no rewind for these fileStreams. So you have to store them somewere if you want to still be able to process them.

@MolotovCherry
Copy link
Author

MolotovCherry commented Sep 4, 2022

Well you have somewhere to store the data. There is no rewind for these fileStreams. So you have to store them somewere if you want to still be able to process them.

This was my motivation for creating the issue. The case where you can't both use attachFieldsToBody and streams in the API method at the same time. (This is why I went with a disableAccumulation option)

Currently the only to use streams in this way is with attachFieldsToBody off

@Uzlopak
Copy link
Contributor

Uzlopak commented Sep 4, 2022

I am really confused. What is what you want to achieve? You want to attach the fields to the body but on the other hand you want to stream the file content. Ok. So maybe you dont wantthat the files get attached to the body as strings? Is that what you want?

@MolotovCherry
Copy link
Author

MolotovCherry commented Sep 4, 2022

I am really confused. What is what you want to achieve? You want to attach the fields to the body but on the other hand you want to stream the file content. Ok. So maybe you dont wantthat the files get attached to the body as strings? Is that what you want?

Yeah, that's right. I'd like to have the fields attached to body, but don't want files handled at all. I want to directly stream the files myself in the API method (not before it reaches it)

Because I can't do both, I've had to re-implement attachFieldsToBody, but it would be much cleaner to let the library do this (after all, the option exists in the first place). This way I can use fastify schema validations on the API method, rather than having to manually validate afterwards inside the API method itself (which results in messier code)

@Uzlopak
Copy link
Contributor

Uzlopak commented Sep 4, 2022

Well imho there is way to do tihs. You can not "stop" the stream. You have to store it somewhere.
Imagine you have a form with three fields, one of them is the file. Now you upload it, and first you get the file, and after it you get the field. As long as you not handle the file stream you will not get the field. So you have to process the file stream.

The only solution I know is to store the file into the memory, or to store it into redis.

redis/ioredis#1376

With RedisWritable, you could store the uploaded file into redis as onFile, and then in the handler, you can read it again with the RedisReadable.

@MolotovCherry
Copy link
Author

Imagine you have a form with three fields, one of them is the file. Now you upload it, and first you get the file, and after it you get the field. As long as you not handle the file stream you will not get the field. So you have to process the file stream.

Yes, in that case you would have to handle the file first, but as long as you order the fields first and have the file field be the very last one, this should be fine

@Uzlopak
Copy link
Contributor

Uzlopak commented Sep 4, 2022

Well I think it will not be possible to implement it.

@v4dkou
Copy link

v4dkou commented Sep 6, 2022

For the time being I disabled attachFieldsToBody and attach fields in a loop like this:

const body = {}
for await (const part of request.parts()) {
    if (part.file) {
        await pump(part.file, fs.createWriteStream(part.filename)) // `const pump = util.promisify(pipeline)` near imports as in other examples from docs
    } else {
        body[part.fieldname] = part // TODO: Array handling
    }
}

What might go wrong if the async generator function .parts would assign fields like this internally?

Of course, request.body would only be properly available after request.parts() is iterated over, but:
a) that seems intuitive to me
b) that could be mentioned in the docs along the other quirks like attachFieldsToBody: 'keyValues' converting all files with buffer.toString() by default

@liudichen
Copy link

Maybe, we can custom the onFile option , process the file stream, store files at local disk, and then attach local paths of files instead of files to body.

@Uzlopak
Copy link
Contributor

Uzlopak commented Mar 17, 2023

To be honest... This is the implementers duty to handle the streams. and tbh there are so many ready to use stream Implementations... An implementer just needs to correctly pipe the file streams.
I am strongly against storing data on the harddisk by default. I can already see, that then people complain that their harddisk is full and we did not implement logic to clean the files after they were processed. Then the next comes and asks why we dont stream it into redis or mongo gridFS, or, or, or...

@liudichen
Copy link

liudichen commented Mar 19, 2023

To be honest... This is the implementers duty to handle the streams. and tbh there are so many ready to use stream Implementations... An implementer just needs to correctly pipe the file streams. I am strongly against storing data on the harddisk by default. I can already see, that then people complain that their harddisk is full and we did not implement logic to clean the files after they were processed. Then the next comes and asks why we dont stream it into redis or mongo gridFS, or, or, or...

yes. we can do this just by passing a custom onFile to save to a local disk.

const onFile =  (field, file, filename, encoding, mimetype, body) =>  {
  const fileData = []
  const lastFile = body[field][body[field].length - 1]
  file.on('data', data => { if (!lastFile.limit) { fileData.push(data) } })
  file.on('limit', () => { lastFile.limit = true })
  file.on('end', () => {
    if (!lastFile.limit) {
      const buffer = Buffer.concat(fileData)
      lastFile.size = buffer.length
      const dir = path.join(os.tmpdir(), 'fastify-multipart')
     if (!fs.existsSync(dir)) { fs.mkdirSync(dir) }
    const filepath = path.join(dir, toID() + path.extname(filename))
     try {
        fs.writeFileSync(filepath, buffer)
        lastFile.filepath = filepath
     } catch (error) {
        lastFile.error = 'error when write to disk'
     }
     delete lastFile.data      
    } else {
        delete lastFile.data
    }
  })
}

However, I suggest whether the onFile configuration item can add several parameters, such as the options ( registration), or add an array parameter to collect all files (called requestFiles, for example, by which we can easily get files or clean them, or maybe add the logic to clean them in the cleanRequestFiles func) to enhance the customization ability of the onFile configuration item

@felicienfrancois
Copy link

I have the exact same issue.

Here what I would like to achieve :

fastify.post("/publish", function(request, reply) {
  const body = request.body;
  body.mainImageFile.file
    .pipe(imageResizer(1000,1000))
    .pipe(cloudStorage("some/path/" + body.mainImageFile.filename"));
  body.otherImageFile.file
    .pipe(imageResizer(500,500))
    .pipe(cloudStorage("some/path/" + body.otherImageFile.filename"));

   // use other fields as string
   body.title  // title as string
   body.description  // description as string
   ...
}

I even think it should be the default behavior. Fields are named in the request formdata so i expect to retrieve all fields (file and non-file) by their name in the body.

the attachFieldsToBody flag have probably been created for this purpose.
It is processing non-file fields as expected
JSON files are loaded and parsed which is a OK behavior to my opinion, as its the most common usage of such input.
Binary files are loaded in memory (in buffer) and set as the value (loosing contentType and filename metadatas).

I'm not sure it fits the usual need. I think binary files should be usable as stream and metadata should be available in value.

So I tried to do it with an onFile function:

// does not work
onFile: function(part) {
  part.value = part;
}

But this does not work as onFile must consume the stream in order to reach the next field or the end. The API method is only called when all files are consumed.

@maximelafarie
Copy link

maximelafarie commented Nov 16, 2023

I'm in the same situation as you @felicienfrancois.
After that, I don't think adding them to the body changes anything to the general need to only retrieve fields (other than the "file" type) in JSON format a more "convenient way".

I think a "helper" function would be a good thing, allowing us to do some variable destructuring.
E.g.:

const { fields, filename, mimetype } = await request.file(); // Actual behavior
const { myNonFileField, myOtherField } = fields.getFields(); // would be nice (but may return files)

// const { myNonFileField, myOtherField } = fields.getFields({ includeFiles: false }); // would be even nicer

Since actual fields (from await request.file();) value looks like:

fields: <ref *2> {
    file: [Circular *1],
    myExampleField: {
      type: 'field',
      fieldname: 'myExampleField',
      mimetype: 'text/plain',
      encoding: '7bit',
      value: 'myValue',
      fieldnameTruncated: false,
      valueTruncated: false,
      fields: [Circular *2]
    }
  }

The getFields() helper could just filter the fields by their mimetype or even if the key is different from file.

I don't know what @mcollina and @Eomm think about that.

@gurgunday gurgunday added the feature request New feature to be added label Jan 28, 2024
@arthurfiorette
Copy link
Contributor

arthurfiorette commented Jul 22, 2024

So I tried to do it with an onFile function:

This technically reads the stream while keeping a "readable" interface to be consumed inside req.body.

await app.register(fastifyMultipart, {
  attachFieldsToBody: true,
  onFile(part) {
    part.file = part.file.pipe(new PassThrough())
  },
})

However I'm pretty sure this also accumulates the file in memory, is that right @gurgunday?

@climba03003
Copy link
Member

However I'm pretty sure this also accumulates the file in memory, is that right @gurgunday?

Yes, you only add an extra layer of stream.
It may works when the file size is smaller than the high watermark.
But the whole stream will still pause when the file size is larger than high watermark.

@mcollina
Copy link
Member

The solution to this problem is to save the files on disk instead of memory. Then, load them from disk once they are being read.

This is possible, it needs a volunteer.

@gurgunday
Copy link
Member

Other than what @mcollina suggested, I don't know how to achieve this

But the idea looks really solid, saving files to disk and reading them as a stream when needed

Most places allow this

https://docs.digitalocean.com/products/app-platform/how-to/store-data/#local-filesystem-ephemeral-files-only

@climba03003
Copy link
Member

https://github.com/kaka-ng/fastify-plugins/tree/main/packages/multipart

For reference, I like to provide the users choice.
I provide, Buffering and File by default. Personally, using S3 and MongoDB GridFS.

@felicienfrancois
Copy link

As far as i understand, this issue is not about where to store / to accumulate. It is about stoping accumulating / storing when not necessary.
Storing on disk may solve another issue when memory is limited and disk is available and less limited. It would still requires some expiration and temp files management in order to avoid filling disk undefinitelly.
BUT it would not solve this issue.
Here a concrete example so you can understand.
I'm sending 3 form text fields alongside 10 image files of 20MB each which takes 10min to upload on my slow connection.
With the current implementation, all the 10 files is accumulated in memory AND the handler is called only at the end (10 mins after the start of the request), it then resize all images at the end and do its stuff in database which takes takes 20s after the end of upload before the server answer, taking 20s of full sequential load on the server.
This is inneficient, primarily because traeatment is sequential.
The correct way of handling "large" file upload (large, numerous or treatment intensive), is by allowing processing while receiving.
Then, as soon as this this possible, thoose having a memory issue can easily store files on disk to handle them later if they want.
BUT storing on disk does not solve the inneficiency issue due to impossibility to process while receiving.

@climba03003
Copy link
Member

climba03003 commented Jul 23, 2024

As far as i understand, this issue is not about where to store / to accumulate. It is about stoping accumulating / storing when not necessary.

Then, shouldn't the new API solves the problem?
You can use request.formData and the stream will be parsed at that point.

I don't know how it can be possible to disable accumulate with attachFieldsToBody.
There is no API for providing the onFile function.
If you need to handle the stream later at some point, you shouldn't be using attachFieldsToBody.
If you need the body to determine the action on file, you need to handle the file together.

@climba03003
Copy link
Member

From my understanding, the request here is retrieve all the fields first and let the file handle in later.
Technically speaking, it is possible only when your client send fields before the file.

In generally case, you never know the order of fields and files sent by the client.
If we provide an API that handle the fields first and pause on the first file received.
You will faced in a situation of either reading incomplete body or request hanging.

This is also why we propose to have it accumulated in other places first.
Then you can use it later.

@Uzlopak
Copy link
Contributor

Uzlopak commented Jul 23, 2024

redis/ioredis#1376

Or redis

@arthurfiorette
Copy link
Contributor

If you need to handle the stream later at some point, you shouldn't be using `attachFieldsToBody' .

Most of my current problem today is how can I NOT accumulate the file in memory while also being able to use other fastify features, such as schema validation, other fields on body and swagger/openapi. Which all of them does not work without attachFieldsToBody

@felicienfrancois
Copy link

The purpose of streams is not to use it later. It's the opposite. It's using it while receiving and not at the end. So the solution is not to accumulate somewhere else.
Also I think this is not a edge case. Most file processing libs support streaming (excel or csv parsing, image resizing, copy to disk, ...).
Without streaming:
Upload ----->
Validation -->
Processing ---->
Saving --->
With streaming
Upload ----->
Validation -->
Processing ----->
Saving ----->

TLDR streaming would benefit to most real world use cases. It would reduce response time (and reduce memory footprint), which is basically why we choosed fastify over express or others.


Then the question is how can be benefit at the same time from :

  • streaming
  • easy object based API (event based API is a pain) provided by attachFieldsToBody
  • fastify features (such as schema validation, other fields on body and swagger/openapi) (requires attachFieldsToBody)

The biggest issue here is that fields can be appended in random order in the body (not truely random, it appears in the order the frontend appended it in the formData) and there is no "header" listing the fields that will be encountered in the body.
So for example if text fields are appended after the file in the body we have to read/receive the full file before getting access to the text values.


Two ideas:
1st idea: Read body till we found text fields and then call handler with BufferStreams
(full buffer for files which were appended before text fields and empty buffers for others, which will be filled after the start of the handler).
Basically if we have text fields appended first and then 2 files the execution scheme would be
Upload -t-|----f1----|----f2---->
Req handler ---------------------->
Validation -->
Processing --------------------->
Saving ------>

But if appended last it would still accumulate and be called at the end but with (full) streams instead of buffer which ease usage in streaming libs
Upload ----f1----|----f2----|-t->
Req handler -------------->
Validation -->
Processing ------------->
Saving ------------->

If text field are appended in the middle it would still benefit the performance by calling handler in the middle.

This solution would require to know in advance what text fields are expected. Either via Fastify Schema or via additional options.
Also, we could have a performance hint log (maybe shown by default on dev environment), when text fields are not appended first, so developers can improve their frontend code.

2nd idea: async getters for everything
We could have an option which would makes every key in the body an async getter function instead of the value.
Example :

const first_name = await req.body.first_name();
const last_name = await req.body.last_name();
const fileStream = await req.body.file();

This solution is more flexible, does not requires a schema definition (if implemented with Object Proxy), but is more complicated to understand how to uses in an optimized way (properties should be accessed in the order they are appended in the form).


That's just ideas for now, it may not be easy to implement with fastify and busboy constraints.

@climba03003
Copy link
Member

climba03003 commented Jul 25, 2024

I don't see how it is possible for both ideas without buffering in somewhere (Memory or File).

Idea 1:
We can knows the fields from advance by schema, but the caller can be anyone and supply any fields that doesn't belong to the schema.
From the point of validation, you required to have all received the full body, or at least all specified fields.
That also means for the file fields, you MUST buffer it to some place.

If you don't rely on fastify built-in validation, it is already possible through request.parts.

Idea 2:
async getter won't be hard to implement, but you ignore the face that you must provide the form as your expected order.
Otherwise, your request will stale as soon as the non-specified file field comes, since you have not specify any handler on it.
That also means, you MUST buffer it to some place for a smooth DX.

For all your assumption is the order of fields must be guarantee, but from my point of view.
The design must support in the way of random fields order.

@climba03003
Copy link
Member

climba03003 commented Jul 25, 2024

A short summary,

  1. Smooth DX to use fastify built-in schema validation.

Personal advice, use my plugin @kakang/fastify-multipart

import FastifyMultipart from '@kakang/fastify-multipart'
import { BusboyAdapter } from '@kakang/fastify-multipart/lib/adapter/busboy'
import { FileStorage } from '@kakang/fastify-multipart/lib/storage/file'

fastify.register(FastifyMultipart, {
  addContentTypeParser: true,
  adapter: BusboyAdapter,
  storage: FileStorage,
  storageOption: {
    uploadDir: '/'
  }
})
  1. Direct processing of formdata.

You can already archive with the async generator.
There is no perfect world, you can't use it together with built-in validation.

fastify.post('/upload', async function (request, reply) {
  for await (const part of request.parts()) {
    // you can determine the action by field name
    switch(part.fieldname) {

    }
  }
  return ''
})

If you use my plugin, it is also similar.

fastify.post('/upload', async function (request, reply) {
  for await (const { type, name, value, info } of request.multipart()) {
    // you can determine the action by field name
    switch(name) {
    }
  }
}

Another API that I can think of is providing an object that have all handler.
It would be easier for us to just discard the unwanted field / file.

request.multipart({
  foo: 'string' // short alias to validate the type
  bar(value) {
    // direct handling

  }
})

or more simple just provide handler of files in hooks.

fastify.get('/', {
  config: {
    multipart: {
      hello(stream) {
        // you must return stream here
        return stream
      }
    }
  }
}, async function(request, reply) {
  await requset.parseMultipart({
    // maybe here too
    hello(stream) {
      // you must return stream here
      return stream
    }
  })

  // access in body or files
  request.body.other
  request.files.hello // stream 
})

@mcollina
Copy link
Member

The biggest issue here is that fields can be appended in random order in the body (not truely random, it appears in the order the frontend appended it in the formData) and there is no "header" listing the fields that will be encountered in the body.
So for example if text fields are appended after the file in the body we have to read/receive the full file before getting access to the text values.

This is the fundamental reason why attachFieldsToBody buffers everything in memory. The solution for this problem is to stream all files to disk, and once the request has been fully upload process validation, and finally have the user handler. A PR to implement this would be highly welcomed.

@matteorigotti
Copy link

matteorigotti commented Jan 16, 2025

Hi all, is there a way to specify a dedicated behavior for a specific route?

If I can be able to specify a different onFile function for a specific route I think the issue is solved because I can parse the files in a specific way and let other fields go in the body.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature to be added semver-minor Issue or PR that should land as semver minor
Projects
None yet
Development

No branches or pull requests