Option to avoid parsing entire Matroska file? #2135

hvianna · 2024-03-02T20:58:44Z

Hello!

I'm having some issues when trying to retrieve the metadata of a large (15GB) video file with parseBlob() - disk usage skyrockets and it takes about 1 minute and 20 seconds to resolve with the metadata, so it looks like the it's parsing the entire file.

Sometimes the browser just crash or I get an out of memory error (having the dev tools open seems to make things worse / slower).

I tried using skipPostHeaders: true and duration: false, but it seems parseBlob() doesn't take an options object.

I'd appreciate any advice.

Kind regards.

The text was updated successfully, but these errors were encountered:

hvianna · 2024-03-02T23:07:26Z

Update:

fetchFromUrl( url, { skipPostHeaders: true } ) also doesn't seem to prevent it from reading the entire file until it returns the metadata. At least for this particular file, which is an .mkv with an AVC video track and two audio tracks (DTS and PCM).

Borewit · 2024-07-10T18:45:18Z

Does music-metadata v9.0.0 solve you issue?

The implementation of reading from Blobs have been changed from buffering to streaming.

hvianna · 2024-07-11T21:07:00Z

I'm not sure yet, music-metadata 9.0.0 gives me this error when trying to parse mkv and webm files:

Also, do I still need a buffer polyfill for the browser? If I remove it, I can only retrieve metadata from flac files, everything else gives me the error below:

I'm testing with the following code:

// for web files (URLs)
const response = await fetch( uri );
const metadata = await parseWebStream( response.body, response.headers.get('content-type'), { skipPostHeaders: true } );

// for FileSystem API files
const file = await handle.getFile();
const metadata = await parseBlob( file );

Thanks.

pcbowers · 2024-07-12T04:14:37Z

@Borewit Unless I'm missing something, it looks like parseWebStream is not being exported and thus cannot be used: https://github.com/Borewit/music-metadata/blob/v9.0.0/lib/index.ts#L11.

Furthermore, on use of this code:

const response = await fetch(`https://my/mp3/file`);
const metadata = await parseWebStream(response.body!, response.headers.get('content-type')!, {
  skipPostHeaders: true,
  includeChapters: true,
  skipCovers: true
});

I get this error:

TypeError [ERR_INVALID_ARG_VALUE]: The argument 'stream' must be a byte stream. Received ReadableStream { locked: false, state: 'readable', supportsBYOB: false }
    at new NodeError (node:internal/errors:405:5)
    at setupReadableStreamBYOBReader (node:internal/webstreams/readablestream:2155:11)
    at new ReadableStreamBYOBReader (node:internal/webstreams/readablestream:916:5)
    at ReadableStream.getReader (node:internal/webstreams/readablestream:352:12)
    at new WebStreamReader (file:///home/pcbowers/projects/hono/node_modules/.pnpm/[email protected]/node_modules/peek-readable/lib/WebStreamReader.js:12:30)
    at Module.fromWebStream (file:///home/pcbowers/projects/hono/node_modules/.pnpm/[email protected]/node_modules/strtok3/lib/core.js:25:36)
    at Module.parseWebStream (file:///home/pcbowers/projects/hono/node_modules/.pnpm/[email protected]/node_modules/music-metadata/lib/core.js:29:39)
    at Array.eval (/home/pcbowers/projects/hono/src/index.ts:12:48)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async getRequestListener.overrideGlobalObjects (file:///home/pcbowers/projects/hono/node_modules/.pnpm/@[email protected][email protected]/node_modules/@hono/vite-dev-server/dist/dev-server.js:69:32) {
  code: 'ERR_INVALID_ARG_VALUE'

I wish I knew more about it or else I would have debugged further! Leaving this here instead of on a new issue since I think fixing this would solve "avoid parsing entire file"

Borewit · 2024-07-12T05:34:10Z

~~Please put #2135 (comment) as a new issue @pcbowers , it is unrelated.~~

Moved #2135 (comment) to issue #2143

Borewit · 2024-07-12T10:07:21Z

I'm not sure yet, music-metadata 9.0.0 gives me this error when trying to parse mkv and webm files:

That was bad, do you mind giving it a try with v9.0.1 @hvianna ?

hvianna · 2024-07-12T17:45:40Z

I'm not sure yet, music-metadata 9.0.0 gives me this error when trying to parse mkv and webm files:

That was bad, do you mind giving it a try with v9.0.1 @hvianna ?

It works fine for flac and mp3, no more Buffer-related errors.

I'm still getting errors for webm and mkv, though.

using parseWebStream():

TypeError: Cannot read properties of undefined (reading 'docType')
    at MatroskaParser.parse (MatroskaParser.js:50:68)
    at async parse (ParserFactory.js:57:5)
    at async retrieveMetadata (index.js:3172:17)

using parseBlob():

Error: End-Of-Stream
    at ReadStreamTokenizer.readBuffer (ReadStreamTokenizer.js:44:19)
    at async MatroskaParser.readBuffer (MatroskaParser.js:221:9)
    at async MatroskaParser.parseContainer (MatroskaParser.js:151:39)
    at async MatroskaParser.parseContainer (MatroskaParser.js:139:33)
    at async MatroskaParser.parseContainer (MatroskaParser.js:139:33)
    at async MatroskaParser.parse (MatroskaParser.js:49:26)
    at async parse (ParserFactory.js:57:5)
    at async retrieveMetadata (index.js:3175:17)

Borewit · 2024-07-12T17:58:27Z

Parse 'parseBlob()' is calling parseWebStream() internally, so it is weird you have inconsistent results.

music-metadata/lib/core.ts

Lines 23 to 29 in d6c2755

    
           export async function parseBlob(blob: Blob, options: IOptions = {}): Promise<IAudioMetadata> { 
        
             const fileInfo: strtok3.IFileInfo = {mimeType: blob.type, size: blob.size}; 
        
             if (blob instanceof File) { 
        
               fileInfo.path = (blob as File).name; 
        
             } 
        
             return parseWebStream(blob.stream() as any, fileInfo, options); 
        
           }

Do you experience the same issues here?: https://audio-tag-analyzer.netlify.app/

hvianna · 2024-07-12T18:04:53Z

Do you experience the same issues here?: https://audio-tag-analyzer.netlify.app/

Yes, same error. I tried with a few video formats (webm, mkv, mp4)..

Fileinfo of one of them:

General
Complete name                            : W:\DIY - Tips & Tricks - Tips in life.mp4
Format                                   : MPEG-4
Format profile                           : Base Media
Codec ID                                 : isom (isom/iso2/avc1/mp41)
File size                                : 24.9 MiB
Duration                                 : 4 min 11 s
Overall bit rate                         : 828 kb/s
Frame rate                               : 30.000 FPS
Writing application                      : Lavf58.29.100

Video
ID                                       : 1
Format                                   : AVC
Format/Info                              : Advanced Video Codec
Format profile                           : [email protected]
Format settings                          : CABAC / 5 Ref Frames
Format settings, CABAC                   : Yes
Format settings, Reference frames        : 5 frames
Codec ID                                 : avc1
Codec ID/Info                            : Advanced Video Coding
Duration                                 : 4 min 11 s
Bit rate                                 : 692 kb/s
Width                                    : 576 pixels
Height                                   : 1 024 pixels
Display aspect ratio                     : 0.562
Frame rate mode                          : Constant
Frame rate                               : 30.000 FPS
Color space                              : YUV
Chroma subsampling                       : 4:2:0
Bit depth                                : 8 bits
Scan type                                : Progressive
Bits/(Pixel*Frame)                       : 0.039
Stream size                              : 20.8 MiB (84%)
Title                                    : Twitter-vork muxer
Writing library                          : x264 core 164 r3095 baee400
Encoding settings                        : cabac=1 / ref=5 / deblock=1:0:0 / analyse=0x3:0x113 / me=hex / subme=2 / psy=0 / mixed_ref=1 / me_range=16 / chroma_me=1 / trellis=1 / 8x8dct=1 / cqm=0 / deadzone=21,11 / fast_pskip=1 / chroma_qp_offset=0 / threads=4 / lookahead_threads=1 / sliced_threads=0 / nr=0 / decimate=1 / interlaced=0 / bluray_compat=0 / stitchable=1 / constrained_intra=0 / bframes=3 / b_pyramid=2 / b_adapt=1 / b_bias=0 / direct=1 / weightb=1 / open_gop=0 / weightp=2 / keyint=infinite / keyint_min=30 / scenecut=40 / intra_refresh=0 / rc_lookahead=40 / rc=crf / mbtree=1 / crf=28.0 / qcomp=0.60 / qpmin=10 / qpmax=69 / qpstep=4 / vbv_maxrate=2048 / vbv_bufsize=2048 / crf_max=0.0 / nal_hrd=none / filler=0 / ip_ratio=1.40 / aq=2:1.00
Codec configuration box                  : avcC

Audio
ID                                       : 2
Format                                   : AAC LC
Format/Info                              : Advanced Audio Codec Low Complexity
Codec ID                                 : mp4a-40-2
Duration                                 : 4 min 11 s
Bit rate mode                            : Constant
Bit rate                                 : 128 kb/s
Channel(s)                               : 2 channels
Channel layout                           : L R
Sampling rate                            : 44.1 kHz
Frame rate                               : 43.066 FPS (1024 SPF)
Compression mode                         : Lossy
Stream size                              : 3.84 MiB (15%)
Title                                    : Twitter-vork muxer
Default                                  : Yes
Alternate group                          : 1

Borewit · 2024-07-12T18:15:43Z

I managed to get an end-of-stream exception as well, parsing an MP4 file.

Issue may be caused by https://github.com/Borewit/peek-readable/blob/master/lib/WebStreamReader.ts

Not something I can resolve quickly.

hvianna · 2024-07-12T18:29:56Z

No problem, thanks for investigating this.

In the meantime, I'll keep testing it with more audio files. I love the fact that my bundle size has decreased around 100 kB with the new music-metadata, compared to the latest music-metadata-browser. Awesome job!

hvianna · 2024-07-20T15:10:36Z

I did some testing with music-metadata v9.0.3 and this is what I got:

file size	container	audio streams	time to resolve
2.3 GB	mp4	aac	12 s
4.3 GB	mkv	ac3 + dts	24 s
15 GB	mkv	dts + pcm	80 s
17 GB	mkv	pcm	99 s

It still reads the entire file, even with { skipPostHeaders: true } in the options, or if I set fileInfo.size to a small value.

I'm not sure if this can be avoided at all, since I don't think you can skip to a random position in the stream (without reading all the data up to that point sequentially).

Borewit · 2024-07-21T08:37:23Z

The atom based format parser, MP4Parser and MatroskaParser are not changing their behavior on any of the flags.

Changing the file size, will impact the container format read. Depends on the structure of file is that has an impact, the length of the nested atoms will usually override the parent atom / container size.

There are a few approaches possible to get your metadata result faster:

1: Read only a portion of the stream
Currently not implemented, but we could add an option to the parser to read as little as possible. Challenge with atom based format is, that is not guaranteed metadata atoms appear first. Neither it so straightforward to understand at which point in the stream (at which atom) we got all or most metadata.

I don't think you can skip to a random position

No, that is not directly possible. But... the underlying token architecture (see dependencies), is designed that if the underlying file access does support skipping to a random position, that can be utilized, which brings us to option:

2: Utilize the tokenizer
You cannot skip in a stream, but it possible to read your file in smaller sub-streams using @tokenizer/http. Requires your web back-end to support HTTP(S) RFC-7233 range request. The file format read, plus the network delay, determine of this method is more efficient, or even slower then read the file as a normal stream.

3: Get early access to the metadata
With the observer option in option, you can receive a notification when the metadata is updates. Strictly speaking this makes parsing of the file even slightly slower, but it allows you have results as soon as the metadata has been read.

Borewit · 2024-08-14T16:31:08Z

In PR #2213 I am working towards asynchronous parsing of Matroska, instead of extracting metadata from the full tree. I hope to be able parse less elements, to speed up the overall process.

hvianna · 2024-08-14T17:42:53Z

@Borewit Thanks for the update, much appreciated!

Borewit · 2024-08-14T19:21:17Z

It is very tricky, looks like not all metadata is necessary at the beginning of the file.

For video a 1 GB remote (on WS S3 cloud) video file, I could bring the the parsing time from 45 seconds to 500 ms, by quieting after receiving the first segment/cluster element.

With that hack, other Matroska files fail, as they have metadata further on on the file.

With partial read support, there are possible optimizations to be made. There are certainly elements I did parse, which are not even used. I flagged a bunch of them to be ignored, but it does not do magic. The elements I am interested in are sometimes on the same level as (many) elements I am not interested in. So it is hard to efficiently seek in the file.

Borewit · 2024-08-15T16:24:24Z

I managed to skip multiple segment/cluster at once, using the SeekHead index (ref). Implementation in: #2219

I was able to parse a 1 GB remote (on WS S3 cloud) video file, in 600 ms. I do not expect any improvement on a flat stream, you need a seek-able medium.

hvianna changed the title ~~Avoid parsing entire file when using parseBlob()?~~ Option to avoid parsing entire file? Mar 2, 2024

Borewit transferred this issue from Borewit/music-metadata-browser Jul 10, 2024

Borewit mentioned this issue Jul 12, 2024

music-metadata 9.0.0 has still Buffer dependencies #2141

Closed

Borewit mentioned this issue Jul 12, 2024

parseWebStream not exported in Node.js entry point #2143

Closed

Borewit mentioned this issue Jul 13, 2024

Issue parsing Matroska files using browser Web Streams #2145

Closed

Borewit added the improvement Improvement of existing functionality label Jul 13, 2024

Borewit mentioned this issue Aug 15, 2024

Add option.mkvUseIndex to use index to skip Matroska cluster elements #2219

Merged

Borewit changed the title ~~Option to avoid parsing entire file?~~ Option to avoid parsing entire Matroska file? Aug 15, 2024

Borewit closed this as completed in #2219 Aug 15, 2024

Borewit mentioned this issue Aug 15, 2024

Add unit tests reading from AWS S3 cloud with music-metadata Borewit/tokenizer-s3#1286

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to avoid parsing entire Matroska file? #2135

Option to avoid parsing entire Matroska file? #2135

hvianna commented Mar 2, 2024 •

edited

Loading

hvianna commented Mar 2, 2024

Borewit commented Jul 10, 2024

hvianna commented Jul 11, 2024

pcbowers commented Jul 12, 2024

Borewit commented Jul 12, 2024 •

edited

Loading

Borewit commented Jul 12, 2024

hvianna commented Jul 12, 2024

Borewit commented Jul 12, 2024

hvianna commented Jul 12, 2024

Borewit commented Jul 12, 2024

hvianna commented Jul 12, 2024

hvianna commented Jul 20, 2024

Borewit commented Jul 21, 2024

Borewit commented Aug 14, 2024

hvianna commented Aug 14, 2024

Borewit commented Aug 14, 2024 •

edited

Loading

Borewit commented Aug 15, 2024

Option to avoid parsing entire Matroska file? #2135

Option to avoid parsing entire Matroska file? #2135

Comments

hvianna commented Mar 2, 2024 • edited Loading

hvianna commented Mar 2, 2024

Borewit commented Jul 10, 2024

hvianna commented Jul 11, 2024

pcbowers commented Jul 12, 2024

Borewit commented Jul 12, 2024 • edited Loading

Borewit commented Jul 12, 2024

hvianna commented Jul 12, 2024

Borewit commented Jul 12, 2024

hvianna commented Jul 12, 2024

Borewit commented Jul 12, 2024

hvianna commented Jul 12, 2024

hvianna commented Jul 20, 2024

Borewit commented Jul 21, 2024

Borewit commented Aug 14, 2024

hvianna commented Aug 14, 2024

Borewit commented Aug 14, 2024 • edited Loading

Borewit commented Aug 15, 2024

hvianna commented Mar 2, 2024 •

edited

Loading

Borewit commented Jul 12, 2024 •

edited

Loading

Borewit commented Aug 14, 2024 •

edited

Loading