Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit the lengths of parsed CSV cells #70

Open
aengelberg opened this issue Feb 10, 2018 · 2 comments
Open

Limit the lengths of parsed CSV cells #70

aengelberg opened this issue Feb 10, 2018 · 2 comments
Labels

Comments

@aengelberg
Copy link

I regularly work with dirty CSVs that have overly long cells between separators for some reason:

a,b,c
d,e,fffffffffff...

or have a misplaced double quote that mistakenly implies an especially large cell:

a,b,c
d,e,"f
g,h,i
j,k,l
... (the rest of the file is one cell?)

Both of these cases trigger an OutOfMemoryError if I use the Jackson parser. I would like to set a hard limit of, say, 1MB per cell, so that Jackson will halt before trying (and failing) to buffer large amounts of text into memory.

@aengelberg
Copy link
Author

Are there any known workarounds that would let me effectively achieve this behavior with the current Jackson API? For example, plug in some kind of faux "parser" that throws away data if it goes over a certain threshold?

@cowtowncoder
Copy link
Member

Unfortunately I don't think this would be easy thing to do right now.

In theory such things are doable: for example XML parsers often support this. Woodstox, for example:

https://medium.com/@cowtowncoder/configuring-woodstox-xml-parser-woodstox-specific-properties-1ce5030a5173

has a big of set of maximum size limits.

What is generally needed is support from low-level parser (JsonParser and subtypes) so that they can enforce limits: usually when reading a new buffer-full of data (so every 4k bytes or characters). But that has to be done for each format backend separately. After parser level limits are little bit too late enforce.

Another possibility which could be bit more general would be to allow settings in buffering class (TextBuffer I think); this would be more generic, if coarser.

But I fear that tackling this problem would be best done with Jackson 3.0 (under development) -- reason being that it allows much better configurability of format backends.

So... I don't really have a good solution at this point. What could perhaps work, from your end, is writing custom Reader subtype that wraps read() method. If it could interact with higher level code, it could throw exception if a length maximum was violated. Not completely sure how it should interact (perhaps caller would need to effectively reset state between rows or tokens), but that would be one approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants