- fix: pdfminer-six dependencies
- fix: add
password
for PDF
- feat: add back
source
toTextRegions
andLayoutElements
for backward compatibility
- fix: remove
pdfplumber
but includepdfminer-six==20240706
to updatepdfminer
- feat: add
text_as_html
andtable_as_cells
toLayoutElements
class as new attributes - feat: replace the single valueed
source
attribute fromTextRegions
andLayoutElements
with an array attributesources
- fix: removed
layoutelement.from_lp_textblock()
and related tests as it's not used - fix: update requirements to drop
layoutparser
lib - fix: update
README.md
to remove layoutparser model zoo support note
- fix: fix bug when an empty list is passed into
TextRegions.from_list
triggersIndexError
- fix: fix bug when concatenate a list of
LayoutElements
the class id mapping is no properly updated
- fix: fix list index out of range error caused by calling LayoutElements.from_list() with empty list
- fix: fix missing source after cleaning layout elements
- BREAKING Remove chipper model
- fix: fix incorrect type casting with higher versions of
numpy
when substracting afloat
from anint
array - fix: fix a bug where class id 0 becomes class type
None
when callingLayoutElements.as_list()
- fix: store probabilities with
float
data type instead ofint
- fix: Correctly assign mutable default value to variable in
LayoutElements
class
- fix: Correctly assign mutable default value to variable in
TextRegions
class
- refactor: remove layout analysis related code
- enhancement: Hide warning about table transformer weights not being loaded
- fix(layout): Use TemporaryDirectory instead of NamedTemporaryFile for Windows support
- refactor: use
numpy
array to store layout elements' information in one singleLayoutElements
object instead of using a list ofLayoutElement
fix: add input parameter validation to fill_cells()
when converting cells to html
Fix syntax for generated HTML tables
- Reduce excessive logging
- BREAKING CHANGE: removes legacy detectron2 model
- deps: remove layoutparser optional dependencies
- refactor: remove all code related to filling inferred elements text from embedded text (pdfminer).
- bug: set the Chipper max_length variable
- refactor: remove all
cid
related code that was originally added to filter out invalidpdfminer
text - enhancement: Wrapped hf_hub_download with a function that checks for local file before checking HF
- fix: table transformer doesn't return multiple cells with same coordinates
- fix: table transformer predictions are now removed if confidence is below threshold
- feat: allow table transformer agent to return table prediction in not parsed format
- fix: remove pin from
onnxruntime
dependency.
- feat: add a set of new
ElementType
s to extend future element types recognition - feat: allow registering of new models for inference using
unstructured_inference.models.base.register_new_model
function
- fix: replace
Rectangle.is_in()
withRectangle.is_almost_subregion_of()
when filling in an inferred element with embedded text - bug: check for None in Chipper bounding box reduction
- chore: removes
install-detectron2
from theMakefile
- fix: convert label_map keys read from os.environment
UNSTRUCTURED_DEFAULT_MODEL_INITIALIZE_PARAMS_JSON_PATH
to int type - feat: removes supergradients references
- fix: assign value to
text_as_html
element attribute only iftext
attribute contains HTML tags.
- fix: added handling in
UnstructuredTableTransformerModel
for ifrecognize
returns an empty list inrun_prediction
.
- fix: add logic to handle computation of intersections betwen 2
Rectangle
s when aRectangle
hasNone
value in its coordinates
- fix: fix a bug where chipper, or any element extraction model based
PageLayout
object, lackimage_metadata
and other attributes that are required for downstream processing; this fix also reduces the memory overhead of using chipper model
- chipper-v3: improved table prediction
- refactor: remove all OCR related code
- refactor: remove all image extraction related code
- refactor: remove all
pdfminer
related code - enhancement: improved Chipper bounding boxes
- bug: Allow supplied ONNX models to use label_map dictionary from json file
- enhancement: Enable env variables for model definition
- enhancement: Remove Super-Gradients Dependency and Allow General Onnx Models Instead
- refactor: add a class
ElementType
for the element type constants and use the constants to replace element type strings - enhancement: support extracting elements with types
Picture
andFigure
- fix: update logger in table initalization where the logger info was not showing
- chore: supress UserWarning about specified model providers
- change the default model to yolox, as table output appears to be better and speed is similar to
yolox_quantized
- chore: remove logger info for chipper since its private
- fix: update broken slack invite link in chipper logger info
- enhancement: Improve error message when # images extracted doesn't match # page layouts.
- fix: use automatic mixed precision on GPU for Chipper
- fix: chipper Table elements now match other layout models' Table element format: html representation is stored in
text_as_html
attribute andtext
attribute stores text without html tags
- Handle kwargs explicitly when needed, suppress otherwise
- fix: Reduce Chipper memory consumption on x86_64 cpus
- fix: Skips ordering elements coming from Chipper
- fix: After refactoring to introduce Chipper, annotate() wasn't able to show text with extra info from elements, this is fixed now.
- feat: add table cell and dataframe output formats to table transformer's
run_prediction
call - breaking change: function
unstructured_inference.models.tables.recognize
no longer takesout_html
parameter and it now only returns table cell data format (lists of dictionaries)
- Allow table model to accept optional OCR tokens
- Fix: include onnx as base dependency.
• Fix a memory leak in DonutProcessor when using large images in numpy format • Set the right settings for beam search size > 1 • Fix a bug that in very rare cases made the last element predicted by Chipper to have a bbox = None
- fix a bug where invalid zoom factor lead to exceptions; now invalid zoom factors results in no scaling of the image
- Improved packaging
- Dynamic beam search size has been implemented for Chipper, the decoding process starts with a size = 1 and changes to size = 3 if repetitions appear.
- Fixed bug when PDFMiner predicts that an image text occupies the full page and removes annotations by Chipper.
- Added random seed to Chipper text generation to avoid differences between calls to Chipper.
- Allows user to use super-gradients model if they have a callback predict function, a yaml file with names field corresponding to classes and a path to the model weights
- Integration of Chipperv2 and additional Chipper functionality, which includes automatic detection of GPU, bounding box prediction and hierarchical representation.
- Remove control characters from the text of all layout elements
- Sort elements extracted by
pdfminer
to get consistent result fromaggregate_by_block()
- Download yolox_quantized from HF
- Remove all OCR related code expect the table OCR code
- Stop passing ocr_languages parameter into paddle to avoid invalid paddle language code error, this will be fixed until we have the mapping from standard language code to paddle language code.
- Add functionality to keep extracted image elements while merging inferred layout with extracted layout
- Fix
source
property for elements generated by pdfminer. - Add 'OCR-tesseract' and 'OCR-paddle' as sources for elements generated by OCR.
- add a function to automatically scale table crop images based on text height so the text height is optimum for
tesseract
OCR task - add the new image auto scaling parameters to
config.py
- fix a bug where padded table structure bounding boxes are not shifted back into the original image coordinates correctly
- move the confidence threshold for table transformer to config
- YoloX_quantized is now the default model. This models detects most diverse types and detect tables better than previous model.
- Since detection models tend to nest elements inside others(specifically in Tables), an algorithm has been added for reducing this behavior. Now all the elements produced by detection models are disjoint and they don't produce overlapping regions, which helps reduce duplicated content.
- Add
source
property to our elements, so you can know where the information was generated (OCR or detection model)
- add a config class to handle parameter configurations for inference tasks; parameters in the config class can be set via environement variables
- update behavior of
pad_image_with_background_color
so that inputpad
is applied to all sides
- Add functionality to extract and save images from the page
- Add functionality to get only "true" embedded images when extracting elements from PDF pages
- Update the layout visualization script to be able to show only image elements if need
- add an evaluation metric for table comparison based on token similarity
- fix paddle unit tests where
make test
fails since paddle doesn't work on M1/M2 chip locally
- add env variable
ENTIRE_PAGE_OCR
to specify using paddle or tesseract on entire page OCR
- table structure detection now pads the input image by 25 pixels in all 4 directions to improve its recall
- support paddle with both cpu and gpu and assumed it is pre-installed
- fix a bug where
cells_to_html
doesn't handle cells spanning multiple rows properly
- remove
cv2
preprocessing step before OCR step in table transformer
- Add functionality to bring back embedded images in PDF
- Add object-detection classification probabilities to LayoutElement for all currently implemented object detection models
- adds
safe_division
to replae 0 with machine epsilon forfloat
to avoid division by 0 - apply
safe_division
to area overlap calculations inunstructured_inference/inference/elements.py
- Adds YoloX quantized model
- Add functionality to supplement detected layout with elements from the full page OCR
- Add functionality to annotate any layout(extracted, inferred, OCR) on a page
- Fix for incorrect type assignation at ingest test
- Use
OMP_THREAD_LIMIT
to improve tesseract performance
- Fix to no longer create a directory for storing processed images
- Hot-load images for annotation
- Handle an uncaught TesseractError
- Add TIFF test file and TIFF filetype to
test_from_image_file
intest_layout
- Fix extracted image elements being included in layout merge
- Add multipage TIFF extraction support
- Fix a pdfminer error when using
process_data_with_model
- Add warning when chipper is used with < 300 DPI
- Use None default for dpi so defaults can be properly handled upstream
- Implement full-page OCR
- Handle exceptions from Tesseract
- Add alternative architecture for detectron2 (but default is unchanged)
- Updates:
Library | From | To |
---|---|---|
transformers | 4.29.2 | 4.30.2 |
opencv-python | 4.7.0.72 | 4.8.0.74 |
ipython | 8.12.2 | 8.14.0 |
- Cache named models that have been loaded
- hotfix to handle issue storing images in a new dir when the pdf has no file extension
- Update the
annotate
and_get_image_array
methods ofPageLayout
to get the image from theimage_path
property if theimage
property isNone
. - Add functionality to store pdf images for later use.
- Add
image_metadata
property toPageLayout
& setpage.image
to None to reduce memory usage. - Update
DocumentLayout.from_file
to open only one image. - Update
load_pdf
to return either Image objects or Image paths. - Warns users that Chipper is a beta model.
- Exposed control over dpi when converting PDF to an image.
- Updated detectron2 version to avoid errors related to deprecated PIL reference
- Rename large model to chipper
- Added functionality to write images to computer storage temporarily instead of keeping them in memory for
pdf2image.convert_from_path
- Added functionality to convert a PDF in small chunks of pages at a time for
pdf2image.convert_from_path
- Table processing check for the area of the package to fix division by zero bug
- Added CUDA and TensorRT execution providers for yolox and detectron2onnx model.
- Warning for onnx version of detectron2 for empty pages suppresed.
- Tweak to element ordering to make it more deterministic
- Refactor for large model
- Combine inferred elements with extracted elements
- Add ruff to keep code consistent with unstructured
- Configure fallback for OCR token if paddleocr doesn't work to use tesseract
- Add annotation for pages
- Store page numbers when processing PDFs
- Hotfix to handle inference of blank pages using ONNX detectron2
- Revert ordering change to investigate examples of misordering
- Preserve image format in PIL.Image.Image when loading
- Added ONNX version of Detectron2 and make default model
- Remove API code, we don't serve this as a standalone API any more
- Update ordering logic to account for multicolumn documents.
- Fixed patches not being a package.
- Patch pdfminer.six to fix parsing bug
- Output of table extraction is now stored in
text_as_html
property rather thantext
property
- Added the ability to pass
ocr_languages
to the OCR agent for users who need non-English language packs.
- Added logic to partition granular elements (words, characters) by proximity
- Text extraction is now delegated to text regions rather than being handled centrally
- Fixed embedded image coordinates being interpreted differently than embedded text coordinates
- Update to how dependencies are being handled
- Update detectron2 version
- Allow extracting tables from higher level functions
- Pin protobuf version to avoid errors
- Make paddleocr an extra again
- Fix for text block detection
- Add paddleocr dependency to setup for x86_64 machines
- Suppressed processing progress bars
- Add table processing
- Change OCR logic to be aware of PDF image elements
- Fix for processing RGBA images
- Fixed some cases where image elements were not being OCR'd
- Removed control characters from tesseract output
- Removed multithreading from OCR (DocumentLayout.get_elements_from_layout)
- Refactored YoloX inference code to integrate better with framework
- Improved testing time
- Fixed duplicated load_pdf call
- Add donut model script for image prediction
- Add sample receipt and test for donut prediction
- Add YoloX model for images and PDFs
- Add generic model interface
- Download default model from huggingface
- Clarify error when trying to open file that doesn't exist as an image
- Pins the version of
opencv-python
for linux compatibility
- Add capability to process image files
- Add logic to use OCR when layout text is full of unknown characters
- Refactor to facilitate local inference
- Removes BasicConfig from logger configuration
- Implement auto model downloading
- Initial release of unstructured-inference