Releases: conjuncts/gmft
Releases · conjuncts/gmft
v0.4.0
v0.4.0
Features
3 new table structure recognition options!
- Added
TabledFormatter
, with support of the fantastic new Tabled library from VikParuchuri. Check out the demo notebook for a quick example. - Added
HistogramFormatter
, a super-fast and decently accurate algorithmic option for table structure recognition. The algorithm uses word bboxes to detect separating lines between text. Check out the demo notebook for a quick example. - Added
DITRFormatter
. This formatter is a blend between TATRFormatter and HistogramFormatter, being trained to recognize table separating lines rather than cells. It fine tunesmicrosoft/table-transformer-structure-recognition-v1.1-all
on PubTables-1M for 15 epochs. Its main draw is mixing and matching deep and algorithmic separating line detection. Check out the demo notebook for a quick example.
These formatters can all be used in combination with any detector (like TATRDetector).
A visual to explain HistogramFormatter
:
Bugfixes
- Tweaked spanning cell merging
- Fixed bug where it would overwrite data
- Give warning when importing from
gmft
directly (usegmft.auto
instead) - Merged PR #32, thanks!
v0.4.0.rc1
v0.4.0rc1
Exciting upcoming changes:
- Added
TabledFormatter
, with support of the fantastic new Tabled library. Check out the demo notebook for a quick example. - Added
IntervalicFormatter
, a super-fast and fairly accurate algorithmic option for table structure recognition. Check out the demo notebook for a quick example. - These formatters can all be used in combination with any detector (like TATRDetector).
v0.3.x
v0.3.2
Changes:
- Raise default threshold of heuristic for rejecting tables on high overlap. Makes ValueErrors more rare.
- (total_overlap_reject_threshold) ValueError thrown on overlap > 90%, up from 20%
- (total_overlap_warn_threshold) overlap warned on overlap > 10%, up from 5%
- Python 3.9 compatability.
v0.3.1
Bugfix:
- divide by 0 when taking median of empty list in row height estimate
- Fix broken build in v0.3.0 (missing formatters)
Changes:
- Added
Img2TableDetector
. - refactor of code into organizational modules,
detectors
andformatters
- Importing from
gmft
is no longer encouraged. Please import fromgmft.auto
instead. - Tentative rich_text module and FormattedPage for direct RAG embedding usage
- Configs are now dataclasses. However, a possibly breaking change is that passing
config_overrides
will now completely replace the config, rather than updating it.
v0.2.2
Changes
is_projecting_row
is removed, with the information now available underFormattedTable._projecting_indices
- Formally removed
timm
as a dependency - Slight tweak to captions with the aim to better reflect paragraph word height, still WIP. See #8 and be93159
- Fix: return result so image can be used outside of notebook by @brycedrennan in #15
Full Changelog: v0.2.1...v0.2.2
v0.2.1
- GPU support, thank you @MathiasToftas!
Full Changelog: v0.2.0...v0.2.1
v0.2.0
Features:
- Multiple headers; multi-index tables (6225043)
- Spanning cells on both the top and left (bbbbd7c)
- Captions for tables (ca18bcc)
- "Margin" parameter allows text outside of table bbox to be included (ab81f22)
- Return visualized images as PIL image; allow padding or margin around visualized (ab81f22)
Several tweaks to formatting algorithm that may result in different outputs compared to prior versions.
- Automatically drop rows whose only non-null values is the "is_projecting_row" column
- Fill in gaps between table rows, to reduce skipped text
- Non-maxima suppression, as seen in inference.py (ab81f22)
- "total overlap" metric has become less useful in favor of "rows removed by NMS"
- Widen out the rows to same length
- Several tweaks to conditions, parameters, heuristics
- superscripts/subscripts now more likely to be merged to their parent rows
Many possibly breaking changes to config.
TableDetectorConfig.confidence_score_threshold
has been renamed toTableDetectorConfig.detector_base_threshold
TableFormatter.deduplication_iob_threshold
has been removed in favor ofnms_iob_threshold
spanning_cell_minimum_width
,corner_clip_outlier_threshold
, andaggregate_spanning_cells
have been removed- Tweaks to default settings may yield different results
no_timm
is now the default, which fixes #1.- this might cause slightly different bboxes
v0.1.1
- Created AutoTableFormatter and AutoTableDetector for future flexibility (v0.1.1, a840488)
- Renamed is_spanning_row to is_projecting_row (v0.1.1, a840488)
Older:
- Even better accuracy for large tables (v0.1.0, 8c537ed)
Full Changelog: v0.1.0...v0.1.1