"Sentences must be at least two words long, unless a linebreak or end-of-text." #4

NinoSkopac · 2017-07-26T15:06:17Z

Hey, why this logic?

What if someone writes:

See it. Report it. Sorted.

Will "Sorted." get ignored?

vanderlee · 2017-07-26T15:27:12Z

Because it does not know that "it" isn't an abbreviation.

Without the ability to actually understand the words (which would be the holy grail of artificial intelligence, so slightly outside scope), the algorithm might as well try to parse Bla bo. Bubble bi. Booboo. No hint at all whether the period is used as an abbreviation or the end of a sentence.

It might as well parse See eg. Boston mr. Garcia, which has all the same character counts, positions of caps and periods, yet is only one sentence (albeit a bad one, but still something that should be detected as a single sentence).

As such, PHP-Sentence (and basically all other Sentence boundary disambiguation algorithms) make a best effort attempt at detection whether a period is the end of a sentence, part of an abbreviation or just... ellipsis.

NinoSkopac · 2017-07-26T15:32:00Z

i feel you, but how about loading a list of abbreviations in English (and DE and NL, if you wanna maintain those languages), and then checking if the single word is an abbreviation?

I think that simply ignoring. all. single. worded. sentences. isn't. very. good.

NinoSkopac · 2017-07-26T23:42:03Z

This is a really good lib for this: https://github.com/bigwhoop/sentence-breaker

vanderlee · 2017-07-28T09:38:49Z

I think a list of abbreviations could be added without too much effort. I'll look into it though no promisses as to a date.

I know about sentence-breaker, but I don't know of it's quality. The testcases don't seem particularly challenging and on first sight seems to ignore things like colons, question marks, exclaimation marks and imperfect use of punctuation. It'll probably perform better on professional, well-written texts and worse on real-world texts. For instance, they seem to deal with "... word", but not with "...word", "..word" or "....word". Their rule system is easily extendable, but does require individual rules for each exceptional case.

As far as I know, there is no scientific corpus or set of tests to compare these types of algorithms.

vanderlee closed this as completed Jul 26, 2017

vanderlee reopened this Jul 28, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Sentences must be at least two words long, unless a linebreak or end-of-text." #4

"Sentences must be at least two words long, unless a linebreak or end-of-text." #4

NinoSkopac commented Jul 26, 2017

vanderlee commented Jul 26, 2017 •

edited

Loading

NinoSkopac commented Jul 26, 2017

NinoSkopac commented Jul 26, 2017

vanderlee commented Jul 28, 2017 •

edited

Loading

"Sentences must be at least two words long, unless a linebreak or end-of-text." #4

"Sentences must be at least two words long, unless a linebreak or end-of-text." #4

Comments

NinoSkopac commented Jul 26, 2017

vanderlee commented Jul 26, 2017 • edited Loading

NinoSkopac commented Jul 26, 2017

NinoSkopac commented Jul 26, 2017

vanderlee commented Jul 28, 2017 • edited Loading

vanderlee commented Jul 26, 2017 •

edited

Loading

vanderlee commented Jul 28, 2017 •

edited

Loading