Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Sentences must be at least two words long, unless a linebreak or end-of-text." #4

Open
NinoSkopac opened this issue Jul 26, 2017 · 4 comments

Comments

@NinoSkopac
Copy link

Hey, why this logic?

What if someone writes:

See it. Report it. Sorted.

Will "Sorted." get ignored?

@vanderlee
Copy link
Owner

vanderlee commented Jul 26, 2017

Because it does not know that "it" isn't an abbreviation.

Without the ability to actually understand the words (which would be the holy grail of artificial intelligence, so slightly outside scope), the algorithm might as well try to parse Bla bo. Bubble bi. Booboo. No hint at all whether the period is used as an abbreviation or the end of a sentence.

It might as well parse See eg. Boston mr. Garcia, which has all the same character counts, positions of caps and periods, yet is only one sentence (albeit a bad one, but still something that should be detected as a single sentence).

As such, PHP-Sentence (and basically all other Sentence boundary disambiguation algorithms) make a best effort attempt at detection whether a period is the end of a sentence, part of an abbreviation or just... ellipsis.

@NinoSkopac
Copy link
Author

i feel you, but how about loading a list of abbreviations in English (and DE and NL, if you wanna maintain those languages), and then checking if the single word is an abbreviation?

I think that simply ignoring. all. single. worded. sentences. isn't. very. good.

@NinoSkopac
Copy link
Author

This is a really good lib for this: https://github.com/bigwhoop/sentence-breaker

@vanderlee
Copy link
Owner

vanderlee commented Jul 28, 2017

I think a list of abbreviations could be added without too much effort. I'll look into it though no promisses as to a date.

I know about sentence-breaker, but I don't know of it's quality. The testcases don't seem particularly challenging and on first sight seems to ignore things like colons, question marks, exclaimation marks and imperfect use of punctuation. It'll probably perform better on professional, well-written texts and worse on real-world texts. For instance, they seem to deal with "... word", but not with "...word", "..word" or "....word". Their rule system is easily extendable, but does require individual rules for each exceptional case.

As far as I know, there is no scientific corpus or set of tests to compare these types of algorithms.

@vanderlee vanderlee reopened this Jul 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants