-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Sentences must be at least two words long, unless a linebreak or end-of-text." #4
Comments
Because it does not know that "it" isn't an abbreviation. Without the ability to actually understand the words (which would be the holy grail of artificial intelligence, so slightly outside scope), the algorithm might as well try to parse It might as well parse As such, PHP-Sentence (and basically all other Sentence boundary disambiguation algorithms) make a best effort attempt at detection whether a period is the end of a sentence, part of an abbreviation or just... ellipsis. |
i feel you, but how about loading a list of abbreviations in English (and DE and NL, if you wanna maintain those languages), and then checking if the single word is an abbreviation? I think that simply ignoring. all. single. worded. sentences. isn't. very. good. |
This is a really good lib for this: https://github.com/bigwhoop/sentence-breaker |
I think a list of abbreviations could be added without too much effort. I'll look into it though no promisses as to a date. I know about sentence-breaker, but I don't know of it's quality. The testcases don't seem particularly challenging and on first sight seems to ignore things like colons, question marks, exclaimation marks and imperfect use of punctuation. It'll probably perform better on professional, well-written texts and worse on real-world texts. For instance, they seem to deal with "... word", but not with "...word", "..word" or "....word". Their rule system is easily extendable, but does require individual rules for each exceptional case. As far as I know, there is no scientific corpus or set of tests to compare these types of algorithms. |
Hey, why this logic?
What if someone writes:
Will "Sorted." get ignored?
The text was updated successfully, but these errors were encountered: