Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encoding issue: not a blocker but something to be aware of for fellow users #24

Open
victusfate opened this issue Jan 19, 2023 · 0 comments

Comments

@victusfate
Copy link

victusfate commented Jan 19, 2023

Original text is transformed in the return sentences, so finding the start/end of each sentence requires comparing to a transformed sentence.

ie
For text

Super Tortas is Sun Valley is outstanding! I haven’t had as many Tortas I would like.

I failed to find a start for the second sentence since the second sentence comes back as

I haven't had as many Tortas I would like.

which isn't found in the original.
code (not working, $iStart returns as false for the second sentence)

    $oSentenceSplitter = new Sentence;
    $aRawSentences = $oSentenceSplitter->split($sText,Sentence::SPLIT_TRIM);
    $iOffset = 0;
    foreach($aRawSentences as $aRawSentence) {
        $iStart = mb_strpos($sText,$sRawSentence,$iOffset);
        $iLength = mb_strlen($sRawSentence);
        $iOffset += $iLength;
    } 

corrected code

    $sCleanedText = Multibyte::cleanUnicode($sText);
    $oSentenceSplitter = new Sentence;
    $aRawSentences = $oSentenceSplitter->split($sCleanedText,Sentence::SPLIT_TRIM);
    $iOffset = 0;
    foreach($aRawSentences as $aRawSentence) {
        $iStart = mb_strpos($sCleanedText,$sRawSentence,$iOffset);
        $iLength = mb_strlen($sRawSentence);
        $sSentenceOut = mb_substr($sText,$iStart,$iLength);
        $iOffset += $iLength;
    } 

note that the above works as long as the transformations and offsets all work out (ie the mb_strlen of the transformed sentence is the same as the mb_strlen of the sentence in the original)

@victusfate victusfate changed the title not a blocker but something to be aware of for fellow users encoding issue: not a blocker but something to be aware of for fellow users Jan 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant