-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Standard-compliant wc
implementation
#97
Comments
I've merged some intermediate patches by @ghazariann, but some parts have to be reimplemented. Like this: if args.max_line_length:
max_line_length = max(len(line) for line in str(mapped_bytes).split("\n"))
counts["max_line_length"] = max_line_length It is expensive to convert to |
We're missing tests and don't handle locale. Some thoughts on test Stdin Redirection - Note that --files0-from needs to pull a nul delimited list of filenames
Word Count We only count spaces so add tests for adjacent and other whitespace. Line Count If a file ends in a non-newline character, its trailing partial line is not counted. Max Line Length Tabs are set at every 8th column. Display widths of wide characters are considered. Non-printable characters are given 0 width. Locale -m --chars Print only the character counts, as per the current locale. ( utf-8, and utf-16 support needed ) Encoding errors are not counted. locale.getencoding / setencoding
-w --words Uses locale specific whitespace.
References https://www.gnu.org/software/coreutils/manual/html_node/wc-invocation.html#wc-invocation |
Can we detect those locale-based settings in the Python implementation of wc, without changing the core C implementation and the Python binding? |
For counting characters we can locale.getencoding() in python then a naive approach would be len(bytes.decode('utf-8')) which would not be performant. Ultimately we'd want to be able to scan for unicode characters ( & 0x80 ) and consume them as the character could be 2-4 bytes. If the library does not have a way to find bytes with the first bit set (& 0x80) we'd have to add it. For counting words I believe we want a find_charset function that we can use with the whitespace character set. |
For the first part we can temporarily compensate that by performing several runs over data - one for each multi-byte rune. |
UTF-8 looks like this - you can count bits for the character size once you see the left most bit set. Languages like Chinese will be all unicode characters. I speak Chinese so optimized this in mrjson. I'll setup tests next.
|
The 4c738ea commit introduces a prototype for StringZilla-based Command-Line toolkit, including the
wc
utility replacement. The original prototype suggests a 3x performance improvement opportunity, but it can't currently handle multiple inputs and flags. Those must be easy to add incli/wc.py
purely in the Python layer. We are aiming to match the following specification:The text was updated successfully, but these errors were encountered: