Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standard-compliant ws and split implementation (Issue 97 98) #99

Merged
merged 2 commits into from
Feb 21, 2024

Conversation

ghazariann
Copy link
Contributor

I used argparse to handle the arguments and flags, mirroring those of the system's split and wc commands to maintain a consistent user experience. The wc functionality is replicated to align with these system (GNU) commands. To avoid complicating usage, the implementation omits some flags from the split function (such as -a, -b, -C, -d), focusing instead on essential features: -t (separator), -n (chunk size), -l (line size), and standard input handling. Are there any suggestions for further improvements?

@ashvardanian
Copy link
Owner

Thanks for the patch! Looks good at the first glance, I will look deeper in a few hours. Can you wc variant handle directories and log stats for many files in it?

@ashvardanian
Copy link
Owner

Also, as the functionality is maturing, it would be great to add tests. Any chance you can start the scripts/test_cli_wc.py and scripts/test_cli_split.py, @ghazariann? Thanks again!

@ashvardanian ashvardanian merged commit c878caf into ashvardanian:main-dev Feb 21, 2024
27 checks passed
@ghazariann
Copy link
Contributor Author

Sounds good @ashvardanian. Currently I am handling multiple files input by simple for loop. Do we need parallel approach for dictionary? (The files inside might be a lot). Maybe threading?
Could you also clarify what kind of log stats we are talking about?
Agree that we need tests. Will work on it !

@ashvardanian
Copy link
Owner

ashvardanian commented Feb 21, 2024

@ghazariann, I was also thinking about parallelism, but not sure about how to implement it. Let's start by patching what I've described in #97 and tests.

The best test on Linux would be to compare the output of the default CLI tools against StringZilla variants. I suspect, we may have to use the locale metadata to determine, what is considered whitespace/newline in each region to get the same results.

@ashvardanian
Copy link
Owner

🎉 This PR is included in version 3.3.0 🎉

The release is available on GitHub release

Your semantic-release bot 📦🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants