-
Notifications
You must be signed in to change notification settings - Fork 83
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
New experimental Python interfaces simplify big-data processing from the command line. The new Python CLI prototypes include: - over 3x faster `wc` word-, and line-counting utility - over 4x faster `split` dataset sharding utility
- Loading branch information
1 parent
92e4bc6
commit 4c738ea
Showing
4 changed files
with
271 additions
and
3 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# SIMD-accelerate CLI utilities based on StringZilla | ||
|
||
## `wc`: Word Count | ||
|
||
The `wc` utility on Linux can be used to count the number of lines, words, and bytes in a file. | ||
Using SIMD-accelerated character and character-set search, StringZilla, even with slow SSDs, it can be noticeably faster. | ||
|
||
```bash | ||
$ time wc enwik9.txt | ||
13147025 129348346 1000000000 enwik9.txt | ||
|
||
real 0m3.562s | ||
user 0m3.470s | ||
sys 0m0.092s | ||
|
||
$ time cli/wc.py enwik9.txt | ||
13147025 139132610 1000000000 enwik9.txt | ||
|
||
real 0m1.165s | ||
user 0m1.121s | ||
sys 0m0.044s | ||
``` | ||
|
||
## `split`: Split File into Smaller Ones | ||
|
||
The `split` utility on Linux can be used to split a file into smaller ones. | ||
The current prototype only splits by line counts. | ||
|
||
```bash | ||
$ time split -l 100000 enwik9.txt ... | ||
|
||
real 0m6.424s | ||
user 0m0.179s | ||
sys 0m0.663s | ||
|
||
$ time cli/split.py 100000 enwik9.txt ... | ||
|
||
real 0m1.482s | ||
user 0m1.020s | ||
sys 0m0.460s | ||
``` | ||
|
||
--- | ||
|
||
What other interfaces should be added? | ||
|
||
- Levenshtein distances? | ||
- Fuzzy search? |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
#!/usr/bin/env python3 | ||
|
||
import sys | ||
|
||
from stringzilla import File, Str | ||
|
||
|
||
def split_file(file_path, lines_per_file, output_prefix): | ||
try: | ||
# 1. Memory-map the large file | ||
file_mapped = File(file_path) | ||
file_contents = Str(file_mapped) | ||
|
||
# Variables to keep track of the current position and file part number | ||
current_position = 0 | ||
file_part = 0 | ||
newline_position = ( | ||
-1 | ||
) # Start before file begins to find the first newline correctly | ||
|
||
# Loop until the end of the file | ||
while current_position < len(file_contents): | ||
# 2. Loop to skip `lines_per_file` lines | ||
for _ in range(lines_per_file): | ||
newline_position = file_contents.find("\n", newline_position + 1) | ||
if newline_position == -1: # No more newlines | ||
break | ||
|
||
# If no newlines were found and we're not at the start, process the rest of the file | ||
if newline_position == -1 and current_position < len(file_contents): | ||
newline_position = len(file_contents) | ||
|
||
# 3. Use offset_within to get the length of the current section | ||
# Assuming offset_within gives you the length from the current position | ||
section_length = ( | ||
newline_position - current_position if newline_position != -1 else 0 | ||
) | ||
|
||
# Extract the current section to write out | ||
if section_length > 0: # Prevent creating empty files | ||
current_slice = file_contents[current_position : newline_position + 1] | ||
|
||
# 4. Save the current slice to file | ||
output_path = f"{output_prefix}{file_part}" | ||
current_slice.write_to(output_path) | ||
|
||
# Prepare for the next slice | ||
file_part += 1 | ||
current_position = newline_position + 1 | ||
|
||
except FileNotFoundError: | ||
print(f"No such file: {file_path}") | ||
except Exception as e: | ||
print(f"An error occurred: {e}") | ||
|
||
|
||
def main(): | ||
if len(sys.argv) < 4: | ||
print( | ||
"Usage: python split_file.py <lines_per_file> <input_file> <output_prefix>" | ||
) | ||
sys.exit(1) | ||
|
||
lines_per_file = int(sys.argv[1]) | ||
file_path = sys.argv[2] | ||
output_prefix = sys.argv[3] | ||
|
||
split_file(file_path, lines_per_file, output_prefix) | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
#!/usr/bin/env python3 | ||
|
||
import sys | ||
|
||
from stringzilla import File, Str | ||
|
||
|
||
def wc(file_path): | ||
try: | ||
mapped_file = File(file_path) | ||
mapped_bytes = Str(mapped_file) | ||
line_count = mapped_bytes.count("\n") | ||
word_count = mapped_bytes.count(" ") | ||
char_count = mapped_bytes.__len__() | ||
|
||
return line_count, word_count, char_count | ||
except FileNotFoundError: | ||
return f"No such file: {file_path}" | ||
|
||
|
||
def main(): | ||
if len(sys.argv) < 2: | ||
print("Usage: python wc.py <file>") | ||
sys.exit(1) | ||
|
||
file_path = sys.argv[1] | ||
counts = wc(file_path) | ||
|
||
if isinstance(counts, tuple): | ||
line_count, word_count, char_count = counts | ||
print(f"{line_count} {word_count} {char_count} {file_path}") | ||
else: | ||
print(counts) | ||
|
||
|
||
if __name__ == "__main__": | ||
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters