Skip to content

Commit

Permalink
Merge pull request #3 from dapper91/dev
Browse files Browse the repository at this point in the history
- documentation fixed.
- examples added.
  • Loading branch information
dapper91 authored Jan 9, 2022
2 parents 8324121 + c5243d4 commit 41960de
Show file tree
Hide file tree
Showing 10 changed files with 392 additions and 67 deletions.
12 changes: 10 additions & 2 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
[package]
name = "ext-sort"
version = "0.1.0"
version = "0.1.1"
edition = "2021"
license = "Unlicense"
description = "rust external sort algorithm implementation"
Expand All @@ -17,7 +17,7 @@ keywords = ["algorithms", "sort", "sorting", "external-sort", "external"]
bytesize = { version = "^1.1", optional = true }
deepsize = { version = "^0.2", optional = true }
env_logger = { version = "^0.9", optional = true}
log = "0.4"
log = "^0.4"
rayon = "^1.5"
rmp-serde = "^0.15"
serde = { version = "^1.0", features = ["derive"] }
Expand All @@ -33,3 +33,11 @@ memory-limit = ["deepsize"]
[[example]]
name = "quickstart"
required-features = ["bytesize", "env_logger"]

[[example]]
name = "custom_serializer"
required-features = ["env_logger"]

[[example]]
name = "custom_type"
required-features = ["env_logger"]
107 changes: 66 additions & 41 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,57 +1,82 @@
[![Crates.io][crates-badge]][crates-url]
[![License][licence-badge]][licence-url]
[![Test Status][test-badge]][test-url]
[![Documentation][doc-badge]][doc-url]

[crates-badge]: https://img.shields.io/crates/v/ext-sort.svg
[crates-url]: https://crates.io/crates/ext-sort
[licence-badge]: https://img.shields.io/badge/license-Unlicense-blue.svg
[licence-url]: https://github.com/dapper91/ext-sort-rs/blob/master/LICENSE
[test-badge]: https://github.com/dapper91/ext-sort-rs/actions/workflows/test.yml/badge.svg?branch=master
[test-url]: https://github.com/dapper91/ext-sort-rs/actions/workflows/test.yml
[doc-badge]: https://docs.rs/ext-sort/badge.svg
[doc-url]: https://docs.rs/ext-sort


# Rust external sort

`ext-sort` is a rust external sort algorithm implementation.

External sort algorithm implementation. External sorting is a class of sorting algorithms
that can handle massive amounts of data. External sorting is required when the data being
sorted do not fit into the main memory (RAM) of a computer and instead must be resided in
slower external memory, usually a hard disk drive. Sorting is achieved in two passes.
During the first pass it sorts chunks of data that each fit in RAM, during the second pass
it merges the sorted chunks together.
For more information see https://en.wikipedia.org/wiki/External_sorting.
External sorting is a class of sorting algorithms that can handle massive amounts of data. External sorting
is required when the data being sorted do not fit into the main memory (RAM) of a computer and instead must be
resided in slower external memory, usually a hard disk drive. Sorting is achieved in two passes. During the
first pass it sorts chunks of data that each fit in RAM, during the second pass it merges the sorted chunks together.
For more information see [External Sorting](https://en.wikipedia.org/wiki/External_sorting).

## Overview

## Features
`ext-sort` supports the following features:

* **Data agnostic:**
`ext-sort` support all data types that that implement `serde` serialization/deserialization.
it supports all data types that implement `serde` serialization/deserialization by default,
otherwise you can implement your own serialization/deserialization mechanism.
* **Serialization format agnostic:**
`ext-sort` use `MessagePack` serialization format by default, but it can be easily substituted by your custom one
if `MessagePack` serialization/deserialization performance is not sufficient for your task.
the library uses `MessagePack` serialization format by default, but it can be easily substituted by your custom one
if `MessagePack` serialization/deserialization performance is not sufficient for your task.
* **Multithreading support:**
`ext-sort` support multithreading, which means data is sorted in multiple threads utilizing maximum CPU resources
multi-threaded sorting is supported, which means data is sorted in multiple threads utilizing maximum CPU resources
and reducing sorting time.
* **Memory limit support:**
memory limited sorting is supported. It allows you to limit sorting memory consumption
(`memory-limit` feature required).

# Basic example

Activate `memory-limit` feature of the ext-sort crate on Cargo.toml:

```toml
[dependencies]
ext-sort = { version = "^0.1.1", features = ["memory-limit"] }
```

``` rust
use std::fs;
use std::io::{self, prelude::*};
use std::path;

use bytesize::MB;
use env_logger;
use log;

use ext_sort::buffer::mem::MemoryLimitedBufferBuilder;
use ext_sort::{ExternalSorter, ExternalSorterBuilder};

fn main() {
env_logger::Builder::new().filter_level(log::LevelFilter::Debug).init();

let input_reader = io::BufReader::new(fs::File::open("input.txt").unwrap());
let mut output_writer = io::BufWriter::new(fs::File::create("output.txt").unwrap());

let sorter: ExternalSorter<String, io::Error, MemoryLimitedBufferBuilder> = ExternalSorterBuilder::new()
.with_tmp_dir(path::Path::new("tmp"))
.with_buffer(MemoryLimitedBufferBuilder::new(50 * MB))
.build()
.unwrap();

let sorted = sorter.sort(input_reader.lines()).unwrap();

for item in sorted.map(Result::unwrap) {
output_writer.write_all(format!("{}\n", item).as_bytes()).unwrap();
}
output_writer.flush().unwrap();
use std::fs;
use std::io::{self, prelude::*};
use std::path;

use bytesize::MB;
use env_logger;
use log;

use ext_sort::{buffer::mem::MemoryLimitedBufferBuilder, ExternalSorter, ExternalSorterBuilder};

fn main() {
env_logger::Builder::new().filter_level(log::LevelFilter::Debug).init();

let input_reader = io::BufReader::new(fs::File::open("input.txt").unwrap());
let mut output_writer = io::BufWriter::new(fs::File::create("output.txt").unwrap());

let sorter: ExternalSorter<String, io::Error, MemoryLimitedBufferBuilder> = ExternalSorterBuilder::new()
.with_tmp_dir(path::Path::new("./"))
.with_buffer(MemoryLimitedBufferBuilder::new(50 * MB))
.build()
.unwrap();

let sorted = sorter.sort(input_reader.lines()).unwrap();

for item in sorted.map(Result::unwrap) {
output_writer.write_all(format!("{}\n", item).as_bytes()).unwrap();
}
output_writer.flush().unwrap();
}
```
77 changes: 77 additions & 0 deletions examples/custom_serializer.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
use std::fs;
use std::fs::File;
use std::io::{self, prelude::*, BufReader, BufWriter, Take};
use std::path;

use env_logger;
use log;

use ext_sort::{ExternalChunk, ExternalSorter, ExternalSorterBuilder, LimitedBufferBuilder};

struct CustomExternalChunk {
reader: io::Take<io::BufReader<fs::File>>,
}

impl ExternalChunk<u32> for CustomExternalChunk {
type SerializationError = io::Error;
type DeserializationError = io::Error;

fn new(reader: Take<BufReader<File>>) -> Self {
CustomExternalChunk { reader }
}

fn dump(
chunk_writer: &mut BufWriter<File>,
items: impl IntoIterator<Item = u32>,
) -> Result<(), Self::SerializationError> {
for item in items {
chunk_writer.write_all(&item.to_le_bytes())?;
}

return Ok(());
}
}

impl Iterator for CustomExternalChunk {
type Item = Result<u32, io::Error>;

fn next(&mut self) -> Option<Self::Item> {
if self.reader.limit() == 0 {
None
} else {
let mut buf: [u8; 4] = [0; 4];
match self.reader.read_exact(&mut buf.as_mut_slice()) {
Ok(_) => Some(Ok(u32::from_le_bytes(buf))),
Err(err) => Some(Err(err)),
}
}
}
}

fn main() {
env_logger::Builder::new().filter_level(log::LevelFilter::Debug).init();

let input_reader = io::BufReader::new(fs::File::open("input.txt").unwrap());
let mut output_writer = io::BufWriter::new(fs::File::create("output.txt").unwrap());

let sorter: ExternalSorter<u32, io::Error, LimitedBufferBuilder, CustomExternalChunk> =
ExternalSorterBuilder::new()
.with_tmp_dir(path::Path::new("./"))
.with_buffer(LimitedBufferBuilder::new(1_000_000, true))
.build()
.unwrap();

let sorted = sorter
.sort(input_reader.lines().map(|line| {
let line = line.unwrap();
let number = line.parse().unwrap();

return Ok(number);
}))
.unwrap();

for item in sorted.map(Result::unwrap) {
output_writer.write_all(format!("{}\n", item).as_bytes()).unwrap();
}
output_writer.flush().unwrap();
}
100 changes: 100 additions & 0 deletions examples/custom_type.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
use std::cmp::Ordering;
use std::error::Error;
use std::fmt::{Display, Formatter};
use std::fs;
use std::io::{self, prelude::*};
use std::path;

use env_logger;
use log;
use serde;

use ext_sort::{ExternalSorter, ExternalSorterBuilder, LimitedBufferBuilder};

#[derive(Debug)]
enum CsvParseError {
RowError(String),
ColumnError(String),
}

impl Display for CsvParseError {
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
match self {
CsvParseError::ColumnError(err) => write!(f, "column format error: {}", err),
CsvParseError::RowError(err) => write!(f, "row format error: {}", err),
}
}
}

impl Error for CsvParseError {}

#[derive(PartialEq, Eq, serde::Serialize, serde::Deserialize)]
struct Person {
name: String,
surname: String,
age: u8,
}

impl Person {
fn as_csv(&self) -> String {
format!("{},{},{}", self.name, self.surname, self.age)
}

fn from_str(s: &str) -> Result<Self, CsvParseError> {
let parts: Vec<&str> = s.split(',').collect();
if parts.len() != 3 {
Err(CsvParseError::RowError("wrong columns number".to_string()))
} else {
Ok(Person {
name: parts[0].to_string(),
surname: parts[1].to_string(),
age: parts[2]
.parse()
.map_err(|err| CsvParseError::ColumnError(format!("age field format error: {}", err)))?,
})
}
}
}

impl PartialOrd for Person {
fn partial_cmp(&self, other: &Self) -> Option<Ordering> {
Some(self.cmp(&other))
}
}

impl Ord for Person {
fn cmp(&self, other: &Self) -> Ordering {
self.surname
.cmp(&other.surname)
.then(self.name.cmp(&other.name))
.then(self.age.cmp(&other.age))
}
}

fn main() {
env_logger::Builder::new().filter_level(log::LevelFilter::Debug).init();

let input_reader = io::BufReader::new(fs::File::open("input.csv").unwrap());
let mut output_writer = io::BufWriter::new(fs::File::create("output.csv").unwrap());

let sorter: ExternalSorter<Person, io::Error, LimitedBufferBuilder> = ExternalSorterBuilder::new()
.with_tmp_dir(path::Path::new("./"))
.with_buffer(LimitedBufferBuilder::new(1_000_000, true))
.build()
.unwrap();

let sorted = sorter
.sort(
input_reader
.lines()
.map(|line| line.map(|line| Person::from_str(&line).unwrap())),
)
.unwrap();

for item in sorted.map(Result::unwrap) {
output_writer
.write_all(format!("{}\n", item.as_csv()).as_bytes())
.unwrap();
}
output_writer.flush().unwrap();
}
5 changes: 2 additions & 3 deletions examples/quickstart.rs
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,7 @@ use bytesize::MB;
use env_logger;
use log;

use ext_sort::buffer::mem::MemoryLimitedBufferBuilder;
use ext_sort::{ExternalSorter, ExternalSorterBuilder};
use ext_sort::{buffer::mem::MemoryLimitedBufferBuilder, ExternalSorter, ExternalSorterBuilder};

fn main() {
env_logger::Builder::new().filter_level(log::LevelFilter::Debug).init();
Expand All @@ -16,7 +15,7 @@ fn main() {
let mut output_writer = io::BufWriter::new(fs::File::create("output.txt").unwrap());

let sorter: ExternalSorter<String, io::Error, MemoryLimitedBufferBuilder> = ExternalSorterBuilder::new()
.with_tmp_dir(path::Path::new("tmp"))
.with_tmp_dir(path::Path::new("./"))
.with_buffer(MemoryLimitedBufferBuilder::new(50 * MB))
.build()
.unwrap();
Expand Down
Loading

0 comments on commit 41960de

Please sign in to comment.