Data Munging

At some point, I worked with some tools that didn't take more than 1000 data points, but I was studying the effects of unusually large data sets in these applications. These are some perl scripts I wrote to "trim" the data sets. They all operate on tab or whitespace separated data files, where the first column is the independent variable.

All the tools have fuller documentation inside them in POD format. Running them with the '-m' switch will show this formatted like a manpage.

avgdups

If duplicate values of the independent value are present, avgdups will average all the numeric data in the other columns, leaving one row for each value. It will concatenate multiple files first.

pickln

pickln will select n lines spaced logarithmically along the full range of the independent variable. It first finds n-logarithmically-spaced points, and then outputs the nearest point to each one, which does mean that it will output duplicates.

pickn

pickn will select n lines spaced evenly along the full range of the independent variable. Unlike pickln, it just sorts the file and picks out n lines in an "off-center" fashion, leaving an equal number of lines from the beginning as from the end. It can also select randomly.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
LICENSE		LICENSE
README.md		README.md
avgdups		avgdups
pickln		pickln
pickn		pickn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Munging

avgdups

pickln

pickn

About

Releases

Packages

Languages

License

iemcd/munging

Folders and files

Latest commit

History

Repository files navigation

Data Munging

avgdups

pickln

pickn

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages