Skip to content

Latest commit

 

History

History
26 lines (19 loc) · 838 Bytes

find-duplicates-in-a-column.md

File metadata and controls

26 lines (19 loc) · 838 Bytes

How to find duplicates in a column

Searching for duplicate values in a column can be done using cat, csvcols, sort and csvfind. Here's the basic algorithm from the command line or Bash script.

  • for each line of your CSV file
    • extract the value in the colum
    • sort for unique values
    • for each unique value use csvfind to output matching rows

Here's an example Bash script looking for duplicates in dups.csv in column 2, second column (columns are counted from 1 rather than zero)

    CSV_FILE="dups.csv"
    CSV_COL_NO="2"

    csvcols -i "$CSV_FILE" -col "$CSV_COL_NO" | sort -u | while read CELL; do
	    if [ "$CELL" != "" ]; then
		    csvfind -i "$CSV_FILE" -trim-spaces -col "$CSV_COL_NO"  "${CELL}"
	    fi
    done

This would result a new CSV file with duplicates grouped together.