-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy pathfind-duplicates-in-a-column.html
65 lines (62 loc) · 2.37 KB
/
find-duplicates-in-a-column.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
<!DOCTYPE html>
<html>
<head>
<title>Caltech Library's Digital Library Development Sandbox</title>
<link href='https://fonts.googleapis.com/css?family=Open+Sans' rel='stylesheet' type='text/css'>
<link rel="stylesheet" href="/css/site.css">
</head>
<body>
<header>
<a href="http://library.caltech.edu"><img src="/assets/liblogo.gif" alt="Caltech Library logo"></a>
</header>
<nav>
<ul>
<li><a href="/">Home</a></li>
<li><a href="../">README</a></li>
<li><a href="../LICENSE">LICENSE</a></li>
<li><a href="../INSTALL.html">INSTALL</a></li>
<li><a href="../user-manual.html">User Manual</a></li>
<li><a href="./">Tutorials</a></li>
<li><a href="../search.html">Search Docs</a></li>
<li><a href="../about.html">About</a></li>
<li><a href="https://github.com/caltechlibrary/datatools">GitHub</a></li>
</ul>
</nav>
<section>
<h1 id="how-to-find-duplicates-in-a-column">How to find duplicates in a
column</h1>
<p>Searching for duplicate values in a column can be done using
<em>cat</em>, <em>csvcols</em>, <em>sort</em> and <em>csvfind</em>.
Here’s the basic algorithm from the command line or Bash script.</p>
<ul>
<li>for each line of your CSV file
<ul>
<li>extract the value in the colum</li>
<li>sort for unique values</li>
<li>for each unique value use <em>csvfind</em> to output matching
rows</li>
</ul></li>
</ul>
<p>Here’s an example Bash script looking for duplicates in
<em>dups.csv</em> in column 2, second column (columns are counted from 1
rather than zero)</p>
<pre class="shell"><code> CSV_FILE="dups.csv"
CSV_COL_NO="2"
csvcols -i "$CSV_FILE" -col "$CSV_COL_NO" | sort -u | while read CELL; do
if [ "$CELL" != "" ]; then
csvfind -i "$CSV_FILE" -trim-spaces -col "$CSV_COL_NO" "${CELL}"
fi
done</code></pre>
<p>This would result a new CSV file with duplicates grouped
together.</p>
</section>
<footer>
<span><h1><A href="http://caltech.edu">Caltech</a></h1></span>
<span>© 2023 <a href="https://www.library.caltech.edu/copyright">Caltech library</a></span>
<address>1200 E California Blvd, Mail Code 1-32, Pasadena, CA 91125-3200</address>
<span>Phone: <a href="tel:+1-626-395-3405">(626)395-3405</a></span>
<span><a href="mailto:[email protected]">Email Us</a></span>
<a class="cl-hide" href="sitemap.xml">Site Map</a>
</footer>
</body>
</html>