Skip to content

Commit efc1294

Browse files
committed
fixes for cluster
1 parent 06d0383 commit efc1294

8 files changed

Lines changed: 10 additions & 10 deletions

File tree

demo-cluster.qmd

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -66,21 +66,22 @@ This helps us some, combining 300 different variations to only 162 choices, but
6666
OpenRefine has a concept called **Cluster** that will use algorithms to find similarly-constructed or even similar sounding words. We'll use a series of these help us clean these city names.
6767

6868
1. In the text facet box for *City_clean*, click on the **Cluster** button at the top-right. This brings up the **Cluster and edit column** tool.
69+
1. Click on the **Cluster** button in the middle so we can take a little tour of the options.
6970

7071
![Cluster tour](img/ppp-cluster-tour.png)
7172

7273
The idea here is to work through all the results methodically:
7374

7475
- Look through all the values for a particular **Keying function**.
7576
- If you want to merge **all** the values in the cluster, check the **Merge** box and set the **New Cell Value** to the desired result.
76-
- If even one of the values in the cluster does not belong together, then DON'T MERGE IT. You'll have to deal with them independently later. Take notes and edit from the text facet, perhaps.
77+
- If one of the values in the cluster does belong to the new value, then uncheck the box next to that value so it won't be included.
7778
- Once you've reviewed all the clusters, choose **Merge Selected & Re-Cluster**.
7879
- After a quick double-check, change the **Keying Function** to the next algorithm.
7980
- Rinse and repeat for all the keying functions.
8081
- Then change the **Method** from "key collision" to "nearest neighbor" and follow all the above steps again.
8182
- With **nearest neighbor** and **levenshtein** it might be worth reducing the value in **Block Chars** to see if there are more matches that help you.
8283

83-
Following is a gif of me going through a couple of keying functions, merges and new algorithms. I'm not fixing all the values, just showing enough of the process to give you an idea of how it works.
84+
Below is a gif of me going through a couple of keying functions, merges and new algorithms. I'm not fixing all the values, just showing enough of the process to give you an idea of how it works.
8485

8586
![Clustering](img/ppp-cluster.gif)
8687

@@ -89,7 +90,6 @@ Following is a gif of me going through a couple of keying functions, merges and
8990
As you cluster and clean data like this, you'll likely have to do some research and make style decisions (N PROVIDENCE vs NORTH PROVIDENCE? Is it PEACE DALE or PEACEDALE?)
9091

9192
1. Go through all the algorithms and clean up the city names.
92-
1. Remember: Don't merge unless all values in a cluster should be the same.
9393
1. Once through all the algorithms, double-check through the facet list to see if there are values the algorithms missed. It is quite possible.
9494

9595
You would typically use text facets on all the text-based columns to check for other inconsistencies.
@@ -124,4 +124,4 @@ Once you've done all your cleaning, use the Export dropdown button at the top-ri
124124

125125
---
126126

127-
We're done with this lesson. Perhaps head back to the [Overivew](index.qmd) to read about some case studies.
127+
We're done with this lesson. Perhaps head back to the [Overivew](index.qmd#case-studies) to read about some case studies.

docs/demo-cluster.html

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -290,7 +290,8 @@ <h2 class="anchored" data-anchor-id="change-to-uppercase">Change to uppercase</h
290290
<h2 class="anchored" data-anchor-id="cluster">Cluster</h2>
291291
<p>OpenRefine has a concept called <strong>Cluster</strong> that will use algorithms to find similarly-constructed or even similar sounding words. We’ll use a series of these help us clean these city names.</p>
292292
<ol type="1">
293-
<li><p>In the text facet box for <em>City_clean</em>, click on the <strong>Cluster</strong> button at the top-right. This brings up the <strong>Cluster and edit column</strong> tool.</p>
293+
<li><p>In the text facet box for <em>City_clean</em>, click on the <strong>Cluster</strong> button at the top-right. This brings up the <strong>Cluster and edit column</strong> tool.</p></li>
294+
<li><p>Click on the <strong>Cluster</strong> button in the middle so we can take a little tour of the options.</p>
294295
<div class="quarto-figure quarto-figure-center">
295296
<figure class="figure">
296297
<p><img src="img/ppp-cluster-tour.png" class="img-fluid figure-img"></p>
@@ -303,7 +304,7 @@ <h2 class="anchored" data-anchor-id="cluster">Cluster</h2>
303304
<li>Look through all the values for a particular <strong>Keying function</strong>.</li>
304305
<li>If you want to merge <strong>all</strong> the values in the cluster, check the <strong>Merge</strong> box and set the <strong>New Cell Value</strong> to the desired result.
305306
<ul>
306-
<li>If even one of the values in the cluster does not belong together, then DON’T MERGE IT. You’ll have to deal with them independently later. Take notes and edit from the text facet, perhaps.</li>
307+
<li>If one of the values in the cluster does belong to the new value, then uncheck the box next to that value so it won’t be included.</li>
307308
</ul></li>
308309
<li>Once you’ve reviewed all the clusters, choose <strong>Merge Selected &amp; Re-Cluster</strong>.</li>
309310
<li>After a quick double-check, change the <strong>Keying Function</strong> to the next algorithm.</li>
@@ -313,7 +314,7 @@ <h2 class="anchored" data-anchor-id="cluster">Cluster</h2>
313314
<li>With <strong>nearest neighbor</strong> and <strong>levenshtein</strong> it might be worth reducing the value in <strong>Block Chars</strong> to see if there are more matches that help you.</li>
314315
</ul></li>
315316
</ul>
316-
<p>Following is a gif of me going through a couple of keying functions, merges and new algorithms. I’m not fixing all the values, just showing enough of the process to give you an idea of how it works.</p>
317+
<p>Below is a gif of me going through a couple of keying functions, merges and new algorithms. I’m not fixing all the values, just showing enough of the process to give you an idea of how it works.</p>
317318
<div class="quarto-figure quarto-figure-center">
318319
<figure class="figure">
319320
<p><img src="img/ppp-cluster.gif" class="img-fluid figure-img"></p>
@@ -325,7 +326,6 @@ <h3 class="anchored" data-anchor-id="practice-cleaning-up-city_clean">Practice c
325326
<p>As you cluster and clean data like this, you’ll likely have to do some research and make style decisions (N PROVIDENCE vs NORTH PROVIDENCE? Is it PEACE DALE or PEACEDALE?)</p>
326327
<ol type="1">
327328
<li>Go through all the algorithms and clean up the city names.</li>
328-
<li>Remember: Don’t merge unless all values in a cluster should be the same.</li>
329329
<li>Once through all the algorithms, double-check through the facet list to see if there are values the algorithms missed. It is quite possible.</li>
330330
</ol>
331331
<p>You would typically use text facets on all the text-based columns to check for other inconsistencies.</p>
@@ -362,7 +362,7 @@ <h2 class="anchored" data-anchor-id="timeline-facets">Timeline facets</h2>
362362
<h2 class="anchored" data-anchor-id="export">Export</h2>
363363
<p>Once you’ve done all your cleaning, use the Export dropdown button at the top-right of the app to export the data in your filetype of choice.</p>
364364
<hr>
365-
<p>We’re done with this lesson. Perhaps head back to the <a href="./index.html">Overivew</a> to read about some case studies.</p>
365+
<p>We’re done with this lesson. Perhaps head back to the <a href="./index.html#case-studies">Overivew</a> to read about some case studies.</p>
366366

367367

368368
</section>

docs/img/ppp-cluster-tour.png

71.4 KB
Loading

docs/img/ppp-cluster.gif

875 KB
Loading

docs/search.json

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -120,7 +120,7 @@
120120
"href": "demo-cluster.html#cluster",
121121
"title": "Clustering",
122122
"section": "Cluster",
123-
"text": "Cluster\nOpenRefine has a concept called Cluster that will use algorithms to find similarly-constructed or even similar sounding words. We’ll use a series of these help us clean these city names.\n\nIn the text facet box for City_clean, click on the Cluster button at the top-right. This brings up the Cluster and edit column tool.\n\n\n\nCluster tour\n\n\n\nThe idea here is to work through all the results methodically:\n\nLook through all the values for a particular Keying function.\nIf you want to merge all the values in the cluster, check the Merge box and set the New Cell Value to the desired result.\n\nIf even one of the values in the cluster does not belong together, then DON’T MERGE IT. You’ll have to deal with them independently later. Take notes and edit from the text facet, perhaps.\n\nOnce you’ve reviewed all the clusters, choose Merge Selected & Re-Cluster.\nAfter a quick double-check, change the Keying Function to the next algorithm.\nRinse and repeat for all the keying functions.\nThen change the Method from “key collision” to “nearest neighbor” and follow all the above steps again.\n\nWith nearest neighbor and levenshtein it might be worth reducing the value in Block Chars to see if there are more matches that help you.\n\n\nFollowing is a gif of me going through a couple of keying functions, merges and new algorithms. I’m not fixing all the values, just showing enough of the process to give you an idea of how it works.\n\n\n\nClustering\n\n\n\nPractice cleaning up City_clean\nAs you cluster and clean data like this, you’ll likely have to do some research and make style decisions (N PROVIDENCE vs NORTH PROVIDENCE? Is it PEACE DALE or PEACEDALE?)\n\nGo through all the algorithms and clean up the city names.\nRemember: Don’t merge unless all values in a cluster should be the same.\nOnce through all the algorithms, double-check through the facet list to see if there are values the algorithms missed. It is quite possible.\n\nYou would typically use text facets on all the text-based columns to check for other inconsistencies.",
123+
"text": "Cluster\nOpenRefine has a concept called Cluster that will use algorithms to find similarly-constructed or even similar sounding words. We’ll use a series of these help us clean these city names.\n\nIn the text facet box for City_clean, click on the Cluster button at the top-right. This brings up the Cluster and edit column tool.\nClick on the Cluster button in the middle so we can take a little tour of the options.\n\n\n\nCluster tour\n\n\n\nThe idea here is to work through all the results methodically:\n\nLook through all the values for a particular Keying function.\nIf you want to merge all the values in the cluster, check the Merge box and set the New Cell Value to the desired result.\n\nIf one of the values in the cluster does belong to the new value, then uncheck the box next to that value so it won’t be included.\n\nOnce you’ve reviewed all the clusters, choose Merge Selected & Re-Cluster.\nAfter a quick double-check, change the Keying Function to the next algorithm.\nRinse and repeat for all the keying functions.\nThen change the Method from “key collision” to “nearest neighbor” and follow all the above steps again.\n\nWith nearest neighbor and levenshtein it might be worth reducing the value in Block Chars to see if there are more matches that help you.\n\n\nBelow is a gif of me going through a couple of keying functions, merges and new algorithms. I’m not fixing all the values, just showing enough of the process to give you an idea of how it works.\n\n\n\nClustering\n\n\n\nPractice cleaning up City_clean\nAs you cluster and clean data like this, you’ll likely have to do some research and make style decisions (N PROVIDENCE vs NORTH PROVIDENCE? Is it PEACE DALE or PEACEDALE?)\n\nGo through all the algorithms and clean up the city names.\nOnce through all the algorithms, double-check through the facet list to see if there are values the algorithms missed. It is quite possible.\n\nYou would typically use text facets on all the text-based columns to check for other inconsistencies.",
124124
"crumbs": [
125125
"Demos",
126126
"Clustering"

img/ppp-cluster-start.png

69 KB
Loading

img/ppp-cluster-tour.png

71.4 KB
Loading

img/ppp-cluster.gif

875 KB
Loading

0 commit comments

Comments
 (0)