You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: demo-cluster.qmd
+4-4Lines changed: 4 additions & 4 deletions
Original file line number
Diff line number
Diff line change
@@ -66,21 +66,22 @@ This helps us some, combining 300 different variations to only 162 choices, but
66
66
OpenRefine has a concept called **Cluster** that will use algorithms to find similarly-constructed or even similar sounding words. We'll use a series of these help us clean these city names.
67
67
68
68
1. In the text facet box for *City_clean*, click on the **Cluster** button at the top-right. This brings up the **Cluster and edit column** tool.
69
+
1. Click on the **Cluster** button in the middle so we can take a little tour of the options.
69
70
70
71

71
72
72
73
The idea here is to work through all the results methodically:
73
74
74
75
- Look through all the values for a particular **Keying function**.
75
76
- If you want to merge **all** the values in the cluster, check the **Merge** box and set the **New Cell Value** to the desired result.
76
-
- If even one of the values in the cluster does not belong together, then DON'T MERGE IT. You'll have to deal with them independently later. Take notes and edit from the text facet, perhaps.
77
+
- If one of the values in the cluster does belong to the new value, then uncheck the box next to that value so it won't be included.
77
78
- Once you've reviewed all the clusters, choose **Merge Selected & Re-Cluster**.
78
79
- After a quick double-check, change the **Keying Function** to the next algorithm.
79
80
- Rinse and repeat for all the keying functions.
80
81
- Then change the **Method** from "key collision" to "nearest neighbor" and follow all the above steps again.
81
82
- With **nearest neighbor** and **levenshtein** it might be worth reducing the value in **Block Chars** to see if there are more matches that help you.
82
83
83
-
Following is a gif of me going through a couple of keying functions, merges and new algorithms. I'm not fixing all the values, just showing enough of the process to give you an idea of how it works.
84
+
Below is a gif of me going through a couple of keying functions, merges and new algorithms. I'm not fixing all the values, just showing enough of the process to give you an idea of how it works.
84
85
85
86

86
87
@@ -89,7 +90,6 @@ Following is a gif of me going through a couple of keying functions, merges and
89
90
As you cluster and clean data like this, you'll likely have to do some research and make style decisions (N PROVIDENCE vs NORTH PROVIDENCE? Is it PEACE DALE or PEACEDALE?)
90
91
91
92
1. Go through all the algorithms and clean up the city names.
92
-
1. Remember: Don't merge unless all values in a cluster should be the same.
93
93
1. Once through all the algorithms, double-check through the facet list to see if there are values the algorithms missed. It is quite possible.
94
94
95
95
You would typically use text facets on all the text-based columns to check for other inconsistencies.
@@ -124,4 +124,4 @@ Once you've done all your cleaning, use the Export dropdown button at the top-ri
124
124
125
125
---
126
126
127
-
We're done with this lesson. Perhaps head back to the [Overivew](index.qmd) to read about some case studies.
127
+
We're done with this lesson. Perhaps head back to the [Overivew](index.qmd#case-studies) to read about some case studies.
<p>OpenRefine has a concept called <strong>Cluster</strong> that will use algorithms to find similarly-constructed or even similar sounding words. We’ll use a series of these help us clean these city names.</p>
292
292
<oltype="1">
293
-
<li><p>In the text facet box for <em>City_clean</em>, click on the <strong>Cluster</strong> button at the top-right. This brings up the <strong>Cluster and edit column</strong> tool.</p>
293
+
<li><p>In the text facet box for <em>City_clean</em>, click on the <strong>Cluster</strong> button at the top-right. This brings up the <strong>Cluster and edit column</strong> tool.</p></li>
294
+
<li><p>Click on the <strong>Cluster</strong> button in the middle so we can take a little tour of the options.</p>
<li>Look through all the values for a particular <strong>Keying function</strong>.</li>
304
305
<li>If you want to merge <strong>all</strong> the values in the cluster, check the <strong>Merge</strong> box and set the <strong>New Cell Value</strong> to the desired result.
305
306
<ul>
306
-
<li>If even one of the values in the cluster does not belong together, then DON’T MERGE IT. You’ll have to deal with them independently later. Take notes and edit from the text facet, perhaps.</li>
307
+
<li>If one of the values in the cluster does belong to the new value, then uncheck the box next to that value so it won’t be included.</li>
307
308
</ul></li>
308
309
<li>Once you’ve reviewed all the clusters, choose <strong>Merge Selected & Re-Cluster</strong>.</li>
309
310
<li>After a quick double-check, change the <strong>Keying Function</strong> to the next algorithm.</li>
<li>With <strong>nearest neighbor</strong> and <strong>levenshtein</strong> it might be worth reducing the value in <strong>Block Chars</strong> to see if there are more matches that help you.</li>
314
315
</ul></li>
315
316
</ul>
316
-
<p>Following is a gif of me going through a couple of keying functions, merges and new algorithms. I’m not fixing all the values, just showing enough of the process to give you an idea of how it works.</p>
317
+
<p>Below is a gif of me going through a couple of keying functions, merges and new algorithms. I’m not fixing all the values, just showing enough of the process to give you an idea of how it works.</p>
@@ -325,7 +326,6 @@ <h3 class="anchored" data-anchor-id="practice-cleaning-up-city_clean">Practice c
325
326
<p>As you cluster and clean data like this, you’ll likely have to do some research and make style decisions (N PROVIDENCE vs NORTH PROVIDENCE? Is it PEACE DALE or PEACEDALE?)</p>
326
327
<oltype="1">
327
328
<li>Go through all the algorithms and clean up the city names.</li>
328
-
<li>Remember: Don’t merge unless all values in a cluster should be the same.</li>
329
329
<li>Once through all the algorithms, double-check through the facet list to see if there are values the algorithms missed. It is quite possible.</li>
330
330
</ol>
331
331
<p>You would typically use text facets on all the text-based columns to check for other inconsistencies.</p>
Copy file name to clipboardExpand all lines: docs/search.json
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -120,7 +120,7 @@
120
120
"href": "demo-cluster.html#cluster",
121
121
"title": "Clustering",
122
122
"section": "Cluster",
123
-
"text": "Cluster\nOpenRefine has a concept called Cluster that will use algorithms to find similarly-constructed or even similar sounding words. We’ll use a series of these help us clean these city names.\n\nIn the text facet box for City_clean, click on the Cluster button at the top-right. This brings up the Cluster and edit column tool.\n\n\n\nCluster tour\n\n\n\nThe idea here is to work through all the results methodically:\n\nLook through all the values for a particular Keying function.\nIf you want to merge all the values in the cluster, check the Merge box and set the New Cell Value to the desired result.\n\nIf even one of the values in the cluster does not belong together, then DON’T MERGE IT. You’ll have to deal with them independently later. Take notes and edit from the text facet, perhaps.\n\nOnce you’ve reviewed all the clusters, choose Merge Selected & Re-Cluster.\nAfter a quick double-check, change the Keying Function to the next algorithm.\nRinse and repeat for all the keying functions.\nThen change the Method from “key collision” to “nearest neighbor” and follow all the above steps again.\n\nWith nearest neighbor and levenshtein it might be worth reducing the value in Block Chars to see if there are more matches that help you.\n\n\nFollowing is a gif of me going through a couple of keying functions, merges and new algorithms. I’m not fixing all the values, just showing enough of the process to give you an idea of how it works.\n\n\n\nClustering\n\n\n\nPractice cleaning up City_clean\nAs you cluster and clean data like this, you’ll likely have to do some research and make style decisions (N PROVIDENCE vs NORTH PROVIDENCE? Is it PEACE DALE or PEACEDALE?)\n\nGo through all the algorithms and clean up the city names.\nRemember: Don’t merge unless all values in a cluster should be the same.\nOnce through all the algorithms, double-check through the facet list to see if there are values the algorithms missed. It is quite possible.\n\nYou would typically use text facets on all the text-based columns to check for other inconsistencies.",
123
+
"text": "Cluster\nOpenRefine has a concept called Cluster that will use algorithms to find similarly-constructed or even similar sounding words. We’ll use a series of these help us clean these city names.\n\nIn the text facet box for City_clean, click on the Cluster button at the top-right. This brings up the Cluster and edit column tool.\nClick on the Cluster button in the middle so we can take a little tour of the options.\n\n\n\nCluster tour\n\n\n\nThe idea here is to work through all the results methodically:\n\nLook through all the values for a particular Keying function.\nIf you want to merge all the values in the cluster, check the Merge box and set the New Cell Value to the desired result.\n\nIf one of the values in the cluster does belong to the new value, then uncheck the box next to that value so it won’t be included.\n\nOnce you’ve reviewed all the clusters, choose Merge Selected & Re-Cluster.\nAfter a quick double-check, change the Keying Function to the next algorithm.\nRinse and repeat for all the keying functions.\nThen change the Method from “key collision” to “nearest neighbor” and follow all the above steps again.\n\nWith nearest neighbor and levenshtein it might be worth reducing the value in Block Chars to see if there are more matches that help you.\n\n\nBelow is a gif of me going through a couple of keying functions, merges and new algorithms. I’m not fixing all the values, just showing enough of the process to give you an idea of how it works.\n\n\n\nClustering\n\n\n\nPractice cleaning up City_clean\nAs you cluster and clean data like this, you’ll likely have to do some research and make style decisions (N PROVIDENCE vs NORTH PROVIDENCE? Is it PEACE DALE or PEACEDALE?)\n\nGo through all the algorithms and clean up the city names.\nOnce through all the algorithms, double-check through the facet list to see if there are values the algorithms missed. It is quite possible.\n\nYou would typically use text facets on all the text-based columns to check for other inconsistencies.",
0 commit comments