Google Refine Data Cleaning Software
January 5, 2011 2:11 AM Subscribe
Google Refine is a free data-cleaning tool provided by Google. You install it on your own computer and run it through your browser. It’s very similar to PivotTables in Excel, but there are some advanced features such as data matching. On their blog, ProPublica provides a practical example of how they used the software to clean medical billing data for one of their investigative reports.
I have fiddled with this, and it seems a whole lot better than bloody PivotTables, with respect to washing dodgy data sets.
That said: shhhhhhh, I'm trying to pretend that tidying up these datasets is all my work.
posted by pompomtom at 3:26 AM on January 5, 2011 [1 favorite]
That said: shhhhhhh, I'm trying to pretend that tidying up these datasets is all my work.
posted by pompomtom at 3:26 AM on January 5, 2011 [1 favorite]
I saw this demo'd at OpenTech in London last year and was quite impressed. The data.gov.uk people who were using it to make e.g. Excel sheets showing local government purchasing into usable web-friendly data. This was when it was still Freebase Gridworks.
posted by crocomancer at 4:06 AM on January 5, 2011
posted by crocomancer at 4:06 AM on January 5, 2011
Very cool, wish I had this in college!
posted by r_nebblesworthII at 4:12 AM on January 5, 2011
posted by r_nebblesworthII at 4:12 AM on January 5, 2011
Awesome, I had a conversation just before Christmas with a friend who manages a very large ebook collection that came from various sources, and is held under various ownership and lease agreements. It's shockingly cataloged. A lot of the problems come down to misspellings and typos of author's names, publisher details, exact dates of acquisition, dates the leases end, dates leases need to be paid for, etc. And excel is a pig at cleaning that stuff up. This looks better. So I've sent it on. Thank you.
posted by Ahab at 5:06 AM on January 5, 2011
posted by Ahab at 5:06 AM on January 5, 2011
I have been using Refine (and Freebase) for about a week now and I think it's the greatest thing since sliced bread. Yet another indispensable tool from Google. (I sure hope they don't turn out to be evil.)
posted by GrammarMoses at 5:32 AM on January 5, 2011
posted by GrammarMoses at 5:32 AM on January 5, 2011
Some colleagues and I tried this a few months ago, and while we were impressed, there doesn't seem to be any way to script it or record what you do or create any kind of audit trail, if you like... if I use this to reshape some funky data, I would like to be able to quickly repeat the steps on the next data I apply to it, or at least obtain some kind of record of how the data was transformed. Still...good work.
posted by Jimbob at 5:47 AM on January 5, 2011 [1 favorite]
posted by Jimbob at 5:47 AM on January 5, 2011 [1 favorite]
Can someone give me a use case for this? I"m a little confused by what it does.
posted by empath at 6:44 AM on January 5, 2011
posted by empath at 6:44 AM on January 5, 2011
Much like a plumber's snake, if you don't know what it's for, you can count yourself lucky.
posted by kersplunk at 6:50 AM on January 5, 2011 [23 favorites]
posted by kersplunk at 6:50 AM on January 5, 2011 [23 favorites]
empath, it is really really handy when you have to clean up data that people have entered in text boxes.
You can very easily correctly group/replace synonyms, spelling mistakes, abbreviations, etc and the beauty is that you can immediately see the impact on your data.
Really if you do anything with user entered data this tool is an incredible timesaver.
For me, I get a chance to try and figure out what the hell Birmingham City Council is trying to hide with their obstructionist data dumps.
posted by srboisvert at 6:54 AM on January 5, 2011
You can very easily correctly group/replace synonyms, spelling mistakes, abbreviations, etc and the beauty is that you can immediately see the impact on your data.
Really if you do anything with user entered data this tool is an incredible timesaver.
For me, I get a chance to try and figure out what the hell Birmingham City Council is trying to hide with their obstructionist data dumps.
posted by srboisvert at 6:54 AM on January 5, 2011
I know I need this but I'm not sure why I need this.
posted by KevinSkomsvold at 7:17 AM on January 5, 2011 [1 favorite]
posted by KevinSkomsvold at 7:17 AM on January 5, 2011 [1 favorite]
Does it generate discreet little text ads based on the data it's refined for you?
posted by notyou at 7:20 AM on January 5, 2011 [2 favorites]
posted by notyou at 7:20 AM on January 5, 2011 [2 favorites]
Jimbob, you can repeat steps on another set of data using the undo/redo history tab. Every step is saved in JSON format. Here's a video explaining it.
Other useful tutorials to get a good sense of what Refine can do:
Cleaning up data
Transforming data in to another format
Linking with external data sources (Freebase).
posted by kakarott999 at 7:25 AM on January 5, 2011 [5 favorites]
Other useful tutorials to get a good sense of what Refine can do:
Cleaning up data
Transforming data in to another format
Linking with external data sources (Freebase).
posted by kakarott999 at 7:25 AM on January 5, 2011 [5 favorites]
The "practical example" link in the post was for me the most revealing of what this can do, which in my opinion, is awesome.
For people talking about cleaning data using Pivot Tables, does this really happen? How so? When cleaning data in excel i have suite of formulas and tricks that i use to scrub data, but I have never used a pivot table for this. Am i missing something obvious here?
Having said that, this tool seems by far more powerful than the few tricks I use in excel to clean data, so I hope to delve into this in detail in the next few days - thanks for the post!
posted by kev23f at 7:27 AM on January 5, 2011
For people talking about cleaning data using Pivot Tables, does this really happen? How so? When cleaning data in excel i have suite of formulas and tricks that i use to scrub data, but I have never used a pivot table for this. Am i missing something obvious here?
Having said that, this tool seems by far more powerful than the few tricks I use in excel to clean data, so I hope to delve into this in detail in the next few days - thanks for the post!
posted by kev23f at 7:27 AM on January 5, 2011
empath: "Can someone give me a use case for this? I"m a little confused by what it does."
Lets say I found the salary study for the local community college, and was curious whether "the administration" was truly to blame. Unfortunately it's only published in a binder in their library. So you have to scan and OCR it. But ten percent of the time it mistook $ for 8, and 5 for S and so on. Or it had a hard time distinguishing between "Admin. Asst. II" with capital I's and " Admin. Asst. ll" with lowercase L's. So you use this to look at a massive dataset and clean up the data, so you can get more accurate statistics of say, the average wage of admin assistants by I,II,III. Or average wage of professors by dept. Or the Managers / Mgr / Man. / Mngr. and Directors / Dir. And so on.
Also, it would have been helpful if the post had pointed out this was actually Gridworks, because for a moment I thought this was a competing product.
posted by pwnguin at 7:43 AM on January 5, 2011 [2 favorites]
Lets say I found the salary study for the local community college, and was curious whether "the administration" was truly to blame. Unfortunately it's only published in a binder in their library. So you have to scan and OCR it. But ten percent of the time it mistook $ for 8, and 5 for S and so on. Or it had a hard time distinguishing between "Admin. Asst. II" with capital I's and " Admin. Asst. ll" with lowercase L's. So you use this to look at a massive dataset and clean up the data, so you can get more accurate statistics of say, the average wage of admin assistants by I,II,III. Or average wage of professors by dept. Or the Managers / Mgr / Man. / Mngr. and Directors / Dir. And so on.
Also, it would have been helpful if the post had pointed out this was actually Gridworks, because for a moment I thought this was a competing product.
posted by pwnguin at 7:43 AM on January 5, 2011 [2 favorites]
This is really freaking sweet, thanks for posting this!
I usually am having to resolve text string issues, not so much numeric value issues. Being a more, well, right brained dude, I found myself wanting to clutch a blanky as the demonstration confidently moved toward talking about logs of values and what they represented. But... whew... text clean up... neat!
posted by cavalier at 7:55 AM on January 5, 2011
I usually am having to resolve text string issues, not so much numeric value issues. Being a more, well, right brained dude, I found myself wanting to clutch a blanky as the demonstration confidently moved toward talking about logs of values and what they represented. But... whew... text clean up... neat!
posted by cavalier at 7:55 AM on January 5, 2011
Man, I wish I had had this when I was setting up distributed entries for our insurance packages earlier this year.
Definitely going to link this around the office.
posted by traversionischaracter at 8:07 AM on January 5, 2011
Definitely going to link this around the office.
posted by traversionischaracter at 8:07 AM on January 5, 2011
I'm tempted to see if this can help clean up my iTunes data.
posted by ardgedee at 9:00 AM on January 5, 2011
posted by ardgedee at 9:00 AM on January 5, 2011
Oooo I have been putting off transmogrifying some OpenOffice spreadsheet data...maybe I can use this...thanks!
posted by Xoebe at 10:27 AM on January 5, 2011
posted by Xoebe at 10:27 AM on January 5, 2011
Holy FSM, I could have used this in my old job. Normalizing online journal titles across multiple systems with multiple standards (The [name] versus [name], The, etc.) is a nightmare, but it needs to happen in order to look at usage statistics across collections in a meaningful way. Refine sounds like it might make that process a lot less headdesk-inducing. Thanks for posting this!
posted by ashirys at 11:19 AM on January 5, 2011
posted by ashirys at 11:19 AM on January 5, 2011
A decent-sized chunk of my programming career has involved figuring how to take data from disparate sources (CSV, multi-sheet Excel, Access dbs, Word docs -ecch! -), and rearranging it, cleaning it, "de-duplicating", and porting it to another format, layout, or database.
For one-offs I usually do this task manually, using a combination of UltraEdit (with column-editing and killer search/replace), Excel (or OpenOffice Calc), and a MySQL db. For recurring conversions, I'll write a script, a commandline app in C, or even a full desktop app to run the conversion.
So, this tool is COOL, but I'll have to keep my employers from finding it.
posted by Artful Codger at 11:21 AM on January 5, 2011
For one-offs I usually do this task manually, using a combination of UltraEdit (with column-editing and killer search/replace), Excel (or OpenOffice Calc), and a MySQL db. For recurring conversions, I'll write a script, a commandline app in C, or even a full desktop app to run the conversion.
So, this tool is COOL, but I'll have to keep my employers from finding it.
posted by Artful Codger at 11:21 AM on January 5, 2011
Jasper Friendly Bear, I could kiss you RIGHT ON THE MOUTH! I downloaded this earlier today and spent a couple of happy "ooh and ahh" hours cleaning up the data in some spreadsheets that I have been loathe to review. It was so EASY! And intuitive! And fun even! Thank you thank you thank you!
(I'm a little excited right now.)
posted by jeanmari at 7:55 PM on January 5, 2011 [1 favorite]
(I'm a little excited right now.)
posted by jeanmari at 7:55 PM on January 5, 2011 [1 favorite]
empath: "Can someone give me a use case for this? I"m a little confused by what it does."
In regards to something I used it for today. I had two huge spreadsheets of data collected over a six week period from a group of research subjects. Each subject had to fill out a daily Google form answering questions, and that data was dumped into a timestamped row on a spreadsheet. Unfortunately, some of the subjects either mis-entered their ID number now and then, or forgot to use their ID number and instead entered their Login ID but only sometimes. This means that I had some data polluting my spreadsheet which hadn't been matched with the rest of the data from that respondent. With Google Refine, I was able to quickly cluster the ID data (versus just sort it) and spot problems, edit those problems, and apply those edits to all other instances of the problem with just 1 click. The interface makes it very easy to spot problems, unlike having to sort in Excel or something similar. If someone mistyped their ID number, I quickly saw something like this:
9866535 (27)
9866635 (1)
I was expecting each respondent to have entered 28 answers. And since the IDs of each respondent were so different, there isn't a respondent labeled "9866635". The respondent typed the number incorrectly.
Or I saw things like this:
5789021 (4)
mol432 (24)
I was able to sleuth around and check the timestamps to match the respondents who entered their login IDs for some of their answers, and reunite all of the data together for one respondent.
That was a very little thing, but saved me so much time and strained eyes. I did a lot more with this, and the videos on the ProPublica website were extremely helpful.
posted by jeanmari at 8:06 PM on January 5, 2011
In regards to something I used it for today. I had two huge spreadsheets of data collected over a six week period from a group of research subjects. Each subject had to fill out a daily Google form answering questions, and that data was dumped into a timestamped row on a spreadsheet. Unfortunately, some of the subjects either mis-entered their ID number now and then, or forgot to use their ID number and instead entered their Login ID but only sometimes. This means that I had some data polluting my spreadsheet which hadn't been matched with the rest of the data from that respondent. With Google Refine, I was able to quickly cluster the ID data (versus just sort it) and spot problems, edit those problems, and apply those edits to all other instances of the problem with just 1 click. The interface makes it very easy to spot problems, unlike having to sort in Excel or something similar. If someone mistyped their ID number, I quickly saw something like this:
9866535 (27)
9866635 (1)
I was expecting each respondent to have entered 28 answers. And since the IDs of each respondent were so different, there isn't a respondent labeled "9866635". The respondent typed the number incorrectly.
Or I saw things like this:
5789021 (4)
mol432 (24)
I was able to sleuth around and check the timestamps to match the respondents who entered their login IDs for some of their answers, and reunite all of the data together for one respondent.
That was a very little thing, but saved me so much time and strained eyes. I did a lot more with this, and the videos on the ProPublica website were extremely helpful.
posted by jeanmari at 8:06 PM on January 5, 2011
I'm tempted to see if this can help clean up my iTunes data.
Freebase has a load of MusicBrainz artist and title info, so you could. But ...
It probably would be a easier in terms of how often you'd need to step in manually to use a tool designed to fix music data, like MusicBrainz's Picard, or a commercial tool for iTunes like TuneUp Media's.
disclaimer: I helped found Metaweb, some of my cow-orkers left to join TuneUp, and I'm also a friend of (and fan of) Musicbrainz. So no links above.
posted by zippy at 11:28 PM on January 5, 2011
Freebase has a load of MusicBrainz artist and title info, so you could. But ...
It probably would be a easier in terms of how often you'd need to step in manually to use a tool designed to fix music data, like MusicBrainz's Picard, or a commercial tool for iTunes like TuneUp Media's.
disclaimer: I helped found Metaweb, some of my cow-orkers left to join TuneUp, and I'm also a friend of (and fan of) Musicbrainz. So no links above.
posted by zippy at 11:28 PM on January 5, 2011
ardgedee: "I'm tempted to see if this can help clean up my iTunes data."
It's not quite what you're looking for, but Banshee has this concept as the Bulk Metadata Fixup extension. Banshee is open source and slated to be the default music player for Nokia'a Meego smartphone OS.
It works about the same way, except the facets are predefined for you.
posted by pwnguin at 7:39 AM on January 6, 2011
It's not quite what you're looking for, but Banshee has this concept as the Bulk Metadata Fixup extension. Banshee is open source and slated to be the default music player for Nokia'a Meego smartphone OS.
It works about the same way, except the facets are predefined for you.
posted by pwnguin at 7:39 AM on January 6, 2011
Has anyone looked at this from a privacy perspective? The thought of taking work-related data and handing it over to Google is nausea-inducing. The thought of thousands of people doing so without even thinking about it is terrifying.
I don't see any privacy statement that tells me that this tool doesn't send everything to Google. The source code is all there, but I'm not excited enough about this to go hunting for the "send it to Google" call.
posted by gurple at 2:20 PM on January 7, 2011
I don't see any privacy statement that tells me that this tool doesn't send everything to Google. The source code is all there, but I'm not excited enough about this to go hunting for the "send it to Google" call.
posted by gurple at 2:20 PM on January 7, 2011
Has anyone looked at this from a privacy perspective?
That's the first thing I had to look at. In my job it's no use to me if it sends any of my data anywhere. Google took pains in the demo vids to explain that this is not a web service and everything happens, and stays, local to the user. I reckon I'd probably just yank the ethernet cable while running it on any actual client data.
posted by pompomtom at 4:14 PM on January 7, 2011
That's the first thing I had to look at. In my job it's no use to me if it sends any of my data anywhere. Google took pains in the demo vids to explain that this is not a web service and everything happens, and stays, local to the user. I reckon I'd probably just yank the ethernet cable while running it on any actual client data.
posted by pompomtom at 4:14 PM on January 7, 2011
pompomtom: "oogle took pains in the demo vids to explain that this is not a web service"
Technically, it is. The implementation is basically a small webserver that runs on your computer, that you then point a browser at. If you're worried about it phoning your data home you can always sandbox it and firewall it and wireshark it.
posted by pwnguin at 3:02 PM on January 8, 2011
Technically, it is. The implementation is basically a small webserver that runs on your computer, that you then point a browser at. If you're worried about it phoning your data home you can always sandbox it and firewall it and wireshark it.
posted by pwnguin at 3:02 PM on January 8, 2011
« Older Story of the King James Bible | Dr. Katz, Professional Therapist Newer »
This thread has been archived and is closed to new comments
posted by Catblack at 2:26 AM on January 5, 2011