♫Do you want to scrape a website?♫
March 17, 2015 7:46 AM Subscribe
Did you ever just want a bunch of web data as painlessly as possible but don't know a thing about command-line webscrapers (curl, wget) or parsing libraries (BeautifulSoup, JSoup, pandas)?
import.io will try to auto-magically hash any website you give it into structured data. (Here's MetaFilter.)
Need a bit more control over those results?
Kimono gives you a point-and-click environment for choosing page elements and pagination indicators. (Requires a Chrome add-on or browser bookmarklet.)
import.io will try to auto-magically hash any website you give it into structured data. (Here's MetaFilter.)
Need a bit more control over those results?
Kimono gives you a point-and-click environment for choosing page elements and pagination indicators. (Requires a Chrome add-on or browser bookmarklet.)
And if you *do* know a bit of programming, but don't want to bother with setting up a database to hold information you've already scraped, there's morph.io (free) and ScraperWiki. I used ScraperWiki on my booksfordc project (though I think they're now targeting corporate users and charging a ton—I snuck in when they were giving out free 'community' accounts).
posted by waninggibbon at 8:19 AM on March 17, 2015 [8 favorites]
posted by waninggibbon at 8:19 AM on March 17, 2015 [8 favorites]
Selector Gadget is pretty sweet gadget for figuring out good selectors.
posted by ethansr at 8:22 AM on March 17, 2015 [6 favorites]
posted by ethansr at 8:22 AM on March 17, 2015 [6 favorites]
I used to fantasize about being able to do sql joins on tabular content served by separate websites.
SELECT st.title, st.times, rt.tomatometer
FROM http://mylocaltheater.com/showtimes AS st
JOIN http://rottentomatoes.com/browse/in-theaters AS rt ON st.title = rt.title
WHERE rt.tomatometer > 70
posted by a snickering nuthatch at 8:24 AM on March 17, 2015 [6 favorites]
SELECT st.title, st.times, rt.tomatometer
FROM http://mylocaltheater.com/showtimes AS st
JOIN http://rottentomatoes.com/browse/in-theaters AS rt ON st.title = rt.title
WHERE rt.tomatometer > 70
posted by a snickering nuthatch at 8:24 AM on March 17, 2015 [6 favorites]
Jpfed, you could do something like that using YQL.
posted by waninggibbon at 8:32 AM on March 17, 2015 [6 favorites]
posted by waninggibbon at 8:32 AM on March 17, 2015 [6 favorites]
Is this bandwidth intensive for the sites it's scraping?
No more so than if you were to scrape them yourselves. While Kimono is very easy, in my experience, it's also slow as a dog and sometimes seems to die but shhhh, so it might be trying to play nice with rate limits.
posted by Going To Maine at 8:55 AM on March 17, 2015 [1 favorite]
No more so than if you were to scrape them yourselves. While Kimono is very easy, in my experience, it's also slow as a dog and sometimes seems to die but shhhh, so it might be trying to play nice with rate limits.
posted by Going To Maine at 8:55 AM on March 17, 2015 [1 favorite]
For the more command-line-shell inspired among us, I have to say that piping curl stdout into html2 makes scraping trivial with grep and awk and suchlike.
The xml2/html2 tools take structured xmlish stuff and turn the stream into output not entirely unlike find(1) output. The result is that you get all the nested hierarchy /in/path/like/layout on the left of the =, and all kinds of attribute and content data on the right. The big drawback is that it assumes perfectly-formed XML-like structure and never really got updated for html5 (by the by, the name was a coincidence), but when it works it is heavenly.
You can also use it to generate html:
The xml2/html2 tools take structured xmlish stuff and turn the stream into output not entirely unlike find(1) output. The result is that you get all the nested hierarchy /in/path/like/layout on the left of the =, and all kinds of attribute and content data on the right. The big drawback is that it assumes perfectly-formed XML-like structure and never really got updated for html5 (by the by, the name was a coincidence), but when it works it is heavenly.
You can also use it to generate html:
$ echo '/html/body/p/ol/li=Metafilter: pretty sweet gadget for figuring out good selectors' | 2htmlposted by rum-soaked space hobo at 9:33 AM on March 17, 2015 [4 favorites]
I like this thread because the comments are all improving my FPP.
posted by Going To Maine at 9:45 AM on March 17, 2015 [3 favorites]
posted by Going To Maine at 9:45 AM on March 17, 2015 [3 favorites]
I used scraperwiki back in the day. It was a pain in the ass to get working, but man when it started flying, watching that glorious data pour in for my job was so worth it.
posted by msbutah at 10:10 AM on March 17, 2015
posted by msbutah at 10:10 AM on March 17, 2015
Don't worry guys. This will all be unnecessarily once we're all using XML for everything.
posted by schmod at 10:39 AM on March 17, 2015 [18 favorites]
posted by schmod at 10:39 AM on March 17, 2015 [18 favorites]
If you've never experienced the agony of hand-scraping the html you wrote by hand many years ago in order to extract 70 rows from a table because you no longer have the plain text ... I don't recommend it. BeautifulSoup was a revelation when I discovered it.
And thanks for the earworm.
posted by RedOrGreen at 11:19 AM on March 17, 2015
And thanks for the earworm.
posted by RedOrGreen at 11:19 AM on March 17, 2015
♫Do you want to scrape a website?♫
Correct me if I'm wrong but I assume this goes to the same tune of "Are you ready for some football?"
posted by Ratio at 11:47 AM on March 17, 2015
Correct me if I'm wrong but I assume this goes to the same tune of "Are you ready for some football?"
posted by Ratio at 11:47 AM on March 17, 2015
Looks like someone doesn't have kids
posted by RustyBrooks at 11:51 AM on March 17, 2015 [4 favorites]
posted by RustyBrooks at 11:51 AM on March 17, 2015 [4 favorites]
Looks like someone doesn't have kids
Are you talkin' to me? I have 2 kids.
posted by Ratio at 12:00 PM on March 17, 2015
Are you talkin' to me? I have 2 kids.
posted by Ratio at 12:00 PM on March 17, 2015
Correct me if I'm wrong
I'll do so even if you are right! It's a quote of Do You Want To Know A Secret? by The Beatles, I thought
posted by thelonius at 12:01 PM on March 17, 2015 [1 favorite]
I'll do so even if you are right! It's a quote of Do You Want To Know A Secret? by The Beatles, I thought
posted by thelonius at 12:01 PM on March 17, 2015 [1 favorite]
I'll do so even if you are right! It's a quote of Do You Want To Know A Secret? by The Beatles, I thought
Well, no, but it seems like it's all in the ear of the beholder.
posted by Going To Maine at 12:04 PM on March 17, 2015
Well, no, but it seems like it's all in the ear of the beholder.
posted by Going To Maine at 12:04 PM on March 17, 2015
Are you talkin' to me? I have 2 kids.
I guess they don't like to build snowmen?
posted by RustyBrooks at 12:10 PM on March 17, 2015
I guess they don't like to build snowmen?
posted by RustyBrooks at 12:10 PM on March 17, 2015
I guess they don't like to build snowmen?
They do... is there some new snowman song that all the kids are singing these days? Is this a Disney thing?
posted by Ratio at 12:22 PM on March 17, 2015
They do... is there some new snowman song that all the kids are singing these days? Is this a Disney thing?
posted by Ratio at 12:22 PM on March 17, 2015
It's a song from Frozen, "Do you want to build a snowman"
posted by RustyBrooks at 12:25 PM on March 17, 2015
posted by RustyBrooks at 12:25 PM on March 17, 2015
It's a song from Frozen, "Do you want to build a snowman"
Oh.
Well, now I want to hear it sung by Hank Williams Jr.
Come on, Internet, don't fail me.
posted by Ratio at 12:30 PM on March 17, 2015
Oh.
Well, now I want to hear it sung by Hank Williams Jr.
Come on, Internet, don't fail me.
posted by Ratio at 12:30 PM on March 17, 2015
Ugh. Sorry for the typo above. I guess I'm conditioned to use the word "unnecessarily" a lot when talking about various aspects of XML...
posted by schmod at 1:00 PM on March 17, 2015
posted by schmod at 1:00 PM on March 17, 2015
Whoa, YQL looks awesome! Thanks for sharing that.
posted by Doleful Creature at 1:03 PM on March 17, 2015
posted by Doleful Creature at 1:03 PM on March 17, 2015
Some general thoughts for DIY scraping:
Rate limit, rate limit, rate limit. Often sites are poorly engineered and run on very limited hardware. Be cool, and include some sort of back pressure mechanism so you don't inadvertently DOS it, even with a laptop on wifi. Database access and CPU are often tighter limitations than bandwidth.
Be conscious of robots.txt
Acquiring raw HTML / JSON / Whatever is a separate concern from extracting data. You should limit your impact on sites by gathering local copies of files to process.
Include sanity checks as you are spidering your way around the site. If you are getting 404, 500s or other errors from the site, deal with them as part of the spidering process. Don't let them sneak through because dealing with spidering errors in your processing code will create unnecessary complexity.
Be especially careful with links that can lead to infinite loops. Pagination is often very poorly implemented, even on well regarded sites. Don't be surprised if Page 10 loops back to Page 1 while claiming its Page 11. Robots are often the only users who frequently traverse these links beyond a certain point so it isn't surprising they aren't well tested.
Try to run your processing routines on local copies, this will make recovering from failures much easier. Accessing from disk will be orders of magnitude faster. If you do a bad job processing the pages you can just fix your code and rerun the whole batch. This also avoids moral hazard with rate limits.
If possible, you should generally use a parser on the raw format and use XPath or CSS to extract information from specific elements. Regex will cause you pain, but can be useful to pare down the text within an element.
XPath is your friend. It is a bit more verbose than CSS, but it makes some types of selectors much more comprehensible, notably when you want to use a small element to find one of its parents. Many libraries will let you use both, and they are visually distinct so I would suggest writing the query in the form that will make it easiest to decipher later.
Javascript is becoming much more important for sites, and often you'll find important information for the page bootstrapped in a script tag somewhere. This complicates life because it then requires you to parse the javascript. Most of the time you can probably get by with a regex, although tools like Esprima make me wonder whether static analysis isn't a better route.
posted by ethansr at 1:24 PM on March 17, 2015 [9 favorites]
Rate limit, rate limit, rate limit. Often sites are poorly engineered and run on very limited hardware. Be cool, and include some sort of back pressure mechanism so you don't inadvertently DOS it, even with a laptop on wifi. Database access and CPU are often tighter limitations than bandwidth.
Be conscious of robots.txt
Acquiring raw HTML / JSON / Whatever is a separate concern from extracting data. You should limit your impact on sites by gathering local copies of files to process.
Include sanity checks as you are spidering your way around the site. If you are getting 404, 500s or other errors from the site, deal with them as part of the spidering process. Don't let them sneak through because dealing with spidering errors in your processing code will create unnecessary complexity.
Be especially careful with links that can lead to infinite loops. Pagination is often very poorly implemented, even on well regarded sites. Don't be surprised if Page 10 loops back to Page 1 while claiming its Page 11. Robots are often the only users who frequently traverse these links beyond a certain point so it isn't surprising they aren't well tested.
Try to run your processing routines on local copies, this will make recovering from failures much easier. Accessing from disk will be orders of magnitude faster. If you do a bad job processing the pages you can just fix your code and rerun the whole batch. This also avoids moral hazard with rate limits.
If possible, you should generally use a parser on the raw format and use XPath or CSS to extract information from specific elements. Regex will cause you pain, but can be useful to pare down the text within an element.
XPath is your friend. It is a bit more verbose than CSS, but it makes some types of selectors much more comprehensible, notably when you want to use a small element to find one of its parents. Many libraries will let you use both, and they are visually distinct so I would suggest writing the query in the form that will make it easiest to decipher later.
Javascript is becoming much more important for sites, and often you'll find important information for the page bootstrapped in a script tag somewhere. This complicates life because it then requires you to parse the javascript. Most of the time you can probably get by with a regex, although tools like Esprima make me wonder whether static analysis isn't a better route.
posted by ethansr at 1:24 PM on March 17, 2015 [9 favorites]
schmod: MY SEMANTIC WEB: LET ME SHOW YOU IT
posted by rum-soaked space hobo at 2:34 PM on March 17, 2015 [3 favorites]
posted by rum-soaked space hobo at 2:34 PM on March 17, 2015 [3 favorites]
Also yes: a hearty +1 for xpath. In many ways it shares inspiration with the html2 tools I linked above. It's also not without its warts, but as mentioned earlier your scraper will break.
posted by rum-soaked space hobo at 2:35 PM on March 17, 2015
posted by rum-soaked space hobo at 2:35 PM on March 17, 2015
I still haven't had the time to fix the scraper that powers Secret Metafilter, which was broken by the redesign. Pull requests welcome.
posted by jjwiseman at 2:47 PM on March 17, 2015
posted by jjwiseman at 2:47 PM on March 17, 2015
I've been playing with import.io for about 6 months, I think, (and it's pretty cool) but I've been hesitant to actually build anything on top of it for people to use, because I've kinda felt like it was only going to be a matter of time until it went away. Hope I'm wrong.
posted by stavrosthewonderchicken at 3:36 PM on March 17, 2015
posted by stavrosthewonderchicken at 3:36 PM on March 17, 2015
Man, they really want you to download that app when you enter various URLs in.
posted by Seekerofsplendor at 7:32 PM on March 17, 2015
posted by Seekerofsplendor at 7:32 PM on March 17, 2015
YQL is also neat, because it lets you do cross-domain requests with JQuery without running into CORS issues. Used it for the get request in a tampermonkey script.
posted by waninggibbon at 10:55 AM on March 18, 2015
posted by waninggibbon at 10:55 AM on March 18, 2015
And, oh gee, this is a really great feature that morph.io announced yesterday: PhantomJS support, built in. So you can scrape annoying javascript-y sites.
posted by waninggibbon at 2:19 PM on March 18, 2015 [2 favorites]
posted by waninggibbon at 2:19 PM on March 18, 2015 [2 favorites]
My spouse wrote Beautiful Soup and I am so glad whenever people can use it to save time. Yay! :)
I wish Aaron Swartz were alive to see these new tools.
posted by brainwane at 8:35 AM on March 23, 2015 [6 favorites]
I wish Aaron Swartz were alive to see these new tools.
posted by brainwane at 8:35 AM on March 23, 2015 [6 favorites]
I use BeautifulSoup all the time. It is the best.
posted by Going To Maine at 9:22 AM on March 23, 2015
posted by Going To Maine at 9:22 AM on March 23, 2015
I want this to be a permanent thread where people post scraping news and tools.
posted by waninggibbon at 6:46 PM on March 23, 2015 [2 favorites]
posted by waninggibbon at 6:46 PM on March 23, 2015 [2 favorites]
brainwane, please thank your spouse from me. Scraping websites was one of the first "real programming" things I learned how to do, and Beautiful Soup made (and continues to make) it much easier.
posted by daisyk at 5:46 AM on March 25, 2015
posted by daisyk at 5:46 AM on March 25, 2015
« Older Yam cake for everybody! | "...hollow out a heel of french bread and stick a... Newer »
This thread has been archived and is closed to new comments
posted by Going To Maine at 8:01 AM on March 17, 2015