i want scrap groupon.com now my problem is such sites when you load for the first time asks you to join their email service but when you reload the page they directly show you the content of the page. how do i do it? i am using php for my scripting.
also if anyone could suggest a framework or library in php which makes scraping easy it would be great.
I would investigate the cURL library for grabbing website content. I'm not sure on the exact information you want to scrape, or if the refresh will cause an issue, but hopefully this launches your attempt.
We use iMacros. PRO: Works in browser, works with any website. CON: Not as fast as CURL. - of course, nothing stops you from using both.
Must you stick with PHP for the scraping? TestPlan makes this type of testing easy. You can either access the page again, or simply use TestPlan to sign up for their email list to gain extended access to their site.
Here's a rough example that takes you to the main page and closes the little popup:
GotoURL http://www.groupon.com/
Click id:step_one
SubmitForm with
%Params:subscription[email_address]% somewhere@test.domain.xx
end
They have an API http://www.groupon.com/pages/api if that helps.
Source: http://stackoverflow.com/questions/3843733/web-scraping-groupon
 
No comments:
Post a Comment