How to scrape data from a websitePosted: 10/03/2014
Sometimes, websites host tons of useful data that you would like to collect and collide in a spreadsheet. We all know how to copy&paste, but that can become a hateful task if we are talking about many data spread in different pages of that website. There is an useful skill to solve that: scraping websites.
Some software can help us in doing this. I have chosen OutWit, a web collection engine. You can apply it to the information displayed in any website, by indicating it the fields you want to extract, normally through pasting the HTML tag that defines that field. However, some websites might be more difficult to scrape if they are badly formatted in HTML. You also have to make sure that the information is not hidden behind a paywall or any authentication system to prevent automatic access, for example, CAPTCHA codes. The main steps to use it are:
- Install OutWit extension for Mozilla Firefox. You can scrap one hundred rows of data for free.
- Open a website with it.
- Create a new scraper clicking on “scrapers”. This will collect the data from the fields you tell it to.
- Insert the data field you want to scrap by doing double click. Copy and paste the beginning of the code of the field you want to scrap in “Marker Before” and the ending of that piece of code in “Marker After”. For instance, if you want to scrap the title of each page of that website, you can look for the tag “<h2>” in the code and past the following piece on the “Marker Before” column of your scraper: <div class=”details”><h2>
as well as
</h2> to “Marker After”.
- Finally, press “Execute” to see the result.
Other useful tools to help you scraping data are:
- Readability: to extract text from a page.
- DownThemAll: to download many files at once.
- FireBug: an extension for Firefox that will help you track how a website is structured.
- ScraperWiki: a wikipage to help you understanding the code scrapers in many different programming languages, such as Phyton, Ruby and PHP, so you can then apply that knowledge when using OutWit.