How to Scrape a Search Engine Results Page for Your SEO Project

Posted June 29th, 2009 by Dale Stokdyk

Have you ever wanted to capture search results and import them into Excel (or favorite database) for a quick-and-dirty SEO analysis?

I’ll give you a couple of examples from my own experience. Sometimes I want to capture:

  • search engine results pageshow well does a site rank for keywords? who are competitors? how appealing are the page titles?
  • a web site’s indexed pageswhich pages are indexed? how many? any duplicate pages? what keywords are being used?

If I’m interested in only a few keywords or I’m examining a fairly small web site, I can usually jot down some observations and I’m good to go.

But that approach rapidly becomes unwieldy when I’m interested in, say, a dozen or more keywords. Or when I’m examining a site with more than 50 pages. That’s when it would be nice to download the (Google/Yahoo/Bing) search results and import them into a spreadsheet so I can sort, count, and arrange things to my heart’s content!

Harvesting the Web with Outwit Hub

My (some might say anal retentive) desire to capture everything in an Excel spreadsheet led me to OutWit Hub, a free extension for use with the Firefox web browser. Here’s how they describe it:

With OutWit Hub you can find, grab and organize all kinds of data and media from online sources. Automatically explore series of Web pages or search engine results to extract contacts, links, images, data, news, etc.

With a little tinkering, I found that I can use OutWit Hub to grab search results I’m interested in and save them to Excel (the two other export options are CSV and SQL). Let’s look at a few examples so you can judge for yourself how this tool might be useful.

Using OutWit Hub with “Site:” Search

With over 100 pages, a new e-commerce site I examined for SEO Audit of Gemstone Designs was a great opportunity to use OutWit Hub in combination with the site: command to inspect pages indexed by the major search engines.

To follow along, visit OutWit Technologies to download and then install OutWit Hub for Firefox. After restarting Firefox, you’ll see a new menu button for OutWit Hub in your browser menu:

new outwit hub button in firefox browser menu

We’ll start on Google’s home page and use the ‘Advanced Search’ option to select “100″ for “Results per page.” Then we’ll use the “site:” command to find all the indexed pages for a domain by searching for “site:your-domain-name-here.com.” For Gemstone Designs, Google displays 100 results on the first page and an additional 15 on the second, for a total of 115 indexed pages.

With the first 100 search results displayed, launch OutWit Hub by clicking on the button in your Firefox menu bar. I’ve already created a scraper for Google searches (more on this in a minute). Here’s how my Google scraper displays the Source URL, Page Title, Page Description, and Page Link (showing the first 11 of 100 lines):

outwit hub scraper window

We can select the desired rows and use the “Catch” button to capture the first 100 items. Navigate to the next search results page using the “next in series” button to pick up the remaining 15 search results, and then export these results using File > Export Catch As > Excel.

Investigate Site Rankings with Outwit Hub

Next, let’s use OutWit Hub with Microsoft’s Bing.com search engine to scrape search results for a keyword we’re interested in – let’s say “shoelaces” – and save them in Excel.

Once again, let’s collect more than just the top 10 results. In the upper right hand corner of the Bing home page select Extras > Preferences > Web settings (middle of page) > Results setting. Let’s select “50″ and then save this new preference.

After clicking on the OutWit Hub button, the scraper I’ve created for Bing does a good job of collecting the Page Title, Description and Link, but it also captured 5 paid ads with the organic search results. Not a problem! I’ll just exclude the last 5 lines from my selection when using “Catch” or when exporting directly to Excel.

scraper for bing.com search results

Create Your Own Scraper

guess and source buttons

Scrapers are specific to the page displaying the information, so you’ll need to create separate OutWit Hub scrapers for each search engine — one each for Google search results, Yahoo! Maps searches, and image searches on Bing, for example.

Fortunately, that’s not difficult. Sometimes using OutWit Hub’s “guess,” “list” or “table” data extractors will provide exactly what you’re looking for.

If not, select “source” to view the source code for a search engine’s results page and begin building your own custom scraper.

You do this by identifying unique markers in the source code that appear before and after the data you want.

Use the Editor, in the window to the right of the source code, to assemble your scraper. Here’s a picture of my Google scraper:

editor window for creating custom scrapers

For for more information on how to create a custom scraper, see OutWit Technologies’ Create Your First Scraper.

Basic Scrapers for Google, Yahoo and Bing

OutWit Hub makes it easy to download and share scrapers as xml files. But for these basic scrapers, I’m providing snapshots of the Editor window so you can quickly see the snippets I’ve used to create these scrapers.

Google Scraper

snippets to create a google scraper with outwit hub

Yahoo! Scraper

snippets to create a yahoo scraper with outwit hub

Bing Scraper

snippets to create a bing scraper with outwit hub

A Caveat or Two to Wrap This Up

Because search engines sometimes format certain results differently (for example, Wikipedia.org and YouTube entries in Google search results), these scrapers will not provide perfect results in every instance.

The Google scraper also stumbles on search results pages with ads. But there are work-arounds, including copying Google’s source code (View > Source Code) into a text file and deleting the ads before using Outwit Hub. Alternatively, use the Scroogle Scraper to get ad-free search results (but will also need to copy the source code for these results into a text file before scraping).

But for not-a-lot-of-money (did I mention that OutWit Hub is free?!?) and just a little effort, you too can harvest the web for SEO projects that you want to catch in a spreadsheet!

Resources Mentioned in this Post

  • Pingback: How to Export Google Search Results to Excel

  • Pingback: How to Extract Any Web Page Information and Export it to Excel

  • Pingback: Top Positions - How To Extract Any Web Page Information And Export It To Excel « TopPositions.org

  • Pingback: How to scrape Search Engine Result Pages with OutWit Hub for SEO Audit (Video) | OutWitters' Blog

  • mrique

    yes you could use Dapper.net, yahoo pipes or yahoo yql console to

  • chris

    spent bloody ages looking for an easy scraper, thanks!

  • http://en.slowakei-netz.de Mike Slowakei

    I do adore your style of writing and the ting I love most of all – is that tips that you give, especially about scrappers!
    Thanks for the really useful material of high quality published!

  • Anonymous

    Thanks for great information. So far only managed to find a few sources with information concerning optimization tips for Chinese search market. Seems like the biggest concern for most of the people will be relevant and readable by humans Chinese content…
    seo

  • http://profiles.yahoo.com/u/G5P6CBOPFYQYUUG3GJURWN4JYU Ajay

    The programs and instructions that run a computer, as opposed to the actual physical machinery and devices that compose the hardware.
    anti spyware software download

  • http://pulse.yahoo.com/_PHNMPXRJVQ5KXV3ZTA5HA7ZCYY rickey Surname or initialgupta

    Thanks for sharing this application here. I needed it for my project. The reason i liked it because it is very simple and easy as you told above. I admire you for making very simple and useful application.
    Poker Reviews

  • http://www.e-optimator.dk/ Optimator

    Great tools, thx a lot!

  • tomas

    Thanks a lot for the tutorial, could you please make a tutorial explaining the macros and jobs parts, because it doesn’t work. Thank you very much again

  • Anonymous

    Thank You SOOOO MUCH!

  • Joe

    NICE WORK! This tip works perfectly!!!

  • http://www.charmsjewelryuk.com/charms Pandora Charms

     good job,thanks for that.This
    good article I really appreciate sharing this great post. Keep up your work.

  • Pandora bracelets

    I really enjoyed the quality information you offer to your visitors for this blog.I will bookmark your blog and have my children check up here often.This blog is valuvable for me.

  • Anonymous

    This blog is sharing  basic information Scrapers for the search engine . I really like this blog and I read it . This blog is sharing good information.
    seo company

  • http://leadremarks.com Jay

    Thanks this was a really easy scraper to use.

  • Henry

    hi Dale,

    I use your example as a walk through tutorial, using the same site and exact steps per your video. Somehow I am not able to get the same output in the scraped result. Instead of showing me the title/desc/link, what I get is a one row with 17 columns of link.

    This tool is great but there’s not enough video tutorials to show us how to use it.

  • http://www.marketing2oh.com/ Dale Stokdyk

    Hi Henry. The post was written over 2 years ago, and they’ve continued to update the product. It sounds like you’ve found their tutorials (http://www.outwit.com/support/help/tutorials/) but still are having problems; have you tried using the built-in help feature, or contacting them directly?

  • Soma56

    Free and downloads to CSV: http://downloadgooglesearchresults.com/

  • refencement google

    I’m not that much of a internet reader to be honest but your blogs really nice, keep it up!
    I’ll go ahead and bookmark your site to come back later on. Cheers

  • http://www.balitripreview.com Cherri Huffine

    Amazing blog! Do you’ve got any suggestions for aspiring writers? I’m hoping to start my own blog soon but I’m just a little lost on everything. Would you suggest starting with a free of charge platform like WordPress or go for a paid option? There are so numerous choices out there that I’m completely overwhelmed .. Any concepts? Appreciate it!

  • http://www.mathewporter.co.uk/ Matt Porter

    This is a great little tool, is it possible to use it to export a list of indexed pages from google for my site (site:mydomain.com)?

  • Runy

    Hi would this work with Linked in? Say I Search managers on the Linked in site with Google. And I want to get all the managers profiles in the excel sheet. What adjustments do I need to make?

  • http://www.marketing2oh.com Dale Stokdyk

    Cheri, thanks for the kind words. I’m a fan of WordPress, and use it for this blog. It’s widely used, with a huge variety of plug-ins that you can use to customize your blog, so I highly recommend it. If all the options cause your eyes to glaze over, then I’d suggest finding a local web developer who should be able to get you set up and show you how to manage your blog and website (you can manage both from within WordPress).

  • http://www.marketing2oh.com/ Dale Stokdyk

    Mathew, you can use the Google site:yourdomain search, and the current version of Outwit Hub should make it easy to download the results. BUT a better way to check on which pages have been indexed by Google — if that’s what you’re ultimately trying to achieve — is to create a Google Webmaster Tools account and see what Google says (they don’t necessarily show all indexed pages with a site: search).

  • http://www.marketing2oh.com/ Dale Stokdyk

    Hi Runy. I haven’t kept abreast of updates to Outwit Hub, but with a little knowledge of HTML, it’s likely you can use Outwit Hub to scrape LinkedIn search result but identifying the HTML code in front of and after the info you’re interested in.

  • http://www.mathewporter.co.uk/ Mathew Porter

    site:yourdomain will bring back all of the indexed pages including sub domains, wmt doesnt always track as many and you can use google docs to generate the list in a csv automatically… ish. Anyway great tool to make life easier.



Note: This is the end of the usable page. The image(s) below are preloaded for performance only.