Website Crawler API Documentation

What is Website Crawler API?

Crawling is a resource intensive and difficult task. Website Crawler handles this with breeze. You might want to first know the capability of Websitecrawler.org. Our API enables developers to harness the capability of WebsiteCrawler.org through 3 simple endpoints which we have discussed below. Before your program starts making requests, you'll have to generate an API key.

Getting started

Register a new account or sign in to WebsiteCrawler with your Google Account. Navigate to the setttings page. Find the generate API key button and click it. That's it. Your API key will be generated instantly.

The 3 endpoints

WC provides 3 endpoints with which you can run the crawler, get the last URL processed by WC and get a JSON Array of the scraped data. Crawling is a difficult task. The base URL is https://www.websitecrawler.org/api. WC handles this efficiently. To run its crawler, simple call the follow endpoints

Endpoint 1: GET /crawl/start

Here are the required paramters:

  • URL: A non-redirecting main URL (domain) of the website. For example https://www.websitecrawler.org
  • Limit: Number of links WC should visit and crawl
  • Key: The API key you have generated

Sample request


https://www.websitecrawler.org/api/crawl/start?url=MY_URL&limit=MY_LIMIT&key=YOUR_API_KEY    //replace MY_URL,YOUR_API_KEY,MY_LIMIT
                    

Sample response of the above request


                        
{
  "status": "Crawling"
}
                    

Endpoint 2: GET /crawl/currentURL

These are the required parameters:

  • URL: A non-redirecting main URL (domain) of the website. For example https://www.websitecrawler.org
  • Key: Your API key
  • Sample request

    
    https://www.websitecrawler.org/api/crawl/start?currentURL=MY_URL&key=YOUR_API_KEY    //replace MY_URL,YOUR_API_KEY
                        

    Sample response of the above request

    
                            
    {
      "currentURL": "https://example.com"
    }
                        

    Endpoint 3: GET /crawl/cwdata

    These parameters are required

  • URL: A non-redirecting main URL (domain) of the website. For example https://www.websitecrawler.org
  • Key: The API key you have generated
  • Sample request

    
    https://www.websitecrawler.org/api/crawl/start?url=MY_URL&key=YOUR_API_KEY    //replace MY_URL,YOUR_API_KEY
                        

    Sample response of the above request

    
    [
      {
        "tt": "",
        "np": "12",
        "h1": "Pramod",
        "nw": "540",
        "h2": "",
        "ul___li___a___span": "",
        "h4": "",
        "h5": "",
        "atb": "",
        "sc": "200",
        "ELSC": "",
        "md": ""
     }
    ]                        
    
                        

    Rate limiting

    Our crawler is multi threaded and asynchronous. However, to make sure that the API is not misused, WC free plan processes one request every 10 second and the number of the links it processes per day is per your plan. Make sure your application makes one request every 15 or 20 seconds.

    Integration example

    XML Sitemap Generator is swing based Java application which is powered by WebsiteCrawler.org API. It has an option to enter the API key. The main interface will be visible only if the user enters a API key. On the main interface, there's text field to enter the URL, number of links to crawl and a text area where the visited links are displayed. The application also features a button to generate a sitemap and a status section which displays important messages including any error encountered by WC while crawling the site.