Website Crawler API Documentation

What is Website Crawler API?

Crawling is a resource intensive and difficult task. Website Crawler handles this with breeze. Our API enables developers to harness the capability of WebsiteCrawler.org through 5 simple endpoints which we have discussed below. Before your program/project/platform starts making requests, you'll have to generate an API key.

Getting the API key

Register a new account or sign in to WebsiteCrawler with your Google Account. Navigate to the settings page. Find the generate API key button and click it. That's it! Your API key will be generated instantly. You can either integrate our following endpoints with your project/software or platform or use our Python SDK or Java Library JAR file which has functions you can call programatically.

The 5 endpoints

Website Crawler API offers 5 endpoints with which you can run the crawler, get the last URL processed by WC and get a JSON Array of the scraped data. Crawling is a difficult task. WC handles this efficiently. To run our crawler, simple call the following endpoints.

Note: The base URL is https://www.websitecrawler.org/api.

Endpoint 1: GET /crawl/start

This endpoint submits a URL to the crawler. It also gets the live crawling status. The status can be NULL [for first request], "Crawling" and "Completed!". The first call to this endpoint is URL submission. The Subsequent calls will return the status.

Here are the required paramters:

  • URL: A non-redirecting main URL (domain) of the website. For example https://www.websitecrawler.org
  • Limit: Number of links WC should visit and crawl
  • Key: The API key you have generated

Sample request


https://www.websitecrawler.org/api/crawl/start?url=MY_URL&limit=MY_LIMIT&key=YOUR_API_KEY    //replace MY_URL,YOUR_API_KEY,MY_LIMIT
                    

Sample response of the above request


                        
{
  "status": "Crawling"
}
                    

Endpoint 2: GET /crawl/currentURL

With this endpoint, you can receive the URL WebsiteCrawler is currently processing/analyzing.

These are the required parameters:

  • URL: A non-redirecting main URL (domain) of the website. For example https://www.websitecrawler.org
  • Key: Your API key
  • Sample request

    
    https://www.websitecrawler.org/api/crawl/currentURL?url=MY_URL&key=YOUR_API_KEY    //replace MY_URL,YOUR_API_KEY
                        

    Sample response of the above request

    
                            
    {
      "currentURL": "https://example.com"
    }
                        

    Endpoint 3: GET /crawl/cwdata

    This endpoints lets users retrieve LLM ready structured JSOSN data of the crawled website. The data will be returned only if the crawling status is "Completed!".

    These parameters are required

  • URL: A non-redirecting main URL (domain) of the website. For example https://www.websitecrawler.org
  • Key: The API key you have generated
  • Sample request

    
    https://www.websitecrawler.org/api/crawl/cwdata?url=MY_URL&key=YOUR_API_KEY    //replace MY_URL,YOUR_API_KEY
                        

    Sample response of the above request

    
    [
      {
        "tt": "",
        "np": "12",
        "h1": "Pramod",
        "nw": "540",
        "h2": "",
        "ul___li___a___span": "",
        "h4": "",
        "h5": "",
        "atb": "",
        "sc": "200",
        "ELSC": "",
        "md": ""
     }
    ]                        
    
                        

    Endpoint 4: GET /crawl/clear

    Once crawl job is over and the API returns status as "Completed!" via the endpoint 1, you may wait to use this endpoint to clear the crawl job status.

    These are the required parameters:

  • URL: A non-redirecting main URL (domain) of the website. For example https://www.websitecrawler.org
  • Key: Your API key
  • Sample request

    
    https://www.websitecrawler.org/api/crawl/clear?url=MY_URL&key=YOUR_API_KEY    //replace MY_URL,YOUR_API_KEY
                        

    Sample response of the above request

    
                            
    {
    "clearStatus": "Job cleared"
    }
                        

    Endpoint 5: GET /crawl/waitTime

    With this endpoint, you can receive the waitTime (the time you have to wait to receive the latest response). The Python SDK, Java Library has been configured to receive the waitTime in real-time.

    These are the required parameters:

  • Key: Your API key
  • Sample request

    
    https://www.websitecrawler.org/api/crawl/waitTime?key=YOUR_API_KEY    //replace MY_URL,YOUR_API_KEY
                        

    Sample response of the above request

    
                            
    {
      "currentURL": "https://example.com"
    }
                        

    Rate limiting

    Our crawler is multi threaded and asynchronous. However, to make sure that the API is not misused, WC free plan processes one request every 10 second and the number of the links it processes per day is per your plan. Make sure your application makes one request every 3 second if you're using the paid plan for 15 or 20 seconds if you're using our free plan.

    Integration example

    XML Sitemap Generator is swing based Java application which is powered by WebsiteCrawler.org API. It has an option to enter the API key. The main interface will be visible only if the user enters a API key. On the main interface, there's text field to enter the URL, number of links to crawl and a text area where the visited links are displayed. The application also features a button to generate a sitemap and a status section which displays important messages including any error encountered by WC while crawling the site.