Website Crawler API Documentation

What is Website Crawler API?

Crawling is a resource intensive and difficult task. Website Crawler handles this like a breeze. Our API enables developers to harness the capability of WebsiteCrawler.org through 6 simple endpoints which we have discussed below. Before your program/project/platform starts making requests, you'll have to generate an API key.

Getting the API key

Register a new account or sign in to WebsiteCrawler with your Google Account. Navigate to the settings page. Find the generate API key button and click it. That's it! Your API key will be generated instantly. You can either integrate our following endpoints with your project/software or platform or use our Python SDK or Java Library JAR file which has functions you can call programmatically.

The 6 endpoints

Website Crawler API offers 6 endpoints with which you can run the crawler, get the last URL processed by WC and get a JSON Array of the scraped data. Crawling is a difficult task. WC handles this efficiently. To run our crawler, simply pass the required data to the following endpoints.

Note: The base URL is https://www.websitecrawler.org/api.

Endpoint 1: GET /crawl/authenticate

Before your application starts making any requests, it will have to retrieve a token. The token will be generated only if the api key is valid. You must send this token (replace api_generated_token) in each post request.

Sample request

curl -X POST https://www.websitecrawler.org/api/crawl/authenticate \
 -H "Content-Type: application/json" \
 -d '{"apiKey": "your_api_key"}'

Sample response of the above request

{
  "token": "some_api_generated_token"
}
                     
                        

Endpoint 2: GET /crawl/start

This endpoint submits a URL to the crawler. It also gets the live crawling status. The status can be NULL [for first request], "Crawling" and "Completed!". The first call to this endpoint is URL submission. The Subsequent calls will return the status.

Here are the keys required in the JSON data payload:

  • URL: A non-redirecting main URL (domain) of the website. For example https://www.websitecrawler.org
  • Limit: Number of links WC should visit and crawl

Sample request


curl -X POST https://www.websitecrawler.org/api/crawl/start \
     -H "Authorization: Bearer api_generated_token" \
 -H "Content-Type: application/json" \
 -d '{"url": "your_url","limit":"your_limit"}'
                    

Sample response of the above request


                        
{
  "status": "Crawling"
}
                    

Endpoint 3: GET /crawl/currentURL

With this endpoint, you can receive the URL WebsiteCrawler is currently processing/analyzing.

This is the required JSON key:

  • URL: A non-redirecting main URL (domain) of the website. For example https://www.websitecrawler.org
  • Sample request

    
    curl -X POST https://www.websitecrawler.org/api/crawl/currentURL \
         -H "Authorization: Bearer api_generated_token" \
    -H "Content-Type: application/json" \
     -d '{"url": "your_url"}'
    
                        

    Sample response of the above request

    
                            
    {
      "currentURL": "https://example.com"
    }
                        

    Endpoint 4: GET /crawl/cwdata

    This endpoint lets users retrieve LLM ready structured JSON data of the crawled website. The data will be returned only if the crawling status is "Completed!".

    Your JSON payload must have the following key:

  • URL: A non-redirecting main URL (domain) of the website. For example https://www.websitecrawler.org
  • Sample request

    
    curl -X POST https://www.websitecrawler.org/api/crawl/cwdata \
         -H "Authorization: Bearer api_generated_token" \
    -H "Content-Type: application/json" \
     -d '{"url": "your_url"}'
    
                        

    Sample response of the above request

    
    [
      {
        "tt": "",
        "np": "12",
        "h1": "Pramod",
        "nw": "540",
        "h2": "",
        "ul___li___a___span": "",
        "h4": "",
        "h5": "",
        "atb": "",
        "sc": "200",
        "ELSC": "",
        "md": ""
     }
    ]                        
    
                        

    Endpoint 5: GET /crawl/clear

    Once crawl job is over and the API returns status as "Completed!" via the endpoint 1, you may wait to use this endpoint to clear the crawl job status.

    Key required in the JSON payload:

  • URL: A non-redirecting main URL (domain) of the website. For example https://www.websitecrawler.org
  • Sample request

    
    curl -X POST https://www.websitecrawler.org/api/crawl/clear \
         -H "Authorization: Bearer api_generated_token" \
    -H "Content-Type: application/json" \
     -d '{"url": "your_url"}'
    
                        

    Sample response of the above request

    
                            
    {
    "clearStatus": "Job cleared"
    }
                        

    Endpoint 6: GET /crawl/waitTime

    With this endpoint, you can receive the waitTime (the time you have to wait to receive the latest response). The Python SDK, Java Library has been configured to receive the waitTime in real-time.

    This is the only required key in the JSON payload:

  • URL: A non-redirecting main URL (domain) of the website. For example https://www.websitecrawler.org
  • Sample request

    
    
    curl -X POST https://www.websitecrawler.org/api/crawl/waitTime \
         -H "Authorization: Bearer api_generated_token" \
    -H "Content-Type: application/json" \
     -d '{"url": "your_url"}'
      
                        

    Sample response of the above request

    
                            
    {
      "currentURL": "https://example.com"
    }
                        

    Rate limiting

    Our crawler is multi threaded and asynchronous. However, to make sure that the API is not misused, WC free plan processes one request every 10 second and the number of the links it processes per day is per your plan. Make sure your application makes one request every 3 second if you've bought the paid plan, or 15 or 20 seconds if you're using our free plan.

    Integration example

    XML Sitemap Generator is swing based Java application which is powered by WebsiteCrawler.org API. It has an option to enter the API key. The main interface will be visible only if the user enters a API key. On the main interface, there's a text field to enter the URL, number of links to crawl and a text area where the visited links are displayed. The application also features a button to generate a sitemap and a status section which displays important messages including any error encountered by WC while crawling the site.