Crawling is a resource intensive and difficult task. Website Crawler handles this with breeze. You might want to first know the capability of Websitecrawler.org. Our API enables developers to harness the capability of WebsiteCrawler.org through 3 simple endpoints which we have discussed below. Before your program starts making requests, you'll have to generate an API key.
Register a new account or sign in to WebsiteCrawler with your Google Account. Navigate to the setttings page. Find the generate API key button and click it. That's it. Your API key will be generated instantly.
WC provides 3 endpoints with which you can run the crawler, get the last URL processed by WC and get a JSON Array of the scraped data. Crawling is a difficult task. The base URL is https://www.websitecrawler.org/api. WC handles this efficiently. To run its crawler, simple call the follow endpoints
Here are the required paramters:
https://www.websitecrawler.org/api/crawl/start?url=MY_URL&limit=MY_LIMIT&key=YOUR_API_KEY //replace MY_URL,YOUR_API_KEY,MY_LIMIT
{
"status": "Crawling"
}
These are the required parameters:
https://www.websitecrawler.org/api/crawl/start?currentURL=MY_URL&key=YOUR_API_KEY //replace MY_URL,YOUR_API_KEY
{
"currentURL": "https://example.com"
}
These parameters are required
https://www.websitecrawler.org/api/crawl/start?url=MY_URL&key=YOUR_API_KEY //replace MY_URL,YOUR_API_KEY
[
{
"tt": "",
"np": "12",
"h1": "Pramod",
"nw": "540",
"h2": "",
"ul___li___a___span": "",
"h4": "",
"h5": "",
"atb": "",
"sc": "200",
"ELSC": "",
"md": ""
}
]
Our crawler is multi threaded and asynchronous. However, to make sure that the API is not misused, WC free plan processes one request every 10 second and the number of the links it processes per day is per your plan. Make sure your application makes one request every 15 or 20 seconds.
XML Sitemap Generator is swing based Java application which is powered by WebsiteCrawler.org API. It has an option to enter the API key. The main interface will be visible only if the user enters a API key. On the main interface, there's text field to enter the URL, number of links to crawl and a text area where the visited links are displayed. The application also features a button to generate a sitemap and a status section which displays important messages including any error encountered by WC while crawling the site.