Website Crawler: Online crawling and data extraction

Website Crawler is a SaaS (Software as a Service) that you can use to crawl and analyze up to 100 pages of a website for free in real-time. You can run the crawler as many times as you want, up to the daily set limit. Website Crawler is robust and fast. It can generate JSON or CSV format file from the extracted data.

Features

link_off Broken Links: WebsiteCrawler makes you aware of unreachable, internal and external links on your site. It checks HTTP status code of each URL on the pages it has analyzed and makes you aware of the same.

bolt Page speed: This SaaS detects and displays the loading time of the pages on your site. You can filter the list of pages by their loading time. Thus, you can find pages that are slow and fast in no time.

file_copy Duplicate titles, meta tags: Multiple title, meta description tags can confuse search bots especially those who are indexing your pages for ranking in the search engines. With Website Crawler, you can easily find the pages of a site that are having multiple title or meta tags.

broken_image Missing Alt Tags: Search bots index images displayed on the HTML pages and displays them in their image search tools. If the image URL does not have an alt tag, they won't rank for search keywords. This SaaS has a missing alt tag reports which can use to find pages having images without alt tag.

account_tree XML Sitemap: This SaaS can generate an XML sitemap for your site with a click of a button. You can exclude URLs from the sitemap or add priority or specify "changefrequency" for the URLs [New feature]. If you're using a CMS or a custom-built site that does not have a sitemap, use this feature.

file_export Export data: You can export/download the data displayed in the reports section to a PDF, CSV, or a spreadsheet file with a few clicks of a button. There's also an option to export the entire website data to a file. Website Crawler can also generate LLM ready structured data format i.e. JSON file from the scraped data in just one click of a button.

javascript JavaScript Crawling: This SaaS can execute JS enabled web pages. It can also render JS heavy sites.

link Canonical Link issues: One of the major reasons why sites suffer from algorithm penalties is improper canonical links. Website Crawler finds invalid canonical links on the pages of your site and displays it.

format_h1 Pages with/without heading tags: Want to know which pages on your site lack heading tags h1 to h5? Want to find the pages on your site that have small headings or headings containing a specific word? With Website Crawler, it is easy to analyze the h1 to h5 HTML tags used on the pages of websites. You can filter heading tags containing certain words, letters, etc.

network_node The number of internal/external links: This platform can display the number of internal and external links that pages on a website have. You can filter the list by the URL count with just one click of a button.

abc Thin pages: Ranking of websites can tank after an algorithm update if it has a lot of pages with thin content. Finding thin content on a site is a breeze with this SaaS.

acuteFast: WebsiteCrawler.org is fast. It can crawl 1000s of pages within a few minutes. It can execute the scraping/crawling tasks in the background while you work on other things.

format_h1Custom data: You can configure this platform to extract/scrape certain data from the pages of a site. You can see if the tag whose data you want is fetching any data in real-time.

articleLog files: You can see useful data from the access log files with our log file analyzer [beta].

spellcheckBulk check spelling mistakes: WebsiteCrawler can bulk check 100s of articles for spelling mistakes with one click of a button. After identifying the mistakes it will make you aware of the same.

Who should use Website Crawler?

This SaaS has been designed and built for:

Extracting data from websites. (eCommerce portals, sites built using WordPress, Blogger, Drupal, Joomla, or any other content management systems, or sites that have been built from scratch, with site builders, JS heavy portals, etc)
Analyzing the extracted data for finding errors and discovering areas of optimization.
Exporting the structured data to a file.
Integrating with third-party services that require clean structured JSON data.

If you own a website that uses a CMS, use this application as it can help you get rid of plugins and reduce the load on your server as SEO analysis will be done on the cloud. If you have built the site using a site builder tool or by yourself, you can discover on-page SEO issues with this SaaS or you're a researcher, student, etc working on an AI project and want to train your model on a website's dataset, you can use WebsiteCrawler.

FAQs

What is Website crawler?

WebsiteCrawler is a SaaS (Software as a Service) that crawls every link it has found on an entered domain. It does not overwhelm any server but does the job like a pro.

How to use Website crawler?

Enter a non redirecting and reachable website domain (include https, www, http, etc) and the number of URLs you want this SaaS to analyze and click the submit button. Once the crawler gets into action, you will see the list of URLs that have been analyzed. This list is updated every 10 to 15 seconds. Once the number of links in the list is equal to the limit you've entered, you will see a form with option to log in with your Google account or register a new account. Proceed with the option of your choice to see the dashboard.

What sites does this SaaS support?

WebsiteCrawler can render JS heavy sites. It thus supports every publicly reachable website. It does not automatically fill and submits form. It works only with publicly available information on HTML pages.

What is the daily limit?

We have set a daily limit of 100 URLs for free plan users. For registered users, this limit is increased to 2000+. How does this feature work? WebsiteCrawler keeps a record of the total links of a domain it has crawled. Once the daily threshold is reached and you enter the domain and limit in the above form, and click on submit, you will see an error.

What is the sitemap crawl function?

Although this SaaS supports JS, some pages of the site may be poorly linked. Here's when this feature comes in handy. To make WebsiteCrawler crawl a sitemap, you should enter the url of the sitemap in the "xml sitemap" text box available above. Websitecrawler.org will extract each URL from the sitemap file and analyze the number of pages you want us to analyze.

Do we support custom tags?

Yes, in the settings page of WebsiteCrawler.org, there's a "custom tags" section where you have to select a project, enter a URL and the tags you want this software to scrape (you must enter CSS tag e.g. div > p). You should fill out this form and click the submit button. If the tag is valid and you see some matched results below the form, it will be added to the list of tags that will be crawled.

What data format does WebsiteCrawler support?

WebsiteCrawler lets users download data of a website in a comma separated value (CSV) or JSON file. The generated JSON file includes a JSON Array containing one or several JSON objects. The time taken to download the file depends on the data length and your internet connection speed.

Is our data LLM ready format or suitable/compatible with large language models (LLMS)?

Yes, this platform provides an API through which you can get data in LLM ready format instantly once the website data is in its database. You have to create an API key to use this feature. A few lines of code can integrate WebsiteCrawler with any LLM of your choice provided it supports JSON data.

You entered a non redirecting domain name, limit and clicked the submit button but nothing happened. What to do?

The crawl progress should appear within 15 to 20 seconds you have clicked the button. In case this does not happen, use the sitemap crawl function i.e. enter the sitemap URL instead of the non redirecting domain and try again.