Author: pramod

  • How to see the crawl history with WebsiteCrawler?

    WebsiteCrawler considers every crawl as different and to let users monitor audits and track data changes over time, it provides a feature called “Crawl History”. Why is this feature useful and was required? A new crawl can get you the latest data if the site’s content is updated frequently. If the content on the website doesn’t change much, the data would be identical. Hence, tracking changes in reports/data becomes easier.

    How is this feature different from “Crawl compare”? The “Compare” report lets users see the content changes between two timestamps. The “History” feature displays every crawl of each project you’ve executed on WebsiteCrawler.org. It displays these columns – crawl date, project URL accompanied by a “Delete” button.

    If you click the project URL, WebsiteCrawler.org will open the reports for the date which is displayed next to the project. You can open multiple reports by clicking on the respective project URLs and see how the site’s reports have improved or deteriorated over the time. The data will be available for each report. You can download the same by clicking the “Data” report menu on the left sidebar, selecting the fields you want in the report and clicking the “Download CSV” or “Download JSON” button.

    There’s a delete button too on the “Crawl History” page. If you want to remove the site data for a particular date, simply click this button. If your website is huge, this process can take anywhere from a few seconds up to a minute to complete. Deleting old history that you no longer refer to keeps the database table clean. Whether to delete old records is entirely up to you.

  • What is our Enterprise plan?

    Websites can be very large having tens of thousands of pages or a user may own dozens of large websites and for such websites and users, we have introduced the Enterprise plan.

    What features does Enterprise plan unlock?

    Up to a million pages: WebsiteCrawler can crawl the entire site irrespective of the number of pages. Just enter a number of URLs, site link and wait for the crawling task to finish. Enterprise plan of WebsiteCrawler.org supports up to 1,000,000 pages i.e if you have created 50 projects, up to a million pages of 50 websites (combined) will be crawled.

    Faster crawling speed: Enterprise plan users can configure WebsiteCrawler to crawl pages faster than our other plans. The target website’s infrastructure should be strong as dozens of threads will send hundreds of HTTP requests every minute to the target website. If speed matters, you can switch to HTML crawler which is blazing fast. If you switch to HTML crawler, reports such as Render Blocking Resources, Screenshots, and Code Coverage won’t be available.

    1 minute uptime checks: WebsiteCrawler will monitor every website each minute and will send an email alert if any of your sites are not reachable. Our Starter and Growth plan support 5 and 3 minutes uptime monitoring.

    Unlimited users: The Enterprise plan gives you the freedom to add unlimited users to your account. Each user can manage a different project and can have permission to change miscellaneous settings, crawl setting, run crawls, etc.

  • How to schedule crawls with WebsiteCrawler?

    Scheduling a crawl with WebsiteCrawler.org is one of the easiest things to do. Visit the settings page and find the option “Schedule crawls” on the left sidebar. Select a project/domain name from the list and enter the crawl limit. Now select the time at which you want WebsiteCrawler.org to spider your site. Click on the green “Create a Schedule” button. Our platform will add your site to the scheduled tasks and will automatically crawl at the time you’ve set . You can create multiple schedules for the same project. For example, you can make our platform spider the same site multiple times a day.

    schedule crawls

    Benefits of scheduling crawls:

    Detect issues automatically: If you schedule a crawl, you don’t have to log in to your account and invoke the crawler manually. WebsiteCrawler will automatically spider your site at the time of your choice in the same way it does when crawl is run manually and will send an email once the job is done. It will do this every day until you remove the scheduled task. You can do this by clicking the “Delete button”. Before scheduling a task, you should run a sample crawl with a limit of 5 to 10 URLs to ensure that the site is crawlable.

    The major benefit of this is issues are detected for fresh content/pages without manual intervention. You just have to log in to your account and see the current issues report. This report enables users to see previous issues or compare issues by date. If you have added a team member/members and assigned a project to him/her, you can thus easily find out whether the issues were fixed before a new crawl was completed or not.

    We don’t support monthly/weekly crawls yet because we believe issues should be fixed as soon as they are discovered to prevent SEO adversities.

  • How to create your first WebsiteCrawler project?

    For easy identification, projects in WebsiteCrawler have the same name as the domain you had entered for crawling. Projects cannot be renamed. but can be grouped.

    How to create a project? When you run a crawl and sign in to WebsiteCrawler after registering an account or signing in with Google, our platform will automatically create a new project for you. While your site is being crawled, you can register a new account by clicking on the “Get started” button on the menu bar and following the instructions displayed on the page. If you create an account during an active crawl task, WebsiteCrawler will create a project for you automatically and you can monitor the crawl progress from the dashboard. If you’ve logged in first and want to create a new project and run a crawl, this guide is for you.

    Click the green “Create a new project” button displayed on the dashboard page. You’ll now see a form where you must enter the target website URL. Enter the URL and click the “Add website” button. WebsiteCrawler will add your new project to the list.

    new projects

    To run a crawl, visit the homepage and enter the website URL you had entered in the above step. Now enter the crawl limit and click the “Crawl My Site Now” button. WebsiteCrawler will start crawling your site. You can close this page and visit the dashboard to monitor the crawl progress.

    WebsiteCrawler lets users delete a project. When you delete any project, the crawl credits won’t be affected in any way. For example if you made our platform crawl 1000 pages and you remove a project, the crawl credits will not be incremented by 1000.

    group projects

    In our first paragraph, we had mentioned another feature of WebsiteCrawler.org i.e. Group Projects. Along with creating a project, you can group projects from the dashboard page. To do this, click the “Group Projects” button. WebsiteCrawler.org will now show checkboxes beside each project. Select the checkboxes of your choice and enter the name for the group in the new textbox WebsiteCrawler displays.

  • How to extract custom data using XPath or CSS selectors with WebsiteCrawler?

    WebsiteCrawler allows users to extract specific data from web pages using CSS style selectors and XPath selector. Both the approaches are powerful but XPath selector based extraction is more versatile.

    Extracting data using CSS selectors

    custom data settings

    In the custom data settings section, you must first select a project. After selecting a project, choose the extraction method which can be CSS or XPath. Now enter the URL on the page which has the data you want WebsiteCrawler to extract and enter the comma separated CSS, XPath selectors in the textbox below it.

    Suppose you want to extract the text of hyperlink element (a href) wrapped inside the span HTML tag for the project formsbook.com. Here’s the approach you should follow.

    data extraction

    Select project formsbook.com and select the extraction method “CSS”. In the URL to check field, enter the URL https://www.formsbook.com/demos/information-request-form/ and in the text box below below it, enter the css selector span > a where span > a selects direct child URLs inside the span HTML element. Now click the “Check and Add” button. If WebsiteCrawler detects matching elements, it will display the number of elements found on the page. This number can be 1 or more. Unless a match is found, the selector won’t be added to the crawling list. Once the selector is added to the crawling list, run a new crawl. The custom data (span > a) will be extracted and will be available in the “Custom data” report or the data section on reports page.

    The above screenshot is of our log file which depicts crawler working on the span > a custom data.

    custom data report

    The custom tags data report will be available under the content section of the reports section’s left sidebar only if custom data settings was configured and a crawl was run.

    custom data report tag selection

    This is the custom tags data report. Select a tag whose data you want to see.

    final custom data report

    This is the final report. You can download the data in the CSV, JSON format from the data section of the report.

    Extracting data using Xpath

    Here’s another example for extracting custom data. This time we will use XPath extraction method. Suppose I want to extract the number of users that the Growth subscription plan of WebsiteCrawler.org supports. Here’s the XPath settings for the same:

    In the above screenshot, you can see that the “Check and Add” button click resulted in a result. That Xpath selector was added to the crawling list. Now, after running a crawl, this report was generated.

    xpath data extraction report

    As only one page has the pricing table on WebsiteCrawler.org, only a single result is displayed in the custom tags data report and no data is reported for the other crawled pages.

    Note: A result should be returned for the XPath selector as well. Unless a matching result is displayed, the selector won’t be added to the crawling list. WebsiteCrawler verifies the selectors you enter. It uses Chrome browser to do so.

    internal links

    By default, WebsiteCrawler.org skips adding links to downloadable files such as PDFs, videos, MS Excel/Word/PowerPoint documents, etc available on web pages. To deactivate this behavior, visit the settings page then click on the Miscellaneous settings menu on the left sidebar. Once the settings appear, select a project from the drop-down list and choose the option “Consider links to downloadable files hosted on example.com as internal links” from the “Select a condition for internal links” drop-down list.

    Tip: In case you can’t write the XPath or CSS selector, use Google Gemini, Claude or ChatGPT. Ask these tools to generate a selector for the page and verify the same using our Custom Data feature.

  • How do I find duplicate content on entire website with WebsiteCrawler?

    Whether it’s near duplicate or exact copy of the content, WebsiteCrawler identifies the two across 1000s of pages in a few seconds. Our platform considers the repetition of phrases and words to identify it. By default, WebsiteCrawler considers non-stopwords to identify duplicate content. The platform does this to avoid false positives. What are stop words? WebsiteCrawler maintains a list of 100+ commonly used words. Such words will be ignored while generating the report.

    Website Crawler identifies up to 10 matching pages for each URL. It displays the URLs and the percentage similarity which can be up 100. The ideal score should be less than 70. If the page’s score for similarity is more than it, WebsiteCrawler.org flags it and considers the page as a duplicate.

    WebsiteCrawler can be configured to consider only indexable URLs/pages (pages without meta robots noindex directive) while identifying the duplicate content on the site. If you don’t want our platform to consider URLs which will not be indexed by search engines, you can visit the settings page and click the “Miscellaneous” menu on the left sidebar. Now select the project from the drop down list. Find the setting “Select a condition for duplicate content” and select the “Consider indexable URLs” item from the list. Your miscellaneous setting for the selected project will be saved automatically. Now visit the duplicate content report to see the latest results.

    The duplicate content report is generated once a day because the backend does heavy work in the background while generating this report. Furthermore, if a site is big, it is impossible to fix the content issues in a day on 100s of pages. If you have fixed the content, run a crawl and open the report again to see the latest changes on the next day.

  • Getting started with WebsiteCrawler

    WebsiteCrawler is a multipurpose platform which lets you monitor, audit, crawl a site and extract data for LLM training. It doesn’t require installation or special setup unless you want to do something extraordinary such as extracting custom data.

    How to run a crawl?

    Our crawler can be invoked from the homepage (visit websitecrawler.org) or via schedule feature. If you schedule a crawl, crawling will start at the time you specify. Homepage-based crawls can be started anytime not at a specific time. You have to submit a URL and limit to initiate it. Limit is the number of links you want WebsiteCrawler to process.

    The URL can be a simple link to your site’s homepage or to the sitemap file. Once WebsiteCrawler starts spidering a site, you’ll see a list of urls that have been processed on the homepage if the crawler was invoked from the homepage. In case of scheduled task, you can see the status in dashboard by clicking the “status” button.

    Types of URLs you can submit to WebsiteCrawler

    From the homepage, you can submit the link to the target website or its sitemap.xml file. In the 2nd case, each URL in the sitemap will be extracted and will be added to list of urls to be processed.

    What is crawling activity?

    When you submit a URL, WebsiteCrawler will extract the links on the page and its content. It will then process these URLs and their content. This process continues for each page until no link is found.

    What are projects?

    Projects separate data of websites. For example, if you’ve created 3 projects, each site in the project will have their own audit, extracted, or monitoring data. You can create a new project by clicking the green colored “create a project” button.

    What are reports?

    The reports section gives you access to the extracted data, and audit reports which comprises of plain data and charts. Reports can be downloaded as PDF files with WebsiteCrawler’s branding or custom branding. Our platform gives you access to 35+ audit reports. Each report is accompanied by filters.

    What is the settings page?

    This page allows users to see the past crawl history, schedule crawls, manage users, manage PDF report branding, select crawler type (HTML or Chrome), configure email alerts, set crawl delays, manage Google Search console integration, etc. It enables you to change password and email address as well. It has custom data section that lets you configure the CSS/XPath rules for a site.

  • Get WebsiteCrawler alerts to your Slack channel

    WebsiteCrawler integration for Slack enables users to receive messages to the Slack channel of the users choice leveraging Slack’s powerful feature called channels.

    Why use integration for Slack? Although you can group mails, web applications don’t send messages via different email addresses. For example, we send most emails through our help@websitecrawler.org address. We don’t send SSL expiry alerts through some different email alias. A user may get emails from many different sources. Hence, the mails can get lost in the crowd. Slack allows you to create dedicated channels. The WebsiteCrawler integration for Slack lets users receive important messages to these dedicated channels. Thus, if you create a channel dedicated to our alerts, you can easily find all important messages sent by WebsiteCrawler at the same place.

    Types of alerts WebsiteCrawler will send:

    Downtime alerts: Our platform monitors uptime of sites. Depending on your chosen subscription plan, our platform run uptime checks every 5, 3, and 1 minute. If your site is unreachable after a check, WebsiteCrawler will immediately send an email to your registered email address and a message to your Slack channel.

    SSL Expiry alerts: If the time left for the SSL certificate for the site to expire is less than a week, WebsiteCrawler will send Slack message to the channel along with the email.

    Crawl completion alerts: If you run a crawl, you don’t have to keep your eyes glued on the list of processed URLs displayed by WebsiteCrawler. Once the crawl job is complete, you will immediately get an alert message in your chosen Slack channel.

    Integrating Slack with WebsiteCrawler

    Our platform enables you to register a new account and sign in with your login credentials and sign in with Google. Log in to your WebsiteCrawler account and visit the settings page. Click the “API Integration” menu and find the button “Connect to Slack”.

    Click this button and complete the OAuth authentication. Once the OAuth authentication is successful, you’ll see a button “See channels”. Click this button. WebsiteCrawler will now show a list of channels available in your workspace.

    Select the channel where you’d live to receive our messages and click the “Save” button. The message “Connected to Slack” will convert to “Connected to Slack and sending messages to channel with id XXXXXX” where XXXXXX is your chosen Slack channel’s id.

    How will the messages look?

    The sent messages are easy to interpret. Slack will log every message sent by our platform in your channel.

    To know what data we collect and how long we retain it, please refer our Privacy Policy. You can also visit our Terms of Service page to read the TOS.

    If you’re facing any difficulties while using this feature or want to know more about this feature, you can write an email to us at help@websitecrawler.org

  • Analyze Log Files online

    Analyze Log Files online

    Logfile generated by web-servers such as Nginx, Apache, etc contains information that the widely used Analytics tools i.e. Google Analytics won’t display. For example, Google Analytics won’t display the IP address of the bot or the user.

    Log files contain a timestamp, HTTP protocol, request type, URL, status code, IP address, etc. Analyzing the log file data manually can be time-consuming. Here’s when the Website Crawler tool comes into the picture.

    With the log file analyzer tool of WC, you can see what URLs search bots are crawling or the IP addresses of the users that have visited the website. You can also see the URLs that people/bots are visiting the most.

    The Website Crawler’s Log File Analyzer tool displays the following important information:

    • Links with HTTP status code 200, 404, etc.
    • The number of times bots have visited your site.
    • Number of URLs present in the log file, and more.

    How to use the log file analyzer tool?

    Click the “Choose File” option, and select the access log file on your PC. If you don’t have the file, get it from the server. Once you choose the file, click the “Process” button. You’ll now see some vital details, a table containing the log file data, and a textbox.

    Log File Analyzer’s filter

    You can filter data by entering a word in the textbox you’ll below the file upload option. For example, if you want to see the list of URLs crawled by Googlebot, enter “Googlebot” in the textbox that you’ll find below the file selector/upload option. Once you enter the word, you’ll see the filtered data.

    Although WC is capable of reading large files, it will process only 100 lines of log files uploaded by free account holders/unregistered users. If you’ve got a silver account, WC will process 2500 URLs.

    Note: Website Crawler won’t save the file data or the entire file on the server. The file will be discarded once you’ve exited the page/closed the tab in which you’ve opened the tool.