Website Crawler Blog

Category: Tutorials

Analyze Log Files online
Logfile generated by web-servers such as Nginx, Apache, etc contains information that the widely used Analytics tools i.e. Google Analytics won’t display. For example, Google Analytics won’t display the IP address of the bot or the user.

Log files contain a timestamp, HTTP protocol, request type, URL, status code, IP address, etc. Analyzing the log file data manually can be time-consuming. Here’s when the Website Crawler tool comes into the picture.

With the log file analyzer tool of WC, you can see what URLs search bots are crawling or the IP addresses of the users that have visited the website. You can also see the URLs that people/bots are visiting the most.

The Website Crawler’s Log File Analyzer tool displays the following important information:
- Links with HTTP status code 200, 404, etc.
- The number of times bots have visited your site.
- Number of URLs present in the log file, and more.
How to use the log file analyzer tool?

Click the “Choose File” option, and select the access log file on your PC. If you don’t have the file, get it from the server. Once you choose the file, click the “Process” button. You’ll now see some vital details, a table containing the log file data, and a textbox.

Log File Analyzer’s filter

You can filter data by entering a word in the textbox you’ll below the file upload option. For example, if you want to see the list of URLs crawled by Googlebot, enter “Googlebot” in the textbox that you’ll find below the file selector/upload option. Once you enter the word, you’ll see the filtered data.

Although WC is capable of reading large files, it will process only 100 lines of log files uploaded by free account holders/unregistered users. If you’ve got a silver account, WC will process 2500 URLs.

Note: Website Crawler won’t save the file data or the entire file on the server. The file will be discarded once you’ve exited the page/closed the tab in which you’ve opened the tool.
August 27, 2020
How to create XML Sitemap for any website?

As you might know, Website Crawler – On Page SEO Checker is capable of extracting URLs from sitemap and processing them. From now on, you can now create XML sitemaps for any website with WC. For those who don’t know, a sitemap is a file that contains all links the search bots should crawl and index. If your site is well structured, search bots will be able to find the pages. If your site is poorly structured and doesn’t have a sitemap, search bots may not crawl and index some of the pages on your website.

To make sure that the important pages of your site are indexed, you should submit the URL of the sitemap file. Where to save the sitemap.xml file? Well, you can save the file to any folder of your website’s directory.

WebsiteCrawler sitemap generator

Once Website Crawler crawls your website, head over to the projects section, and click the “XML Sitemap” option.

Now, you’ll see a form with several fields. In the first textbox, enter the number of URLs you want the sitemap to have. If you don’t want to see URLs containing specific words or characters in the XML file, enter the word in the textbox 2.

URLs in sitemaps can have priority. URL with higher priority might be crawled/indexed first. With WC, you can create multiple sitemaps. Each sitemap can have URLs with different priorities.

Once the sitemap is generated, you’ll see a URL. Click this link to see the XML file. Once the browser opens the file, right-click on the file’s data and select the “Save As” option.

Sometimes, you may update a page/post on your site. To make sure that bots learn about the change and index the newer version of the page, you must either add the modified date or changefreq to the sitemap. As of now, WC doesn’t support dates. However, it can create sitemaps with changefreq option. If you set the URL changefreq to “weekly”, bots will crawl the URL on a weekly basis. Likewise, setting changefreq to “Always” will make the search bot visit the URL often.

August 25, 2020
Nofollow Links Checker

“rel” is one of the attributes supported by hyperlinks. Before Google introduced the “sponsored” value for the “rel” attribute this month, nofollow and dofollow were the commonly used values. Links using the rel=”nofollow”/”dofollow” attributes are widely known as nofollow or dofollow URLs.

DF links pass page rank to linked documents. Searchbots follow these URLs. NF hyperlinks don’t pass the link juice but they may reduce the page rank flowing to other URLs. It tells Googlebot that the link should be ignored or not considered for ranking the page of your website. Google recommends webmasters to use the rel=nofollow attribute only for URLs that make you money (affiliate URLs, etc).

DF links are used for internal links, and links of the pages of a website that you don’t own. External dofollow URLs should point to point to relevant pages only. If you don’t do so, Googlebot may put your website to the list of sites that are selling backlinks. Sites engaged in the business of selling or buying links suffer from the Google Penguin algorithm.

Finding nofollow links on your entire website

One of the special features of WebsiteCrawler is its ability to find nofollow URLs. WC makes you aware of the pages on your site that have the most number of links with the rel=”nofollow” attribute.

To access this report, create your WebsiteCrawler.org account and follow the instructions displayed on the screen. Once your site has been crawled, you should navigate to your account’s dashboard, click on the project name and then click the “Nofollow Links” report.

This report displays two columns – Link and the Page Where the link was found. The first column displays a count of nofollow URLs WC has found and the other column displays the URL of the page on your website where the link was found.

Now, to see the links with the rel=”nofollow” attribute, copy the page URL from column 2. Now, open the “Link Data” report page and paste the URL you’ve copied in the textbox. Now, choose the “Nofollow Links” option from the dropdown list and click the “Submit” button.

September 17, 2019
How to find low quality content on a website for free?

Thin content is one of the major reasons why websites suffer from Google Panda algorithm penalty. For those who are not aware, Google’s Panda algorithm has been designed to lower the rankings of websites that have On-Page SEO issues. Thin content is one of several On-Page search engine optimization issues.

The Panda algorithm is now baked into the core algorithm. The best way to avoid this penalty is to find thin posts/pages and either rewrite the entire content, update it, add new paragraphs to it or noindex it.

Although the above-mentioned activities take some time, the results can be fruitful.

Benefits of finding and improving thin content

If your website was hit by the Google Panda algorithm penalty, you should see some improvement after a major algorithm update. If your website wasn’t affected by an algorithm update and you have published high-quality content on your website, your website will start getting more visitors.

Finding thin content on a website

Frankly speaking, if a website has a 100s of 1000s of posts, it will take a lot of time to find thin pages. If you’re using a CMS such as WordPress, installing plugin will help you but your website’s performance will take a toll. This is because the plugin that finds the number of words in a post/pages will run several queries.

The best way to find thin content on a website is to use Website Crawler.

How to find thin pages on a website with WC?

Step 1: Open Website Crawler – The On Page SEO checker. Enter your website’s URL in the 1st textbox and the number of pages you want to analyze in the textbox 2.

Step 2: Click the Submit button. WC will now show a button called Status. Click this button to see the current crawling status.

Step 3: Once WC crawls your website, enter your email address and click the submit button. Now, enter the verification code in the new textbox displayed on the screen. When you enter a correct verification code, WC will create an account for you and it will display a success message.

Step 4: Log in to your account, and click the project name. Now, scroll down till you find the “Content Report” option. That’s it. Website Crawler will now display a list of posts, the number of paragraphs and words it has.

Conclusion: If you want your website to be alive for several years, either make sure that you post high-quality content or noindex/update the outdated or thin content.

August 20, 2019
Crawl sitemap links using Website Crawler

A sitemap, as you may already know, is the most important part of a website. It contains a list of links of a website and helps search engines in crawling/indexing pages which it may not find it. Today, I have good news for the users of Website Crawler. WC can now crawl the links it finds in the sitemap file. Yes, that’s right. This feature was on our checklist and we have rolled it out today.

In case you’re wondering how to use the new XML crawl feature of Website Crawler, here’s a tutorial that you can follow.

How to make Websitecrawler crawl sitemap links?

Enter the direct link to the XML format sitemap file of your website in the large text box you see on the Website Crawler’s homepage. Don’t worry about the file’s size. WC can analyze sitemaps containing 25000+ URLs.

Once you enter the URL, enter the number of URLs you want the website crawler to crawl in the “Depth” textbox. Now, click the submit button.

Note: Free account of WC supports 550 URLs i.e. no matter how big the sitemap file is, only up to 550 URLs will be processed. Our Silver (paid) plan supports 2500 URLs.

When you hit this button, WC will start crawling your website URLs. To see the status i.e. current URL being crawled and the list of processed URLs, click the “Status” button.

Once WC finishes processing the sitemap links, you’ll see a new form that asks you to enter your email address. Enter your email ID and click the submit button. WC will now send a verification email to your inbox. Enter the 3 digit code in the textbox and hit the button “Submit”. WC will now create your websitecrawler.org account through which you can see the On-Page SEO reports of your website.

Conclusion: If you want Website Crawler to analyze a fixed set of URLs or the links of the sitemap files only, enter the link to the sitemap file instead of a website URL in the large text box that you’ll find on WC’s home page.

August 5, 2019