Website Crawler Blog

Author: pramod

Analyze Log Files online
Logfile generated by web-servers such as Nginx, Apache, etc contains information that the widely used Analytics tools i.e. Google Analytics won’t display. For example, Google Analytics won’t display the IP address of the bot or the user.

Log files contain a timestamp, HTTP protocol, request type, URL, status code, IP address, etc. Analyzing the log file data manually can be time-consuming. Here’s when the Website Crawler tool comes into the picture.

With the log file analyzer tool of WC, you can see what URLs search bots are crawling or the IP addresses of the users that have visited the website. You can also see the URLs that people/bots are visiting the most.

The Website Crawler’s Log File Analyzer tool displays the following important information:
- Links with HTTP status code 200, 404, etc.
- The number of times bots have visited your site.
- Number of URLs present in the log file, and more.
How to use the log file analyzer tool?

Click the “Choose File” option, and select the access log file on your PC. If you don’t have the file, get it from the server. Once you choose the file, click the “Process” button. You’ll now see some vital details, a table containing the log file data, and a textbox.

Log File Analyzer’s filter

You can filter data by entering a word in the textbox you’ll below the file upload option. For example, if you want to see the list of URLs crawled by Googlebot, enter “Googlebot” in the textbox that you’ll find below the file selector/upload option. Once you enter the word, you’ll see the filtered data.

Although WC is capable of reading large files, it will process only 100 lines of log files uploaded by free account holders/unregistered users. If you’ve got a silver account, WC will process 2500 URLs.

Note: Website Crawler won’t save the file data or the entire file on the server. The file will be discarded once you’ve exited the page/closed the tab in which you’ve opened the tool.
August 27, 2020
How to create XML Sitemap for any website?

As you might know, Website Crawler – On Page SEO Checker is capable of extracting URLs from sitemap and processing them. From now on, you can now create XML sitemaps for any website with WC. For those who don’t know, a sitemap is a file that contains all links the search bots should crawl and index. If your site is well structured, search bots will be able to find the pages. If your site is poorly structured and doesn’t have a sitemap, search bots may not crawl and index some of the pages on your website.

To make sure that the important pages of your site are indexed, you should submit the URL of the sitemap file. Where to save the sitemap.xml file? Well, you can save the file to any folder of your website’s directory.

WebsiteCrawler sitemap generator

Once Website Crawler crawls your website, head over to the projects section, and click the “XML Sitemap” option.

Now, you’ll see a form with several fields. In the first textbox, enter the number of URLs you want the sitemap to have. If you don’t want to see URLs containing specific words or characters in the XML file, enter the word in the textbox 2.

URLs in sitemaps can have priority. URL with higher priority might be crawled/indexed first. With WC, you can create multiple sitemaps. Each sitemap can have URLs with different priorities.

Once the sitemap is generated, you’ll see a URL. Click this link to see the XML file. Once the browser opens the file, right-click on the file’s data and select the “Save As” option.

Sometimes, you may update a page/post on your site. To make sure that bots learn about the change and index the newer version of the page, you must either add the modified date or changefreq to the sitemap. As of now, WC doesn’t support dates. However, it can create sitemaps with changefreq option. If you set the URL changefreq to “weekly”, bots will crawl the URL on a weekly basis. Likewise, setting changefreq to “Always” will make the search bot visit the URL often.

August 25, 2020
Nofollow Links Checker

“rel” is one of the attributes supported by hyperlinks. Before Google introduced the “sponsored” value for the “rel” attribute this month, nofollow and dofollow were the commonly used values. Links using the rel=”nofollow”/”dofollow” attributes are widely known as nofollow or dofollow URLs.

DF links pass page rank to linked documents. Searchbots follow these URLs. NF hyperlinks don’t pass the link juice but they may reduce the page rank flowing to other URLs. It tells Googlebot that the link should be ignored or not considered for ranking the page of your website. Google recommends webmasters to use the rel=nofollow attribute only for URLs that make you money (affiliate URLs, etc).

DF links are used for internal links, and links of the pages of a website that you don’t own. External dofollow URLs should point to point to relevant pages only. If you don’t do so, Googlebot may put your website to the list of sites that are selling backlinks. Sites engaged in the business of selling or buying links suffer from the Google Penguin algorithm.

Finding nofollow links on your entire website

One of the special features of WebsiteCrawler is its ability to find nofollow URLs. WC makes you aware of the pages on your site that have the most number of links with the rel=”nofollow” attribute.

To access this report, create your WebsiteCrawler.org account and follow the instructions displayed on the screen. Once your site has been crawled, you should navigate to your account’s dashboard, click on the project name and then click the “Nofollow Links” report.

This report displays two columns – Link and the Page Where the link was found. The first column displays a count of nofollow URLs WC has found and the other column displays the URL of the page on your website where the link was found.

Now, to see the links with the rel=”nofollow” attribute, copy the page URL from column 2. Now, open the “Link Data” report page and paste the URL you’ve copied in the textbox. Now, choose the “Nofollow Links” option from the dropdown list and click the “Submit” button.

September 17, 2019
How to find low quality content on a website for free?

Thin content is one of the major reasons why websites suffer from Google Panda algorithm penalty. For those who are not aware, Google’s Panda algorithm has been designed to lower the rankings of websites that have On-Page SEO issues. Thin content is one of several On-Page search engine optimization issues.

The Panda algorithm is now baked into the core algorithm. The best way to avoid this penalty is to find thin posts/pages and either rewrite the entire content, update it, add new paragraphs to it or noindex it.

Although the above-mentioned activities take some time, the results can be fruitful.

Benefits of finding and improving thin content

If your website was hit by the Google Panda algorithm penalty, you should see some improvement after a major algorithm update. If your website wasn’t affected by an algorithm update and you have published high-quality content on your website, your website will start getting more visitors.

Finding thin content on a website

Frankly speaking, if a website has a 100s of 1000s of posts, it will take a lot of time to find thin pages. If you’re using a CMS such as WordPress, installing plugin will help you but your website’s performance will take a toll. This is because the plugin that finds the number of words in a post/pages will run several queries.

The best way to find thin content on a website is to use Website Crawler.

How to find thin pages on a website with WC?

Step 1: Open Website Crawler – The On Page SEO checker. Enter your website’s URL in the 1st textbox and the number of pages you want to analyze in the textbox 2.

Step 2: Click the Submit button. WC will now show a button called Status. Click this button to see the current crawling status.

Step 3: Once WC crawls your website, enter your email address and click the submit button. Now, enter the verification code in the new textbox displayed on the screen. When you enter a correct verification code, WC will create an account for you and it will display a success message.

Step 4: Log in to your account, and click the project name. Now, scroll down till you find the “Content Report” option. That’s it. Website Crawler will now display a list of posts, the number of paragraphs and words it has.

Conclusion: If you want your website to be alive for several years, either make sure that you post high-quality content or noindex/update the outdated or thin content.

August 20, 2019
Broken Links Checker: Find 404 links on website
When a page of a website is unreachable, the visitor will see an error message in the browser’s tab. These messages are reported because of the following reasons:

The database server isn’t working: All nonstatic websites save data to the database. If the database server isn’t working, the page won’t won’t be able to get data from the DB table and either the page will be blank or the webserver will report NON-HTTP 200 status code.

Rate Limiting: A web server may be configured to limit the number of continuous requests a client/visitor can make to the page. If several requests are made to a page in a short period of time, the web server will throw an error.

The page has been removed: If the webmaster, user or a developer has removed the page, the webserver will report HTTP Status 404. The problem with the 404 status code is that the search bots will make several attempts to crawl the page in the future. To reduce the number of these attempts, you can configure the web-server to throw HTTP Status 410 error instead of 404 for the broken links.

Other reasons that may make your web server respond with status codes other than HTTP 200 are as follows:
- DNS issue.
- Network issue.
- User is blocked by the firewall, etc
Using Website Crawler as a broken links checker

Website Crawler not only enables you to find broken links on your site but also makes you aware of unresponsive pages on your website. Follow the below steps to find broken URLs on your website:

Step 1: Enter the URL of your website in the textbox 1 and the number of URLs you want to check in the textbox 2 displayed on the homepage and click the submit button.

Step 2: Click the Status button to see the “Crawl Status”. Once Website Crawler finishes crawling your site, enter your email address and then the verification code sent to your inbox. Now, log in to your account.

Step 3: Once you log in, you’ll see your project name, website URL and the last crawl date. Click the project name to see the reports.

Step 5: Click the “HTTP Status” URL on the left sidebar of the reports interface and click the 1st drop-down list. Now choose the HTTP status code from the list of options displayed on the screen. Once you do so, click the drop-down list 2 and choose one of the following two options:
- Internal Links.
- External Links.
Click the “Filter” button. Website Crawler will now display the list of URLs that responded with the HTTP status code you had selected in the drop-down list 1. To see the page where the interlink was detected by Website Crawler, click the “Source” button and scroll down till you find the “Pages where the link ______ was found” section.

Conclusion: You can not only find broken URLs on your website but also discover pages that are responding with Non-200 HTTP status code with Website Crawler.
August 11, 2019
Crawl sitemap links using Website Crawler

A sitemap, as you may already know, is the most important part of a website. It contains a list of links of a website and helps search engines in crawling/indexing pages which it may not find it. Today, I have good news for the users of Website Crawler. WC can now crawl the links it finds in the sitemap file. Yes, that’s right. This feature was on our checklist and we have rolled it out today.

In case you’re wondering how to use the new XML crawl feature of Website Crawler, here’s a tutorial that you can follow.

How to make Websitecrawler crawl sitemap links?

Enter the direct link to the XML format sitemap file of your website in the large text box you see on the Website Crawler’s homepage. Don’t worry about the file’s size. WC can analyze sitemaps containing 25000+ URLs.

Once you enter the URL, enter the number of URLs you want the website crawler to crawl in the “Depth” textbox. Now, click the submit button.

Note: Free account of WC supports 550 URLs i.e. no matter how big the sitemap file is, only up to 550 URLs will be processed. Our Silver (paid) plan supports 2500 URLs.

When you hit this button, WC will start crawling your website URLs. To see the status i.e. current URL being crawled and the list of processed URLs, click the “Status” button.

Once WC finishes processing the sitemap links, you’ll see a new form that asks you to enter your email address. Enter your email ID and click the submit button. WC will now send a verification email to your inbox. Enter the 3 digit code in the textbox and hit the button “Submit”. WC will now create your websitecrawler.org account through which you can see the On-Page SEO reports of your website.

Conclusion: If you want Website Crawler to analyze a fixed set of URLs or the links of the sitemap files only, enter the link to the sitemap file instead of a website URL in the large text box that you’ll find on WC’s home page.

August 5, 2019
Google de indexing your site? Learn how to find the root cause of this issue and fix it

Google dropping 1000s of pages from its index is a nightmare for bloggers, webmasters, developers, and online business owners. One of my sites has around 28000 pages. Google had indexed around 12000 pages of this site but in the last few months, it started dropping the pages from its index.

If you’re following SEO news closely, you might know that Google De index bug has been a talk of the town of late. This bug has affected several large websites. I ignored the issue of “deindexing pages on my website” thinking that the Google De index bug may be responsible for it. This was a dreaded mistake.

Google kept on dropping pages of my site from its index. A few weeks after spotting the issue, I re-checked the coverage report of Google Search Console hoping that Google may have fixed the De Index bug. I was shocked to find that the total indexed pages were now 5670 (From 12000, the count of indexed pages dropped to 5670).

Sitemap

Did Google De Index bug affect my site?

No, it was a technical issue.

How I found and fixed the De Indexing issue?

I ran Website Crawler on my affected site. Then, I logged into my account. The first report I checked was the “Meta Robots” tag report. I was skeptical that the pages are being deindexed because one of my website’s function was injecting meta robots noindex tag in the website’s header but I was wrong. This report was clean. Then, I opened the “HTTP Status report to see whether all the pages on the site were working or not. The HTTP status for each page on the site had the status “200”. The next report I checked was the “Canonical Links” report. When I opened the report, I was shocked to find that several thousand pages of the affected website had an invalid canonical tag.

A few days after fixing the issue Google started indexing the dexindexed pages

Tip: If Website Crawler’s Canonical Links report interface displays false instead of true in the 3rd column, there’s a canonical link issue on the page that is displayed in the same row. See the below screenshot:

How does the report look like?

The issue on my site

The valid syntax for canonical links is as follows:

<link rel="canonical" href="" />

I mistakenly used “value” instead of “href” i.e. the canonical tag on my site looked like this:

<link rel="canonical" value="" />

The “Value” didn’t make sense and it confused Googlebot, Bingbot and other search bots. I fixed this issue and re-submitted the sitemap. Google started re-including the dropped pages once again (see the 3rd screenshot from the top).

Conclusion: If Google is de-indexing 100s or 1000s of pages of your website, you should check the canonical and robots meta tags of the pages of your website. The issue may not be at Google’s side but a technical error like the one I’ve mentioned above may be responsible for this.

August 3, 2019