HOW TO FIND ALL EXISTING AND ARCHIVED URLS ON A WEBSITE

How to Find All Existing and Archived URLs on a Website

How to Find All Existing and Archived URLs on a Website

Blog Article

There are lots of motives you may require to discover all of the URLs on an internet site, but your correct purpose will determine That which you’re seeking. For example, you may want to:

Recognize every single indexed URL to analyze concerns like cannibalization or index bloat
Gather present and historic URLs Google has found, especially for site migrations
Find all 404 URLs to Recuperate from write-up-migration errors
In Just about every scenario, a single tool received’t Provide you with anything you would like. Regretably, Google Lookup Console isn’t exhaustive, plus a “internet site:case in point.com” search is limited and tough to extract knowledge from.

During this submit, I’ll stroll you thru some applications to make your URL record and before deduplicating the information utilizing a spreadsheet or Jupyter Notebook, based upon your web site’s dimension.

Aged sitemaps and crawl exports
When you’re in search of URLs that disappeared from your live website a short while ago, there’s an opportunity someone in your group may have saved a sitemap file or even a crawl export prior to the improvements were being made. For those who haven’t by now, check for these data files; they will typically present what you require. But, in case you’re looking through this, you most likely didn't get so Blessed.

Archive.org
Archive.org
Archive.org is a useful Instrument for Search engine marketing duties, funded by donations. In case you seek out a website and choose the “URLs” choice, you are able to access nearly ten,000 outlined URLs.

Having said that, Here are a few restrictions:

URL Restrict: You'll be able to only retrieve up to web designer kuala lumpur ten,000 URLs, that is inadequate for larger web-sites.
High quality: A lot of URLs may very well be malformed or reference resource information (e.g., illustrations or photos or scripts).
No export option: There isn’t a constructed-in method to export the list.
To bypass the lack of an export button, use a browser scraping plugin like Dataminer.io. Nevertheless, these restrictions imply Archive.org may not provide an entire Alternative for larger sized web-sites. Also, Archive.org doesn’t show no matter whether Google indexed a URL—but if Archive.org observed it, there’s a fantastic opportunity Google did, as well.

Moz Pro
Though you could ordinarily utilize a link index to locate external websites linking to you, these resources also discover URLs on your web site in the method.


The way to utilize it:
Export your inbound back links in Moz Professional to get a rapid and easy list of goal URLs from the website. When you’re coping with a huge Web-site, think about using the Moz API to export data beyond what’s workable in Excel or Google Sheets.

It’s crucial that you Be aware that Moz Professional doesn’t validate if URLs are indexed or uncovered by Google. On the other hand, considering the fact that most internet sites utilize the exact same robots.txt policies to Moz’s bots because they do to Google’s, this process generally functions effectively being a proxy for Googlebot’s discoverability.

Google Lookup Console
Google Search Console provides a number of valuable sources for building your list of URLs.

Hyperlinks stories:


Similar to Moz Pro, the One-way links segment delivers exportable lists of concentrate on URLs. Sadly, these exports are capped at 1,000 URLs each. You may implement filters for unique internet pages, but because filters don’t apply to the export, you may need to rely on browser scraping applications—limited to five hundred filtered URLs at any given time. Not excellent.

Efficiency → Search Results:


This export provides a list of pages receiving look for impressions. Whilst the export is limited, You need to use Google Research Console API for larger sized datasets. There's also free Google Sheets plugins that simplify pulling far more extensive facts.

Indexing → Pages report:


This area presents exports filtered by situation style, even though these are also constrained in scope.

Google Analytics
Google Analytics
The Engagement → Internet pages and Screens default report in GA4 is an excellent supply for amassing URLs, with a generous limit of one hundred,000 URLs.


Better yet, you could utilize filters to create distinct URL lists, efficiently surpassing the 100k limit. For instance, if you wish to export only weblog URLs, adhere to these steps:

Action 1: Increase a section to your report

Phase two: Click on “Make a new section.”


Move 3: Determine the segment that has a narrower URL pattern, for instance URLs containing /site/


Note: URLs present in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they supply beneficial insights.

Server log documents
Server or CDN log data files are perhaps the ultimate Device at your disposal. These logs seize an exhaustive listing of each URL route queried by customers, Googlebot, or other bots in the course of the recorded period of time.

Things to consider:

Details sizing: Log data files is usually large, a great number of web-sites only retain the last two months of information.
Complexity: Analyzing log documents is often demanding, but different tools are offered to simplify the process.
Incorporate, and very good luck
When you’ve collected URLs from each one of these resources, it’s time to mix them. If your website is sufficiently small, use Excel or, for bigger datasets, instruments like Google Sheets or Jupyter Notebook. Make sure all URLs are persistently formatted, then deduplicate the record.

And voilà—you now have a comprehensive list of existing, outdated, and archived URLs. Fantastic luck!

Report this page