How to Find All Existing and Archived URLs on a web site

There are several causes you may need to have to seek out the many URLs on a website, but your specific purpose will identify That which you’re trying to find. For instance, you may want to:

Identify every single indexed URL to analyze concerns like cannibalization or index bloat
Acquire present-day and historic URLs Google has noticed, especially for site migrations
Find all 404 URLs to recover from post-migration mistakes
In Every scenario, one Software gained’t Offer you anything you would like. However, Google Look for Console isn’t exhaustive, and also a “internet site:instance.com” lookup is proscribed and tough to extract knowledge from.

In this post, I’ll wander you thru some applications to make your URL record and in advance of deduplicating the information utilizing a spreadsheet or Jupyter Notebook, based on your website’s dimensions.

Previous sitemaps and crawl exports
In the event you’re seeking URLs that disappeared in the Stay web page just lately, there’s a chance an individual on the workforce might have saved a sitemap file or simply a crawl export before the adjustments ended up built. In the event you haven’t previously, look for these data files; they can frequently supply what you may need. But, when you’re reading through this, you most likely did not get so Blessed.

Archive.org
Archive.org
Archive.org is a useful Instrument for Web optimization duties, funded by donations. In the event you hunt for a site and select the “URLs” choice, you could accessibility up to 10,000 detailed URLs.

However, There are many limits:

URL limit: You may only retrieve as many as web designer kuala lumpur ten,000 URLs, which happens to be inadequate for bigger internet sites.
Excellent: Lots of URLs can be malformed or reference resource documents (e.g., photos or scripts).
No export choice: There isn’t a crafted-in solution to export the checklist.
To bypass The dearth of the export button, utilize a browser scraping plugin like Dataminer.io. Even so, these constraints necessarily mean Archive.org may well not provide a complete Answer for larger sized websites. Also, Archive.org doesn’t suggest regardless of whether Google indexed a URL—however, if Archive.org located it, there’s a good prospect Google did, way too.

Moz Professional
While you would possibly normally utilize a backlink index to find exterior internet sites linking for you, these equipment also explore URLs on your web site in the method.


The best way to utilize it:
Export your inbound back links in Moz Professional to secure a speedy and simple listing of concentrate on URLs from your web-site. If you’re handling a huge Web-site, think about using the Moz API to export data beyond what’s workable in Excel or Google Sheets.

It’s crucial to Be aware that Moz Pro doesn’t validate if URLs are indexed or learned by Google. Having said that, since most websites use the identical robots.txt policies to Moz’s bots since they do to Google’s, this technique normally functions perfectly like a proxy for Googlebot’s discoverability.

Google Look for Console
Google Search Console provides numerous worthwhile sources for creating your list of URLs.

One-way links stories:


Just like Moz Pro, the Backlinks section presents exportable lists of goal URLs. Sadly, these exports are capped at 1,000 URLs Just about every. You are able to utilize filters for specific pages, but considering that filters don’t use towards the export, you may need to rely upon browser scraping applications—limited to five hundred filtered URLs at a time. Not ideal.

Performance → Search Results:


This export gives you a list of pages acquiring lookup impressions. Even though the export is restricted, You should utilize Google Lookup Console API for greater datasets. In addition there are cost-free Google Sheets plugins that simplify pulling a lot more in depth details.

Indexing → Web pages report:


This segment presents exports filtered by situation kind, nevertheless these are generally also limited in scope.

Google Analytics
Google Analytics
The Engagement → Webpages and Screens default report in GA4 is a wonderful source for gathering URLs, by using a generous Restrict of a hundred,000 URLs.


Even better, you'll be able to implement filters to build distinctive URL lists, effectively surpassing the 100k limit. For instance, in order to export only blog site URLs, abide by these ways:

Action 1: Add a phase on the report

Phase 2: Click on “Produce a new phase.”


Phase 3: Outline the segment which has a narrower URL pattern, like URLs made up of /weblog/


Take note: URLs found in Google Analytics may not be discoverable by Googlebot or indexed by Google, but they offer worthwhile insights.

Server log data files
Server or CDN log files are Most likely the final word Instrument at your disposal. These logs capture an exhaustive listing of each URL route queried by buyers, Googlebot, or other bots through the recorded period of time.

Things to consider:

Data size: Log files could be large, lots of web pages only keep the last two weeks of data.
Complexity: Analyzing log files could be demanding, but several applications are offered to simplify the procedure.
Combine, and good luck
When you’ve gathered URLs from these sources, it’s time to mix them. If your website is sufficiently small, use Excel or, for greater datasets, applications like Google Sheets or Jupyter Notebook. Guarantee all URLs are consistently formatted, then deduplicate the checklist.

And voilà—you now have a comprehensive list of recent, outdated, and archived URLs. Good luck!

Leave a Reply

Your email address will not be published. Required fields are marked *