gray steel file cabinet

Using Archive.org for OSINT Investigations

The Internet Archive, commonly known as the Wayback Machine allows users to visit archived versions of websites. The Internet Archive has been archiving sites since 1996 and has 514 billion archived web pages! 

If you are wondering how you can use the Internet Archive in your OSINT research, you’ve come to the right place. There are many methods to extract important information from the Wayback Machine to further your OSINT investigations. If you are looking to see historical versions of a website due to the site being deleted or replaced with new content, the Wayback Machine can help. You may need to verify that a target previously worked at a company but the current state of the site does not have the target’s information there. Sometimes a target may intentionally hide information from their present website, looking at older dates of the site may reveal new information. Sometimes you can gather relevant data like names, phone numbers, email addresses, and even metadata from older versions of a website. Let’s explore search methods…

Quick Search Methods:

If the site has been archived, a calendar view will appear with colour coded dots which have different meanings. The blue dots are what you’ll want to click on as they indicate a capture of the web page. Green indicates a redirect, orange dots indicate the crawler received a client error and red means there was a server error. Navigating the timeline will display the dates of when the site was archived. 

Example of the time line
Example of all the URLs archived from Osinttechniques.com

Other Search Methods:

Example: search www.myspace.com to see how the site has changed over time.

Blue dots are the most interesting to take a look at

Example: search for “osama bin laden” to see what results are revealed or search for social media users such as the Facebook profile of Mark Zuckerberg. https://web.archive.org/web/*/www.facebook.com/zuck

  • Use the steps below to understand how to find the email address associated with uploaded files. For OSINT research if you identify an email address, it’s another point you can leverage and search that email address in other places such as search engines or social media sites.  

Example: https://archive.org/details/FlintstonesWinstonCigaretteCommericals

  1. Scroll below to find “download options”
  2. Click on “show all” to display all files. 
  3. Click on the file that ends with “meta.xml”
  4. Ctrl+f for the word “uploader” and you will see the email address: donkeykongland2@yahoo.com
Click on the button ‘Show All’ displayed in the light grey box on the right
Click on the …meta.xml-file in the results.

Use Collections and Changes (beta):

  • Collections are a way to learn why a URL has been archived into the Wayback Machine. 

Example: https://web.archive.org/web/collections/2020*/osinttechniques.com

  • Changes allows users to select 2 different versions of a URL & compare them side by side. 

Example: https://web.archive.org/web/changes/osinttechniques.com

Learn more about Collections and Changes here: https://blog.archive.org/2019/10/18/the-wayback-machine-fighting-digital-extinction-in-new-ways

Saving Pages:

  • Use https://archive.org/web/ to request that a page be archived, the save button is visible at the bottom right of the screen or by going directly to https://web.archive.org/save. This “Save Page Now” option only captures that particular page and not the entire website and only works for sites that allow crawlers. The screenshot below shows an article from OSINT Curious saving to the archive.

For sourcing purposes it may be important to understand when something was saved by the Internet Archive. Let’s look at the link below: 

https://web.archive.org/web/20180214034336/http://www.osinttechniques.com

The format of the numbers in the middle are yyyymmddhhmmss so the date the site was crawled was February 14, 2018 at 03:43 and 36 seconds.

What if the site you are investigating isn’t on the Internet Archive? Some sites will not be on the Archive.org due to robots.txt files or because a website owner has requested their site not be archived. 

However, you have other search options such as searching for cache content as mentioned in this blog post https://osintcurio.us/2019/02/12/osint-on-deleted-content or check other online archives such as archive.today