Find Company Data With A Programmable Search Engine

Being able to find sensitive information about an organisation is a key skill for OSINT practitioners. Whether you’re doing recon for a phishing engagement or you’re an investigative journalist looking for documents, being able to filter out the noise and find useful information relating to companies and institutions is essential.

Using Google dorks is a useful technique for filtering searches. For example, a dork to find PDF files from the UK government might look something like this:

site:gov.uk filetype:pdf

Dorks are powerful, but it can get repetitive to enter the same dorks for the same sites over and over again.

If you regularly perform a repetitive OSINT task it’s worth looking at ways to make the process more efficient by automating as much as you can. In the rest of this post we’ll look at how you can harness the power of Google create your own programmable search engine to make finding interesting data easier. A search engine like the one in this walkthrough can you find files like PDFs or Office docs, emails, phone numbers, or even sensitive business information.

Programmable Search Engines

Google allows the creation of Programmable Search Engines that allow users to create highly customised search tools that suit their specific needs. To create one you’ll need to sign into your Google account and head to https://programmablesearchengine.google.com. Once you’ve signed in, click on “Create a new search engine”. You’ll see a menu like this:

Give the search engine a name. We’ll call it “Leaked Info Search“.

Next we need to tell Google where exactly we want to search. This is the most important part of configuring the tool so it’s worth spending a little time to get right.

In the Google dork example at the top of this article, we used the site: filter to tell Google to only search within one particular domain. This is useful – but what if we want to search within the same 10-20 sites every time? Writing out a search query for this would be slow and very time consuming. Instead by using a programmable search engine we only have to tell Google once and then we can reuse the same search engine query every time.

Adding Sites To Search

Add the domains you want to search through.

To add a site, select the option “Search specific sites or pages”. Since I’m going to be looking for leaked information, I want to include Pastebin content in all my searches.

To search the entire pastebin.com domain, enter *.pastebin.com/* and then click on Add. Adding the wildcard * before and after the domain ensures that Google will include data from the entire domain (provided that Google has indexed that part of the site).

The domains that you add to the search engine will depend on exactly the kind of data you want to find. Here are a few that you might want to consider:

slideshare.net – companies often post presentations here in order to share them with third parties. You can find information about sales projections, tech stacks, email addresses and company personnel here.

Adding slideshare.net to the search engine

Github.com – developers leave sensitive code, private keys, email addresses and other useful snippets here.

StackOverflow.com – the world’s biggest site for code troubleshooting. Poor opsec means that developers sometimes post unredacted company data here when asking others for help.

Scribd.com – users upload and share PDFs and documents of all kinds.

Trello.com – Trello is a cooperation and project management platform. Find employee names, company project information, contact details, calendars and (yes) sometimes even passwords and sensitive documents.

s3.amazonaws.com – this search term will find open Amazon S3 buckets containing a wide range of documents and files.

Chegg.com – this is a flashcard learning site where users upload and share facts that they’re trying to learn. You can read here how Bellingcat discovered US military personnel were uploading sensitive information about nuclear weapons in preparation for a test.

These are just a few examples. There’s no limit to the number of domains you can add, and you can always add or remove domains as your needs change.

Once you’re ready. Complete the captcha and click “Create”.

Using Your Search Engine

Once your searching engine has been created, it’ll have it’s own URL which will look something like https://cse.google.com/cse?cx=xxxxxxxxxxxxx. You’ll be able to access your search engine by visiting this URL directly. Google also provides the option to embed the search engine in your own webpage.

Here’s a quick example. Let’s I want to gather information about Volkswagen, the car manufacturer. Instead of using the regular Google search engine, I can just use my bespoke search engine instead, since I know these particular sites are where I’m most likely to find the data I want.

A straightforward search for “volkswagen.com” brings up software projects, marketing slides and other material on the first page:

There’s plenty of other information that might be of interest too. Here are a few examples:

VW financial data left in a third party S3 bucket.
Company training documents were shared on Scribd.com
Employee email addresses in an internship application left on a file sharing site.
“This document shall be treated as confidential” – but it won’t be if it gets left in an open S3 bucket and indexed by Google…

By using the Programmable Search Engine there’s no need to run the same dorks over the same sites repeatedly. Restricting the search to a handful of domains means finding focused, useful results is much quicker and easier. Even though we’re only searching through a more limited number of domains, it’s still possible to use regular Google filters like filetype:, intext: or - (negation) to finely tune the results.

A custom search engine can be a very effective tool and with regular tweaking and refinement they make a very useful addition to your OSINT toolbox.