We leak so much information about ourselves.
I was listening to a podcast recently in which one of the hosts described giving ChatGPT the history of the questions he had asked it and then asking it to build a profile on him based on that.
It was interesting, of course, but it got me to thinking… why not use internet searches as the input? There’s much more information available in our search histories since we use it for more general-purpose things.
I was not disappointed, a little surprised, and more than a little impressed.
Become a Patron of Ask Leo! and go ad-free!
Using AI to build a profile from search history
I used AI tools to build a personal profile of me based on my Google search history. Claude AI used 16,000 searches over five years, but ChatGPT managed with just 126 over six months. Both provided surprisingly accurate profiles. The upshot is that benign data can be used to construct detailed personal profiles. We all need to be aware of online data leakage and what we expose when we use the internet.
My search history
My Google search history goes back to 2006 and has 48,392 entries. The first entry was “signed up for” (I have no idea why), and the most recent search was “download search history” (for obvious reasons)1. The whole list was over 900 megabytes in size.
That’s a lot of data to chew on.
In fact, it was too much.
I started with Claude because it can accept much more data in a query than ChatGPT can. Unfortunately, 48,000 search terms were still over the limit. In fact, I had to cut it back to 16,000 searches representing the most recent five years.
I wrote this query:
Provide a profile for the person whose search history is as follows:
followed by 16,000 searches.
Here’s what Claude had to say:
Based on the search history provided, here is a possible profile for this person:
– Male, likely in his 40s or 50s
– Lives in the Seattle area
– Works in tech, possibly as a software developer or IT professional
– Interested in computers, technology, online security, Linux/Unix, web development (WordPress, PHP, etc)
– Owns a Nikon digital SLR camera
– Dog owner, possibly of a Welsh Corgi
– Enjoys watching TV shows like Game of Thrones, Westworld, Vikings
– Travels occasionally, including some international trips
– Financially savvy; invests and tracks finances online
– Health-conscious; watches diet and nutrition
– Politically left-leaning
Some key details that point to this profile:
– Technical topics like Linux command line, security tips, and web development suggest a tech professional
– Interest in Nikon cameras and Photoshop indicates a photography hobby
– Corgi-related searches imply a Corgi owner
– TV interests point to sci-fi/fantasy and historical dramas
– Travel searches about things like passport renewal and international banking
– Finance-related searches on stocks, Bitcoin, and bank accounts
– Nutrition/diet searches suggest health-consciousness
– Political searches on topics like ACLU donations suggest left-leaning views
The combination of tech-related, finance, photography and pet hobbies indicates a male in his 40s or 50s who likely lives in the Seattle area and works in a tech-related job. Some details may be guesses, but the search topics provide a reasonably detailed profile.
It’s not far off. (Though I chuckled that it thinks I’m in my 40s or 50s. ) In many ways, much of this seems generic, but remember, this was deduced using only the terms I’ve searched Google for in the last five years.
I was impressed.
ChatGPT’s limits are such that I could give it only about 126 searches from the last six months2 of my searches.
Provide a profile for the person whose search history is as follows:
followed by 126 searches.
Here’s what ChatGPT had to say:
Based on this search history, the profile of this person appears to be quite diverse, indicating a range of interests and knowledge areas:
- Tech-Savvy and Problem Solver: Searches related to downloading search history, using regex in Python, and troubleshooting various tech issues (like macOS cron alternatives and dealing with ‘operation not permitted’ errors) suggest a strong familiarity with technology and an inclination towards solving technical problems.
- Interest in Geography and Travel: Queries like distances between countries, converting coordinates to decimal degrees, and looking up country codes indicate an interest in geography or travel.
- Automotive Knowledge: Searches about car values, vacuum line diameters in automobiles, and other car-related inquiries suggest a knowledge of, or interest in, automotive topics.
- Cultural Interests: The person seems to enjoy entertainment and culture, evident from searches about movies and TV shows (like MCU movies, ‘Good Omens’, and ‘Aliens’), indicating an interest in popular culture.
- Healthcare Awareness: Searches about end-of-life breathing patterns suggest either a professional or personal interest in healthcare or medical conditions.
- Food and Local Activities: Looking up local markets, restaurants, and specific menu items suggests an interest in local activities and food.
- Academic Inclinations: The search for the meaning of ‘adjunct professor’ and various educational content indicates an interest in academia or education.
- Digital Content Creation and Management: Queries about converting and editing images, managing digital content through platforms like Google Drive, Dropbox, and OneDrive, and using software like DaVinci and Photoshop point towards a proficiency in digital content creation and management.
- General Curiosity and Diverse Interests: The wide array of topics, from combinatorics to ham radio antenna connectors, suggests a personality that is curious and enjoys learning about a wide range of subjects.
This person appears to be well-rounded with diverse interests spanning technology, entertainment, geography, automotive, health, and more, indicating a continuous pursuit of knowledge and understanding in various domains.
Once again, it feels generic, but based on only 126 queries, the profile is pretty amazing from my perspective.
“An inclination towards solving technical problems” — imagine that.
Yes, there were blatant errors. My search for vacuum line diameters in automobiles had nothing to do with cars and everything to do with covering the perches on our bird feeders. I guess the confusion is understandable. And I have no idea what the context was for “end-of-life breathing patterns” — while I remember searching for it, I have no memory of why. I suspect it was a passing interest rather than something contributing to “Healthcare Awareness”.
Why does this matter?
I know many people are concerned about the amount of data being collected about them. I think this is a good example of how an arbitrary collection of what could be seen as completely benign data can be used to build a profile on someone. That profile could be used by businesses — or hackers — to better target their efforts at you directly, for example.
I still believe that by and large, you and I are just aren’t that interesting and don’t merit the CPU cycles to build such a profile. For example, I’m not concerned that Google has 17 years of my search history. It’s not that exciting, trust me.
But I know some people should be concerned because of the circumstances of their lives, and I know some people will be concerned no matter how uninteresting they might be.
And this isn’t really about search history. While that’s one way we expose information about ourselves, it’s certainly not the only one. Consider the posts you make on social media, the websites you visit without searching, the and more. While I absolutely believe that you and I are simply not that interesting, it’s worth noting the different ways that information might be getting collected and aggregated, in case you happen to be one of the few who really are interesting.
It would be interesting (albeit more difficult) to run a similar analysis based on email or social media posts. That would tell a LOT about people.
The bottom line here is simply to remain aware of the amount of information you’re leaking just by doing whatever you do online. You can, of course, delete your search history if you’re so inclined, but in general it’s just not worth the effort.
I certainly don’t collect any of that kind of information when you subscribe to Confident Computing! Less frustration and more confidence, solutions, answers, and tips in your inbox every week.