Accidental Datasets

We’ve been using a lot of datasets in class. And I mean a lot. There’s the penguins dataset, the iris data, the Titanic manifest, housing prices in Sacramento, in Ames, datasets about cancer, about diamonds, about professional sports compensation. I’m getting certified in Data Science, so obviously we’re going to be using a lot of datasets, but so far, they’ve all come to us pre-packaged as educational examples. So I wanted to dig a little bit deeper into what it’s like to find data when you’re actually working—what I’m calling “Accidental Datasets.” What I mean by that is that these are datasets that weren’t necessarily collected to be a machine learning corpus. 

CORONA Satellite Photography

Photo credit: Wikipedia

Photo credit: Wikipedia

The CORONA program is one of the first generation of Cold War era spy satellites, operating from 1959 to 1972—way before the implementation of digital photography. These satellites actually carried film canisters that they would expose over the Soviet Union, China, and the Middle East. Once all the film had been exposed, they would jettison a recovery pod, which would be intercepted mid-air by recovery planes. The recovered film could then be developed and analyzed. All of this remained classified until 1995, when Bill Clinton declassified them. 

In my previous post, I talked a bit about my experience studying archaeology, and this is how I first encountered the CORONA photography. The photographs were no longer relevant for espionage, being long outdated, but served as a great record of archaeological sites which have seen been destroyed by urban development. A team at my undergrad, the University of Arkansas, has used these images to identify 833 archaeological sites.

Usage Analytics

At one of my previous jobs, I was developing internal productivity tools for the customer operations team. I lead a team that took all the disparate tools that already existed and bundled them into a Chrome extension for easy access. Another thing this gave us an opportunity to do was to get usage analytics on these tools for the first time, which we accomplished by incorporating Google Analytics into the toolset. 

One of our most popular tools is one that I’m really proud of. We used the Zendesk api, allowing advisors to search the knowledge base directly from the Chrome extension. This was a huge time saver, because the previous workflow was to open a tab to the kb, search for the guide, wait for the results to load, click the guide to open it, and copy the URL to link to the advisor. We were able to save ~10 seconds per use, which really adds up over repeated uses. We even had two options for each result: clicking on the title would open the guide in a new tab, while clicking on a button to the side would copy the link directly to the user’s clipboard.

One day I realized that this could be used as a stand-in for the advisor’s familiarity with the subject material. If you’re familiar with the content of the guide, then you don’t need to open it—you’re just going to send the link—whereas if you’re unfamiliar, you’ll have it open, so you can review the information you’re sending to the customer. We used custom tags to report these events, and passed along the ID for the individual article as well. Over time, this gave us a fairly large dataset of what parts of the platform an advisor was familiar with, vs what parts they needed to review. We were able to analyze this data and send it to the training team, who used it to identify areas where further training was required. Pretty cool, huh?

Public Webcams: Bears!

I really enjoyed this story from the New York Times: a pair of developers, searching for a project to work on their machine learning skills, realized that one of their favorite public webcams, a stream of bears in Alaska, would make a great machine learning project. Facial recognition works on humans, so why not try it for other mammals, too? And who among us hasn’t had a favorite public webcam at one point or another, be they bears or hatchlings or other animals?

They teamed up with a biology postdoc who was researching bears (a bearologist?), who had to manually identify photographs of bears taken in the field. Their project, BearID, produced a model capable of identifying bears with an 84% accuracy rate. This is a great use of unsupervised learning: the dataset was already collected for research purposes, and they were able to automate a great amount of the painstaking manual classification the scientists were previously doing by hand. This is a fantastic use of machine learning, and think of how applicable the model is to other research projects which use remote photography to study animals in the wild. 

Public Webcams: Bryant Park

This is the flip-side of the feel-good story about the bears. We’ve all read stories about the creepy implications of facial recognition. Two years ago, the New York Times released the results of a project they were running which demonstrates just how deep the implications of this technology can go. Using public webcams around Bryant Park, in correlation with the public headshots of employees that were posted on the websites of businesses with offices in the Bryant Park area, they were able to identify a large number of people as they were walking through the park.

To be clear, these webcams weren’t being used as security cameras, and definitely weren’t intended to track people’s location using facial recognition. They were, in fact, seen as a public good—a quick way to identify if the Bryant Park lawn has open space for sunbathing in the summer, or to check if the ice skating rink is too busy in the winter. But it just goes to show you how powerful a technology face recognition has become. And that’s just a few public webcam—how many thousands of security cameras are there in New York City and other cities around the world that can be used to violate people’s privacy like this?