Machine Unlearning

Machine learning systems are everywhere. They predict the weather, forecast earthquakes, provide recommendations based on the books and movies we like, and even apply the brakes on our cars when we're not paying attention.

To do this, software programs in these systems calculate predictive relationships from massive amounts of data. The systems identify these predictive relationships using advanced algorithms and "training data." This data is then used to construct the models and features that enable a system to determine the latest best-seller you wish to read or to predict the likelihood of rain next week.

This intricate process means that a piece of raw data often goes through a series of computations in a system. The computations and information derived by the system from that data together form a complex propagation network called the data's "lineage." The term was coined by Yinzhi Cao, assistant professor of computer science and engineering, and his colleague, Junfeng Yang of Columbia University, who are pioneering a novel approach to make learning systems forget.

Widely used learning systems such as Google Search are, for the most part, only able to forget a user's raw data—and not the data's lineage—upon request. This is problematic for users who wish to ensure that any trace of unwanted data is removed completely, and it is also a challenge for service providers who have strong incentives to fulfill data removal requests and retain customer trust.

Considering the importance of increased security and privacy protection, Cao and Yang believe that easy adoption of forgetting systems will be increasingly in demand. The two researchers have developed a way to do it more quickly and effectively than can be done using current methods.

Their concept, called "machine unlearning," is so promising that Cao and Yang have been awarded a four-year, $1.2 million National Science Foundation grant to develop the approach.

"Effective forgetting systems must be able to let users specify the data to forget with different levels of granularity," says Cao, a principal investigator on the project. "These systems must remove the data and undo its effects so that all future operations run as if the data never existed."

Enhancing Security

Building on work that they presented at the 2015 Institute of Electrical and Electronics Engineers (IEEE) Symposium on Security and Privacy, Cao and Yang's "machine unlearning" method is based on the fact that most learning systems can be converted into a form that can be updated incrementally without costly retraining from scratch.

Their approach introduces a layer of a small number of summations between the learning algorithm and the training data to eliminate dependency on each other. So, the learning algorithms depend only on the summations and not on individual data. Using this method, unlearning a piece of data and its lineage no longer requires rebuilding the models and features that predict relationships between pieces of data. Simply recomputing a small number of summations would remove the data and its lineage completely—and much more quickly than through retraining the system from scratch.


Cao believes he and Yang are the first to establish the connection between unlearning and the summation form.

And it works. Cao and Yang tested their unlearning approach on four diverse, real-world systems: LensKit, an open-source recommendation system; Zozzle, a closed-source JavaScript malware detector; an open-source OSN spam filter; and PJScan, an open-source PDF malware detector.

The success of these initial evaluations sets the stage for the next phases of the project, which include adapting the technique to other systems and creating verifiable machine unlearning to statistically test whether unlearning has indeed repaired a system or completely wiped out unwanted data.

In their paper's introduction, Cao and Yang write that "machine unlearning" could play a key role in enhancing security and privacy and in our economic future:

"We foresee easy adoption of forgetting systems because they benefit both users and service providers. With the flexibility to request that systems forget data, users have more control over their data, so they are more willing to share data with the systems. More data also benefit the service providers, because they have more profit opportunities and fewer legal risks.

"We envision forgetting systems playing a crucial role in emerging data markets where users trade data for money, services, or other data because the mechanism of forgetting enables a user to cleanly cancel a data transaction or rent out the use rights of her data without giving up the ownership."