At our most recent LibDevOps one of the participants mentioned the new European GDPR (General Data Protection Regulation) which was adopted in 2016 and becomes enforceable in May of this year. They mentioned how much this was going to affect them. I know next to nothing so I went to look it up. Article 17 mandates a right to erasure, more commonly called the right to be forgotten. This is something I can get behind.
A right to be forgotten seems tricky from the reading I did. It’s not immediately clear what this means or to what extent it’s even possible. That’s for lawyers and courts to work out was my take. I won't dare comment on the legal interpretation of the regulation. I saw laws that require companies to delete information and laws that require them to retain information, and no doubt there are debates on how to reconcile these.
What interested me however is the logical and possibly statistical issues that these sort of privacy laws bring up. I am not talking about the GDPR specifically.
Suppose you or I had participated in some survey (database) and a report showed that there were 500 drinkers in my region. I realize this affects my insurance and I ask to removed from the database, and now a revised report shows that there are 499 drinkers in the region. What does that tell you about me? I mean in the case of this database. (stay focused ;-)) I'm assuming here big aggregate queries would violate privacy
You might object that removing me from the database doesn’t really cause me to be forgotten if we retain the fact that I was removed from the database. We have to forget that I was forgotten, maybe like George Bailey in "It’s a Wonderful Life." While the fictional George Bailey was able to see what life would have been like if he had never been born, I am denied that option. Truly forgetting anything in a world of ubiquitous databases may not be possible. It may require a cascade of changes that are illegal, impractical, or impossible.
One possible solution I can think of to address the challenges of erasure is though differential privacy. A report constructed via a differentially private system would not have released the exact number of drinkers in my region but rather some randomized version of the exact result. Ideally the amount of noise added to the true result is large enough to protect me, but small enough that the result is useful to the person who wants to know the approximate number of drinkers in my region.
Differential privacy is both powerful and subtle. It gives a theoretically grounded way to quantify the privacy implications of someone’s participation or lack of participation in a database. But it comes with restrictions that may be hard to live with. For example, we cannot let someone ask the same question many times, or if we do, we must give the same answer each time. If the answers were generated afresh each time, one could average the results to remove the effect of the random noise.
This raises the question of what if we rerun this? How does one plan for that a-priori?
I still applaud the EU for attempting to tackle this problem. I am certain cleverer people than I will think of some of the problems I thought of after reading up on this. Meantime I feel bad for my friend who has to deal with this already.