A conversation with Dr. Parinaz Sobhani on reverse engineering data and on integrating differential privacy into machine learning models

Dr. Parinaz Sobhani is the Director of Machine Learning on the Georgian Impact team and is responsible for leading the development of cutting-edge machine learning solutions. Dr. Sobhani has more than 10 years of experience developing and designing new models and algorithms for various artificial intelligence tasks. Prior to joining Georgian Partners, she worked at Microsoft Research where she developed end-to-end neural machine translation models. Previous to this she worked for the National Research Council in Canada, where she designed and developed deep neural network models for natural language understanding and sentiment analysis. She holds a Ph.D. in machine learning and natural language processing from the University of Ottawa with a research focus on opinion mining in social media.

Patricia: Thank you very much for agreeing to be interviewed.

Dr. Sobhani: You’re very welcome.

Patricia: How about we start with the basic question of why we should be interested in privacy-preserving natural language processing in the first place?

Dr. Sobhani: That’s a very good question. For the very same reasons that we should be interested in privacy-preserving AI in general. One of the biggest concerns of using machine learning for applications that make use sensitive data is privacy. It isn’t a surprise to us that machine learning models are very good at memorizing training data. And in NLP, because we often make use of generative models, the risk of being able to predict the training data from the model’s outputs is quite high. Any data used to train language models and machine translation models can be reverse engineered, basically giving users of those models access to a part of the training data. If we are using a machine learning model for healthcare diagnoses, for example, we don’t want the model to memorize any of the individual patients’ information that it was built with. Our ultimate goal is to extract general patterns rather than specific facts about individuals. But machine learning models have the proven ability to memorize individual data points that can be simply reverse engineered by adversaries using malicious inputs.

Patricia: How does differential privacy fit into making data processing more secure?

Dr. Sobhani: Differential privacy can be used to make machine learning models private. One of the most popular approaches in the literature is introduced in “Deep Learning with Differential Privacy.” It is applicable to any learning approach that is based on stochastic gradient descent. Some of the other approaches might have more constraints. For example, we used “Bolt-on Differential Privacy for Scalable Stochastic Gradient Descent-based Analytics” in one of our papers (“Boosting Model Performance through Differentially Private model Aggregation”) which is only applicable to convex optimization. Because we are using these machine learning algorithms for NLP applications, we can apply the same kind of approaches to make them private. My hope is that we will see more research in this area, because all of the differential privacy papers I’ve seen so far have been based on processing structured data or image data. Like text and speech, image data are also unstructured. However, the models used to process them are different and the privacy protection challenges are different.

Patricia: What other approaches can be used to make data processing more secure?

Dr. Sobhani: I am encouraged to see so many other technologies that have been developed to make privacy-preserving data mining a possibility. One such approach is de-identification. Khaled El Emam and his research group have been working on this problem for several years. But still, we don’t know all the limitations of de-identification methods. Normally, if you want to use de-identification methods for NLP tasks, multiple steps need to be addressed. You need to extract identifiers from text, which is not easy. For example, you will want to remove all addresses, names, and so on. The first step for this is to use a named entity recognizer. This in itself is a difficult task. The best named entity recognizers are around 90% accurate. So you have a 10% error rate and we are normally evaluating these named entity recognizers by comparing them with human annotated text. We don’t have 100% inter-annotator agreement for this sort of text annotation. There are all of these sources of noise when it comes to evaluating named entity recognition systems and, depending on the nature of the text, recognizing named entities might be really difficult. For example, in the case of clinical trial reports or diagnosis reports, the grammatical structure might vary, there might be typos, word repetitions, or other problems with the text.
After you have extracted the named entities from a text, you have more structured data to deal with. Then you can apply de-identification algorithms. At the end of the day, when you are using de-identification algorithms, it is based on certain assumptions, including the viability of external datasets. It is so hard to maintain any initial assumptions. Everything is changing and there is so much information out there that we don’t know how to leverage yet and how to connect with the data made accessible to malicious parties through data leaks. Verizon, for example, recently published a report analyzing medical data breaches in 27 countries, which revealed that there have been more than 500 data breach incidents leaking medical records. Leaked information can be cross-referenced with public datasets.
De-identification or anonymization that masks what we consider to be personally identifiable information might not be sufficient and does not provide any privacy guarantees without certain assumptions about the availability of external and auxiliary sources of information. Whereas, techniques such as homomorphic encryption and differential privacy provide more formal guarantees with strong mathematical foundations. Privacy guarantees garner the trust of individuals leading them to more willingly share data and, in turn, to benefit from the algorithmic improvements that personalization provides.

Patricia: So, in order to be able to use differential privacy you have to know what task you want to do beforehand; whereas, theoretically, data de-identification would allow for datasets of unstructured text to be made public without knowing which tasks the data will be used for. Do you think that the best approach would be for researchers and developers to propose differentially private algorithms to the owners of the datasets they’d like to train their models on and hope the owners agree? Or is there some more practical way to go about training models on data that aren’t publicly available?

Dr. Sobhani: That’s a very good question. The main benefit of differential privacy is that it gives you mathematical guarantees about privacy. But of course it has its own limitations as well. It is a probabilistic framework to measure the level or privacy of a mechanism (or an algorithm/function) that uses data to provide an answer based on some computation on that data. The main principle of preserving privacy is to introduce randomness to the function so that the final answer does not depend on any individual data points. In the case of preserving privacy for machine learning models, any learned parameters (weights) of the final system should not have a huge dependency to presence or absence of any individual user’s data that was leveraged to train the model. We can choose to add noise to the data, to the model, or to the output of the model. Which one we choose to add noise to, how we add the noise, and how much noise we add all depends on the type of function that we are interested in using. There has been some research on adding noise to the input data. In that case, the magnitude of the noise has to be higher, because your assumption is going to be that the data can be fed into any function. What I have seen in the literature so far is that that kills the utility of the data. An interesting alternative was presented in a paper called “Generating Differentially Private Datasets Using GANs,” where the authors trained a generative model on their input data, then made that generative model differentially private, generated data using that model and then made that generated data publicly available. I’m a bit skeptical about this approach because I believe we’d need lots and lots of data to be able to train the generative model to produce useful output data. Also, choosing the right magnitude for the noise is not an insignificant task. The quality of the output data is certainly not going to be the same as the quality of the original dataset. If we were to try to use this method for healthcare applications, we might not have enough input data made available to us in order to be able to build a reliable generative model. But on paper, it is at least theoretically possible to use differential privacy to create a useful privacy-preserving public dataset.

Patricia: So they are basically creating a synthetic dataset.

Dr. Sobhani: Exactly. Based on applying differential privacy to a generative model.

Patricia: Okay. And they didn’t provide evaluations of the quality of the output data?

Dr. Sobhani: Well, in the paper, they apply their method to image data and then used their own metrics to measure the quality of the synthesized images.

Patricia: Riveting! If a researcher or developer wants to start using differential privacy, where should they start?

Dr. Sobhani: I’d recommend taking a look at a paper published in SIGMOD 2017 called “Differential Privacy in the Wild: A Tutorial on Current Practices & Open Challenges.” There’s also a blog post by a couple of the best researchers in the area (Nicolas Papernot and Ian Goodfellow) titled “Privacy and machine learning: two unexpected allies?” And I’d definitely recommend “The Promise of Differential Privacy. A Tutorial on Algorithmic Techniques” by Cynthia Dwork, who is famously known for being the co-inventor of differential privacy. With regards to coding, I am so encouraged to see many open source packages available for differential privacy. Researchers working in this area started to publish their code along with their papers. One very good package has been published by Google for Tensorflow. It contains a comprehensive number of mechanisms that can be applied to various deep learning methods and other machine learning methods. When we started working in this area, there weren’t as many open source packages. We identified one paper whose method we wanted to use, we contacted the authors, and they told us that they couldn’t make the code publicly available. We had to write our own code and it wasn’t easy. When we talk about ε-differential privacy, ε is actually the measure of the privacy level of the algorithm. You can think of it similarly to accuracy or F1-score, which are metrics used to measure the performance of a machine learning model. It’s much easier to test how well a machine learning model performs than to check the privacy level of the model. What I means is that there isn’t a list of approaches that you can use to verify the privacy level of your model and get a stamp of approval. Bugs in your code can also make your model non-private in practice, when all the math points to it being privacy-preserving. Or there might be a number of requirements that need to be met in order to have all of the required privacy guarantees and it’s so easy to miss one of them. One of our struggles was understanding all of the requirements. It took us a while to determine whether the privacy was provably guaranteed. That is why I am so encouraged by the new open source packages, all of the new papers coming out, and even the workshops that are focused solely on differential privacy. Even if a newcomer doesn’t manage to publish their paper in one of the main conferences, they still have an opportunity to connect with the community through one of these workshops and discuss their work.

Patricia: Do the packages that are available give you an understanding of what the privacy guarantee is based on the amount of noise that you insert at different points in your model, or do you have to calculate that yourself?

Dr. Sobhani: The way we calculate the privacy level is based on ε and ε is the input to your algorithm. There are some standards for selecting ε. For example, people often make ε equal to 2. You can use this value as an input in these packages to get a model that has 2ε-differential privacy.

Patricia: Which pitfalls would you recommend keeping an eye out for when you are using differential privacy for machine learning?

Dr. Sobhani: One of the biggest challenges I’ve seen people have was actually due to poor communication. Because of the mathematical foundations of differential privacy and because the privacy guarantees are probabilistic, it is harder to communicate the applications and functionality to people with non-technical backgrounds. I’ve often been in discussions with developers who were having a difficult time conveying how this technology works to their company’s customers. It isn’t easy to understand. There should be more work done on simplifying the language and on making the concepts comprehensible to professionals with very different backgrounds.

Patricia: What would be a solution to this communication problem?

Dr. Sobhani: Educating product managers would be key. Product managers are the ones who are usually responsible for communicating with customers. Scientists who work on differential privacy could describe its mathematical foundations, its privacy guarantees, and their own research to product managers without getting into too many details about the math. One way of doing this is by using concrete examples. Also, as we see more examples of differentially private approaches in production, like the ones Google, Apple, and Amazon have in production right now, it will help increase the levels of trust in these technologies. Because at the beginning of any technological adoption, people are normally skeptical about whether it works and they aren’t too sure about what it could be used for. On top of making it easier to understand the concepts, it is our responsibility as scientists to develop testing frameworks for differential privacy. That’s another challenge, because it’s hard to actually measure the privacy guarantees after you’ve shipped your model.

Patricia: Are there any papers that discuss privacy frameworks?

Dr. Sobhani: I know that a couple of researchers at the University of Toronto have started working in this research area, including Aleksander Nikolov. They know it’s a gap that might limit the applications of differential privacy and that we have to find ways to fill it as soon as possible. There is also a recent publication addressing this issue called “Detecting Violations of Differential Privacy” by researchers from Pennsylvania State University.

Patricia: Given what you know about the pitfalls and the challenges that you’ve faced and given the tasks that you are aware of that could use differential privacy for privacy preservation, what wish list do you have for the research community working on applied differential privacy?

Dr. Sobhani: It would be great if there were more of a focus on the NLP side of things. Right now, most of the papers have used MNIST for their experiments. My expectation is that the community will start to use more diverse datasets and more real-world datasets where they have to deal with the challenges faced by industry. For example, more imbalanced datasets, sparse datasets, or unstructure data that is more complex than MNIST. I expect to see more realistic experiments.

Patricia: And which venues would you recommend looking at if you’re interested in reading papers about differential privacy applied to machine learning or if you’re interested in publishing on the topic of differential privacy applied to natural language processing?

Dr. Sobhani: There’s the Privacy Preserving Machine Learning workshop at NeurIPS, the IEEE Symposium on Security and Privacy, and the AAAI Symposium on Privacy-Enhancing Artificial Intelligence and Language Technologies, to name a few.

Patricia: Thank you so much for your time, Parinaz.

Dr. Sobhani: My pleasure.