Storing PII/proprietary data and ML model embeddings

jyebes · July 4, 2023, 9:40pm

One of the AI4Good considerations is to avoid storing or publishing PII, proprietary, private data or in general data that you don’t know those aspects explicitly.

Now, I am wondering what happens with vector encodings like text or visual embeddings. Imagine you train a model, you remove the data but you keep the model that somehow has memorized that data but it is not explicitly the data in their original form so it might no be identifiable. Can we store that model? Maybe as a base pre-trained model that we will later use for fine-tuning on a different dataset. Publicly sharing the model would not be an option by common-sense but storing it is ok? Imagine we use a cloud provider to train and save the checkpoints? I think there are many aspects to be checked if we need to follow the AI4Good default rules?
Somebody with experience that can provide some thoughts? Much appreciated.

Isaak_Kamau · July 5, 2023, 8:02am

Hi @jyebes
Even if the original data is not stored explicitly, the model itself may still contain information or patterns learned from that data. Storing the model itself, especially if it has been trained on sensitive or proprietary data, can be inferred or extracted from the model. I also think the storage of models that have learned from sensitive data may not violate AI4Good principles by default

rmwkwok · July 5, 2023, 9:31am

Google recently launched a machine unlearning challenge.

Topic		Replies	Views
Week #02 License to Use Models Sequence Models week-2	10	152	April 22, 2024
Data sharing permission AI and Public Health week-3	2	428	October 4, 2023
License of models used in this course AI for Medical Diagnosis week-1	3	534	July 7, 2022
Show more rather than just tell? AI and Public Health week-2	4	502	July 12, 2023
Diabetic Retinopathy Dataset AI for Medical Prognosis week-1	1	529	July 13, 2022

Storing PII/proprietary data and ML model embeddings

Related topics