Storing PII/proprietary data and ML model embeddings

One of the AI4Good considerations is to avoid storing or publishing PII, proprietary, private data or in general data that you don’t know those aspects explicitly.

Now, I am wondering what happens with vector encodings like text or visual embeddings. Imagine you train a model, you remove the data but you keep the model that somehow has memorized that data but it is not explicitly the data in their original form so it might no be identifiable. Can we store that model? Maybe as a base pre-trained model that we will later use for fine-tuning on a different dataset. Publicly sharing the model would not be an option by common-sense but storing it is ok? Imagine we use a cloud provider to train and save the checkpoints? I think there are many aspects to be checked if we need to follow the AI4Good default rules?
Somebody with experience that can provide some thoughts? Much appreciated.

1 Like

Hi @jyebes
Even if the original data is not stored explicitly, the model itself may still contain information or patterns learned from that data. Storing the model itself, especially if it has been trained on sensitive or proprietary data, can be inferred or extracted from the model. I also think the storage of models that have learned from sensitive data may not violate AI4Good principles by default

1 Like

Google recently launched a machine unlearning challenge.

1 Like