I would like to understand options for working with the PaLM API in an industry setting where there may be security and privacy concerns. For example, I don’t feel very comfortable copying and pasting large chunks of my companies code base into the PaLM API (or any other connected LLM).
I would also like to better understand how to work with LLMs on complex code bases where a function may be dependent on many external classes or functions, or dependent on external data / databases, etc.
Code generation is basically Q&A on language (code as language). If some code are not inputted for collection, generation would be difficult. It depends on your compnay’s policy to balance the benefit and what you feel not comfortable.
Complex code using many external function works just like text generation based on many world-wide text. If the external function is very rare, it would not have been used in training of the mode.
I guess you mean that :
if some company internal code (i.e., dataset) is not included into the training set, the LLM actually will have a harder time to generate useful/correct code. Is this understanding correct ?
This makes sense. The internal code base can be not that “generic” and some part of it can be quite unique compared against public code such as github.
Depend on how rare outside some organization the code you would like to generate. I feel text variation by organization could be bigger than code. But you can explore on your case.