Large language models are predicated on the availability of large datasets. There is a strong case to be made for opening the data, model weights and source code used for training LLMs. It enables transparency and reproducability of the work. But opening data, a lot of which is personal or personally identifiable, brings a unique set of challenges. In the context of Tattle's ongoing work on abuse detection, opening data introduces the risk of increasing the capabilities of bad actors looking to avoid detection. There is also an overarching concern of the increasing power disparity between corporations developing large language models, and entities creating open datasets. There is a looming threat that large language models will undermine the sustainability of initiatives that produce open knowledge and open data.
Beyond the binary of open and closed data, there is a need to conceptualise newer forms of data contracts that maximize the advantages of open source software, but also address concerns around recognition, privacy and misuse of data.
This talk will describe the challenges of opening data driving AI products and propose alternate forms of data governance models that can balance the risks while honoring the values of open source.
The challenges of opening data for open source AI
Emerging ideas on data governance and strategies to think of your own data governance policy.
Knowledge of avenues where one can engage in conversations on data governance