Talk
Intermediate

Rethinking Data Contracts for Open Source AI

Review Pending

Large language models are predicated on the availability of large datasets. There is a strong case to be made for opening the data, model weights and source code used for training LLMs. It enables transparency and reproducability of the work. But opening data, a lot of which is personal or personally identifiable, brings a unique set of challenges. In the context of Tattle's ongoing work on abuse detection, opening data introduces the risk of increasing the capabilities of bad actors looking to avoid detection. There is also an overarching concern of the increasing power disparity between corporations developing large language models, and entities creating open datasets. There is a looming threat that large language models will undermine the sustainability of initiatives that produce open knowledge and open data.

Beyond the binary of open and closed data, there is a need to conceptualise newer forms of data contracts that maximize the advantages of open source software, but also address concerns around recognition, privacy and misuse of data.

This talk will describe the challenges of opening data driving AI products and propose alternate forms of data governance models that can balance the risks while honoring the values of open source.

  • The challenges of opening data for open source AI

  • Emerging ideas on data governance and strategies to think of your own data governance policy.

  • Knowledge of avenues where one can engage in conversations on data governance

Technology / FOSS licenses, policy
Which track are you applying for?
Geopolitics and Policy in FOSS Devroom

100 %
Approvability
1
Approvals
0
Rejections
0
Not Sure
Reviewer #1
Approved