DescriptionLeanEdits is a dataset of fine-grained theorem-proving data from contributing Lean experts. It's purpose is both to better understand how experts use Lean on a day-to-day basis, and to provide a rich source of data for machine-learning based approaches to proof automation. LeanEdits ExtensionThe repository is home to the LeanEdits VS Code extension.
This extension collects edits made to ParticipationParticipation is easy! It only requires two steps:
UsageUsing the LeanEdits extension is simple. Once you have completed the consent form and installed the extension, it will prompt you to enter the same name you entered on the consent form. This is so that we can ensure we are only using the data from people who have provided consent.
After, installing the extension and entering your name, the extension will begin collecting the edits
you make on For your convenience, the extension will prompt you to create a global Enabling/Disabling LeanEditsYou can very easily enable/disable the extension by clicking on the "Lean Vacuum: ON/OFF" status bar item in the lower left-hand corner of your editor.
What does LeanEdits Collect?LeanEdits logs Where is the Data Stored?The data is stored locally in a Who will have Access to the Data?Before using the data for research, we will ensure the data comes from a public GitHub repository with a permissive license. We will also run Gitleaks and Microsoft Presidio to ensure the data has no potentially sensitive information. For data that is permitted to use for research, only researchers with approval from their IRB review board, or institutional equivalent will be granted access to LeanEdits. RisksInclusion of Sensitive Information: If you type sensitive information into a source code file while programming, and the sensitive information isn't flagged by our secret scanners (namely GitLeaks and Microsoft Presidio), the sensitive information could end up in our dataset. Lack of Anonymity: Because the study seeks to relate edits and the GitHub commits they are a part of, and because we wish to be able to delete your data at any time, we will store information about the GitHub user that made the edit, and the repository to which the edit belongs, which means that your data will not be stored anonymously. Leakage of Sensitive Information through Derived Artifacts: Your data will be used to develop tools likely including language models. These tools can sometimes leak information from their training data which will include your edits. Therefore, it is possible, though unlikely, that you will type sensitive information in a source file, the information will go undetected by our secret scanners, be used to train a model, and the model will memorize the sensitive data and leak it t an unauthorized party. Leakage of Sensitive Information through Security Breaches: Though unlikely, there is always a risk that your data will be leaked while in transit, at rest, or while in use by researchers. We will ensure that your data is encrypted while in transit, and will make sure that it is encrypted at rest. Your data will be stored in an AWS S3 bucket where only authorized researchers will have access. It is strictly prohibited for researchers to share your data, and they may only make copies of the dataset for active research. Once research activities have terminated, they must delete any copies of the dataset so that the access-controlled version in the cloud is the only persistent version of the dataset. Possible Unknown Risks: In addition, there might be risks that we cannot predict at this time. These unknown risks may be temporary, mild, and last only while you are actively participating in the research, or they may be serious, long-lasting, and may even cause death. You will be informed of any new findings that might affect your health or welfare, or might affect your willingness to continue in the research. Reporting ProblemsIf you encounter a problem with the extension, please do not hesitate to reach out to Kyle Thompson at
|


