A collection of diff datasets. It contains:
- GitHub Java is a Java dataset containing 1000 commits from 10 popular projects.
- GitHub Python is a Python dataset containing 1000 commits from 10 popular projects.
- Defects4J is a Java dataset of bug fixes used in the program repair community.
- BugsInPy is a Python dataset of bug fixes used in the program repair community.
The layout of these datasets is the following: the before
folders contain the files before modification, and the after
folders contain the files after. Inside the before
and after
folders, there is one folder per project that contains one folder per commit. Note that the commit names are the same in the before
and after
folders. The unparsable folder contains the commits from the previous datasets for which we could not parse one of the files.
The Python scripts used to produce the datasets are also provided.