SANER2021 ERA Dataset

About

On this page, we publish the dataset used in the our paper “Onboarding to Open Source Projects with Good First Issues: A Preliminary Analysis (Hyuga Horiguchi, Itsuki Omori and Masao Ohira)” has been accepted for inclusion in the Early Research Achievements (ERA) track of the 28th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER ’21).

File list

  1. prs_nums_before_resolving_issue.csv
  2. resolved_issues_percenrage.csv
  3. prs_nums_after_resolving_issues.csv

Description

The first file was used for the analysis of RQ1.
The violin plot in Fig. 1 shows the distribution of the prs_num in the 4th column of the file for each issue_type in the 3rd column.
The 1st column, dev_id is the ID to identify the developer. It is used to anonymize the account information of GitHub.
The 2nd column, issue_url is the URL of the issue resolved by the developer.
The 3rd column, issue_type shows whether the issue is a Regular Issue or a Good First Issue.
The 4th column, prs_num is the number of PRs that the developer with the dev_id has posted on GitHub before resolving the issue with the issue_url.

The second file was used for the analysis of RQ2.
Table II shows the 1st, 4th, and 7th columns of the file as shown below.
The 1st column, repo_url is the URL of the repository.
The 2nd column, issues_num is the number of Regular Issues that the repository has.
The 3rd column, resolved_issues_num is the number of resolved Regular Issues.
The 4th column, resolved_issues_percentage is the value of the resolved_issues_num (3rd column) divided by the issues_num (2nd column).
The 5th column, good_first_issues_num is the number of Good First Issues that the repository has.
The 6th column, resolved_good_first_issues_num is the number of resolved Good First Issues.
The 7th column, resolved_good_first_issues_percentage is the value of the resolved_good_first_issues_num (6th column) divided by the good_first_issues_num (5th column).
The 8th column, resolved_ratio is the ratio of the resolved_good_first_issues_percentage (7th column) divided by the resolved_issues_percentage (4th column).

The third file was used for the analysis of RQ3.
Table III shows the percentage of developers for each repository whose the prs_num(4th column) is 1 or higher among the Good First Issue of the issue_type(3rd column).
The 1st column, dev_id is the ID to identify the developer.
The 2nd column, issue_url is the URL of the issue resolved by the developer.
The 3rd column, issue_type shown whether the issue is a Regular Issue or a Good First Issue.
The 4th column, prs_num is the number of PRs that the developer with the dev_id has posted to the same repository as the issue_url after resolving the issue with the issue_url.

Contact

Hyuga Horiguchi (hhyuga201515xx@xxgmail.com)
Masao Ohira (masaoxx@xxwakayama-u.ac.jp)