SANER2021 ERA Dataset

About

On this page, we publish the dataset used in the our paper “Onboarding to Open Source Projects with Good First Issues: A Preliminary Analysis (Hyuga Horiguchi, Itsuki Omori and Masao Ohira)” has been accepted for inclusion in the Early Research Achievements (ERA) track of the 28th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER ’21).

File list

  1. prs_nums_before_resolving_issue.csv
  2. resolved_issues_percenrage.csv
  3. prs_nums_after_resolving_issues.csv

Description

The first file was used for the analysis of RQ1.
The violin plot in Fig. 1 shows the distribution of the prs_num in the 4th column of the file for each issue_type in the 3rd column.
The 1st column, dev_id is the ID to identify the developer. It is used to anonymize the account information of GitHub.
The 2nd column, issue_url is the URL of the issue resolved by the developer.
The 3rd column, issue_type shows whether the issue is a Regular Issue or a Good First Issue.
The 4th column, prs_num is the number of PRs that the developer with the dev_id has posted on GitHub before resolving the issue with the issue_url.

The second file was used for the analysis of RQ2.
Table II shows the 1st, 4th, and 7th columns of the file as shown below.
The 1st column, repo_url is the URL of the repository.
The 2nd column, issues_num is the number of Regular Issues that the repository has.
The 3rd column, resolved_issues_num is the number of resolved Regular Issues.
The 4th column, resolved_issues_percentage is the value of the resolved_issues_num (3rd column) divided by the issues_num (2nd column).
The 5th column, good_first_issues_num is the number of Good First Issues that the repository has.
The 6th column, resolved_good_first_issues_num is the number of resolved Good First Issues.
The 7th column, resolved_good_first_issues_percentage is the value of the resolved_good_first_issues_num (6th column) divided by the good_first_issues_num (5th column).
The 8th column, resolved_ratio is the ratio of the resolved_good_first_issues_percentage (7th column) divided by the resolved_issues_percentage (4th column).

The third file was used for the analysis of RQ3.
Table III shows the percentage of developers for each repository whose the prs_num(4th column) is 1 or higher among the Good First Issue of the issue_type(3rd column).
The 1st column, dev_id is the ID to identify the developer.
The 2nd column, issue_url is the URL of the issue resolved by the developer.
The 3rd column, issue_type shown whether the issue is a Regular Issue or a Good First Issue.
The 4th column, prs_num is the number of PRs that the developer with the dev_id has posted to the same repository as the issue_url after resolving the issue with the issue_url.

Contact

Hyuga Horiguchi (hhyuga201515xx@xxgmail.com)
Masao Ohira (masaoxx@xxwakayama-u.ac.jp)

[招待講演] ソフトウェアリポジトリマイニングの研究動向とソフトウェア工学におけるAI技術の活用

SIGSS 2018

ソフトウェアサイエンス研究会主催のSIGSS 2018 in 帯広にて研究発表を行いました!!

開発者の活動量の経時的変化がコミッター候補者予測に与える影響の分析(山崎 大輝,大平 雅雄,伊原 彰紀,柏 祐太郎,宮崎 智己)

Open Campus

和歌山大学のオープンキャンパスにて、ネットワーク情報学メジャーおよび研究室の紹介を行いました!!
これを機会に、興味を持ってくれたら嬉しいです。
来てくれた皆さん、春に大学で会えるのを楽しみにしています!

第193回ソフトウェア工学研究発表会

情報処理学会ソフトウェア工学研究会主催(電子情報通信学会ソフトウェアサイエンス研究会・知能ソフトウェア工学研究会との合同開催)の第193回ソフトウェア工学研究発表会にて, 「コードクローンの作成・利用過程における人的影響を調査するための追跡ツールの試作(大平雅雄,久木田雄亮)」に関する研究発表をおこないました!