Retrieval and analysis of GitHub users and their repositories.
users.csv
: Data of 301 GitHub users based in the city of London with over 500 followers.repositories.csv
: Data of 37,922 public repositories of these users, top 500 each by most recently pushed.fetch.py
: The Python script used to collect this data.fetch.json
: A JSON file containing fetch statistics.queries
: This directory contains GraphQL queries used to fetch data from GitHub API.analysis.ipynb
: A Jupyter notebook containing the analysis of the data.answers.ipynb
: Contains answers to the project assignment questions..gitignore
: A file to exclude certain files from being tracked by Git.README.md
: This file. A summary of the project and its findings.
- The data for this project was collected using GitHub's GraphQL API, which gives fine-grained control over the fields to fetch.
- Python scripts were used to get data from the API and processed it to extract user profiles and repository details.
- The information was saved in CSV files for easy retrieval and subsequent analysis.
- The cleaning step included standardizing company names—removing extra spaces, symbols like
@
, and making everything uppercase for consistency. - One key challenge was handling rate limits from the GitHub API. This was mitigated by using the much more efficient GraphQL API.
- An interesting observation was that the search query, intended to fetch user data, occasionally returned empty
{}
responses. Upon further investigation, it was discovered that these entries corresponded to organizations rather than individual users.
-
Popularity and Repository Count: Developers who create more public repositories usually have more followers. Each additional repository adds, on average, around 1.5 followers.
-
Company Affiliation: Most developers in London who are on GitHub work for Facebook.
-
Programming Language Trends: JavaScript is the most popular language among these users.
-
Open Source Engagement: Many developers don't use a license for their projects. Among those who do, the MIT License is the most common, showing that they are open to sharing their work with few restrictions.
-
Features for Collaboration: There is a positive link between using "projects" and "wikis" on repositories. Turning on these features can help get more people involved and make collaboration easier.
-
Surname Trends: The most common surnames among developers in London are "Appleton", "Li" etc. suggesting many contributors have English and Asian ethnicity.
-
Email Sharing Behavior: Hireable developers are more likely to share their email addresses than those who aren't. Hireable developers are more open to being contacted.
-
Leader Strength: Defined as followers divided by
1 + following
. Top users are those with more followers than people they follow, showing they are key figures who provide valuable work.
- To grow their following, developers should be more active by creating and sharing public repositories. Contributing to more projects shows skills and attracts attention from others.
- Turning on collaboration features like "projects" and "wikis" can boost engagement by making it easier for others to contribute and understand the work.
- Sharing contact details, like an email address, can make a developer more visible to collaborators or employers.
- Developers looking for new opportunities should mark themselves as "hireable" and share their contact information to increase their chances of being approached.
- Companies should encourage developers to be visible on GitHub to build their reputation and support a culture of sharing and learning.