Introduction

The GitHub connector for Glean allows Glean to fetch and index content from GitHub, ensuring that users can search and access documents for which they have authorized permissions.

Authentication: Glean requires the GitHub admin to authenticate to Glean during the setup of the Glean crawler app in the GitHub marketplace.
Data Storage: All data is stored in the cloud project within the customer's cloud account, ensuring no data leaves the customer's environment

API Usage

Standard API: Glean uses GitHub’s standard REST API to ingest all data

Integration Features

Content Captured: Glean captures GitHub repos, commits, issues, pulls, and pushes.
Permissions Enforcement: Glean respects all user access permissions, ensuring users only see search results for documents users can access. When a user clicks on a search result, they are taken to the GitHub web application, which enforces the permission.
Glean Code Tool - users gain the ability to search through their deployment's GitHub or GitLab code repository directly from Glean Chat. This functionality is not just about finding code; it's about understanding it, creating snippets, and more. For more info see

There are no specific version limitations of the GitHub connector except the document is limited to GitHub Cloud, not on-premise connected GitHub Server and Enterprise Server

Objects Supported

The GitHub connector supports the following objects:

Contents
Issues
Metadata
Pull requests
Commit statuses
GitHub pages

Authentication Mechanism

The GitHub organizational administrator will install a GitHub Marketplace app with Admin read-access scope. Navigate to https://github.com/apps/glean-github-app or within the Glean UI, Admin console → Data sources → Github and click Install the Glean GitHub App.

The GitHub Connector app will be used for indexing the content and delivers webhooks to the customer’s Glean instance. Users cannot access private repositories until the OAuth flow is completed. The first use of Glean and GitHub search in Glean, the users will be prompted to authenticate to GitHub OAuth to view private repositories and to help sync user aliases. Once the OAuth flow is completed, Glean will detect the change in the entity crawl, and will sync the private repository and aliases.

OAuth Flow for Individual Users and Private Repository Access

Individual users must authorize Glean in the UI by clicking your profile picture (bottom left corner) → Your settings → Data sources → GitHub.

Connector credentials requirements

The GitHub connector for Glean requires specific permissions to function correctly.

Organizational Admin for the data source and GitHub app install
Admin read-only for ongoing running and operating of the GitHub app
Individual user authorization for private repositories

Connection instructions

Disclaimer: The instructions below are updated periodically by Glean on our customer-facing documentation. For the latest instructions, refer to the Glean Admin UI.

Install the Glean App for GitHub

Go to https://github.com/apps/glean-github-app
Click Install or Configure
Click Organization where the app is to be installed
Select All Repositories and Click Install & Authorize.

Setup in Glean

Input the data source name in the Name text box
Select an icon
Input GitHub organizational name in GitHub organizational name text box
Click Save

Items crawled

Content Indexed

Repository permissions
- Administration
- Contents
- Issues
- Metadata
- Pull requests
- Commit statuses
Organization permissions
- Members
Content
Commits
Commit comment
Issues
Issue comment
Pull request
Pull request review
Pull request review comment
Push
Repository
GitHub Pages (HTML and Markdown)
GitHub Wikis

Identity

Users: Information about users within the GitHub
Groups: Details about groups within GitHub at the global and repository level.

The identity crawl operates with the following configurations:

Full Identity Crawls: These are conducted periodically to ensure all identity data is up-to-date.

Webhook Events

Commit comment
Issues
Issue comment
Member
Organization
Pull request
Pull request review
Pull request review comment
Push
Repository
Team
Team add

The Glean App for GitHub helps Glean provide highly personalized search results for users. By sending webhook events to the customer’s Glean instance each time an event occurs, the app enables the Glean instance to gather valuable information crucial to delivering an outstanding search experience. The webhook information is stored securely in the customer’s dedicated cloud project or account, ensuring complete privacy and protection of the data.

Rate Limits

Queries per Second (QPS): The default rate limit is set to 4 queries per second.
For more information regarding GitHub’s rate limiting, see here

Update frequency

Content updates for the GitHub connector in Glean can happen quite rapidly, depending on the type of update and the configuration settings. Here are the key areas:

Webhooks: Any events such as adds, updates, and permissions changes are crawled with best effort as received. This means that any new files, or modifications to existing files are detected and processed quickly.
People / Identity Crawls: Changes to group memberships are picked up by the identity crawl, which runs every 10 mins. This ensures that updates to user groups and their permissions are reflected promptly.
Incremental Crawls: These occur every 10 minutes to provide additional reliability beyond the minute-by-minute webhooks.
Full Crawls: The frequency of full crawls can be configured, but they are generally less frequent than incremental crawls at 28 days

Changes in data must be crawled, processed, and indexed before the data is reflected in the UI. Actual time may vary depending on a number of changes and corpus size. For the most up-to-date crawler refresh information, please refer to [External] Glean crawling strategy

How the crawl works

The GitHub crawler follows the traditional crawler strategy, including utilizing the GitHub API and the following ways to get and update data:

Identity Crawl: updating and adding of People data, including users, groups, and other information
Webhooks: are messages sent by the application to notify Glean of changes in real-time, and then Glean either initiates crawl or picks up the change on the next crawl
Content Crawls: Full crawls the entire defined scope of the application, whereas incremental crawls only capture the changes from the previous full or incremental crawl

Known Limitations in Crawl

Private repositories must be authorized before appearing in the Glean search results. To authorize a private repository to appear in search and chat results, navigate to your user setting in the Glean UI. Then select data sources and click the GitHub tile to begin the authorization process.
Due to the limitations of GitHub’s API, only Pages sites with legacy build types are supported. Glean indexes the content files in the gh-pages branch of Pages repositories.
For GitHub Wiki pages are not crawled by default. To enable GitHub Wiki pages please contact your Glean representative.
For Glean to index compiled Pages site files: the workflow or app must write the compiled files back to the repository. If not Glean can’t access those files directly via our git crawler. Glean has other approaches to crawl the data:
- Crawling the repository: Glean cannot automatically resolve the mapping from source files to compiled files to determine how to combine and index the files.
- Using the web crawler: Works well for public sites since the web crawler can render JavaScript and index the content users see via the browser. GitHub handles authentication internally for private sites, and there are no credentials that Glean can use to access the data.

Unsupported items

Crawling custom GitHub Action workflows concerning GitHub pages

API endpoints

Purpose	Cloud Endpoint	Connect app scope required
List installations for the authenticated app	/app/installations	READ
Create an installation access token for an app	/app/installations/%/access_token
Repository permissions for "Metadata"	/org/%/repos	READ
Organization Permissions for “Members”	/org/%/members	READ
Repository permissions for "Metadata"	/repos/%	READ
Repository Permissions for “Issues”	/repos/%/issues	READ
Repository Permissions for “Issues”	/repos/%/issues/%/comments	READ
Repository Permissions for “Pull requests”	/repos/%/pulls	READ
Repository Permissions for “Pull requests”	/repos/%/pulls/%	READ
Repository Permissions for “Pull requests”	/repos/%/pulls/%/comments	READ
Repository Permissions for “Pull requests”	/repos/%/pulls/%/reviews	READ
Repository Permissions for “Pages”	/repos/%/pages	READ
Repository Permssions for “Metadata”	/repos/%/collaborators	READ
Per-User OAuth with scope user:email to get `code` to be exchanged for oauth access token	/login/oauth/authorize	READ
Per-User Oauth exchange `code` from above to get access_token to retrieve user emails for below	/login/oauth/access_token
Per-User Oauth – scope user:email to get emails per-user	/user
User Permissions for “Email Addresses”	/user/emails	READ

Git Protocol endpoints (prefixed by gitDomain)

GET /<repository-name>.git/info/refs?service=git-upload-pack

POST /<repository-name>.git/git-upload-pack

Content Configuration

Note: If Inclusion (Green-Listing) options are enabled, only content from the Inclusion content will be indexed. If Exclusion (Red-Listing) options are enabled all content in the exclusions will be removed. If both rules are applied to the same content, then the content will NOT be indexed as the exclusion rule takes priority.

The rules below should be used MINIMALLY to preserve the enterprise search experience, as most end-users expect to find all content. Most customers do not apply any rules or apply red-listing rules sparingly for sensitive items.

Exclusion (Red-Listing) Options

By entering specific repository names, the inputted repositories will not be crawled.

To exclude

GitHub Pages Sites

Please contact your Glean representative to configure this in your Glean instance.

Inclusion (Green-Listing) Options

By entering specific repository names, the inputted repositories will be the only repositories crawled. Otherwise stated repositories will not be crawled unless specified.

GitHub Cloud Connector