Skip to main content
All CollectionsConnectors
GitHub Cloud Connector
GitHub Cloud Connector

This document covers all information related to our GitHub connector.

Dan Iacono avatar
Written by Dan Iacono
Updated over a week ago

Introduction

The GitHub connector for Glean allows Glean to fetch and index content from GitHub, ensuring that users can search and access documents for which they have authorized permissions.

  • Authentication: Glean requires the GitHub admin to authenticate to Glean during the setup of the Glean crawler app in the GitHub marketplace.

  • Data Storage: All data is stored in the cloud project within the customer's cloud account, ensuring no data leaves the customer's environment

API Usage

  • Standard API: Glean uses GitHub’s standard REST API to ingest all data

Integration Features

  • Content Captured: Glean captures GitHub repos, commits, issues, pulls, and pushes.

  • Permissions Enforcement: Glean respects all user access permissions, ensuring users only see search results for documents users can access. When a user clicks on a search result, they are taken to the GitHub web application, which enforces the permission.

Versions Supported

There are no specific version limitations of the GitHub connector except the document is limited to GitHub Cloud, not on-premise connected GitHub Server and Enterprise Server

Objects Supported

The GitHub connector supports the following objects:

  • Contents

  • Issues

  • Metadata

  • Pull requests

  • Commit statuses

  • GitHub pages

Authentication Mechanism

The GitHub organizational administrator will install a GitHub Marketplace app with Admin read-access scope. Navigate to https://github.com/apps/glean-github-app or within the Glean UI, Admin console → Data sources → Github and click Install the Glean GitHub App.

The GitHub Connector app will be used for indexing the content and delivers webhooks to the customer’s Glean instance. Users cannot access private repositories until the OAuth flow is completed. The first use of Glean and GitHub search in Glean, the users will be prompted to authenticate to GitHub OAuth to view private repositories and to help sync user aliases. Once the OAuth flow is completed, Glean will detect the change in the entity crawl, and will sync the private repository and aliases.

OAuth Flow for Individual Users and Private Repository Access

Individual users must authorize Glean in the UI by clicking your profile picture (bottom left corner) → Your settings → Data sources → GitHub.

Connector credentials requirements

The GitHub connector for Glean requires specific permissions to function correctly.

  • Organizational Admin for the data source and GitHub app install

  • Admin read-only for ongoing running and operating of the GitHub app

  • Individual user authorization for private repositories

Connection instructions

Disclaimer: The instructions below are updated periodically by Glean on our customer-facing documentation. For the latest instructions, refer to the Glean Admin UI.

Install the Glean App for GitHub

  1. Click Install or Configure

  2. Click Organization where the app is to be installed

  3. Select All Repositories and Click Install & Authorize.

Setup in Glean

  1. Input the data source name in the Name text box

  2. Select an icon

  3. Input GitHub organizational name in GitHub organizational name text box

  4. Click Save

Items crawled

Content Indexed

  • Repository permissions

    • Administration

    • Contents

    • Issues

    • Metadata

    • Pull requests

    • Commit statuses

  • Organization permissions

    • Members

  • Content

  • Commits

  • Commit comment

  • Issues

  • Issue comment

  • Pull request

  • Pull request review

  • Pull request review comment

  • Push

  • Repository

  • GitHub Pages (HTML and Markdown)

  • GitHub Wikis

Identity

  • Users: Information about users within the GitHub

  • Groups: Details about groups within GitHub at the global and repository level.

The identity crawl operates with the following configurations:

  • Full Identity Crawls: These are conducted periodically to ensure all identity data is up-to-date.

Webhook Events

  • Commit comment

  • Issues

  • Issue comment

  • Member

  • Organization

  • Pull request

  • Pull request review

  • Pull request review comment

  • Push

  • Repository

  • Team

  • Team add

The Glean App for GitHub helps Glean provide highly personalized search results for users. By sending webhook events to the customer’s Glean instance each time an event occurs, the app enables the Glean instance to gather valuable information crucial to delivering an outstanding search experience. The webhook information is stored securely in the customer’s dedicated cloud project or account, ensuring complete privacy and protection of the data.

Rate Limits

  • Queries per Second (QPS): The default rate limit is set to 4 queries per second.

  • For more information regarding GitHub’s rate limiting, see here

Update frequency

Content updates for the GitHub connector in Glean can happen quite rapidly, depending on the type of update and the configuration settings. Here are the key areas:

  • Webhooks: Any events such as adds, updates, and permissions changes are crawled with best effort as received. This means that any new files, or modifications to existing files are detected and processed quickly.

  • People / Identity Crawls: Changes to group memberships are picked up by the identity crawl, which runs every 10 mins. This ensures that updates to user groups and their permissions are reflected promptly.

  • Incremental Crawls: These occur every 10 minutes to provide additional reliability beyond the minute-by-minute webhooks.

  • Full Crawls: The frequency of full crawls can be configured, but they are generally less frequent than incremental crawls at 28 days

Changes in data must be crawled, processed, and indexed before the data is reflected in the UI. Actual time may vary depending on a number of changes and corpus size. For the most up-to-date crawler refresh information, please refer to [External] Glean crawling strategy

How the crawl works

The GitHub crawler follows the traditional crawler strategy, including utilizing the GitHub API and the following ways to get and update data:

  • Identity Crawl: updating and adding of People data, including users, groups, and other information

  • Webhooks: are messages sent by the application to notify Glean of changes in real-time, and then Glean either initiates crawl or picks up the change on the next crawl

  • Content Crawls: Full crawls the entire defined scope of the application, whereas incremental crawls only capture the changes from the previous full or incremental crawl

Known Limitations in Crawl

  • Private repositories must be authorized before appearing in the Glean search results. To authorize a private repository to appear in search and chat results, navigate to your user setting in the Glean UI. Then select data sources and click the GitHub tile to begin the authorization process.

  • Due to the limitations of GitHub’s API, only Pages sites with legacy build types are supported. Glean indexes the content files in the gh-pages branch of Pages repositories.

  • For GitHub Wiki pages are not crawled by default. To enable GitHub Wiki pages please contact your Glean representative.

  • For Glean to index compiled Pages site files: the workflow or app must write the compiled files back to the repository. If not Glean can’t access those files directly via our git crawler. Glean has other approaches to crawl the data:

    • Crawling the repository: Glean cannot automatically resolve the mapping from source files to compiled files to determine how to combine and index the files.

    • Using the web crawler: Works well for public sites since the web crawler can render JavaScript and index the content users see via the browser. GitHub handles authentication internally for private sites, and there are no credentials that Glean can use to access the data.

Unsupported items

  • Crawling custom GitHub Action workflows concerning GitHub pages

API endpoints

Purpose

Cloud Endpoint

Connect app scope required

List installations for the authenticated app

READ

Create an installation access token for an app

Repository permissions for "Metadata"

READ

Organization Permissions for “Members”

READ

Repository permissions for "Metadata"

READ

Repository Permissions for “Issues”

READ

Repository Permissions for “Issues”

READ

Repository Permissions for “Pull requests”

READ

Repository Permissions for “Pull requests”

READ

Repository Permissions for “Pull requests”

READ

Repository Permissions for “Pull requests”

READ

Repository Permissions for “Pages”

READ

Repository Permssions for “Metadata”

READ

Per-User OAuth with scope user:email to get `code` to be exchanged for oauth access token

READ

Per-User Oauth exchange `code` from above to get access_token to retrieve user emails for below

Per-User Oauth – scope user:email to get emails per-user

/user

User Permissions for “Email Addresses”

READ

Git Protocol endpoints (prefixed by gitDomain)

GET /<repository-name>.git/info/refs?service=git-upload-pack

POST /<repository-name>.git/git-upload-pack

Content Configuration

Note: If Inclusion (Green-Listing) options are enabled, only content from the Inclusion content will be indexed. If Exclusion (Red-Listing) options are enabled all content in the exclusions will be removed. If both rules are applied to the same content, then the content will NOT be indexed as the exclusion rule takes priority.

The rules below should be used MINIMALLY to preserve the enterprise search experience, as most end-users expect to find all content. Most customers do not apply any rules or apply red-listing rules sparingly for sensitive items.

Exclusion (Red-Listing) Options

By entering specific repository names, the inputted repositories will not be crawled.

To exclude

  • GitHub Pages Sites

Please contact your Glean representative to configure this in your Glean instance.

Inclusion (Green-Listing) Options

By entering specific repository names, the inputted repositories will be the only repositories crawled. Otherwise stated repositories will not be crawled unless specified.

Did this answer your question?