Introduction
The GitHub connector for Glean allows Glean to fetch and index content from GitHub, ensuring that users can search and access documents for which they have authorized permissions.
Authentication: Glean requires the GitHub admin to authenticate to Glean during the setup of the Glean crawler app in the GitHub marketplace.
Data Storage: All data is stored in the cloud project within the customer's cloud account, ensuring no data leaves the customer's environment
API Usage
Standard API: Glean uses GitHub’s standard REST API to ingest all data
Integration Features
Content Captured: Glean captures GitHub repos, commits, issues, pulls, and pushes.
Permissions Enforcement: Glean respects all user access permissions, ensuring users only see search results for documents users can access. When a user clicks on a search result, they are taken to the GitHub web application, which enforces the permission.
Versions Supported
There are no specific version limitations of the GitHub connector except the document is limited to GitHub Cloud, not on-premise connected GitHub Server and Enterprise Server
Objects Supported
The GitHub connector supports the following objects:
Contents
Issues
Metadata
Pull requests
Commit statuses
GitHub pages
Authentication Mechanism
The GitHub organizational administrator will install a GitHub Marketplace app with Admin read-access scope. Navigate to https://github.com/apps/glean-github-app or within the Glean UI, Admin console → Data sources → Github and click Install the Glean GitHub App.
The GitHub Connector app will be used for indexing the content and delivers webhooks to the customer’s Glean instance. Users cannot access private repositories until the OAuth flow is completed. The first use of Glean and GitHub search in Glean, the users will be prompted to authenticate to GitHub OAuth to view private repositories and to help sync user aliases. Once the OAuth flow is completed, Glean will detect the change in the entity crawl, and will sync the private repository and aliases.
OAuth Flow for Individual Users and Private Repository Access
Individual users must authorize Glean in the UI by clicking your profile picture (bottom left corner) → Your settings → Data sources → GitHub.
Connector credentials requirements
The GitHub connector for Glean requires specific permissions to function correctly.
Organizational Admin for the data source and GitHub app install
Admin read-only for ongoing running and operating of the GitHub app
Individual user authorization for private repositories
Connection instructions
Disclaimer: The instructions below are updated periodically by Glean on our customer-facing documentation. For the latest instructions, refer to the Glean Admin UI.
Install the Glean App for GitHub
Click Install or Configure
Click Organization where the app is to be installed
Select All Repositories and Click Install & Authorize.
Setup in Glean
Input the data source name in the Name text box
Select an icon
Input GitHub organizational name in GitHub organizational name text box
Click Save
Items crawled
Content Indexed
Repository permissions
Administration
Contents
Issues
Metadata
Pull requests
Commit statuses
Organization permissions
Members
Content
Commits
Commit comment
Issues
Issue comment
Pull request
Pull request review
Pull request review comment
Push
Repository
GitHub Pages (HTML and Markdown)
GitHub Wikis
Identity
Users: Information about users within the GitHub
Groups: Details about groups within GitHub at the global and repository level.
The identity crawl operates with the following configurations:
Full Identity Crawls: These are conducted periodically to ensure all identity data is up-to-date.
Webhook Events
Commit comment
Issues
Issue comment
Member
Organization
Pull request
Pull request review
Pull request review comment
Push
Repository
Team
Team add
The Glean App for GitHub helps Glean provide highly personalized search results for users. By sending webhook events to the customer’s Glean instance each time an event occurs, the app enables the Glean instance to gather valuable information crucial to delivering an outstanding search experience. The webhook information is stored securely in the customer’s dedicated cloud project or account, ensuring complete privacy and protection of the data.
Rate Limits
Queries per Second (QPS): The default rate limit is set to 4 queries per second.
For more information regarding GitHub’s rate limiting, see here
Update frequency
Content updates for the GitHub connector in Glean can happen quite rapidly, depending on the type of update and the configuration settings. Here are the key areas:
Webhooks: Any events such as adds, updates, and permissions changes are crawled with best effort as received. This means that any new files, or modifications to existing files are detected and processed quickly.
People / Identity Crawls: Changes to group memberships are picked up by the identity crawl, which runs every 10 mins. This ensures that updates to user groups and their permissions are reflected promptly.
Incremental Crawls: These occur every 10 minutes to provide additional reliability beyond the minute-by-minute webhooks.
Full Crawls: The frequency of full crawls can be configured, but they are generally less frequent than incremental crawls at 28 days
Changes in data must be crawled, processed, and indexed before the data is reflected in the UI. Actual time may vary depending on a number of changes and corpus size. For the most up-to-date crawler refresh information, please refer to [External] Glean crawling strategy
How the crawl works
The GitHub crawler follows the traditional crawler strategy, including utilizing the GitHub API and the following ways to get and update data:
Identity Crawl: updating and adding of People data, including users, groups, and other information
Webhooks: are messages sent by the application to notify Glean of changes in real-time, and then Glean either initiates crawl or picks up the change on the next crawl
Content Crawls: Full crawls the entire defined scope of the application, whereas incremental crawls only capture the changes from the previous full or incremental crawl
Known Limitations in Crawl
Private repositories must be authorized before appearing in the Glean search results. To authorize a private repository to appear in search and chat results, navigate to your user setting in the Glean UI. Then select data sources and click the GitHub tile to begin the authorization process.
Due to the limitations of GitHub’s API, only Pages sites with legacy build types are supported. Glean indexes the content files in the gh-pages branch of Pages repositories.
For GitHub Wiki pages are not crawled by default. To enable GitHub Wiki pages please contact your Glean representative.
For Glean to index compiled Pages site files: the workflow or app must write the compiled files back to the repository. If not Glean can’t access those files directly via our git crawler. Glean has other approaches to crawl the data:
Crawling the repository: Glean cannot automatically resolve the mapping from source files to compiled files to determine how to combine and index the files.
Using the web crawler: Works well for public sites since the web crawler can render JavaScript and index the content users see via the browser. GitHub handles authentication internally for private sites, and there are no credentials that Glean can use to access the data.
Unsupported items
Crawling custom GitHub Action workflows concerning GitHub pages
API endpoints
Purpose | Cloud Endpoint | Connect app scope required |
List installations for the authenticated app | READ | |
Create an installation access token for an app |
| |
Repository permissions for "Metadata" | READ | |
Organization Permissions for “Members” | READ | |
Repository permissions for "Metadata" | READ | |
Repository Permissions for “Issues” | READ | |
Repository Permissions for “Issues” | READ | |
Repository Permissions for “Pull requests” | READ | |
Repository Permissions for “Pull requests” | READ | |
Repository Permissions for “Pull requests” | READ | |
Repository Permissions for “Pull requests” | READ | |
Repository Permissions for “Pages” | READ | |
Repository Permssions for “Metadata” | READ | |
Per-User OAuth with scope user:email to get `code` to be exchanged for oauth access token | READ | |
Per-User Oauth exchange `code` from above to get access_token to retrieve user emails for below |
| |
Per-User Oauth – scope user:email to get emails per-user | /user |
|
User Permissions for “Email Addresses” | READ |
Git Protocol endpoints (prefixed by gitDomain)
GET /<repository-name>.git/info/refs?service=git-upload-pack
POST /<repository-name>.git/git-upload-pack
Content Configuration
Note: If Inclusion (Green-Listing) options are enabled, only content from the Inclusion content will be indexed. If Exclusion (Red-Listing) options are enabled all content in the exclusions will be removed. If both rules are applied to the same content, then the content will NOT be indexed as the exclusion rule takes priority.
The rules below should be used MINIMALLY to preserve the enterprise search experience, as most end-users expect to find all content. Most customers do not apply any rules or apply red-listing rules sparingly for sensitive items.
Exclusion (Red-Listing) Options
By entering specific repository names, the inputted repositories will not be crawled.
To exclude
GitHub Pages Sites
Please contact your Glean representative to configure this in your Glean instance.
Inclusion (Green-Listing) Options
By entering specific repository names, the inputted repositories will be the only repositories crawled. Otherwise stated repositories will not be crawled unless specified.