Introduction

The GitLab connector for Glean allows Glean to fetch and index content from GitLab, ensuring that users can search and access documents for which they have authorized permissions.

Authentication: Glean requires the GitLab admin to authenticate to Glean during the setup of the data source with a Personal access token and Webhook secret token
Data Storage: All data is stored in the cloud project within the customer's cloud account, ensuring no data leaves the customer's environment

API Usage

Standard API: Glean uses GitLab’s standard REST API to ingest all data

Integration Features

Content Captured: Glean captures GitLab repos, commits, issues, pulls, and pushes.
Permissions Enforcement: Glean respects all user access permissions, ensuring users only see search results for documents users can access. When a user clicks on a search result, they are taken to the GitLab web application, which enforces the permission.

Versions Supported

There are no specific version limitations of the GitLab connector except that the document is limited to GitLab Cloud and not on-premise connected GitLab Server (which is supported by Glean with a separate connector).

Objects Supported

The GitLab connector supports the following objects:

Merge request descriptions
Merge request conversations/comments
Commit messages for the main branch
Wikis
Issues

Glean will capture the following from the latest commit on the main branch:

Directory/file names
Full content of documentation files only (.md and .txt)

Authentication Mechanism

Glean requires a personal access token from a GitLab user account to authorize Glean API calls. This account must have access to all projects in scope for Glean to crawl. Glean can programmatically create webhooks during setup by granting Glean the API scope for this API token (recommended). If the token is restricted to read-only access, webhooks will need to be created manually for every single project that you want crawled.

Create a Personal Access Token (PAT)

Sign into your GitLab user account.
Navigate to upper right-hand corner (user icon) and click "Preferences"
Select "Access Tokens" on the left side menu.
Add a personal access token.
1. Name: Glean Token
2. Scopes:
  1. For granting write privileges: API (recommending)
  2. If granting read-only privileges:
    read_user
    read_api
    read_repository
Leave Expires at empty.
Copy the Personal Access Token (PAT) into the corresponding field in the Glean GitLab data source in the setup step.

Create Webhooks Manually (if PAT is read-only)

Log into GitLab with an account with owner privileges to manually create webhooks in a project. For each project, perform the following steps:

Navigate to the project page within GitLab.
On the left-side menu, navigate to Settings → Webhooks
Create a webhook with the following properties:
- URL: (generated dynamically from Glean GitLab data source UI setup → Show setup instructions)
- Secret token: (generated dynamically from Glean GitLab data source UI setup → Show setup instructions)
- Trigger:
  - Push events
  - Comments
  - Issue events
  - Merge request events
  - Wiki page events
Input that Secret Token to the corresponding "Webhook secret token" field in Glean.
After creating all project webhooks on GitLab, click Save

User Mapping

GitLab Cloud does not disclose a user’s email address through the API unless the user has explicitly consented to the public display of their email. For Glean to accurately retrieve permissions within GitLab, it is essential for Glean to associate each user ID with the corresponding company email. Please generate a CSV file comprising two columns: GitLab user ID and email.

Column headers are not required.
Columns must be in the order (user ID, email).
The user ID is NOT the username –– the user ID should be numbers only and corresponds to the ID in the example response of the /members API.
Example of a correct row: 12345,user1@glean.com

To enumerate all GitLab user identification numbers, please utilize the GitLab API. Company email addresses can be retrieved internally from an identity management system such as Okta, GSuite, or any alternative source.

Connection instructions

Disclaimer: The instructions below are updated periodically by Glean on our customer-facing documentation. For the latest instructions, refer to the Glean Admin UI.

Note: Please complete the Personal Access Token (PAT) and Webhook setup before proceeding

Setup in Glean

Input the data source name in the Name text box and select an icon
Complete any outstanding steps in Setup Instructions. Note: Please have your Personal Access Token (PAT) and Webhook secret token ready.
Complete the following
1. Input GitLab PAT in Personal access token name text box
2. Check the box if the PAT has write privileges
3. Input GitLab Webhook token in Webhook secret token name text box
Upload a CSV of the GitLab user ID to the email address mapping
Click Save

OAuth Flow for Individual Users

Individual users must authorize Glean in the UI by clicking your profile picture (bottom left corner) → Your settings → Data sources → GitLab.

Items crawled

Content Indexed

Merge request descriptions
Merge request conversations/comments
Commit messages for the main branch
Wikis
Issues

Glean will capture the following from the latest commit on the main branch:

Directory/file names
Full content of documentation files only (.md and .txt)

Identity

Users: Information about users within the GitLab
Groups: Details about groups within GitLab at the global and repository level.

The identity crawl operates with the following configurations:

Incremental Identity Crawls: These are performed to capture changes since the last crawl.
Full Identity Crawls: These are conducted periodically to ensure all identity data is up-to-date.

Webhook Events

The Glean Application for GitLab facilitates Glean in delivering highly customized search results for its users. By transmitting webhook events to the customer’s Glean instance upon the occurrence of each event, the application enables the Glean instance to acquire valuable information essential for providing an exceptional search experience. For instance, webhooks are triggered when a pull request, issue, or comment is modified or added, typically prompting a content crawl within ten minutes.

The webhook information is stored securely in the customer’s dedicated cloud project or account, ensuring complete data privacy and protection.

Rate Limits

Queries per Second (QPS): The default rate limit is set to 4 queries per second per user.
For more information regarding GitLab’s rate limiting, see here

Update frequency

Content updates for the GitLab connector in Glean can happen quite rapidly, depending on the type of update and the configuration settings. Here are the key areas:

Webhooks: Any events such as adds, updates, and permissions changes are crawled with the best effort as received. This means that any new files, modifications to existing files, or changes in sharing permissions are detected and processed quickly.
People / Identity Crawls: Changes to group memberships are picked up by the identity crawl, which runs every hour. This ensures that updates to user groups and their permissions are reflected promptly.
Incremental Crawls: These occur every 10 minutes to provide additional reliability beyond the minute-by-minute webhooks.
Full Crawls: The frequency of full crawls can be configured, but they are generally less frequent than incremental crawls at 28 days

Changes in data must be crawled, processed, and indexed before the data is reflected in the UI. Actual time may vary depending on a number of changes and corpus size. For the most up-to-date crawler refresh information, please refer to [External] Glean crawling strategy

How the crawl works

The GitLab crawler follows the traditional crawler strategy, including utilizing the GitLab API and the following ways to get and update data:

Identity Crawl: updating and adding of People data, including users, groups, and other information
Webhooks: are messages sent by the application to notify Glean of changes in real-time and then Glean either initiates crawl or picks up the change on the next crawl
Content Crawls: Full crawls the entire defined scope of the application, whereas incremental crawls only capture the changes from the previous full or incremental crawl

Known Limitations in Crawl

A Glean representative must turn on code search functionality

Unsupported items

None

API endpoints

Purpose	Cloud Endpoint
Retrieves all groups of a domain or a user given a userKey (paginated).	/users
Lists all projects where the admin user is an explicit member.	/projects
Retrieves a list of members for a given project	/projects/<project_id>/members/all
Retrieves a list of wiki pages within a given project	/projects/<project_id>/wikis
Retrieves a list of issues within a given project	/projects/<project_id>/issues
Retrieves a list of merge requests within a given project	/projects/<project_id>/merge_requests
Retrieves a list of comments within a given merge request	/projects/<project_id>/merge_requests/<merge_request_id>/notes
Retrieves a list of diffs within a given merge request	/projects/<project_id>/merge_requests/<merge_request_id>/diffs
Retrieves a project	/projects/<project_id>
Retrieves a commit	/projects/<project_id>/repository/commits/<commit_hash>
Retrieves the latest HEAD ref	GET /<repository-name>.git/info/refs?service=git-upload-pack
Retrieves git objects	POST /<repository-name>.git/git-upload-pack

Content Configuration

Inclusion (green-listing) and Exclusion(red-listing) rules are not available for GitLab connector

Content Configuration

Note: If Inclusion (Green-Listing) options are enabled, only content from the Inclusion content will be indexed. If Exclusion (Red-Listing) options are enabled all content in the exclusions will be removed. If both rules are applied to the same content, then the content will NOT be indexed as the exclusion rule takes priority.

The rules below should be used MINIMALLY to preserve the enterprise search experience, as most end-users expect to find all content. Most customers do not apply any rules or apply red-listing rules sparingly for sensitive items.

Glean permits inclusion and exclusion at the repository level. Please contact Glean for more information and configuration.

GitLab Cloud Connector

Introduction

Versions Supported

Objects Supported

Authentication Mechanism

Create a Personal Access Token (PAT)

Create Webhooks Manually (if PAT is read-only)

User Mapping

Connection instructions

Setup in Glean

OAuth Flow for Individual Users

Items crawled

Content Indexed

Identity

Webhook Events

Rate Limits

Update frequency

How the crawl works

Known Limitations in Crawl

Unsupported items

API endpoints

Content Configuration

Content Configuration