Skip to main content
All CollectionsConnectors
GitLab Cloud Connector
GitLab Cloud Connector

This document covers all information related to our GitLab connector

Dan Iacono avatar
Written by Dan Iacono
Updated over a week ago

Introduction

The GitLab connector for Glean allows Glean to fetch and index content from GitLab, ensuring that users can search and access documents for which they have authorized permissions.

  • Authentication: Glean requires the GitLab admin to authenticate to Glean during the setup of the data source with a Personal access token and Webhook secret token

  • Data Storage: All data is stored in the cloud project within the customer's cloud account, ensuring no data leaves the customer's environment

API Usage

  • Standard API: Glean uses GitLab’s standard REST API to ingest all data

Integration Features

  • Content Captured: Glean captures GitLab repos, commits, issues, pulls, and pushes.

  • Permissions Enforcement: Glean respects all user access permissions, ensuring users only see search results for documents users can access. When a user clicks on a search result, they are taken to the GitLab web application, which enforces the permission.

Versions Supported

There are no specific version limitations of the GitLab connector except that the document is limited to GitLab Cloud and not on-premise connected GitLab Server (which is supported by Glean with a separate connector).

Objects Supported

The GitLab connector supports the following objects:

  • Merge request descriptions

  • Merge request conversations/comments

  • Commit messages for the main branch

  • Wikis

  • Issues

Glean will capture the following from the latest commit on the main branch:

  • Directory/file names

  • Full content of documentation files only (.md and .txt)

Authentication Mechanism

Glean requires a personal access token from a GitLab user account to authorize Glean API calls. This account must have access to all projects in scope for Glean to crawl. Glean can programmatically create webhooks during setup by granting Glean the API scope for this API token (recommended). If the token is restricted to read-only access, webhooks will need to be created manually for every single project that you want crawled.

Create a Personal Access Token (PAT)

  1. Sign into your GitLab user account.

  2. Navigate to upper right-hand corner (user icon) and click "Preferences"

  3. Select "Access Tokens" on the left side menu.

  4. Add a personal access token.

    1. Name: Glean Token

    2. Scopes:

      1. For granting write privileges: API (recommending)

      2. If granting read-only privileges:

        • read_user

        • read_api

        • read_repository

  5. Leave Expires at empty.

  6. Copy the Personal Access Token (PAT) into the corresponding field in the Glean GitLab data source in the setup step.

Create Webhooks Manually (if PAT is read-only)

Log into GitLab with an account with owner privileges to manually create webhooks in a project. For each project, perform the following steps:

  1. Navigate to the project page within GitLab.

  2. On the left-side menu, navigate to SettingsWebhooks

  3. Create a webhook with the following properties:

    • URL: (generated dynamically from Glean GitLab data source UI setup → Show setup instructions)

    • Secret token: (generated dynamically from Glean GitLab data source UI setup → Show setup instructions)

    • Trigger:

      • Push events

      • Comments

      • Issue events

      • Merge request events

      • Wiki page events

  4. Input that Secret Token to the corresponding "Webhook secret token" field in Glean.

  5. After creating all project webhooks on GitLab, click Save

User Mapping

GitLab Cloud does not disclose a user’s email address through the API unless the user has explicitly consented to the public display of their email. For Glean to accurately retrieve permissions within GitLab, it is essential for Glean to associate each user ID with the corresponding company email. Please generate a CSV file comprising two columns: GitLab user ID and email.

  • Column headers are not required.

  • Columns must be in the order (user ID, email).

  • The user ID is NOT the username –– the user ID should be numbers only and corresponds to the ID in the example response of the /members API.

  • Example of a correct row: 12345,user1@glean.com

To enumerate all GitLab user identification numbers, please utilize the GitLab API. Company email addresses can be retrieved internally from an identity management system such as Okta, GSuite, or any alternative source.

Connection instructions

Disclaimer: The instructions below are updated periodically by Glean on our customer-facing documentation. For the latest instructions, refer to the Glean Admin UI.

Note: Please complete the Personal Access Token (PAT) and Webhook setup before proceeding

Setup in Glean

  1. Input the data source name in the Name text box and select an icon

  2. Complete any outstanding steps in Setup Instructions. Note: Please have your Personal Access Token (PAT) and Webhook secret token ready.

  3. Complete the following

    1. Input GitLab PAT in Personal access token name text box

    2. Check the box if the PAT has write privileges

    3. Input GitLab Webhook token in Webhook secret token name text box

  4. Upload a CSV of the GitLab user ID to the email address mapping

  5. Click Save

OAuth Flow for Individual Users

Individual users must authorize Glean in the UI by clicking your profile picture (bottom left corner) → Your settings → Data sources → GitLab.

Items crawled

Content Indexed

  • Merge request descriptions

  • Merge request conversations/comments

  • Commit messages for the main branch

  • Wikis

  • Issues

Glean will capture the following from the latest commit on the main branch:

  • Directory/file names

  • Full content of documentation files only (.md and .txt)

Identity

  • Users: Information about users within the GitLab

  • Groups: Details about groups within GitLab at the global and repository level.

The identity crawl operates with the following configurations:

  • Incremental Identity Crawls: These are performed to capture changes since the last crawl.

  • Full Identity Crawls: These are conducted periodically to ensure all identity data is up-to-date.

Webhook Events

The Glean Application for GitLab facilitates Glean in delivering highly customized search results for its users. By transmitting webhook events to the customer’s Glean instance upon the occurrence of each event, the application enables the Glean instance to acquire valuable information essential for providing an exceptional search experience. For instance, webhooks are triggered when a pull request, issue, or comment is modified or added, typically prompting a content crawl within ten minutes.

The webhook information is stored securely in the customer’s dedicated cloud project or account, ensuring complete data privacy and protection.

Rate Limits

  • Queries per Second (QPS): The default rate limit is set to 4 queries per second per user.

  • For more information regarding GitLab’s rate limiting, see here

Update frequency

Content updates for the GitLab connector in Glean can happen quite rapidly, depending on the type of update and the configuration settings. Here are the key areas:

  • Webhooks: Any events such as adds, updates, and permissions changes are crawled with the best effort as received. This means that any new files, modifications to existing files, or changes in sharing permissions are detected and processed quickly.

  • People / Identity Crawls: Changes to group memberships are picked up by the identity crawl, which runs every hour. This ensures that updates to user groups and their permissions are reflected promptly.

  • Incremental Crawls: These occur every 10 minutes to provide additional reliability beyond the minute-by-minute webhooks.

  • Full Crawls: The frequency of full crawls can be configured, but they are generally less frequent than incremental crawls at 28 days

Changes in data must be crawled, processed, and indexed before the data is reflected in the UI. Actual time may vary depending on a number of changes and corpus size. For the most up-to-date crawler refresh information, please refer to [External] Glean crawling strategy

How the crawl works

The GitLab crawler follows the traditional crawler strategy, including utilizing the GitLab API and the following ways to get and update data:

  • Identity Crawl: updating and adding of People data, including users, groups, and other information

  • Webhooks: are messages sent by the application to notify Glean of changes in real-time and then Glean either initiates crawl or picks up the change on the next crawl

  • Content Crawls: Full crawls the entire defined scope of the application, whereas incremental crawls only capture the changes from the previous full or incremental crawl

Known Limitations in Crawl

  • A Glean representative must turn on code search functionality

Unsupported items

  • None

API endpoints

Purpose

Cloud Endpoint

Retrieves all groups of a domain or a user given a userKey (paginated).

Lists all projects where the admin user is an explicit member.

Retrieves a list of members for a given project

Retrieves a list of wiki pages within a given project

Retrieves a list of issues within a given project

Retrieves a list of merge requests within a given project

Retrieves a list of comments within a given merge request

Retrieves a list of diffs within a given merge request

Retrieves a project

Retrieves a commit

Retrieves the latest HEAD ref

Retrieves git objects

Content Configuration

Inclusion (green-listing) and Exclusion(red-listing) rules are not available for GitLab connector

Content Configuration

Note: If Inclusion (Green-Listing) options are enabled, only content from the Inclusion content will be indexed. If Exclusion (Red-Listing) options are enabled all content in the exclusions will be removed. If both rules are applied to the same content, then the content will NOT be indexed as the exclusion rule takes priority.

The rules below should be used MINIMALLY to preserve the enterprise search experience, as most end-users expect to find all content. Most customers do not apply any rules or apply red-listing rules sparingly for sensitive items.

Glean permits inclusion and exclusion at the repository level. Please contact Glean for more information and configuration.

Did this answer your question?