Introduction
The GitLab connector for Glean allows Glean to fetch and index content from GitLab, ensuring that users can search and access documents for which they have authorized permissions.
Authentication: Glean requires the GitLab admin to authenticate to Glean during the setup of the data source with a Personal access token and Webhook secret token
Data Storage: All data is stored in the cloud project within the customer's cloud account, ensuring no data leaves the customer's environment
API Usage
Standard API: Glean uses GitLab’s standard REST API to ingest all data
Integration Features
Content Captured: Glean captures GitLab repos, commits, issues, pulls, and pushes.
Permissions Enforcement: Glean respects all user access permissions, ensuring users only see search results for documents users can access. When a user clicks on a search result, they are taken to the GitLab web application, which enforces the permission.
Versions Supported
There are no specific version limitations of the GitLab connector except that the document is limited to GitLab Cloud and not on-premise connected GitLab Server (which is supported by Glean with a separate connector).
Objects Supported
The GitLab connector supports the following objects:
Merge request descriptions
Merge request conversations/comments
Commit messages for the main branch
Wikis
Issues
Glean will capture the following from the latest commit on the main branch:
Directory/file names
Full content of documentation files only (.md and .txt)
Authentication Mechanism
Glean requires a personal access token from a GitLab user account to authorize Glean API calls. This account must have access to all projects in scope for Glean to crawl. Glean can programmatically create webhooks during setup by granting Glean the API scope for this API token (recommended). If the token is restricted to read-only access, webhooks will need to be created manually for every single project that you want crawled.
Create a Personal Access Token (PAT)
Sign into your GitLab user account.
Navigate to upper right-hand corner (user icon) and click "Preferences"
Select "Access Tokens" on the left side menu.
Add a personal access token.
Name: Glean Token
Scopes:
For granting write privileges: API (recommending)
If granting read-only privileges:
read_user
read_api
read_repository
Leave Expires at empty.
Copy the Personal Access Token (PAT) into the corresponding field in the Glean GitLab data source in the setup step.
Create Webhooks Manually (if PAT is read-only)
Log into GitLab with an account with owner privileges to manually create webhooks in a project. For each project, perform the following steps:
Navigate to the project page within GitLab.
On the left-side menu, navigate to Settings → Webhooks
Create a webhook with the following properties:
URL: (generated dynamically from Glean GitLab data source UI setup → Show setup instructions)
Secret token: (generated dynamically from Glean GitLab data source UI setup → Show setup instructions)
Trigger:
Push events
Comments
Issue events
Merge request events
Wiki page events
Input that Secret Token to the corresponding "Webhook secret token" field in Glean.
After creating all project webhooks on GitLab, click Save
User Mapping
GitLab Cloud does not disclose a user’s email address through the API unless the user has explicitly consented to the public display of their email. For Glean to accurately retrieve permissions within GitLab, it is essential for Glean to associate each user ID with the corresponding company email. Please generate a CSV file comprising two columns: GitLab user ID and email.
Column headers are not required.
Columns must be in the order (user ID, email).
The user ID is NOT the username –– the user ID should be numbers only and corresponds to the ID in the example response of the /members API.
Example of a correct row: 12345,user1@glean.com
To enumerate all GitLab user identification numbers, please utilize the GitLab API. Company email addresses can be retrieved internally from an identity management system such as Okta, GSuite, or any alternative source.
Connection instructions
Disclaimer: The instructions below are updated periodically by Glean on our customer-facing documentation. For the latest instructions, refer to the Glean Admin UI.
Note: Please complete the Personal Access Token (PAT) and Webhook setup before proceeding
Setup in Glean
Input the data source name in the Name text box and select an icon
Complete any outstanding steps in Setup Instructions. Note: Please have your Personal Access Token (PAT) and Webhook secret token ready.
Complete the following
Input GitLab PAT in Personal access token name text box
Check the box if the PAT has write privileges
Input GitLab Webhook token in Webhook secret token name text box
Upload a CSV of the GitLab user ID to the email address mapping
Click Save
OAuth Flow for Individual Users
Individual users must authorize Glean in the UI by clicking your profile picture (bottom left corner) → Your settings → Data sources → GitLab.
Items crawled
Content Indexed
Merge request descriptions
Merge request conversations/comments
Commit messages for the main branch
Wikis
Issues
Glean will capture the following from the latest commit on the main branch:
Directory/file names
Full content of documentation files only (.md and .txt)
Identity
Users: Information about users within the GitLab
Groups: Details about groups within GitLab at the global and repository level.
The identity crawl operates with the following configurations:
Incremental Identity Crawls: These are performed to capture changes since the last crawl.
Full Identity Crawls: These are conducted periodically to ensure all identity data is up-to-date.
Webhook Events
The Glean Application for GitLab facilitates Glean in delivering highly customized search results for its users. By transmitting webhook events to the customer’s Glean instance upon the occurrence of each event, the application enables the Glean instance to acquire valuable information essential for providing an exceptional search experience. For instance, webhooks are triggered when a pull request, issue, or comment is modified or added, typically prompting a content crawl within ten minutes.
The webhook information is stored securely in the customer’s dedicated cloud project or account, ensuring complete data privacy and protection.
Rate Limits
Queries per Second (QPS): The default rate limit is set to 4 queries per second per user.
For more information regarding GitLab’s rate limiting, see here
Update frequency
Content updates for the GitLab connector in Glean can happen quite rapidly, depending on the type of update and the configuration settings. Here are the key areas:
Webhooks: Any events such as adds, updates, and permissions changes are crawled with the best effort as received. This means that any new files, modifications to existing files, or changes in sharing permissions are detected and processed quickly.
People / Identity Crawls: Changes to group memberships are picked up by the identity crawl, which runs every hour. This ensures that updates to user groups and their permissions are reflected promptly.
Incremental Crawls: These occur every 10 minutes to provide additional reliability beyond the minute-by-minute webhooks.
Full Crawls: The frequency of full crawls can be configured, but they are generally less frequent than incremental crawls at 28 days
Changes in data must be crawled, processed, and indexed before the data is reflected in the UI. Actual time may vary depending on a number of changes and corpus size. For the most up-to-date crawler refresh information, please refer to [External] Glean crawling strategy
How the crawl works
The GitLab crawler follows the traditional crawler strategy, including utilizing the GitLab API and the following ways to get and update data:
Identity Crawl: updating and adding of People data, including users, groups, and other information
Webhooks: are messages sent by the application to notify Glean of changes in real-time and then Glean either initiates crawl or picks up the change on the next crawl
Content Crawls: Full crawls the entire defined scope of the application, whereas incremental crawls only capture the changes from the previous full or incremental crawl
Known Limitations in Crawl
A Glean representative must turn on code search functionality
Unsupported items
None
API endpoints
Purpose | Cloud Endpoint |
Retrieves all groups of a domain or a user given a userKey (paginated). | |
Lists all projects where the admin user is an explicit member. | |
Retrieves a list of members for a given project | |
Retrieves a list of wiki pages within a given project | |
Retrieves a list of issues within a given project | |
Retrieves a list of merge requests within a given project | |
Retrieves a list of comments within a given merge request | |
Retrieves a list of diffs within a given merge request | |
Retrieves a project | |
Retrieves a commit | |
Retrieves the latest HEAD ref | |
Retrieves git objects |
Content Configuration
Inclusion (green-listing) and Exclusion(red-listing) rules are not available for GitLab connector
Content Configuration
Note: If Inclusion (Green-Listing) options are enabled, only content from the Inclusion content will be indexed. If Exclusion (Red-Listing) options are enabled all content in the exclusions will be removed. If both rules are applied to the same content, then the content will NOT be indexed as the exclusion rule takes priority.
The rules below should be used MINIMALLY to preserve the enterprise search experience, as most end-users expect to find all content. Most customers do not apply any rules or apply red-listing rules sparingly for sensitive items.
Glean permits inclusion and exclusion at the repository level. Please contact Glean for more information and configuration.