Introduction
The Box connector for Glean allows Glean to fetch and index content from Box, ensuring that users can search and access documents for which they have authorized permissions.
Authentication: Glean requires the Box admin to authenticate Glean via OAuth2 during the setup of the Glean crawler
Data Storage: All data is stored in the cloud project within the customer's cloud account, ensuring no data leaves the customer's environment
API Usage
Standard API: Glean uses Box’s standard API for Box to ingest all data
Integration Features
Content Captured: Glean captures Box projects, service management, dashboards, and more.
Permissions Enforcement: Glean respects all user access permissions, ensuring users only see search results for documents they can access. When a user clicks on a search result, they are taken to the Box web application, which enforces the permission.
Versions Supported
There are no specific version limitations of the Box connector
Objects Supported
The Box connector for Glean supports the following objects:
Folders
Files (including slides, word documents, etc.)
Box Notes
Comments on the files
Authentication Mechanism
During the configuration of Glean, a systematic authorization procedure will be conducted in conjunction with Box, during which Glean will be granted both refresh and access tokens. The access token is designed to facilitate API calls to the customer’s Box account, which possesses a limited validity period. A routine refresh operation is executed periodically to acquire a new access token by utilizing the refresh token obtained from the initial authorization process.
Connector credentials requirements
The user setting up this data source must be the Box Admin.
Note: Co-admin does not work. Co-admins cannot access other co-admin’s items there Glean will not be able to crawl all of the expected information.
Connection instructions
Optional: Recommended Notification Suppression
Glean uses the download files endpoint as part of the crawl logic. If the Box instance is set up to send email notifications for suspicious download behavior, Glean recommends Box support and suppress notifications for the client ID used for the integration.
Failure to suppress notifications may result in download notifications across the entire Box organization.
Setup in Glean
Input the data source name in the Name text box and select an icon
Complete any outstanding setup in Show setup instructions
Click Authorize and follow the instructions
Authentication scope requirements
Scope | Purpose |
Read all files/folders in Box | List all files from user drives |
Read and write all files and folders stored in Box | Required to download file content into Glean (despite saying write). |
Manage users | List users and associated group memberships |
Manage groups | List all groups |
Manage enterprise properties | Crawl recent enterprise logs activity to ingest newly created/modified data |
Admin can make calls on behalf of Users | Use the As-User header, to distribute rate limits between different owners of files. |
Items crawled
Content
For Box, Glean indexes the following content and associated permissions:
Folders
Files (e.g. slides, word documents, etc.)
Box Notes
Comments on the files
Identity
Users: Information about users within the Box
Groups: Details about groups within Box
Activity
Adds: New files or folders added to Box.
Updates: Modifications made to existing files or folders.
Permissions Changes: Changes in file or folder sharing permissions.
Deletions: Files or folders that have been deleted.
View Activity: Events indicating when a file or folder has been via Glean.
The activity crawl operates with the following configurations:
Incremental Activity Crawls: These are performed every 1 minutes to capture recent changes.
Full Activity Crawls: These are conducted periodically to ensure all activity data is up-to-date.
Rate Limits
Glean is restricted to a maximum of 16 QPS per individual user. Glean distributes all users across 10 distinct queues for an initial maximum of 160 QPS.
Update frequency
Content updates for the Box connector in Glean can happen quite rapidly, depending on the type of update and the configuration settings. Here are the key areas:
Activity Reports: Adds, updates, and permissions changes are crawled every minute. This means that any new files, modifications to existing files, or changes in sharing permissions are detected and processed quickly.
Identity Crawls for User Group Memberships: Modifications to group memberships are detected by the identity crawl, which operates hourly. This mechanism ensures that updates concerning user groups and their corresponding permissions are promptly reflected.
Incremental Crawls: These occur every 10 minutes to provide additional reliability beyond the minute-by-minute activity reports.
Full Crawls: The frequency of full crawls can be configured, but they are generally less frequent than incremental crawls at 28 days
Changes in data must be crawled, processed, and indexed before the data is reflected in the UI. Actual time may vary depending on the number of changes and corpus size. For the most up-to-date crawler refresh information, please refer to [External] Glean crawling strategy
How the crawl works
The crawler follows the traditional crawler strategy, including utilizing the API and the following ways to get and update data:
Identity Crawl: updating and adding of People data, including users, groups, and other information
Activity Crawl: Adds, updates, and permissions changes to content
Webhooks: The system uses API to identify new/modified/deleted docs
Content Crawls: Full crawls the entire defined scope of the application whereas incremental crawls only capture the changes from the previous full or incremental crawl.
Known Limitations in Crawl
Box has a per-user limit for API requests that we utilize for crawling. Glean runs into issues with this when customers have a large number of documents owned by a single service account. This can occur when customers do large migrations from on premise to cloud. Box itself recommends using a single service account.
The user setting up the connector must be a Box Admin. Co-admins do not have the necessary access permissions, which means they cannot access other co-admins items, leading to incomplete crawls.
Glean does not currently index:
Box Web Links
Custom metadata set on folders/files
Favorites Collections
API endpoints
Glean systematically crawls and indexes content utilizing the designated Box API endpoints. Its application, accessible through the Box App Center, ensures comprehensive connectivity.
Authentication Endpoints
Use Case | Endpoint | Documentation |
Refresh access token | Refresh an Access Token using its client ID, secret, and refresh token. |
Identity Endpoints
Use Case | Endpoint | Documentation |
List enterprise users | Determine which users (and associated content) need to be indexed. | |
List groups for enterprise | Fetch all groups within a tenant (for permissions). | |
List Enterprise Users | Determine which users are members of which group (for permissions). |
Content Endpoints
Use Case | Endpoint | Documentation |
List items in folder | List all items and content within a folder for indexing. | |
Get file information | Retrieve metadata for each specific item for indexing. | |
List file collaborations | Retrieve a list of all users with access to an item (for permissions). | |
List file collaborations | Retrieve a list of all users with access to an item (for permissions). |
Activity Endpoints
Use Case | Endpoint | Documentation |
List user and enterprise events | Fetch activity data for each user for ranking signals (12 month limit). |
Content Configuration
Note: If Inclusion (Green-Listing) options are enabled, only content from the Inclusion category will be indexed. If Exclusion (Red-Listing) options are enabled, all content in the exclusion category will be removed. If both rules are applied to the same content, then the content will NOT be indexed, as exclusion rules take priority.
The rules below should be used MINIMALLY to preserve the enterprise search experience, as most end-users expect to find all content. Most customers do not apply any rules or apply exclusion rules sparingly for sensitive folders.
Exclusion (Red-Listing) Options
Glean provides several options for excluding content from the data crawl, which excludes data from search and chat results.
Users: Exclude content belonging to specific users from being crawled. How to find the User ID
Folders: Exclude content belonging to specific folders from being crawled.
Inclusion (Green-Listing) Options
Glean provides several options for excluding content from the data crawl, which excludes data from search and chat results.
Files: Include specific file content
Folders: Include content belonging to specific folders from being crawled.