Skip to main content
Box Connector

This document covers all information related to the Box connector.

Dan Iacono avatar
Written by Dan Iacono
Updated over 2 weeks ago

Introduction

The Box connector for Glean allows Glean to fetch and index content from Box, ensuring that users can search and access documents for which they have authorized permissions.

  • Authentication: Glean requires the Box admin to authenticate Glean via OAuth2 during the setup of the Glean crawler

  • Data Storage: All data is stored in the cloud project within the customer's cloud account, ensuring no data leaves the customer's environment

API Usage

  • Standard API: Glean uses Box’s standard API for Box to ingest all data

Integration Features

  • Content Captured: Glean captures Box projects, service management, dashboards, and more.

  • Permissions Enforcement: Glean respects all user access permissions, ensuring users only see search results for documents they can access. When a user clicks on a search result, they are taken to the Box web application, which enforces the permission.

Versions Supported

There are no specific version limitations of the Box connector

Objects Supported

The Box connector for Glean supports the following objects:

  1. Folders

  2. Files (including slides, word documents, etc.)

  3. Box Notes

  4. Comments on the files

Authentication Mechanism

During the configuration of Glean, a systematic authorization procedure will be conducted in conjunction with Box, during which Glean will be granted both refresh and access tokens. The access token is designed to facilitate API calls to the customer’s Box account, which possesses a limited validity period. A routine refresh operation is executed periodically to acquire a new access token by utilizing the refresh token obtained from the initial authorization process.

Connector credentials requirements

  • The user setting up this data source must be the Box Admin.

  • Note: Co-admin does not work. Co-admins cannot access other co-admin’s items there Glean will not be able to crawl all of the expected information.

Connection instructions

Optional: Recommended Notification Suppression

  • Glean uses the download files endpoint as part of the crawl logic. If the Box instance is set up to send email notifications for suspicious download behavior, Glean recommends Box support and suppress notifications for the client ID used for the integration.

  • Failure to suppress notifications may result in download notifications across the entire Box organization.

Setup in Glean

  1. Input the data source name in the Name text box and select an icon

  2. Complete any outstanding setup in Show setup instructions

  3. Click Authorize and follow the instructions

Authentication scope requirements

Scope

Purpose

Read all files/folders in Box

List all files from user drives

Read and write all files and folders stored in Box

Required to download file content into Glean (despite saying write).

Manage users

List users and associated group memberships

Manage groups

List all groups

Manage enterprise properties

Crawl recent enterprise logs activity to ingest newly created/modified data

Admin can make calls on behalf of Users

Use the As-User header, to distribute rate limits between different owners of files.

Items crawled

Content

For Box, Glean indexes the following content and associated permissions:

  • Folders

  • Files (e.g. slides, word documents, etc.)

  • Box Notes

  • Comments on the files

Identity

  • Users: Information about users within the Box

  • Groups: Details about groups within Box

Activity

  • Adds: New files or folders added to Box.

  • Updates: Modifications made to existing files or folders.

  • Permissions Changes: Changes in file or folder sharing permissions.

  • Deletions: Files or folders that have been deleted.

  • View Activity: Events indicating when a file or folder has been via Glean.

The activity crawl operates with the following configurations:

  • Incremental Activity Crawls: These are performed every 1 minutes to capture recent changes.

  • Full Activity Crawls: These are conducted periodically to ensure all activity data is up-to-date.

Rate Limits

Glean is restricted to a maximum of 16 QPS per individual user. Glean distributes all users across 10 distinct queues for an initial maximum of 160 QPS.

Update frequency

Content updates for the Box connector in Glean can happen quite rapidly, depending on the type of update and the configuration settings. Here are the key areas:

  • Activity Reports: Adds, updates, and permissions changes are crawled every minute. This means that any new files, modifications to existing files, or changes in sharing permissions are detected and processed quickly.

  • Identity Crawls for User Group Memberships: Modifications to group memberships are detected by the identity crawl, which operates hourly. This mechanism ensures that updates concerning user groups and their corresponding permissions are promptly reflected.

  • Incremental Crawls: These occur every 10 minutes to provide additional reliability beyond the minute-by-minute activity reports.

  • Full Crawls: The frequency of full crawls can be configured, but they are generally less frequent than incremental crawls at 28 days

Changes in data must be crawled, processed, and indexed before the data is reflected in the UI. Actual time may vary depending on the number of changes and corpus size. For the most up-to-date crawler refresh information, please refer to [External] Glean crawling strategy

How the crawl works

The crawler follows the traditional crawler strategy, including utilizing the API and the following ways to get and update data:

  • Identity Crawl: updating and adding of People data, including users, groups, and other information

  • Activity Crawl: Adds, updates, and permissions changes to content

  • Webhooks: The system uses API to identify new/modified/deleted docs

  • Content Crawls: Full crawls the entire defined scope of the application whereas incremental crawls only capture the changes from the previous full or incremental crawl.

Known Limitations in Crawl

  • Box has a per-user limit for API requests that we utilize for crawling. Glean runs into issues with this when customers have a large number of documents owned by a single service account. This can occur when customers do large migrations from on premise to cloud. Box itself recommends using a single service account.

  • The user setting up the connector must be a Box Admin. Co-admins do not have the necessary access permissions, which means they cannot access other co-admins items, leading to incomplete crawls.

Glean does not currently index:

  • Box Web Links

  • Custom metadata set on folders/files

  • Favorites Collections

API endpoints

Glean systematically crawls and indexes content utilizing the designated Box API endpoints. Its application, accessible through the Box App Center, ensures comprehensive connectivity.

Authentication Endpoints

Use Case

Endpoint

Documentation

Refresh access token

Refresh an Access Token using its client ID, secret, and refresh token.

Identity Endpoints

Use Case

Endpoint

Documentation

List enterprise users

Determine which users (and associated content) need to be indexed.

List groups for enterprise

Fetch all groups within a tenant (for permissions).

List Enterprise Users

Determine which users are members of which group (for permissions).

Content Endpoints

Use Case

Endpoint

Documentation

List items in folder

List all items and content within a folder for indexing.

Get file information

Retrieve metadata for each specific item for indexing.

List file collaborations

Retrieve a list of all users with access to an item (for permissions).

List file collaborations

Retrieve a list of all users with access to an item (for permissions).

Activity Endpoints

Use Case

Endpoint

Documentation

List user and enterprise events

Fetch activity data for each user for ranking signals (12 month limit).

Content Configuration

Note: If Inclusion (Green-Listing) options are enabled, only content from the Inclusion category will be indexed. If Exclusion (Red-Listing) options are enabled, all content in the exclusion category will be removed. If both rules are applied to the same content, then the content will NOT be indexed, as exclusion rules take priority.

The rules below should be used MINIMALLY to preserve the enterprise search experience, as most end-users expect to find all content. Most customers do not apply any rules or apply exclusion rules sparingly for sensitive folders.

Exclusion (Red-Listing) Options

Glean provides several options for excluding content from the data crawl, which excludes data from search and chat results.

  • Users: Exclude content belonging to specific users from being crawled. How to find the User ID

  • Folders: Exclude content belonging to specific folders from being crawled.

Inclusion (Green-Listing) Options

Glean provides several options for excluding content from the data crawl, which excludes data from search and chat results.

  • Files: Include specific file content

  • Folders: Include content belonging to specific folders from being crawled.

Did this answer your question?