Skip to main content
All CollectionsConnectors
Microsoft OneDrive/Sharepoint Connector
Microsoft OneDrive/Sharepoint Connector
Cindy Chang avatar
Written by Cindy Chang
Updated over a week ago

Introduction

The SharePoint connector for Glean allows Glean to fetch and index content from SharePoint sites, ensuring that users can search and access documents where they have authorized permissions.

  • Authentication: is done by creating and registering an App for each deployment - https://docs.microsoft.com/en-us/graph/auth-v2-service.

  • API Usage:

    • Glean will use the Graph API to ingest all data and permissions, using the current Microsoft Graph API SDK v5.30.0.

    • Glean will ingest all data using the standard Graph API and SharePoint REST API.

    • Glean uses application permissions with admin-granted access.

  • Permissions Enforcement: Glean respects all user access permissions, ensuring users only see search results for documents they have access to. When a user clicks on a search result, they are taken to the Office 365 web application, which enforces the permission

  • Data Storage: All data is stored in the customer’s project within the customer's cloud account, ensuring no data leaves the customer's environment

Content Captured:

For OneDrive, Glean will capture the following content:

  • Folders

  • Documents (All document types, e.g. word, excel, PowerPoint)

  • OneNote (limited support, indexing Notebooks + Sections)

For SharePoint, Glean will capture the following content:

  • Site Pages (web part or wiki page libraries)

  • Site Drives (document libraries)

  • Basic List and Calendar List items (optional configuration not by default)

SharePoint Permissions

Glean has extensive experience collecting and analyzing customer data from SharePoint (and the Office 365 ecosystem). In order to meet customer needs and provide optimal search and chat experience, Glean will require the following permissions set by the Office 365 tenant administrator:

For Identities in Azure:

  • User.Read.All

  • Group.Read.All

  • GroupMember.Read.All

For OneDrive/SharePoint:

  • Directory.Read.All

  • Files.Read.All

  • Files.ReadWrite.All (for webhooks setup)

  • Reports.Read.All (for ranking signals)

  • Sites.FullControl.All (previously Sites.Read.All)

  • SharePoint REST API requires full control to properly crawl all Site Collections, SharePoint site content, and permissions

Glean uses Microsoft's recommended best practices strategy to crawl and record incremental changes for all documents. Ideally, Glean would like to use the Microsoft Graph API for all operations. Due to Glean’s extensive experience with SharePoint, Glean requires the usage of Graph and SharePoint REST APIs to meet customer needs. This article provides detailed usage of Microsoft’s APIs by Glean.

Note: If the SharePoint site is a new tenant, Glean has observed DisableCustomAppAuthentication is set to True, which needs to be set to False in order for the registered app to be authenticated. The command to run is “|set-pnptenant -DisableCustomAppAuthentication $false”

SharePoint REST & Graph API Full Control Discussion

As of 07/06/2024, Microsoft SharePoint REST API does not provide granular access for admins (see SharePoint admin APIs authentication and authorization &

Therefore, Glean requires FullControl to properly retrieve data and permissions such as Role Assignment, Collections, etc. on the SharePoint site pages. As Microsoft evolves its SharePoint REST API, Glean will evaluate the changes and implement what is best for our customers.

Sites.FullControl.All Discussion

As a default, Glean optimizes to automatically collect and analyze customer data as much as possible to provide maximum value to our customers. With the Site.FullControl.All, Glean will be able to discover all of the customer’s current SharePoint sites automatically and in addition automatically add new sites as they are created. Since Glean has extensive experience in SharePoint data gathering and permissions, Glean has worked with customers who would prefer to limit the scope of Glean’s data crawling.

Sites Selected Discussion

Customers can alternatively use Sites.Selected to explicitly indicate which sites the application can crawl. There are a few trade-offs with the “Site Selected” option and the customer will have to determine which is best for their environment. First, the customer will have to notify Glean when new SharePoint sites are created, so Glean can update the crawl list. It will take a substantial amount of time, which may exceed up 24 hours or longer for Glean to crawl the new SharePoint site depending on size, which could cause misaligned expectations between the end-users (ie it’s not instantaneous). The last trade-off is Glean can’t gather activity data for a specific list of SharePoint sites, which is a limitation in the granularity of the Site.Read.All permission by Microsoft. Without Activity data, ranking and personalization will be affected.

Files.ReadWrite.All for Webhooks Discussion

Webhooks allow Glean to be aware of and sync changes to content in the customer’s environment as they occur instead of waiting for incremental crawls to complete. For example, if a document is deleted or its access permissions change, Microsoft will notify Glean of the change (through a webhook), and Glean will process the changes.

To set up and maintain webhooks, Microsoft requires the Files.ReadWrite.All permission, where Glean subscribes to the driveItem webhook. All permission is the least privilege to set up and reauthorize notifications for this resource. For more information, please refer to subscription: reauthorize from Microsoft.

In Glean’s experience, disabling webhooks provides a suboptimal user experience. Glean’s scans for OneDrive and SharePoint are optimized to retrieve changes based on Microsoft notifications from webhooks Glean sets up. If webhooks are disabled then new changes within OneDrive/SharePoint could take longer than 24 hours to be processed (via our incremental scans) versus generally within an hour. This would include updates to permissions, changed files and sites.

Versions Supported

There are no specific version limitations of the SharePoint connector.

Objects Supported

The SharePoint connector supports the following objects:

  • Folders: Glean captures and indexes folders within OneDrive & SharePoint.

  • Documents: This includes various types of documents stored in OneDrive & SharePoint.

  • Native File Types: Office including Word, Excel, PowerPoint, etc
    Content from Personal and Shared Drives: Glean supports content from both personal drives and shared drives within OneDrive.

Authentication Mechanism

Connector credentials requirements

The SharePoint connector for Glean requires specific permissions to function correctly.

  • Glean requires authentication by setting up a registered app in Azure to SharePoint in order to fetch relevant information.

  • Glean understands all user access permissions and strictly enforces them at the time of the query, ensuring that users cannot see results to which they do not have access.

  • It’s important to note that all data is stored in the customer’s project in the customer's cloud account and no data leaves the customer's environment

  • Glean only requires READ-level permissions. Application vendors may not provide granularity in their permission schemes for read-only access as observed by Microsoft with webhooks and scanning for permissions

Scope

Purpose

Notes / Workarounds (if needed)

User.Read.All

List all the users within the directory (used for permissions)

Sites.FullControl.All

Retrieve sites, metadata, and associated content from the item for the index. FullControl is required to scan permission hierarchies.

Customers can consider Sites.Selected. This allows customers to manually provision certain SharePoint sites to have Graph API + REST API access.

Files.Read.All

Retrieve items, metadata, and associated content from the item for the index.

If Sites.Selected is used, this should not be needed.

GroupMember.Read.All

Get the members of a group (used for permissions)

Reports.Read.All

Used for logging site usage metrics for validating crawler is gathering all documents, and scaling infra to accommodate total document counts.

Files.ReadWrite.All

Used to create and manage a webhook to subscribe to change notifications.

If Sites.Selected is used, this should not be needed.

SharePoint Sites Selected Setup and Permissions

Please contact your Glean representative before installing the SharePoint data source connector for Site Selected to ensure the Glean environment is ready for the configuration.

Glean requires the following application permissions. Glean must be granted admin consent for the following permissions.

Permission

Reason

User.Read.All

List users in the tenant. This is used to assign permissions.

GroupMember.Read.All

List members of groups in the tenant. This is used to assign permissions.

Sites.Selected

This is required to grant the permissions below per site. See documentation. [link] [link]

Reports.Read.All

This is used to get usage data to estimate crawl times.

SharePoint Permissions per Site Setup

These instructions leverage a limited Graph API permission scope via Sites.Selected, to explicitly grant access only to a particular SharePoint site collection

Required permissions for setup

  • The user setting up this data source must be the Global Admin.

Register a new app

  1. Sign in to the Azure portal. Select Azure Active Directory, then App registrations > New registration.

  2. On the Register an application page, register an app with the following:

Field

Value

Name

Glean

Supported account types

Accounts in this organizational directory only (Single tenant)

Redirect URI

(Leave this field blank)

  1. Click Register

Configure permissions

  1. On the left side navigation on the overview page, click on Manage > API Permissions.

  2. Click Add a permission and select Microsoft Graph. Choose Application permissions and add the following:

  • User.Read.All

  • GroupMember.Read.All

  • Sites.Selected

  • Reports.Read.All

  • Members.Read.Hidden

Grant admin consent

  1. Please sign into Azure as a Global, Application or Cloud Application Administrator.

  2. Use the search box to navigate to Enterprise applications. Select the Glean app just created from the list of applications.

  3. Click on Permissions under Security. Review the permissions shown, and then click Grant admin consent.

Generate Certificate and PrivateKey

  1. Run the following command line by line

    openssl genrsa -out privatekey.key 2048
    openssl req -new -key privatekey.key -out request.csr
    openssl x509 -req -days 365 -in request.csr -signkey privatekey.key -out certificate.crt
  2. Verify that both certificate.crt and privatekey.key exist.

    1. certificate.crt should begin with -----BEGIN CERTIFICATE----- and end with -----END CERTIFICATE-----

    2. privatekey.key should begin with -----BEGIN PRIVATE KEY----- and end with -----END PRIVATE KEY-----

  3. Upload the certificate.crt in Glean:

    1. Client Certificate

  4. Upload the privateKey.key in Glean:

    1. Private Key

Upload Certificate

  1. Navigate back to Home > App registrations and click on the app created earlier. Then click on Manage > Certificates & secrets in the left sidebar.

  2. Click the Certificates Section and Upload the certificate.

  3. Upload the certificate.crt file that just generated

Fill out keys

  1. Scroll to the top of the left sidebar and click Overview.

  2. Copy the following content from the center Essentials panel and enter it in Glean:

    1. Application (client) ID

    2. Directory (tenant) ID

  3. Enter the SharePoint domain in Glean. The SharePoint domain should end with "sharepoint.com"

  4. Glean recommends 5 additional applications with the same permission settings as the initial app created to maximize crawl speeds. Repeat the setup steps from "Register a new app" until this step, saving the client ID and client secret in the process. Paste the client ID and client secret into the Glean web app.

  5. Please go through the next steps to set up SharePoint REST API permissions, or clicking Save will not succeed.

Add Sites.Selected Permission

  1. Navigate back to Home > App registrations and click on the app created earlier. Then click on Manage > API permissions in the left sidebar.

  2. Click Add a permission and select SharePoint. Choose Application permissions and add Sites.Selected

The below step must be performed for every individual SharePoint site collection to be indexed.

Grant Graph API permissions to an individual site

Have SharePoint Powershell installed. If any of the following commands do not work, install the module first before running the commands again within Powershell.

  1. Grant consent for PnP management in the Azure tenant for the specific site collection via site collection URL:

    Choose either of the two options to see which one works:

    Connect-PnPOnline -Url $SITE_COLLECTION_URL -Interactive -ClientId <clientId>

    (See section Interactive Connection Troubleshoot if not working)[Recommended]

    Connect-PnPOnline -Url $SITE_COLLECTION_URL -DeviceLogin -ClientId <clientId> -Tenant <tenantId>

    (See section DeviceLogin Troubleshoot if not working)

  2. With the application client ID and site collection url, grant Full Control for the site collection:

    Grant-PnpAzureADAppSitePermission -AppId $CLIENT_ID -Site $SITE_COLLECTION_URL -Permissions FullControll

Provide the list of all sites to be crawled

Glean cannot automatically determine the sites with Sites.Selected permissions applied ahead of time. This requires configuration via the Manage Data tab.

  1. Navigate to the Manage Data > Inclusion Rules tab. Provide the list of urls (can be just the subsites of the site collections with permissions) for the explicit sites to be crawled. If a site collection and all associated subsites should be crawled, provide all the urls explicitly in the greenlist.

Interactive Connection Troubleshoot

  1. On the azure portal, find the app just created. In the menu, look for Manage and click on Authentication

  2. Under Platform configurations on the page, click on Add a platform

  3. In the panel that shows up on the right, click on Mobile and desktop applications

  4. Leave the three boxes shown in the panel on the right unchecked and in the Custom redirect URIs field, enter: http://localhost. Note that this should really be http and not https

  5. Click on Configure at the bottom

  6. Retry the command

    Connect-PnPOnline -Url $SITE_COLLECTION_URL -Interactive -ClientId <clientId>

Delegated vs. Application Permissions

Delegated permissions where users authenticate themselves or use a service account explicitly added to certain sites/user drives to crawl Glean. In Glean’s experience, this does not provide the optimal experience:

  • Item insights (for activity) would only be assigned to the individual user. Glean could miss activity for other users using Glean, but not associated with the service account (Item insights in Microsoft Graph - Microsoft Graph ).

  • Microsoft imposes a tenant-wide rate limit. If individual users authorize, Glean would potentially crawl the same content with multiple user tokens. In turn would cause the same content crawled with multiple user tokens. Due to the restrictive Graph API rate limits, Glean’s crawl speed will be significantly slower.

  • Glean cannot list all the site collections with delegated permissions (List sites—Microsoft Graph v1.0). Admins will still need to configure inclusive SharePoint sites.

Connection instructions

Required permissions for setup

The user setting up this data source must be the Global Admin.

Register a new app

  1. Sign into the Azure portal. Select Azure Active Directory, then App registrations > New registration

  2. On the Register an application page, register an app with see table below

  3. Click Register

Field

Value

Name

Glean

Supported account types

Accounts in this organizational directory only (Single tenant)

Redirect URI

(Leave this field blank)

Configure permissions

  1. On the left side navigation on the overview page, click on Manage > API Permissions

  2. Click Add a Permission and select Microsoft Graph. Choose Application permissions and add the following:

    1. User.Read.All

    2. GroupMember.Read.All

    3. Files.Read.All

    4. Files.ReadWrite.All (for webhooks)

    5. Reports.Read.All

    6. Sites.Read.All

    7. Members.Read.Hidden

    8. Directory.ReadWrite.All (Note: for grant REST API permissions and will remove in the end of install)

  3. Click Add a Permission and select Microsoft Graph. Choose Application permissions and add the following:

    1. Sites.FullControl.All

Grant admin consent

  1. Please sign in to Azure as a Global, Application or Cloud Application Administrator.

  2. Use the search box to navigate to Enterprise applications. Select the Glean app just created from the list of applications.

  3. Click on Permissions under Security. Review the permissions shown, and then click Grant admin consent.

Generate Certificate and PrivateKey

  1. Run the following command line by line

    openssl genrsa -out privatekey.key 2048
    openssl req -new -key privatekey.key -out request.csr
    openssl x509 -req -days 365 -in request.csr -signkey privatekey.key -out certificate.crt
  2. Verify that both certificate.crt and privatekey.key exist.

    1. certificate.crt should begin with -----BEGIN CERTIFICATE----- and end with -----END CERTIFICATE-----

    2. privatekey.key should begin with -----BEGIN PRIVATE KEY----- and end with -----END PRIVATE KEY-----

  3. Upload the certificate.crt in Glean:

    1. Client Certificate

  4. Upload the privateKey.key in Glean:

    1. Private Key

Upload Certificate

  1. Navigate back to Home > App registrations and click on the app created earlier. Then click on Manage > Certificates & secrets in the left sidebar.

  2. Click the Certificates Section and Upload the certificate.

  3. Upload the certificate.crt file that was just generated

Generate secret

  1. Navigate back to Home > App Registration. Then click on Manage > Certificates & secrets in the left sidebar.

  2. Click on New client secret. Enter a description and select 24 months for expiry time, then click Add.

  3. Under Client secrets, copy the Value (not the Secret ID) generated and enter it in Glean as the Client secret. The Value will only be shown once.

Fill out keys

  1. Scroll to the top of the left sidebar and click Overview.

  2. Copy the following content from the center Essentials panel and enter it in Glean:

    • Application (client) ID

    • Directory (tenant) ID

  3. Enter the SharePoint domain in Glean. The SharePoint domain should end with "sharepoint.com"

  4. (Strongly Recommended) To increase the full crawl indexing speeds, Glean recommends between 1 and 10 additional applications with the same permission settings as the initial app created. Repeat the setup steps from "Register a new app" until this step, saving the client ID and client secret in the process. Paste the client ID and client secret into the Glean web app.

  5. Follow the next step to set up SharePoint REST API permissions; otherwise, clicking Save will not succeed.

Grant REST API permissions to individual apps

Please complete this for all the apps registered

  1. Upload the same certificate that was generated in the same instructions to all of the apps

    Ensure that SharePoint Powershell is installed. If any of the following commands do not work, install the module first before running the commands again within Powershell.

  2. Establish a connection to the app:

    Chose either of the two options to see which one works:

    Connect-PnPOnline -Url "https://<domain>.sharepoint.com" -Interactive -ClientId <clientId> 

    (See section Interactive Connection Troubleshoot if not working)[Recommended]

    Connect-PnPOnline -Url "https://<domain>.sharepoint.com" -DeviceLogin -ClientId <clientId> -Tenant <tenantId>

    (See section DeviceLogin Troubleshoot if not working)

  3. Grant Sites.FullControl.All to the app:

    Grant-PnPTenantServicePrincipalPermission -Scope "Sites.FullControl.All

Remove Directory.ReadWrite.All permissions

This step should only be done after Glean verifies the crawler is working as expected with no permission issues.

  1. Go back to the app registrations page on azure. On the left side navigation on the overview page, click on Manage > API Permissions.

  2. In the Configured permissions area, select Directory.ReadWrite.All, and click Remove permissions

  1. In the Other permissions granted, section, click Revoke Admin Consent to make sure the permission is fully removed

Interactive Connection Troubleshoot

  1. On the azure portal, find the app just created. In the menu, look for Manage and click on Authentication

  2. Under Platform configurations on the page, click on Add a platform

  3. In the panel that shows up on the right, click on Mobile and desktop applications

  4. Leave the three boxes shown in the panel on the right unchecked and in the Custom redirect URIs field, enter: http://localhost. Note that this should really be http and not https

  5. Click on Configure at the bottom

  6. Retry the command

Connect-PnPOnline-Url"https://<domain>.sharepoint.com"-Interactive-ClientId<clientId>

DeviceLogin Troubleshooting

If chosen to connect to the app in Powershell using the DeviceLogin option, it may be observed that the request body must contain the following parameter: 'client_assertion' or 'client_secret'. Then follow the solutions here to temporarily Change Allow public client flows to "Yes" and retry. Once the setup is completed, please toggle it back.

Authentication and Endpoint scope requirements

Authentication Endpoints

Endpoint

Use Case

Documentation Link

Product

Token request (Graph API)

Obtain and refresh an access token to interact with the Graph API using OAuth 2.0.

All

Token request (SharePoint REST API)

Obtain and refresh an access token to interact with the SharePoint REST API using OAuth 2.0.

SharePoint

Identity Endpoints

Endpoint

Permissions

Use Case

Documentation Link

Product

List users

User.Read.All

List all the users within the tenant

All

List groups

GroupMember.Read.All

List all the groups within the tenant

All

List group members

GroupMember.Read.All (or

Get the members of a group (to understand permissions).

All

Get profilePhoto

User.Read.All

Get the profile photo of a given user (for Azure people data crawl)

Azure AD / Entra ID

Get site groups

https://<site_domain>.sharepoint.com/sites/<subsite_url>/_api/web/SiteGroups?$expand=Users

SharePoint REST permissions (FullControl)

Get the default site groups and associated user memberships for a site.

SharePoint

Content Endpoints

Endpoint

Permissions

Use Case

Documentation Link

Product

List sites

Sites.FullControl.All (Sites.Read.All)

List all site collections within the tenant. Delta will currently only return site collections from the main geo-location if it is working in a multi-geo tenant per Microsoft guidance

SharePoint, OneDrive

List subsites

Sites.FullControl.All (Sites.Read.All)

List all the subsites within a site. Glean can scan recursively done until there are no more subsites

SharePoint, OneDrive

List lists

Sites.FullControl.All (Sites.Read.All)

List all the lists within the site

SharePoint, OneDrive

List columns

Sites.FullControl.All (Sites.Read.All)

List all columns within the site (attributes of site)

SharePoint, OneDrive

List items delta

Sites.ReadFullControl.All

List all items from the delta endpoint (returns some metadata REST API does not include, along with inheritance). To scan permissions hierarchies properly Sites. FullControl.All is required

SharePoint, OneDrive

Get site list items

https://<site_domain>.sharepoint.com/sites/<subsite_url>/_api/web/lists('<list_id>')/item

SharePoint REST permissions (FullControl)

Get the items within a list for a site. Glean uses the REST API as some content for classic sites is only available via REST APIs

SharePoint, OneDrive

Get site item permissions

SharePoint REST permissions (FullControl)

Get the permissions for an item on the site. SharePoint REST API is required for site pages / web parts, as Graph API only exposes permissions for Document Library items.

SharePoint, OneDrive

Get page content

https://<site_domain>.sharepoint.com/sites/<subsite_url>/_api/web/GetFileById('<id>')/GetLimitedWebPartManager(scope=1)/ExportWebPart

SharePoint REST permissions (FullControl)

Get the web parts on a particular page (ie. blocks of content within text boxes, titles, etc.)

SharePoint, OneDrive

Drives and Document Libraries on SharePoint Sites

Endpoint

Permissions

Use Case

Documentation Link

Product

List drives

Files.Read.All

List all the drives within a given site

SharePoint, OneDrive

Get driveItem

Files.Read.All

List all the items within a drive (change-based as per Microsoft’s scanning guidance)

SharePoint, OneDrive

Get driveItem resource

Files.Read.All

Retrieve metadata for a driveItem.

SharePoint, OneDrive

Download file

Files.Read.All

Download content for a driveItem to index bodies

SharePoint, OneDrive

Get permissions

Files.Read.All

Get the permissions of a given item within a drive

SharePoint, OneDrive

Activity Endpoints

Endpoint

Permissions

Use Case

Documentation Link

Product

Sites.FullControl.All (Sites.Read.All)

Lists recent activities performed by the user on specific items (Drive items). Follows TTL policies. Usually up to 6mo in the past.

SharePoint, OneDrive

Sites.FullControl.All (Sites.Read.All)

Lists recent sharing activity performed by users. Follows TTL policies. Usually up to 6mo in the past.

SharePoint, Onedrive

Reports

Endpoint

Permissions

Use Case

Documentation Link

Product

Get OneDrive Usage: File Count

Reports.Read.All

Get the total number of files across all sites and how many have been created, modified, and shared within the time period.

SharePoint, OneDrive

Get SharePoint Usage: Site Count

Reports.Read.All

Get the total number of active sites within the time period.

SharePoint, OneDrive

Get SharePoint Usage: User Count

Reports.Read.All

Get the total number of active SharePoint users within the time period.

SharePoint, OneDrive

Get SharePoint Usage: Pages

Reports.Read.All

Get the number of pages viewed across all sites within the time period.

SharePoint, OneDrive

Webhooks

Endpoint

Permissions

Use Case

Documentation Link

Product

Create a webhook subscription

Files.ReadWrite.All

Glean subscribes to the driveItem resource which requires (as least privilege) the Files.ReadWrite.All permission to create the subscription.

Create a change notification subscription to a given drive (see driveItem section in the documentation).

SharePoint, OneDrive

Reauthorize a webhook subscription

Files.ReadWrite.All

Glean subscribes to the driveItem resource which requires (as least privilege) the Files.ReadWrite.All permission for reauthorization.

Reauthorize a subscription after timeout when a reauthorizationRequired challenge is received.

SharePoint, OneDrive

Items crawled

Content

By default, an object must be present within a SharePoint List or User Drive to be crawled.

Note: All metadata is crawled by default.

Content is indexed by Glean if the following criteria are met:

  • The document is less than 16MB

  • The document has indexable content, such as Documents, PowerPoints, spreadsheets, PDFs, and text files, which can be downloaded.

    • By default, OCR is not enabled

    • Drawings, images, videos, compressed folders, and code (including JSON) are not downloaded by default and will only have metadata indexed

  • OneNote has limited support

    • OneNote is structured as a “Notebook,” composed of “Sections,” which are in turn composed of “Pages” that have content

    • Glean indexed “Notebook” as a folder and “Section” as standalone content

    • While Glean attempts to parse the downloaded “Section” content, which includes Pages, there are bugs in the content parsing library that prevent all page content from surfacing in the index

  • By default, the connector crawls ALL personal folders for employees in the company. It is possible to limit these crawls to a specific list of groups in the SSO Directory (Azure AD, etc…).

Identity

  • Glean crawls all Azure users and groups in the tenant using the Graph API.

  • With the SharePoint REST API, Glean also crawls all default site groups (e.g., visitors, members, and owners). These are common groups configured either across a full site collection or within subsites.

Note: Site groups and memberships are not available via Graph API.

SharePoint Lists

  • Glean only crawls Site Page (webPageLibrary) List types by default, which does not include a page representing the List library (e.g. url with “...AllItems.aspx”).

    • Web parts will be downloaded from each web page. Only “rich text” web parts will have content parsed and indexed.

    • Limitation: Pages can be in “draft mode.” This occurs when any user checks out a page to edit. Because the API does not distinguish between content available on a published page vs. a draft page, Glean will only show the title of “checked out” or “draft mode” pages by default.

      • If o365sharepoint.crawl.useTitleForDrafts = false, the draft page will not show up at all in search.

  • Glean does not support additional list types, e.g. “xmlPowerForm”

Rate Limits

SharePoint rate limits are shared between all endpoints: Avoid getting throttled or blocked in SharePoint Online

  • Rate limits depend on enterprise size. For the largest enterprises, Glean typically can make around 16 document retrievals per second (100 tokens per second, 1 token for content download, and 5 tokens for permission API calls). Glean optimizes API calls by understanding when content inherits permissions from parents. This helps Glean upwards of 8M documents per day for large enterprises, for a full crawl.

  • Rate limits are per application and also enforced tenant-wide. From custom testing, Glean has found success with increasing crawler applications to up to 5 additional applications.

Update frequency

Content updates for the SharePoint connector in Glean can happen quite rapidly, depending on the type of update and the configuration settings. Here are the key areas:

  • User Insights provides information when docs are modified or viewed, including if permissions change. The crawl every 10 minutes will reprocess these permissions.

  • Glean also subscribes to webhooks on drives to understand which drives need to be crawled incrementally more to ingest up-to-date content.

For the most up-to-date crawler refresh information, please refer to [External] Glean crawling strategy

Crawling Strategy/Performance

The system uses the User Insights API to identify new/modified/deleted docs. The system does an incremental fetch to catch up on the changes

How the Crawl Works

Crawl Behavior

  • Most content, including OneDrive user drives or SharePoint document libraries, will be indexed via a OneDrive incremental crawl. This follows Microsoft’s recommended best practices for working with Microsoft Graph APIs.

    • Microsoft reports a webhook via a subscription on “drives” (Onedrive user drives or SharePoint document libraries). Subscriptions include a change to any item within the drive, including a permission change to a single item.

    • Glean schedules an incremental crawl over all drives with a webhook seen recently, or if the drive has not been crawled for a period of time (by default at least weekly).

    • The change endpoint provides a list of all changed content (both content and permission changes), along with their full folder hierarchy.

    • Glean queues two sets of crawls after discovering changes: a crawl for any changed content and propagation over the full drive for permission changes. The permission propagation must occur over the full drive, as permission changes are not known beforehand in the change query. If folders change permissions, the children must also have the propagated change.

  • SharePoint list crawls are done via a “SitePage” crawl, which is independent of the OneDrive incremental crawls

  • Finally, some document libraries or user OneDrive content will be present in user insights. Glean has a fast path to ingest any newly created content detected in user insights into the index.

Known Limitations in Crawl

  • SharePoint Pages can be in “draft mode.” This occurs when any user checks out a page to edit. Because the API does not distinguish between content available on a published page vs. a draft page, Glean will only show the title of “checked out” or “draft mode” pages by default.

  • For OneNote, while Glean attempts to parse the downloaded “Section” content, which includes Pages, there are bugs in the content parsing library that prevent all page content from surfacing in the index

  • When adding more than 3 additional apps to the SharePoint Online connector, an administrator may encounter the error “Failed to save changes

  • Glean can not crawl SharePoint site URLs with special designations and should be removed before inputting into inclusions or exclusions:

    • “:f” means Folder sharing

    • “:w” means Word document sharing

    • “:x” means Excel document sharing

    • “:p” means PowerPoint document sharing

    • “:b” means PDF document sharing

Unsupported Features

  • Page-level OneNote support.

  • Additional SharePoint List type support (xmlForm, powerApps).

  • SharePoint basic list search results as an aggregate of all list items.

  • Supporting Purview + Sensitive Labels.

  • Event receivers and SPFx client-side activity exports

  • Teams transcripts

  • Custom facets as customizable search metadata for SharePoint Lists

  • additional list types, e.g. “xmlPowerForm”

Content Configuration

For OneDrive, the customer can configure specific users whose content Glean can exclude from indexing. For SharePoint, the customer can either configure specific sites to be excluded from Glean's indexing or configure an explicit list of sites to be indexed by Glean. A customer can choose to index both or only SharePoint content.

A subset of a customer’s employees can be selected by creating an O365 group specifically with those members. Content will only contain OneDrive files from these users, and SharePoint files viewed/modified specifically by these users.

Item Insights are used for search personalization but can be turned off.

Inclusions (Green-Listing) Options

Glean has similar capabilities for green-listing sites and drives. This allows only the greenlisted sites and drives to be crawled and indexed. Redlisting and greenlisting features can be used simultaneously.

Inclusion (Green-listing) SharePoint sites by site URL

Green-listing SharePoint sites can be achieved by providing a list of site URLs to the Glean team. Glean supports

  • Green-listing a list of site URLs

  • Green-listing a file containing a single line of comma-separated URLs (if there are many sites)

URLs should be provided in the following format:

https://<domain>.sharepoint.com/sites/<siteName>

Green-listing OneDrive User Drives

Glean can green-list user drives by email. To enable this, provide a single-line file containing comma-separated user email addresses that should be disallowed.

Exclusions (Red-listing) Options

Glean can red-ist items in Onedrive and SharePoint, which prevents crawling them and removes them from search.

  • Red-listing a SharePoint site removes all site pages and files and folders in the site drives from Glean.

  • Red-listing OneDrive user drives by email removes all files and folders in the user’s personal drive (this also applies to OneDrive for Business drives). It does not affect documents created by the user in other drives or SharePoint.

Red-listing SharePoint sites via SharePoint settings

Glean respects the search preferences for sites in SharePoint. Sites can be removed from search by turning off the setting at Site Contents > Site Settings > Search and offline availability > Indexing Site content. An example can be found at this link.

Red-listing SharePoint sites by site URL

Redlisting can also be achieved by providing a list of site URLs to the Glean team. Glean supports

  • Red-listing a list of site URLs

  • Red-listing a file containing a single line of comma-separated URLs (if there are many sites)

URLs should be provided in the following format:

https://<domain>.sharepoint.com/sites/<siteName>

Red-listing OneDrive user drives

Glean can red-list user drives by email. To enable this, provide a single-line file with comma-separated user email addresses that should be disallowed.

Auditing Permissions

Glean does not require any permission beyond read in the SharePoint API; however, due to the limitations and granularity of the SharePoint API, the least privilege is Full Control. To showcase that Glean only uses read-level permissions, Glean is providing the steps in Microsoft Purview to audit Glean’s usage.

Steps to Monitor Glean Purview

Ensure that unified audit logging is enabled in the Microsoft 365 environment.

Search for Application-Specific Activity in Purview:

  • Navigate to Audit > Audit search.

  • In the Users filter, enter the Application ID or the name of the Glean application (as it appears in Azure AD).

  • Focus on the following activities:

    • File accessed: Logs when files or pages are read.

    • List accessed: Logs list-level operations, such as reading items.

    • Permissions viewed or changed: Identifies whether Glean is accessing or modifying permissions

Monitor Specific Endpoints:

Since the Glean app uses specific SharePoint REST API endpoints, focus on:

  • Get site list items (/web/lists('<list_id>')/item)

  • Get site item permissions (/web/lists('<list_id>')/items('<item_id>')/roleassignments)

  • Get page content (/web/GetFileById('<id>')/GetLimitedWebPartManager)

  • Cross-reference the logs for activities related to these endpoints and verify they are read-only

Set Up Alerts for Write Activity:

  • In the Audit section of Purview, create alert policies for write actions by Glean:

    • File modified or deleted: Flag if the application writes to or deletes files.

    • Permission changes: Monitor for unauthorized modifications to roles or permissions.

  • Use these alerts to trigger immediate reviews.

Troubleshooting

Microsoft Tenants Created Started 2020 and After

If the tenant was recently created (starting from 2020 onwards), then even after the REST API setup in the previous section, an error will be received:

“Unable to fetch O365 SharePoint site groups. Please check that the sharepoint/content/tenant and sharepoint/content/sitecollection scopes are enabled with FullControl for SharePoint REST API.”

o resolve this, the custom app authentication must disabled for the SharePoint tenant (reference).

  1. Install Powershell (if it is not already installed).

  2. In Powershell, install PnP by running
    Install-Module -Name PnP.PowerShell, Install-Module -Name, Microsoft.Online.SharePoint.PowerShell

  3. Run

    Connect-PnPOnline -Url https://<sharepointdomain>-admin.sharepoint.com
  4. Run

    Set-PnPTenant -DisableCustomAppAuthentication $false
  5. Reattempt to Save.

Workaround for Configuring Additional SharePoint Apps

Version: 1.0.0

Date: November 10, 2023

Background

When adding more than 3 additional apps to the SharePoint Online connector, an administrator may encounter the error “Failed to save changes. Please try again. If this issue persists, contact Glean for support.” which prevents them from continuing.

As a workaround, these additional apps are still able to be added using the Advanced app setup interface which allows the secrets for each additional app to be set manually.

Procedure

  1. Create the client/secret IDs with the proper permissions as per the original SharePoint instructions. Verify all of the permissions are correct (there will be 8x Application Permissions), and that the SharePoint REST API privileges have been granted.

  2. Navigate to https://app.glean.com/admin/setup/apps?advanced. A modal should pop up similar to below.

    ⚠️Caution! Do not use this interface outside the instructions in this document or without guidance by a Glean engineer. Doing so may result in errors within the Glean instance.

  3. For each additional SharePoint app added, repeat the following steps:

    1. Set the Client ID:

      1. Open the Advanced modal above, and select Secret.

      2. For Key name, enter O365_CLIENT_ID_<X-1>, where <X> is the number of the additional apps for configuring. This is indexed at 0.

        1. For example, adding Additional App #3, the Key name will be O365_CLIENT_ID_2. For Additional App #6, the Key name will be O365_CLIENT_ID_5. Etc.

      3. For Key value, paste the Application/Client ID of the additional SharePoint app configured.

      4. Click Submit.

      Example of configuring the Application/Client ID for Additional App #3 (O365_CLIENT_ID_2)

    2. Set the Client Secret:

      1. Open the Advanced modal again, and select Secret.

      2. For Key name, enter O365_CLIENT_SECRET_<X-1>, where <X> is the number of the additional app that are configured. This is indexed at 0.

        1. For example, if adding Additional App #3, the Key name will be O365_CLIENT_SECRET_2. For Additional App #6, the Key name will be O365_CLIENT_SECRET_5. Etc.

      3. For Key value, paste the Secret of the additional SharePoint app configured.

      4. Click Submit.

    Example of configuring the Client Secret value for Additional App #3 (O365_CLIENT_SECRET_2)

    The mapping of the Secret names to the fields from the standard SharePoint setup page is illustrated below.

    This image illustrates the mapping of each Additional App field in the standard SharePoint setup workflow to the Client ID and Client Secret keys that will be used.

  4. Advise your Glean engineer of the total number of additional apps thathave on-boarded. They will need to enable the usage of the Secrets that have been set in order for them to take effect.

  5. The procedure is now complete. DO NOT modify anything in the SharePoint Setup page as doing so will overwrite the values entered for the secrets.

Did this answer your question?