Introduction
The SharePoint connector for Glean allows Glean to fetch and index content from SharePoint sites, ensuring that users can search and access documents where they have authorized permissions.
Authentication: is done by creating and registering an App for each deployment - https://docs.microsoft.com/en-us/graph/auth-v2-service.
API Usage:
Glean will use the Graph API to ingest all data and permissions, using the current Microsoft Graph API SDK v5.30.0.
Glean will ingest all data using the standard Graph API and SharePoint REST API.
Glean uses application permissions with admin-granted access.
Permissions Enforcement: Glean respects all user access permissions, ensuring users only see search results for documents they have access to. When a user clicks on a search result, they are taken to the Office 365 web application, which enforces the permission
Data Storage: All data is stored in the customer’s project within the customer's cloud account, ensuring no data leaves the customer's environment
Content Captured:
For OneDrive, Glean will capture the following content:
Folders
Documents (All document types, e.g. word, excel, PowerPoint)
OneNote (limited support, indexing Notebooks + Sections)
For SharePoint, Glean will capture the following content:
Site Pages (web part or wiki page libraries)
Site Drives (document libraries)
Basic List and Calendar List items (optional configuration not by default)
SharePoint Permissions
Glean has extensive experience collecting and analyzing customer data from SharePoint (and the Office 365 ecosystem). In order to meet customer needs and provide optimal search and chat experience, Glean will require the following permissions set by the Office 365 tenant administrator:
For Identities in Azure:
User.Read.All
Group.Read.All
GroupMember.Read.All
For OneDrive/SharePoint:
Directory.Read.All
Files.Read.All
Files.ReadWrite.All (for webhooks setup)
Reports.Read.All (for ranking signals)
Sites.FullControl.All (previously Sites.Read.All)
SharePoint REST API requires full control to properly crawl all Site Collections, SharePoint site content, and permissions
Glean uses Microsoft's recommended best practices strategy to crawl and record incremental changes for all documents. Ideally, Glean would like to use the Microsoft Graph API for all operations. Due to Glean’s extensive experience with SharePoint, Glean requires the usage of Graph and SharePoint REST APIs to meet customer needs. This article provides detailed usage of Microsoft’s APIs by Glean.
Note: If the SharePoint site is a new tenant, Glean has observed DisableCustomAppAuthentication is set to True, which needs to be set to False in order for the registered app to be authenticated. The command to run is “|set-pnptenant -DisableCustomAppAuthentication $false”
SharePoint REST & Graph API Full Control Discussion
As of 07/06/2024, Microsoft SharePoint REST API does not provide granular access for admins (see SharePoint admin APIs authentication and authorization &
Figure:driveItem: delta
Therefore, Glean requires FullControl to properly retrieve data and permissions such as Role Assignment, Collections, etc. on the SharePoint site pages. As Microsoft evolves its SharePoint REST API, Glean will evaluate the changes and implement what is best for our customers.
Sites.FullControl.All Discussion
As a default, Glean optimizes to automatically collect and analyze customer data as much as possible to provide maximum value to our customers. With the Site.FullControl.All, Glean will be able to discover all of the customer’s current SharePoint sites automatically and in addition automatically add new sites as they are created. Since Glean has extensive experience in SharePoint data gathering and permissions, Glean has worked with customers who would prefer to limit the scope of Glean’s data crawling.
Glean had previously used Sites.Read.All, of which is no longer sufficient. See [External] Sharepoint Connector Update for details.
Sites Selected Discussion
Customers can alternatively use Sites.Selected to explicitly indicate which sites the application can crawl. There are a few trade-offs with the “Site Selected” option and the customer will have to determine which is best for their environment. First, the customer will have to notify Glean when new SharePoint sites are created, so Glean can update the crawl list. It will take a substantial amount of time, which may exceed up 24 hours or longer for Glean to crawl the new SharePoint site depending on size, which could cause misaligned expectations between the end-users (ie it’s not instantaneous). The last trade-off is Glean can’t gather activity data for a specific list of SharePoint sites, which is a limitation in the granularity of the Site.Read.All permission by Microsoft. Without Activity data, ranking and personalization will be affected.
Files.ReadWrite.All for Webhooks Discussion
Webhooks allow Glean to be aware of and sync changes to content in the customer’s environment as they occur instead of waiting for incremental crawls to complete. For example, if a document is deleted or its access permissions change, Microsoft will notify Glean of the change (through a webhook), and Glean will process the changes.
To set up and maintain webhooks, Microsoft requires the Files.ReadWrite.All permission, where Glean subscribes to the driveItem webhook. All permission is the least privilege to set up and reauthorize notifications for this resource. For more information, please refer to subscription: reauthorize from Microsoft.
In Glean’s experience, disabling webhooks provides a suboptimal user experience. Glean’s scans for OneDrive and SharePoint are optimized to retrieve changes based on Microsoft notifications from webhooks Glean sets up. If webhooks are disabled then new changes within OneDrive/SharePoint could take longer than 24 hours to be processed (via our incremental scans) versus generally within an hour. This would include updates to permissions, changed files and sites.
Versions Supported
There are no specific version limitations of the SharePoint connector.
Objects Supported
The SharePoint connector supports the following objects:
Folders: Glean captures and indexes folders within OneDrive & SharePoint.
Documents: This includes various types of documents stored in OneDrive & SharePoint.
Native File Types: Office including Word, Excel, PowerPoint, etc
Content from Personal and Shared Drives: Glean supports content from both personal drives and shared drives within OneDrive.
Authentication Mechanism
Connector credentials requirements
The SharePoint connector for Glean requires specific permissions to function correctly.
Glean requires authentication by setting up a registered app in Azure to SharePoint in order to fetch relevant information.
Glean understands all user access permissions and strictly enforces them at the time of the query, ensuring that users cannot see results to which they do not have access.
It’s important to note that all data is stored in the customer’s project in the customer's cloud account and no data leaves the customer's environment
Glean only requires READ-level permissions. Application vendors may not provide granularity in their permission schemes for read-only access as observed by Microsoft with webhooks and scanning for permissions
Scope | Purpose | Notes / Workarounds (if needed) |
User.Read.All | List all the users within the directory (used for permissions) |
|
Sites.FullControl.All | Retrieve sites, metadata, and associated content from the item for the index. FullControl is required to scan permission hierarchies. | Customers can consider Sites.Selected. This allows customers to manually provision certain SharePoint sites to have Graph API + REST API access. |
Files.Read.All | Retrieve items, metadata, and associated content from the item for the index. | If Sites.Selected is used, this should not be needed. |
GroupMember.Read.All | Get the members of a group (used for permissions) |
|
Reports.Read.All | Used for logging site usage metrics for validating crawler is gathering all documents, and scaling infra to accommodate total document counts. |
|
Files.ReadWrite.All | Used to create and manage a webhook to subscribe to change notifications. | If Sites.Selected is used, this should not be needed. |
SharePoint Sites Selected Setup and Permissions
Please contact your Glean representative before installing the SharePoint data source connector for Site Selected to ensure the Glean environment is ready for the configuration.
Glean requires the following application permissions. Glean must be granted admin consent for the following permissions.
Permission | Reason |
User.Read.All | List users in the tenant. This is used to assign permissions. |
GroupMember.Read.All | List members of groups in the tenant. This is used to assign permissions. |
Sites.Selected | |
Reports.Read.All | This is used to get usage data to estimate crawl times. |
SharePoint Permissions per Site Setup
These instructions leverage a limited Graph API permission scope via Sites.Selected, to explicitly grant access only to a particular SharePoint site collection
Required permissions for setup
The user setting up this data source must be the Global Admin.
Register a new app
Sign in to the Azure portal. Select Azure Active Directory, then App registrations > New registration.
On the Register an application page, register an app with the following:
Field | Value |
Name | Glean |
Supported account types | Accounts in this organizational directory only (Single tenant) |
Redirect URI | (Leave this field blank) |
Click Register
Configure permissions
On the left side navigation on the overview page, click on Manage > API Permissions.
Click Add a permission and select Microsoft Graph. Choose Application permissions and add the following:
User.Read.All
GroupMember.Read.All
Sites.Selected
Reports.Read.All
Members.Read.Hidden
Grant admin consent
Please sign into Azure as a Global, Application or Cloud Application Administrator.
Use the search box to navigate to Enterprise applications. Select the Glean app just created from the list of applications.
Click on Permissions under Security. Review the permissions shown, and then click Grant admin consent.
Generate Certificate and PrivateKey
Run the following command line by line
openssl genrsa -out privatekey.key 2048
openssl req -new -key privatekey.key -out request.csr
openssl x509 -req -days 365 -in request.csr -signkey privatekey.key -out certificate.crtVerify that both certificate.crt and privatekey.key exist.
certificate.crt should begin with -----BEGIN CERTIFICATE----- and end with -----END CERTIFICATE-----
privatekey.key should begin with -----BEGIN PRIVATE KEY----- and end with -----END PRIVATE KEY-----
Upload the certificate.crt in Glean:
Client Certificate
Upload the privateKey.key in Glean:
Private Key
Upload Certificate
Navigate back to Home > App registrations and click on the app created earlier. Then click on Manage > Certificates & secrets in the left sidebar.
Click the Certificates Section and Upload the certificate.
Upload the certificate.crt file that just generated
Fill out keys
Scroll to the top of the left sidebar and click Overview.
Copy the following content from the center Essentials panel and enter it in Glean:
Application (client) ID
Directory (tenant) ID
Enter the SharePoint domain in Glean. The SharePoint domain should end with "sharepoint.com"
Glean recommends 5 additional applications with the same permission settings as the initial app created to maximize crawl speeds. Repeat the setup steps from "Register a new app" until this step, saving the client ID and client secret in the process. Paste the client ID and client secret into the Glean web app.
Please go through the next steps to set up SharePoint REST API permissions, or clicking Save will not succeed.
Add Sites.Selected Permission
Navigate back to Home > App registrations and click on the app created earlier. Then click on Manage > API permissions in the left sidebar.
Click Add a permission and select SharePoint. Choose Application permissions and add Sites.Selected
The below step must be performed for every individual SharePoint site collection to be indexed.
Grant Graph API permissions to an individual site
Have SharePoint Powershell installed. If any of the following commands do not work, install the module first before running the commands again within Powershell.
Grant consent for PnP management in the Azure tenant for the specific site collection via site collection URL:
Choose either of the two options to see which one works:
Connect-PnPOnline -Url $SITE_COLLECTION_URL -Interactive -ClientId <clientId>
(See section Interactive Connection Troubleshoot if not working)[Recommended]
Connect-PnPOnline -Url $SITE_COLLECTION_URL -DeviceLogin -ClientId <clientId> -Tenant <tenantId>
(See section DeviceLogin Troubleshoot if not working)
With the application client ID and site collection url, grant Full Control for the site collection:
Grant-PnpAzureADAppSitePermission -AppId $CLIENT_ID -Site $SITE_COLLECTION_URL -Permissions FullControll
Provide the list of all sites to be crawled
Glean cannot automatically determine the sites with Sites.Selected permissions applied ahead of time. This requires configuration via the Manage Data tab.
Navigate to the Manage Data > Inclusion Rules tab. Provide the list of urls (can be just the subsites of the site collections with permissions) for the explicit sites to be crawled. If a site collection and all associated subsites should be crawled, provide all the urls explicitly in the greenlist.
Interactive Connection Troubleshoot
On the azure portal, find the app just created. In the menu, look for Manage and click on Authentication
Under Platform configurations on the page, click on Add a platform
In the panel that shows up on the right, click on Mobile and desktop applications
Leave the three boxes shown in the panel on the right unchecked and in the Custom redirect URIs field, enter: http://localhost. Note that this should really be http and not https
Click on Configure at the bottom
Retry the command
Connect-PnPOnline -Url $SITE_COLLECTION_URL -Interactive -ClientId <clientId>
Delegated vs. Application Permissions
Delegated permissions where users authenticate themselves or use a service account explicitly added to certain sites/user drives to crawl Glean. In Glean’s experience, this does not provide the optimal experience:
Item insights (for activity) would only be assigned to the individual user. Glean could miss activity for other users using Glean, but not associated with the service account (Item insights in Microsoft Graph - Microsoft Graph ).
Microsoft imposes a tenant-wide rate limit. If individual users authorize, Glean would potentially crawl the same content with multiple user tokens. In turn would cause the same content crawled with multiple user tokens. Due to the restrictive Graph API rate limits, Glean’s crawl speed will be significantly slower.
Glean cannot list all the site collections with delegated permissions (List sites—Microsoft Graph v1.0). Admins will still need to configure inclusive SharePoint sites.
Connection instructions
Required permissions for setup
The user setting up this data source must be the Global Admin.
Register a new app
Sign into the Azure portal. Select Azure Active Directory, then App registrations > New registration
On the Register an application page, register an app with see table below
Click Register
Field | Value |
Name | Glean |
Supported account types | Accounts in this organizational directory only (Single tenant) |
Redirect URI | (Leave this field blank) |
Configure permissions
On the left side navigation on the overview page, click on Manage > API Permissions
Click Add a Permission and select Microsoft Graph. Choose Application permissions and add the following:
Click Add a Permission and select Microsoft Graph. Choose Application permissions and add the following:
Sites.FullControl.All
Grant admin consent
Please sign in to Azure as a Global, Application or Cloud Application Administrator.
Use the search box to navigate to Enterprise applications. Select the Glean app just created from the list of applications.
Click on Permissions under Security. Review the permissions shown, and then click Grant admin consent.
Generate Certificate and PrivateKey
Run the following command line by line
openssl genrsa -out privatekey.key 2048
openssl req -new -key privatekey.key -out request.csr
openssl x509 -req -days 365 -in request.csr -signkey privatekey.key -out certificate.crtVerify that both certificate.crt and privatekey.key exist.
certificate.crt should begin with -----BEGIN CERTIFICATE----- and end with -----END CERTIFICATE-----
privatekey.key should begin with -----BEGIN PRIVATE KEY----- and end with -----END PRIVATE KEY-----
Upload the certificate.crt in Glean:
Client Certificate
Upload the privateKey.key in Glean:
Private Key
Upload Certificate
Navigate back to Home > App registrations and click on the app created earlier. Then click on Manage > Certificates & secrets in the left sidebar.
Click the Certificates Section and Upload the certificate.
Upload the certificate.crt file that was just generated
Generate secret
Navigate back to Home > App Registration. Then click on Manage > Certificates & secrets in the left sidebar.
Click on New client secret. Enter a description and select 24 months for expiry time, then click Add.
Under Client secrets, copy the Value (not the Secret ID) generated and enter it in Glean as the Client secret. The Value will only be shown once.
Fill out keys
Scroll to the top of the left sidebar and click Overview.
Copy the following content from the center Essentials panel and enter it in Glean:
Application (client) ID
Directory (tenant) ID
Enter the SharePoint domain in Glean. The SharePoint domain should end with "sharepoint.com"
(Strongly Recommended) To increase the full crawl indexing speeds, Glean recommends between 1 and 10 additional applications with the same permission settings as the initial app created. Repeat the setup steps from "Register a new app" until this step, saving the client ID and client secret in the process. Paste the client ID and client secret into the Glean web app.
Follow the next step to set up SharePoint REST API permissions; otherwise, clicking Save will not succeed.
Grant REST API permissions to individual apps
Please complete this for all the apps registered
Upload the same certificate that was generated in the same instructions to all of the apps
Ensure that SharePoint Powershell is installed. If any of the following commands do not work, install the module first before running the commands again within Powershell.
Establish a connection to the app:
Chose either of the two options to see which one works:
Connect-PnPOnline -Url "https://<domain>.sharepoint.com" -Interactive -ClientId <clientId>
(See section Interactive Connection Troubleshoot if not working)[Recommended]
Connect-PnPOnline -Url "https://<domain>.sharepoint.com" -DeviceLogin -ClientId <clientId> -Tenant <tenantId>
(See section DeviceLogin Troubleshoot if not working)
Grant Sites.FullControl.All to the app:
Grant-PnPTenantServicePrincipalPermission -Scope "Sites.FullControl.All
Remove Directory.ReadWrite.All permissions
This step should only be done after Glean verifies the crawler is working as expected with no permission issues.
Go back to the app registrations page on azure. On the left side navigation on the overview page, click on Manage > API Permissions.
In the Configured permissions area, select Directory.ReadWrite.All, and click Remove permissions
In the Other permissions granted, section, click Revoke Admin Consent to make sure the permission is fully removed
Interactive Connection Troubleshoot
On the azure portal, find the app just created. In the menu, look for Manage and click on Authentication
Under Platform configurations on the page, click on Add a platform
In the panel that shows up on the right, click on Mobile and desktop applications
Leave the three boxes shown in the panel on the right unchecked and in the Custom redirect URIs field, enter: http://localhost. Note that this should really be http and not https
Click on Configure at the bottom
Retry the command
Connect-PnPOnline-Url"https://<domain>.sharepoint.com"-Interactive-ClientId<clientId>
DeviceLogin Troubleshooting
If chosen to connect to the app in Powershell using the DeviceLogin option, it may be observed that the request body must contain the following parameter: 'client_assertion' or 'client_secret'. Then follow the solutions here to temporarily Change Allow public client flows to "Yes" and retry. Once the setup is completed, please toggle it back.
Authentication and Endpoint scope requirements
Authentication Endpoints
Endpoint | Use Case | Documentation Link | Product |
Token request (Graph API)
| Obtain and refresh an access token to interact with the Graph API using OAuth 2.0. | All | |
Token request (SharePoint REST API)
| Obtain and refresh an access token to interact with the SharePoint REST API using OAuth 2.0. | SharePoint |
Identity Endpoints
Endpoint | Permissions | Use Case | Documentation Link | Product |
List users
| User.Read.All
| List all the users within the tenant | All | |
List groups
| GroupMember.Read.All
| List all the groups within the tenant | All | |
List group members
| GroupMember.Read.All (or | Get the members of a group (to understand permissions). | All | |
Get profilePhoto
| User.Read.All | Get the profile photo of a given user (for Azure people data crawl) | Azure AD / Entra ID | |
Get site groups
https://<site_domain>.sharepoint.com/sites/<subsite_url>/_api/web/SiteGroups?$expand=Users
| SharePoint REST permissions (FullControl) | Get the default site groups and associated user memberships for a site. | SharePoint |
Content Endpoints
Endpoint | Permissions | Use Case | Documentation Link | Product |
List sites
| Sites.FullControl.All (Sites.Read.All) | List all site collections within the tenant. Delta will currently only return site collections from the main geo-location if it is working in a multi-geo tenant per Microsoft guidance | SharePoint, OneDrive | |
List subsites
| Sites.FullControl.All (Sites.Read.All) | List all the subsites within a site. Glean can scan recursively done until there are no more subsites | SharePoint, OneDrive | |
List lists
| Sites.FullControl.All (Sites.Read.All) | List all the lists within the site | SharePoint, OneDrive | |
List columns
| Sites.FullControl.All (Sites.Read.All) | List all columns within the site (attributes of site) | SharePoint, OneDrive | |
List items delta
https://graph.microsoft.com/v1.0/sites/<id>/sites/ <id>/lists/ <id>/item /delta
| Sites.ReadFullControl.All | List all items from the delta endpoint (returns some metadata REST API does not include, along with inheritance). To scan permissions hierarchies properly Sites. FullControl.All is required | SharePoint, OneDrive | |
Get site list items
https://<site_domain>.sharepoint.com/sites/<subsite_url>/_api/web/lists('<list_id>')/item
| SharePoint REST permissions (FullControl) | Get the items within a list for a site. Glean uses the REST API as some content for classic sites is only available via REST APIs | SharePoint, OneDrive | |
Get site item permissions
| SharePoint REST permissions (FullControl) | Get the permissions for an item on the site. SharePoint REST API is required for site pages / web parts, as Graph API only exposes permissions for Document Library items. | SharePoint, OneDrive | |
Get page content
https://<site_domain>.sharepoint.com/sites/<subsite_url>/_api/web/GetFileById('<id>')/GetLimitedWebPartManager(scope=1)/ExportWebPart
| SharePoint REST permissions (FullControl) | Get the web parts on a particular page (ie. blocks of content within text boxes, titles, etc.) | SharePoint, OneDrive |
Drives and Document Libraries on SharePoint Sites
Endpoint | Permissions | Use Case | Documentation Link | Product |
List drives
| Files.Read.All | List all the drives within a given site | SharePoint, OneDrive | |
Get driveItem
| Files.Read.All | List all the items within a drive (change-based as per Microsoft’s scanning guidance) | SharePoint, OneDrive | |
Get driveItem resource
| Files.Read.All | Retrieve metadata for a driveItem. | SharePoint, OneDrive | |
Download file
| Files.Read.All | Download content for a driveItem to index bodies | SharePoint, OneDrive | |
Get permissions
| Files.Read.All | Get the permissions of a given item within a drive | SharePoint, OneDrive |
Activity Endpoints
Endpoint | Permissions | Use Case | Documentation Link | Product |
Sites.FullControl.All (Sites.Read.All) | Lists recent activities performed by the user on specific items (Drive items). Follows TTL policies. Usually up to 6mo in the past. | SharePoint, OneDrive
| ||
Sites.FullControl.All (Sites.Read.All) | Lists recent sharing activity performed by users. Follows TTL policies. Usually up to 6mo in the past. | SharePoint, Onedrive |
Reports
Endpoint | Permissions | Use Case | Documentation Link | Product |
Get OneDrive Usage: File Count
| Reports.Read.All | Get the total number of files across all sites and how many have been created, modified, and shared within the time period. | SharePoint, OneDrive | |
Get SharePoint Usage: Site Count
| Reports.Read.All | Get the total number of active sites within the time period. | SharePoint, OneDrive | |
Get SharePoint Usage: User Count
| Reports.Read.All | Get the total number of active SharePoint users within the time period. | SharePoint, OneDrive | |
Get SharePoint Usage: Pages
| Reports.Read.All | Get the number of pages viewed across all sites within the time period. | SharePoint, OneDrive |
Webhooks
Endpoint | Permissions | Use Case | Documentation Link | Product |
Create a webhook subscription
(HTTP POST) https://graph.microsoft.com/v1.0/subscriptions
| Files.ReadWrite.All Glean subscribes to the driveItem resource which requires (as least privilege) the Files.ReadWrite.All permission to create the subscription. | Create a change notification subscription to a given drive (see driveItem section in the documentation).
| SharePoint, OneDrive | |
Reauthorize a webhook subscription
| Files.ReadWrite.All Glean subscribes to the driveItem resource which requires (as least privilege) the Files.ReadWrite.All permission for reauthorization. | Reauthorize a subscription after timeout when a reauthorizationRequired challenge is received. | SharePoint, OneDrive |
Items crawled
Content
By default, an object must be present within a SharePoint List or User Drive to be crawled.
Note: All metadata is crawled by default.
Content is indexed by Glean if the following criteria are met:
The document is less than 16MB
The document has indexable content, such as Documents, PowerPoints, spreadsheets, PDFs, and text files, which can be downloaded.
By default, OCR is not enabled
Drawings, images, videos, compressed folders, and code (including JSON) are not downloaded by default and will only have metadata indexed
OneNote has limited support
OneNote is structured as a “Notebook,” composed of “Sections,” which are in turn composed of “Pages” that have content
Glean indexed “Notebook” as a folder and “Section” as standalone content
While Glean attempts to parse the downloaded “Section” content, which includes Pages, there are bugs in the content parsing library that prevent all page content from surfacing in the index
By default, the connector crawls ALL personal folders for employees in the company. It is possible to limit these crawls to a specific list of groups in the SSO Directory (Azure AD, etc…).
Identity
Glean crawls all Azure users and groups in the tenant using the Graph API.
With the SharePoint REST API, Glean also crawls all default site groups (e.g., visitors, members, and owners). These are common groups configured either across a full site collection or within subsites.
Note: Site groups and memberships are not available via Graph API.
SharePoint Lists
Glean only crawls Site Page (webPageLibrary) List types by default, which does not include a page representing the List library (e.g. url with “...AllItems.aspx”).
Web parts will be downloaded from each web page. Only “rich text” web parts will have content parsed and indexed.
Limitation: Pages can be in “draft mode.” This occurs when any user checks out a page to edit. Because the API does not distinguish between content available on a published page vs. a draft page, Glean will only show the title of “checked out” or “draft mode” pages by default.
If o365sharepoint.crawl.useTitleForDrafts = false, the draft page will not show up at all in search.
Glean does not support additional list types, e.g. “xmlPowerForm”
Rate Limits
SharePoint rate limits are shared between all endpoints: Avoid getting throttled or blocked in SharePoint Online
Rate limits depend on enterprise size. For the largest enterprises, Glean typically can make around 16 document retrievals per second (100 tokens per second, 1 token for content download, and 5 tokens for permission API calls). Glean optimizes API calls by understanding when content inherits permissions from parents. This helps Glean upwards of 8M documents per day for large enterprises, for a full crawl.
Rate limits are per application and also enforced tenant-wide. From custom testing, Glean has found success with increasing crawler applications to up to 5 additional applications.
Update frequency
Content updates for the SharePoint connector in Glean can happen quite rapidly, depending on the type of update and the configuration settings. Here are the key areas:
User Insights provides information when docs are modified or viewed, including if permissions change. The crawl every 10 minutes will reprocess these permissions.
Glean also subscribes to webhooks on drives to understand which drives need to be crawled incrementally more to ingest up-to-date content.
For the most up-to-date crawler refresh information, please refer to [External] Glean crawling strategy
Crawling Strategy/Performance
The system uses the User Insights API to identify new/modified/deleted docs. The system does an incremental fetch to catch up on the changes
How the Crawl Works
Crawl Behavior
Most content, including OneDrive user drives or SharePoint document libraries, will be indexed via a OneDrive incremental crawl. This follows Microsoft’s recommended best practices for working with Microsoft Graph APIs.
Microsoft reports a webhook via a subscription on “drives” (Onedrive user drives or SharePoint document libraries). Subscriptions include a change to any item within the drive, including a permission change to a single item.
Glean schedules an incremental crawl over all drives with a webhook seen recently, or if the drive has not been crawled for a period of time (by default at least weekly).
The change endpoint provides a list of all changed content (both content and permission changes), along with their full folder hierarchy.
Glean queues two sets of crawls after discovering changes: a crawl for any changed content and propagation over the full drive for permission changes. The permission propagation must occur over the full drive, as permission changes are not known beforehand in the change query. If folders change permissions, the children must also have the propagated change.
SharePoint list crawls are done via a “SitePage” crawl, which is independent of the OneDrive incremental crawls
Finally, some document libraries or user OneDrive content will be present in user insights. Glean has a fast path to ingest any newly created content detected in user insights into the index.
Known Limitations in Crawl
SharePoint Pages can be in “draft mode.” This occurs when any user checks out a page to edit. Because the API does not distinguish between content available on a published page vs. a draft page, Glean will only show the title of “checked out” or “draft mode” pages by default.
For OneNote, while Glean attempts to parse the downloaded “Section” content, which includes Pages, there are bugs in the content parsing library that prevent all page content from surfacing in the index
When adding more than 3 additional apps to the SharePoint Online connector, an administrator may encounter the error “Failed to save changes
Glean can not crawl SharePoint site URLs with special designations and should be removed before inputting into inclusions or exclusions:
“:f” means Folder sharing
“:w” means Word document sharing
“:x” means Excel document sharing
“:p” means PowerPoint document sharing
“:b” means PDF document sharing
Unsupported Features
Page-level OneNote support.
Additional SharePoint List type support (xmlForm, powerApps).
SharePoint basic list search results as an aggregate of all list items.
Supporting Purview + Sensitive Labels.
Event receivers and SPFx client-side activity exports
Teams transcripts
Custom facets as customizable search metadata for SharePoint Lists
additional list types, e.g. “xmlPowerForm”
Content Configuration
For OneDrive, the customer can configure specific users whose content Glean can exclude from indexing. For SharePoint, the customer can either configure specific sites to be excluded from Glean's indexing or configure an explicit list of sites to be indexed by Glean. A customer can choose to index both or only SharePoint content.
A subset of a customer’s employees can be selected by creating an O365 group specifically with those members. Content will only contain OneDrive files from these users, and SharePoint files viewed/modified specifically by these users.
Item Insights are used for search personalization but can be turned off.
Inclusions (Green-Listing) Options
Glean has similar capabilities for green-listing sites and drives. This allows only the greenlisted sites and drives to be crawled and indexed. Redlisting and greenlisting features can be used simultaneously.
Inclusion (Green-listing) SharePoint sites by site URL
Green-listing SharePoint sites can be achieved by providing a list of site URLs to the Glean team. Glean supports
Green-listing a list of site URLs
Green-listing a file containing a single line of comma-separated URLs (if there are many sites)
URLs should be provided in the following format:
https://<domain>.sharepoint.com/sites/<siteName>
Green-listing OneDrive User Drives
Glean can green-list user drives by email. To enable this, provide a single-line file containing comma-separated user email addresses that should be disallowed.
Exclusions (Red-listing) Options
Glean can red-ist items in Onedrive and SharePoint, which prevents crawling them and removes them from search.
Red-listing a SharePoint site removes all site pages and files and folders in the site drives from Glean.
Red-listing OneDrive user drives by email removes all files and folders in the user’s personal drive (this also applies to OneDrive for Business drives). It does not affect documents created by the user in other drives or SharePoint.
Red-listing SharePoint sites via SharePoint settings
Glean respects the search preferences for sites in SharePoint. Sites can be removed from search by turning off the setting at Site Contents > Site Settings > Search and offline availability > Indexing Site content. An example can be found at this link.
Red-listing SharePoint sites by site URL
Redlisting can also be achieved by providing a list of site URLs to the Glean team. Glean supports
Red-listing a list of site URLs
Red-listing a file containing a single line of comma-separated URLs (if there are many sites)
URLs should be provided in the following format:
https://<domain>.sharepoint.com/sites/<siteName>
Red-listing OneDrive user drives
Glean can red-list user drives by email. To enable this, provide a single-line file with comma-separated user email addresses that should be disallowed.
Auditing Permissions
Glean does not require any permission beyond read in the SharePoint API; however, due to the limitations and granularity of the SharePoint API, the least privilege is Full Control. To showcase that Glean only uses read-level permissions, Glean is providing the steps in Microsoft Purview to audit Glean’s usage.
Steps to Monitor Glean Purview
Ensure that unified audit logging is enabled in the Microsoft 365 environment.
Search for Application-Specific Activity in Purview:
Open the Microsoft Purview Compliance Portal.
Navigate to Audit > Audit search.
In the Users filter, enter the Application ID or the name of the Glean application (as it appears in Azure AD).
Focus on the following activities:
File accessed: Logs when files or pages are read.
List accessed: Logs list-level operations, such as reading items.
Permissions viewed or changed: Identifies whether Glean is accessing or modifying permissions
Monitor Specific Endpoints:
Since the Glean app uses specific SharePoint REST API endpoints, focus on:
Get site list items (/web/lists('<list_id>')/item)
Get site item permissions (/web/lists('<list_id>')/items('<item_id>')/roleassignments)
Get page content (/web/GetFileById('<id>')/GetLimitedWebPartManager)
Cross-reference the logs for activities related to these endpoints and verify they are read-only
Set Up Alerts for Write Activity:
In the Audit section of Purview, create alert policies for write actions by Glean:
File modified or deleted: Flag if the application writes to or deletes files.
Permission changes: Monitor for unauthorized modifications to roles or permissions.
Use these alerts to trigger immediate reviews.
Troubleshooting
Microsoft Tenants Created Started 2020 and After
If the tenant was recently created (starting from 2020 onwards), then even after the REST API setup in the previous section, an error will be received:
“Unable to fetch O365 SharePoint site groups. Please check that the sharepoint/content/tenant and sharepoint/content/sitecollection scopes are enabled with FullControl for SharePoint REST API.”
o resolve this, the custom app authentication must disabled for the SharePoint tenant (reference).
Install Powershell (if it is not already installed).
In Powershell, install PnP by running
Install-Module -Name PnP.PowerShell, Install-Module -Name, Microsoft.Online.SharePoint.PowerShellRun
Connect-PnPOnline -Url https://<sharepointdomain>-admin.sharepoint.com
Run
Set-PnPTenant -DisableCustomAppAuthentication $false
Reattempt to Save.
Workaround for Configuring Additional SharePoint Apps
Version: 1.0.0
Date: November 10, 2023
Background
When adding more than 3 additional apps to the SharePoint Online connector, an administrator may encounter the error “Failed to save changes. Please try again. If this issue persists, contact Glean for support.” which prevents them from continuing.
As a workaround, these additional apps are still able to be added using the Advanced app setup interface which allows the secrets for each additional app to be set manually.
Procedure
Create the client/secret IDs with the proper permissions as per the original SharePoint instructions. Verify all of the permissions are correct (there will be 8x Application Permissions), and that the SharePoint REST API privileges have been granted.
Navigate to https://app.glean.com/admin/setup/apps?advanced. A modal should pop up similar to below.
⚠️Caution! Do not use this interface outside the instructions in this document or without guidance by a Glean engineer. Doing so may result in errors within the Glean instance.
For each additional SharePoint app added, repeat the following steps:
Set the Client ID:
Open the Advanced modal above, and select Secret.
For Key name, enter O365_CLIENT_ID_<X-1>, where <X> is the number of the additional apps for configuring. This is indexed at 0.
For example, adding Additional App #3, the Key name will be O365_CLIENT_ID_2. For Additional App #6, the Key name will be O365_CLIENT_ID_5. Etc.
For Key value, paste the Application/Client ID of the additional SharePoint app configured.
Click Submit.
Example of configuring the Application/Client ID for Additional App #3 (O365_CLIENT_ID_2)
Set the Client Secret:
Open the Advanced modal again, and select Secret.
For Key name, enter O365_CLIENT_SECRET_<X-1>, where <X> is the number of the additional app that are configured. This is indexed at 0.
For example, if adding Additional App #3, the Key name will be O365_CLIENT_SECRET_2. For Additional App #6, the Key name will be O365_CLIENT_SECRET_5. Etc.
For Key value, paste the Secret of the additional SharePoint app configured.
Click Submit.
Example of configuring the Client Secret value for Additional App #3 (O365_CLIENT_SECRET_2)
The mapping of the Secret names to the fields from the standard SharePoint setup page is illustrated below.
This image illustrates the mapping of each Additional App field in the standard SharePoint setup workflow to the Client ID and Client Secret keys that will be used.
Advise your Glean engineer of the total number of additional apps thathave on-boarded. They will need to enable the usage of the Secrets that have been set in order for them to take effect.
The procedure is now complete. DO NOT modify anything in the SharePoint Setup page as doing so will overwrite the values entered for the secrets.