Integration Features
Glean currently indexes all content under a single given tenant.
Glean captures the following content from Microsoft OneDrive created, updated, or viewed by users:
Folders
Documents (All document types, e.g. Word, Excel, Powerpoint)
For SharePoint, Glean will capture the following content created, updated, or viewed by users:
Site Pages
Site Drives
API Usage
Glean currently uses the Graph API v1.0 to ingest all data and permissions, using the current Microsoft Graph API SDK v5.30.0.
Glean will use the standard Graph API v1.0 and Sharepoint REST API to ingest all data. We use application permissions with admin granted access. The Glean app, set up by the tenant administrator for Onedrive, will require the following permissions:
For Identities:
User.Read.All
Group.Read.All
GroupMember.Read.All
For Onedrive/Sharepoint:
Directory.Read.All
Files.Read.All
Files.ReadWrite.All (for webhooks)
Reports.Read.All (for ranking signals)
Sites.Read.All
Sharepoint permissions as listed in the setup instructions below, requiring full control over Site Collections to properly crawl all Sharepoint site content and permissions via REST.
Glean uses the recommended best practices strategy provided by Microsoft to both crawl and record incremental changes for all documents.
Required Permissions
A tenant administrator (global admin privileges for both Azure portal and Sharepoint admin) is required to set up several dedicated service applications granted with the required privileges above.
A tenant administrator for both Azure portal and Sharepoint is necessary to grant:
Admin Consent permissions for the Graph API application permissions, and
Sharepoint REST API FullControl permissions
Permissions enforcement
All permissions are automatically respected by Glean. Users will only see search results for what they have access to. When the user clicks on a search result they are taken to the OneDrive/Sharepoint web application which enforces the permissions just like it would if the user was to go to OneDrive/Sharepoint directly.
Crawling Strategy/Performance
The system uses the User Insights API to identify new/modified/deleted docs. The system does an incremental fetch to catch up on the changes
Content/ACL Synchronization
User Insights provides information when docs are modified or viewed, including if permissions change. The crawl every 10 minutes will reprocess these permissions.
Glean also subscribes to webhooks on drives to understand which drives need to be crawled incrementally more to ingest up-to-date content.
Controls to Redlist/Greenlist Content
For OneDrive, the customer can configure specific users whose content can be excluded from being indexed by Glean. For Sharepoint, the customer can either configure specific sites to be excluded from being indexed by Glean or configure an explicit list of sites to be indexed by Glean. A customer can choose to index both, or only Sharepoint content.
A subset of a customer’s employees can be selected by creating an O365 group specifically with those members. Content will only contain OneDrive files from these users, and Sharepoint files viewed/modified specifically by these users.
Item Insights are used for search personalization, but can be turned off.
Redlisting
Glean can redlist items in Onedrive and Sharepoint, which prevents crawling them and removes them from search.
Redlisting a Sharepoint site removes all site pages as well as any files and folders in the site drives from Glean.
Redlisting OneDrive user drives by email removes all files and folders in the user’s personal drive (applies to OneDrive for Business drives).
This does not affect documents created by the user in other drives or Sharepoint.
Redlisting Sharepoint sites via Sharepoint settings
Glean respects the search preferences for sites in Sharepoint. Sites can be removed from search by turning off the setting at Site Contents > Site Settings > Search and offline availability > Indexing Site content. An example can be found at this link.
Redlisting Sharepoint sites by site URL
Redlisting can also be achieved by providing a list of site URLs to the Glean team. We support
Redlisting a list of site URLs
Redlisting a file containing a single line of comma-separated URLs (if there are many sites)
URLs should be provided in the following format:
https://<domain>.sharepoint.com/sites/<siteName>
Redlisting OneDrive user drives
Glean can redlist user drives by email. This can be enabled by providing a single-line file of comma-separated user email addresses which should be disallowed.
Greenlisting
Glean has similar capabilities for greenlisting sites and drives. This allows only the greenlisted sites and drives to be crawled and indexed. Redlisting and greenlisting features can be used simultaneously.
Greenlisting Sharepoint sites by site URL
Greenlisting sharepoint sites can be achieved by providing a list of site URLs to the Glean team. We support
Greenlisting a list of site URLs
Greenlisting a file containing a single line of comma-separated URLs (if there are many sites)
URLs should be provided in the following format:
https://<domain>.sharepoint.com/sites/<siteName>
Greenlisting OneDrive user drives
Glean can greenlist user drives by email. This can be enabled by providing a single-line file of comma-separated user email addresses which should be disallowed.
Setup
Required permissions for setup
The user setting up this data source must be the Global Admin.
Register a new app
Sign into the Azure portal. Select Azure Active Directory, then App registrations > New registration.
On the Register an application page, register an app with the following:
Field | Value |
Name | Glean |
Supported account types | Accounts in this organizational directory only (Single tenant) |
Redirect URI | (Leave this field blank) |
Click Register.
Configure permissions
On the left side navigation on the overview page, click on Manage > API Permissions.
Click Add a permission and select Microsoft Graph. Choose Application permissions and add the following:
User.Read.All
GroupMember.Read.All
Files.Read.All
Files.ReadWrite.All (for webhooks)
Reports.Read.All
Sites.Read.All
Grant admin consent
Ensure you are signed into Azure as a Global, Application or Cloud Application Administrator.
Use the search box to navigate to Enterprise applications. Select the Glean app you just created from the list of applications.
Click on Permissions under Security. Review the permissions shown, and then click Grant admin consent.
Generate secret
Navigate back to Home > App Registration. Then click on Manage > Certificates & secrets in the left sidebar.
Click on New client secret. Enter a description and select 24 months for expiry time, then click Add.
Under Client secrets, copy the Value (not the Secret ID) you generated and enter it in Glean as the Client secret. The Value will only be shown once.
Fill out keys
Scroll to the top of the left sidebar and click Overview.
Copy the following content from the center Essentials panel and enter it in Glean:
Application (client) ID
Directory (tenant) ID
Enter your Sharepoint domain in Glean. Your Sharepoint domain should end with "sharepoint.com"
(Strongly Recommended) To increase the full crawl indexing speeds, Glean recommends between 1 and 10 additional applications with the same permission settings as the initial app created. Repeat the setup steps from "Register a new app" until this step, saving the client ID and client secret in the process. Paste the client ID and client secret into the Glean web app.
Ensure you go through the next step to set up Sharepoint REST API permissions, or clicking Save will not succeed.
Sharepoint permissions
Since the graph API does not support many of our sharepoint use cases (e.g. site page permissions), we need to use the Sharepoint REST API. This will need to be done for every app from the previous step.
Navigate to <sharepoint-domain>-admin.sharepoint.com/_layouts/15/appinv.aspx where if you access Sharepoint at glean.sharepoint.com, the sharepoint-domain would be "glean."
Look up the app using the Client ID from the last step. You can fill the App Domain and Redirect URL to glean.com and https://glean.com respectively.
For Permission Request XML, paste the following:
<AppPermissionRequests AllowAppOnlyPolicy="true"> <AppPermissionRequest Scope="http://sharepoint/content/tenant" Right="FullControl" /> <AppPermissionRequest Scope="http://sharepoint/content/sitecollection" Right="FullControl" /> <AppPermissionRequest Scope="http://sharepoint/content/sitecollection/web" Right="FullControl" /></AppPermissionRequests>
Repeat for each additional app created from the previous steps.
Click Save in Glean to save the app credentials. You’re all set for the initial application setup.
Troubleshooting
If the tenant was recently created (starting from 2020 onwards), then even after the REST API setup in the previous section, you will receive an error:
Unable to fetch O365 Sharepoint site groups. Please check that the sharepoint/content/tenant and sharepoint/content/sitecollection scopes are enabled with FullControl for Sharepoint REST API.
To resolve this, you must disable custom app authentication for your Sharepoint tenant (reference).
Install Powershell (if it is not already installed).
In Powershell, install PnP by running
Install-Module -Name PnP.PowerShell, Install-Module -Name Microsoft.Online.SharePoint.PowerShell
Run
Connect-PnPOnline -Url https://<sharepointdomain>-admin.sharepoint.com
Run
Set-PnPTenant -DisableCustomAppAuthentication $false
Reattempt to Save.
For any questions or issues with this setup, please reach out to support@glean.com.