Glean currently indexes all content under a single given tenant.
Glean captures the following content from Microsoft OneDrive created, updated, or viewed by users:
Documents (All document types, e.g. Word, Excel, Powerpoint)
For SharePoint, Glean will capture the following content created, updated, or viewed by users:
Glean currently uses the Graph API v1.0 to ingest all data and permissions, using the current Microsoft Graph API SDK v5.30.0.
Glean will use the standard Graph API v1.0 and Sharepoint REST API to ingest all data. We use application permissions with admin granted access. The Glean app, set up by the tenant administrator for Onedrive, will require the following permissions:
Files.ReadWrite.All (for webhooks)
Reports.Read.All (for ranking signals)
Sharepoint permissions as listed in the setup instructions below, requiring full control over Site Collections to properly crawl all Sharepoint site content and permissions via REST.
Glean uses the recommended best practices strategy provided by Microsoft to both crawl and record incremental changes for all documents.
A tenant administrator (global admin privileges for both Azure portal and Sharepoint admin) is required to set up several dedicated service applications granted with the required privileges above.
A tenant administrator for both Azure portal and Sharepoint is necessary to grant:
Admin Consent permissions for the Graph API application permissions, and
Sharepoint REST API FullControl permissions
All permissions are automatically respected by Glean. Users will only see search results for what they have access to. When the user clicks on a search result they are taken to the OneDrive/Sharepoint web application which enforces the permissions just like it would if the user was to go to OneDrive/Sharepoint directly.
The system uses the User Insights API to identify new/modified/deleted docs. The system does an incremental fetch to catch up on the changes
User Insights provides information when docs are modified or viewed, including if permissions change. The crawl every 10 minutes will reprocess these permissions.
Glean also subscribes to webhooks on drives to understand which drives need to be crawled incrementally more to ingest up-to-date content.
Controls to Redlist/Greenlist Content
For OneDrive, the customer can configure specific users whose content can be excluded from being indexed by Glean. For Sharepoint, the customer can either configure specific sites to be excluded from being indexed by Glean or configure an explicit list of sites to be indexed by Glean. A customer can choose to index both, or only Sharepoint content.
A subset of a customer’s employees can be selected by creating an O365 group specifically with those members. Content will only contain OneDrive files from these users, and Sharepoint files viewed/modified specifically by these users.
Item Insights are used for search personalization, but can be turned off.
Glean can redlist items in Onedrive and Sharepoint, which prevents crawling them and removes them from search.
Redlisting a Sharepoint site removes all site pages as well as any files and folders in the site drives from Glean.
Redlisting OneDrive user drives by email removes all files and folders in the user’s personal drive (applies to OneDrive for Business drives).
This does not affect documents created by the user in other drives or Sharepoint.
Redlisting Sharepoint sites via Sharepoint settings
Glean respects the search preferences for sites in Sharepoint. Sites can be removed from search by turning off the setting at Site Contents > Site Settings > Search and offline availability > Indexing Site content. An example can be found at this link.
Redlisting Sharepoint sites by site URL
Redlisting can also be achieved by providing a list of site URLs to the Glean team. We support
Redlisting a list of site URLs
Redlisting a file containing a single line of comma-separated URLs (if there are many sites)
URLs should be provided in the following format:
Redlisting OneDrive user drives
Glean can redlist user drives by email. This can be enabled by providing a single-line file of comma-separated user email addresses which should be disallowed.
Glean has similar capabilities for greenlisting sites and drives. This allows only the greenlisted sites and drives to be crawled and indexed. Redlisting and greenlisting features can be used simultaneously.
Greenlisting Sharepoint sites by site URL
Greenlisting sharepoint sites can be achieved by providing a list of site URLs to the Glean team. We support
Greenlisting a list of site URLs
Greenlisting a file containing a single line of comma-separated URLs (if there are many sites)
URLs should be provided in the following format:
Greenlisting OneDrive user drives
Glean can greenlist user drives by email. This can be enabled by providing a single-line file of comma-separated user email addresses which should be disallowed.
Register a new app
Sign into the Azure portal. Select Azure Active Directory, then App registrations > New registration.
On the Register an application page, register an app with the following:
Supported account types
Accounts in this organizational directory only (Single tenant)
(Leave this field blank)
On the left side navigation on the overview page, click on Manage > API Permissions.
Click Add a permission and select Microsoft Graph. Choose Application permissions and add the following:
(for ranking signals)
Grant admin consent
Ensure you are signed into Azure as a Global, Application or Cloud Application Administrator.
Use the search box to navigate to Enterprise applications. Select the Glean app you just created from the list of applications.
Click on Permissions under Security. Review the permissions shown, and then click Grant admin consent.
Navigate back to Home > App Registration. Click on Certificates & secrets in the left sidebar.
Click on New client secret. Enter a description and select 24 months for expiry time, then click Add.
Under Client secrets, copy the Value (not the Secret ID) you generated and enter it in Glean. The Value will only be shown once.
Scroll to the top of the left sidebar and click Overview.
Copy the following content from the center Essentials panel and enter it in Glean:
Application (client) ID
Directory (tenant) ID
Click Save in Glean. You’re all set for the initial application setup.
To have the full crawl properly run, Glean recommends between 1 and 10 additional applications with the same permission settings as the initial app created. Repeat the setup steps from "Register a new app" until this step, saving the client ID and client secret in the process. Paste the client ID and client secret into the Glean web app.
Since the graph API does not support many of our sharepoint use cases (e.g. site page permissions), we need to use the Sharepoint REST API. This will need to be done for every app from the previous step.
Navigate to <sharepoint-domain>-admin.sharepoint.com/_layouts/15/appinv.aspx where if you access Sharepoint at glean.sharepoint.com, the sharepoint-domain would be "glean."
Look up the app using the Client ID from the last step. You can fill the App Domain and Redirect URL to glean.com and https://glean.com respectively.
For Permission Request XML, paste the following:
<AppPermissionRequests AllowAppOnlyPolicy="true"> <AppPermissionRequest Scope="http://sharepoint/content/tenant" Right="FullControl" /> <AppPermissionRequest Scope="http://sharepoint/content/sitecollection" Right="FullControl" /> <AppPermissionRequest Scope="http://sharepoint/content/sitecollection/web" Right="FullControl" /></AppPermissionRequests>
For any questions or issues with this setup, please reach out to email@example.com.