Introduction
The Confluence connector for Glean allows Glean to fetch and index content from Confluence, ensuring that users can search and access documents for which they have authorized permissions.
Authentication: Glean requires the Confluence admin to authenticate to Glean when setting up the Glean crawler app and Forge App.
Data Storage: All data is stored in the cloud project within the customer's cloud account (Glean or customer hosted), ensuring no data leaves the customer's environment
API Usage
Standard API: Glean uses Atlassian’s standard REST API for Confluence to ingest all data
Integration Features
Content Captured: Glean captures Confluence pages, blog posts, metadata attachments, comments, and more.
Permissions Enforcement: Glean respects all user access permissions, ensuring users only see search results for documents they can access. When a user clicks on a search result, they are taken to the Confluence web application, which enforces the permission.
Versions Supported
The Confluence cloud connector has no specific version limitations, which is Atlassian’s SaaS offering of Confluence in the cloud.
Glean also supports Confluence datacenter edition, a customer-managed deployment (not SaaS) with a different Glean connector and separate documentation.
Objects Supported
The Confluence connector for Glean supports the following objects:
Pages: This includes all the pages within Confluence that are indexed and searchable.
Blog Posts: Blog posts created within Confluence are also captured.
Attachments: Any attachments associated with pages and blog posts are indexed.
Comments: Comments on both pages and blog posts are included in the indexing.
Restricted Pages. Additional user setup is required
The connector ensures comprehensive data coverage, including metadata, identity data, permissions data, and activity data. It provides real-time synchronization, reflecting updates and permission changes immediately in search results.
Authentication Mechanism
Glean requires authentication to the Atlassian instance to fetch relevant information from Confluence.
For Confluence, the Atlassian admin needs to install Glean’s Forge App to the instance.
Glean understands all user access permissions and strictly enforces permissions at the time of the query, ensuring users cannot see results to which access is not granted.
It’s important to note that all data is stored in the customer’s Glean project inside the customer’s cloud account, and no data leaves the customer's environment.
Connector credentials requirements
Installation and Setup Permissions:
For Confluence Cloud, the Atlassian admin needs to install Glean’s Forge App on the instance. The Admin scope is required to fetch permissions associated with Confluence objects, which is necessary for correctly enforcing permissions in the search experience.
Why Read-Only Permissions are Insufficient
While “read” permissions allow access to all content for the Connector, write:space.permission:confluence scope is necessary for the initial installation to automatically be added to all Confluence spaces. Alternatively, the administrator must manually add the Glean crawler to all spaces.
Connection instructions
Required permissions for setup
The user setting up this data source must have administrator permissions.
Set up the basics
Sign in to Confluence as an admin. Copy your Atlassian domain from the URL bar and paste it into Glean: https://YourAtlassianDomain.atlassian.net
Select the organization matching your Atlassian domain from the previous steps
Find the row for Confluence, click on the three dots, and select Manage Product Access.
Enter the default groups (there might be only one) as a comma-separated list in Glean. Only users in the provided product access groups will be able to see results in Glean.
Click Create Forge Crawler App in Glean. This should create an installation link for the Glean crawler app
Connect the Forge Crawler app
As a Confluence admin, open the Forge Crawler app installation link from the Glean setup page.
Click on Get app and install the app in the correct Confluence instance.
After the app installation is successful, click Save in Glean. You’re all set!
Setup in Glean
Input the data source name in the Name text box and select an icon
Complete any outstanding setup in Show setup instructions
Input the following information:
Input Confluence domain name in Your Atlassian domain name text box
Input the default access group in the Default access group name text box
Webhook URL is utilized for webhook setup in granting Glean access and setting up the Glean from Atlassian’s marketplace
Click Save
Optional: Configuring Glean search for Confluence to crawl restricted pages
The Glean connector for Confluence by default, is configured to access all Confluence spaces and pages except Restricted Pages. Atlassian Admins cannot view restricted pages unless the admin user is given explicit access.
Confluence Restricted Pages can be important for users who request to be included in search results. Glean has built the capability to crawl, and index restricted pages in a permissions-enforced way. It involves providing the “Scio Search for Confluence” app view access to the pages to be indexed. The following is an overview of the procedure:
Users must:
Edit access to a set of Restricted Pages in Confluence
“Add/Delete Restrictions” permissions for the space
Create an API token through the Atlassian settings workflow
Upload that API token into their Glean application settings to store the token securely
Glean will securely read the token and add the “Glean Search Crawler for Confluence” application to the restricted pages where the app has edit access as view-only
Glean will be able to crawl and index the pages with the “Glean Search Crawler for Confluence”
Users with edit or view access to the Restricted Pages can view those pages in Glean search results.
Multiple users can upload their API tokens. For each such user, Glean will add the Glean Search Crawler For Confluence app to the view restrictions of restricted pages that the user can edit.
Create Confluence API Token
For any of the users who have edit access to their Restricted Pages and need to have those pages crawled and indexed into Glean, they must do the following:
Login to their account in Atlassian
Select [Create API Token]
Enter a name for the token, for example “Glean Search Crawler”
OAuth Flow for Individual Users
Individual users must authorize Glean in the UI by clicking your profile picture (bottom left corner) → Your settings → Data sources → Confluence Cloud.
Items crawled
Content
For Confluence, Glean crawls the following content:
Spaces
Pages
Blogs
Comments - from both Pages and Blogs
Attachments metadata
Identity
Users: Information about users
Groups: Details about groups within the domain.
Memberships: Information about group memberships, indicating which users belong to which groups.
The identity crawl operates with the following configurations:
Incremental Identity Crawls: These are performed to capture changes since the last crawl.
Full Identity Crawls: These are conducted periodically to ensure all identity data is up-to-date.
Activity
Crawls the following activities on content (Spaces, pages, blogs,...) to keep the index current:
Adds: New content, spaces, pages, blogs, files, or folders added
Updates: Modifying existing content, spaces, pages, blogs, files, files, or folders.
Permissions Changes: Changes in content, spaces, pages, blogs, files, or folders sharing permissions.
Deletions: content, spaces, pages, blogs, files, or folders that have been deleted.
View Activity: Events indicating when content, spaces, pages, blogs, files, or folders have been viewed.
The Glean Activity plugin for Confluence helps Glean provide highly personalized search results for users. By sending webhook events to Glean each time a user views a page, blog, or piece of content, the plugin enables Glean instance to gather valuable information crucial to delivering an outstanding search experience. This information is stored securely in your dedicated cloud project (Glean or customer-hosted), ensuring complete privacy and protection of your data.
Rate Limits
Queries per Second (QPS): The default rate limit is set to 20 queries per second per user.
Update frequency
Content updates for the Confluence connector in Glean can happen quite rapidly, depending on the type of update and the configuration settings. Here are the key areas:
People / Identity Crawls: Changes to group memberships are picked up by the identity crawl, which runs every 10 minutes. This ensures that updates to user groups and their permissions are reflected promptly.
Incremental Crawls: These occur every 1 hour to provide additional reliability beyond the minute-by-minute activity reports.
Full Crawls: The frequency of full crawls can be configured, but they are generally less frequent than incremental crawls at 7 days
Changes in data must be crawled, processed, and indexed before the data is reflected in the UI. Actual time may vary depending on the number of changes and corpus size. For the most up-to-date crawler refresh information, please refer to [External] Glean crawling strategy
How the crawl works
The crawler follows the traditional crawler strategy, including utilizing the API and the following ways to get and update data:
Identity Crawl: updating and adding of People data, including users, groups, and other information
Webhooks: are messages sent by the application to notify Glean of changes in real-time, and then Glean either initiates a crawl or picks up the change on the next crawl.
Content Crawls: Full crawls the entire defined scope of the application whereas incremental crawls only capture the changes from the previous full or incremental crawl.
Known Limitations in Crawl
The Confluence connector for Glean has the following known limitations in its crawling process:
The Glean app can read all unrestricted pages in the Confluence spaces. However, Glean can only read restricted pages if the admin grants access to the app for them.
Blogpost Hierarchy: In Confluence Cloud and Confluence Server, blogposts do not have a hierarchical structure and will perform a normal list-all-content-ids REST API call. Additionally, Glean does not support databases, whiteboards, smart links, and other custom content.
Unsupported objects include:
Archived Pages
These limitations highlight the constraints and ongoing improvements for the Confluence connector, ensuring better performance and user experience.
API endpoints
Purpose | Cloud Endpoint | Cloud Permission | OAuth 2.0 scopes required & recommended | Cloud Scope |
List users | Exempt from app access rules | read:content-details:confluence | READ | |
List groups | Permission to access the Confluence site ('Can use' global permission). | read:confluence-groups | READ | |
List group members | Permission to access the Confluence site ('Can use' global permission). | read:confluence-space.summary | READ | |
List groups of user | Permission to access the Confluence site ('Can use' global permission). | read:confluence-user | READ | |
Get current user | Permission to access the Confluence site ('Can use' global permission). | read:confluence-user
| READ | |
Get email of users | Permission to access the Confluence site ('Can use' global permission). | read:email-address:confluence | ADMIN | |
List spaces | Permission to access the Confluence site ('Can use' global permission). Note, the returned list will only contain spaces that the current user has permission to view | confluence-space.summary | SPACE_ADMIN | |
CQL-based list spaces | Permission to view the entities. Note, only entities that the user has permission to view will be returned. | search:confluence | READ | |
List pages in space | 'View' permission for the space. Note, the returned list will only contain content that the current user has permission to view. | read:confluence-content.summary | READ | |
List blogposts in space | 'View' permission for the space.
Note, the returned list will only contain content that the current user has permission to view. | read:confluence-content.summary | READ | |
Get space permissions | 'View' permission for the space. | read:confluence-space.summary | READ | |
List content | Permission to access the Confluence site ('Can use' global permission). Only content that the user has permission to view will be returned. | read:confluence-content.summary | READ | |
Get content | Permission to access the Confluence site ('Can use' global permission). Only content that the user has permission to view will be returned. | read:confluence-content.summary | READ | |
CQL based list content | Permission to access the Confluence site ('Can use' global permission). Only content that the user has permission to view will be returned. | search:confluence | READ | |
List children of page | 'View' permission for the space, and permission to view the content if it is a page. | read:confluence-content.summary | READ | |
Get content restrictions | Permission to view the content. | read:confluence-content.all | READ | |
Update content restriction | Permission to edit the content. | write:confluence-content | READ | |
Fetch applinks | N/A |
|
|
|
Create webhook | N/A |
|
|
|
Configure plugin | N/A |
|
|
|
Get installed plugin version | N/A |
|
|
|
Get space permissions via plugin | N/A |
|
|
|
Content Configuration
Note: If Inclusion (Green-Listing) options are enabled, only content from the Inclusion category will be indexed. If Exclusion (Red-Listing) options are enabled, all content in the exclusion category will be removed. If both rules are applied to the same content, then the content will NOT be indexed, as exclusion rules take priority.
The rules below should be used MINIMALLY to preserve the enterprise search experience, as most end-users expect to find all content. Most customers do not apply any rules or apply exclusion rules sparingly for sensitive folders. Exclusion rules are applied automatically after the next full crawl, which can vary by corpus size. If a recrawl is needed, please reach out to your Glean representative.
Exclusion (Red-Listing) Options
Glean provides several options for excluding content from the data crawl, which excludes data from search and chat results.
Space: Exclude certain Confluence spaces from being crawled by Glean by specifying space keys
Pages with specific labels: Exclude pages and blog posts with specific labels from being crawled by Glean
Pages with content matching specific regex: Exclude pages and blog posts with content matching specific regex from being crawled by Glean
Creators: Exclude content created by certain creators from being crawled by Glean.
Inclusion (Green-Listing) Options
Glean provides several options for including content from the data crawl, which includes data from search and chat results.
Spaces: Only allow Glean to crawl certain Confluence spaces. Glean will crawl all spaces except those in the Exclusion rules if no spaces are specified.
Note: Only content specified to be included items will show in search results, chat, or any other Glean applications. Unspecified content will not be included in search results, chat, or other Glean applications.