Introduction

The Confluence connector for Glean allows Glean to fetch and index content from Confluence, ensuring that users can search and access documents for which they have authorized permissions.

Authentication: Glean requires the Confluence admin to authenticate to Glean when setting up the Glean crawler app and Forge App.
Data Storage: All data is stored in the cloud project within the customer's cloud account (Glean or customer hosted), ensuring no data leaves the customer's environment

API Usage

Standard API: Glean uses Atlassian’s standard REST API for Confluence to ingest all data

Integration Features

Content Captured: Glean captures Confluence pages, blog posts, metadata attachments, comments, and more.
Permissions Enforcement: Glean respects all user access permissions, ensuring users only see search results for documents they can access. When a user clicks on a search result, they are taken to the Confluence web application, which enforces the permission.

Versions Supported

The Confluence cloud connector has no specific version limitations, which is Atlassian’s SaaS offering of Confluence in the cloud.

Glean also supports Confluence datacenter edition, a customer-managed deployment (not SaaS) with a different Glean connector and separate documentation.

Objects Supported

The Confluence connector for Glean supports the following objects:

Pages: This includes all the pages within Confluence that are indexed and searchable.
Blog Posts: Blog posts created within Confluence are also captured.
Attachments: Any attachments associated with pages and blog posts are indexed.
Comments: Comments on both pages and blog posts are included in the indexing.
Restricted Pages. Additional user setup is required

The connector ensures comprehensive data coverage, including metadata, identity data, permissions data, and activity data. It provides real-time synchronization, reflecting updates and permission changes immediately in search results.

Authentication Mechanism

Glean requires authentication to the Atlassian instance to fetch relevant information from Confluence.
For Confluence, the Atlassian admin needs to install Glean’s Forge App to the instance.
Glean understands all user access permissions and strictly enforces permissions at the time of the query, ensuring users cannot see results to which access is not granted.
It’s important to note that all data is stored in the customer’s Glean project inside the customer’s cloud account, and no data leaves the customer's environment.

Connector credentials requirements

Installation and Setup Permissions:

For Confluence Cloud, the Atlassian admin needs to install Glean’s Forge App on the instance. The Admin scope is required to fetch permissions associated with Confluence objects, which is necessary for correctly enforcing permissions in the search experience.

Why Read-Only Permissions are Insufficient

While “read” permissions allow access to all content for the Connector, write:space.permission:confluence scope is necessary for the initial installation to automatically be added to all Confluence spaces. Alternatively, the administrator must manually add the Glean crawler to all spaces.

Connection instructions

Required permissions for setup

The user setting up this data source must have administrator permissions.

Set up the basics

Sign in to Confluence as an admin. Copy your Atlassian domain from the URL bar and paste it into Glean: https://YourAtlassianDomain.atlassian.net
Go to https://admin.atlassian.com
Select the organization matching your Atlassian domain from the previous steps
Find the row for Confluence, click on the three dots, and select Manage Product Access.
Enter the default groups (there might be only one) as a comma-separated list in Glean. Only users in the provided product access groups will be able to see results in Glean.
Click Create Forge Crawler App in Glean. This should create an installation link for the Glean crawler app

Connect the Forge Crawler app

As a Confluence admin, open the Forge Crawler app installation link from the Glean setup page.
Click on Get app and install the app in the correct Confluence instance.
After the app installation is successful, click Save in Glean. You’re all set!

Setup in Glean

Input the data source name in the Name text box and select an icon
Complete any outstanding setup in Show setup instructions
Input the following information:
- Input Confluence domain name in Your Atlassian domain name text box
- Input the default access group in the Default access group name text box
Webhook URL is utilized for webhook setup in granting Glean access and setting up the Glean from Atlassian’s marketplace
Click Save

Optional: Configuring Glean search for Confluence to crawl restricted pages

The Glean connector for Confluence by default, is configured to access all Confluence spaces and pages except Restricted Pages. Atlassian Admins cannot view restricted pages unless the admin user is given explicit access.

Confluence Restricted Pages can be important for users who request to be included in search results. Glean has built the capability to crawl, and index restricted pages in a permissions-enforced way. It involves providing the “Scio Search for Confluence” app view access to the pages to be indexed. The following is an overview of the procedure:

Users must:
- Edit access to a set of Restricted Pages in Confluence
- “Add/Delete Restrictions” permissions for the space
- Create an API token through the Atlassian settings workflow
- Upload that API token into their Glean application settings to store the token securely
Glean will securely read the token and add the “Glean Search Crawler for Confluence” application to the restricted pages where the app has edit access as view-only
Glean will be able to crawl and index the pages with the “Glean Search Crawler for Confluence”
Users with edit or view access to the Restricted Pages can view those pages in Glean search results.
Multiple users can upload their API tokens. For each such user, Glean will add the Glean Search Crawler For Confluence app to the view restrictions of restricted pages that the user can edit.

Create Confluence API Token

For any of the users who have edit access to their Restricted Pages and need to have those pages crawled and indexed into Glean, they must do the following:

Login to their account in Atlassian
Go to https://id.atlassian.com/manage-profile/security/api-tokens
Select [Create API Token]
Enter a name for the token, for example “Glean Search Crawler”

OAuth Flow for Individual Users

Individual users must authorize Glean in the UI by clicking your profile picture (bottom left corner) → Your settings → Data sources → Confluence Cloud.

Items crawled

Content

For Confluence, Glean crawls the following content:

Spaces
Pages
Blogs
Comments - from both Pages and Blogs
Attachments metadata

Identity

Users: Information about users
Groups: Details about groups within the domain.
Memberships: Information about group memberships, indicating which users belong to which groups.

The identity crawl operates with the following configurations:

Incremental Identity Crawls: These are performed to capture changes since the last crawl.
Full Identity Crawls: These are conducted periodically to ensure all identity data is up-to-date.

Activity

Crawls the following activities on content (Spaces, pages, blogs,...) to keep the index current:

Adds: New content, spaces, pages, blogs, files, or folders added
Updates: Modifying existing content, spaces, pages, blogs, files, files, or folders.
Permissions Changes: Changes in content, spaces, pages, blogs, files, or folders sharing permissions.
Deletions: content, spaces, pages, blogs, files, or folders that have been deleted.
View Activity: Events indicating when content, spaces, pages, blogs, files, or folders have been viewed.

The Glean Activity plugin for Confluence helps Glean provide highly personalized search results for users. By sending webhook events to Glean each time a user views a page, blog, or piece of content, the plugin enables Glean instance to gather valuable information crucial to delivering an outstanding search experience. This information is stored securely in your dedicated cloud project (Glean or customer-hosted), ensuring complete privacy and protection of your data.

Rate Limits

Queries per Second (QPS): The default rate limit is set to 20 queries per second per user.

Update frequency

Content updates for the Confluence connector in Glean can happen quite rapidly, depending on the type of update and the configuration settings. Here are the key areas:

People / Identity Crawls: Changes to group memberships are picked up by the identity crawl, which runs every 10 minutes. This ensures that updates to user groups and their permissions are reflected promptly.
Incremental Crawls: These occur every 1 hour to provide additional reliability beyond the minute-by-minute activity reports.
Full Crawls: The frequency of full crawls can be configured, but they are generally less frequent than incremental crawls at 7 days

Changes in data must be crawled, processed, and indexed before the data is reflected in the UI. Actual time may vary depending on the number of changes and corpus size. For the most up-to-date crawler refresh information, please refer to [External] Glean crawling strategy

How the crawl works

The crawler follows the traditional crawler strategy, including utilizing the API and the following ways to get and update data:

Identity Crawl: updating and adding of People data, including users, groups, and other information
Webhooks: are messages sent by the application to notify Glean of changes in real-time, and then Glean either initiates a crawl or picks up the change on the next crawl.
Content Crawls: Full crawls the entire defined scope of the application whereas incremental crawls only capture the changes from the previous full or incremental crawl.

Known Limitations in Crawl

The Confluence connector for Glean has the following known limitations in its crawling process:

The Glean app can read all unrestricted pages in the Confluence spaces. However, Glean can only read restricted pages if the admin grants access to the app for them.
Blogpost Hierarchy: In Confluence Cloud and Confluence Server, blogposts do not have a hierarchical structure and will perform a normal list-all-content-ids REST API call. Additionally, Glean does not support databases, whiteboards, smart links, and other custom content.

Unsupported objects include:

Archived Pages

These limitations highlight the constraints and ongoing improvements for the Confluence connector, ensuring better performance and user experience.

API endpoints

Purpose	Cloud Endpoint	Cloud Permission	OAuth 2.0 scopes required & recommended	Cloud Scope
List users	search/user	Exempt from app access rules	read:content-details:confluence	READ
List groups	group	Permission to access the Confluence site ('Can use' global permission).	read:confluence-groups	READ
List group members	group/member	Permission to access the Confluence site ('Can use' global permission).	read:confluence-space.summary	READ
List groups of user	user/memberof	Permission to access the Confluence site ('Can use' global permission).	read:confluence-user	READ
Get current user	user/current	Permission to access the Confluence site ('Can use' global permission).	read:confluence-user	READ
Get email of users	user/email/bulk	Permission to access the Confluence site ('Can use' global permission).	read:email-address:confluence	ADMIN
List spaces	space	Permission to access the Confluence site ('Can use' global permission). Note, the returned list will only contain spaces that the current user has permission to view	confluence-space.summary	SPACE_ADMIN
CQL-based list spaces	search	Permission to view the entities. Note, only entities that the user has permission to view will be returned.	search:confluence	READ
List pages in space	space/%s/content/page	'View' permission for the space. Note, the returned list will only contain content that the current user has permission to view.	read:confluence-content.summary	READ
List blogposts in space	space/%s/content/blogpost	'View' permission for the space. Note, the returned list will only contain content that the current user has permission to view.	read:confluence-content.summary	READ
Get space permissions	space/%s	'View' permission for the space.	read:confluence-space.summary	READ
List content	content	Permission to access the Confluence site ('Can use' global permission). Only content that the user has permission to view will be returned.	read:confluence-content.summary	READ
Get content	content/%s	Permission to access the Confluence site ('Can use' global permission). Only content that the user has permission to view will be returned.	read:confluence-content.summary	READ
CQL based list content	content/search	Permission to access the Confluence site ('Can use' global permission). Only content that the user has permission to view will be returned.	search:confluence	READ
List children of page	pages/%s/children	'View' permission for the space, and permission to view the content if it is a page.	read:confluence-content.summary	READ
Get content restrictions	content/%s/restriction/byOperation/read	Permission to view the content.	read:confluence-content.all	READ
Update content restriction	content/%s/restriction	Permission to edit the content.	write:confluence-content	READ
Fetch applinks	N/A
Create webhook	N/A
Configure plugin	N/A
Get installed plugin version	N/A
Get space permissions via plugin	N/A

Content Configuration

Note: If Inclusion (Green-Listing) options are enabled, only content from the Inclusion category will be indexed. If Exclusion (Red-Listing) options are enabled, all content in the exclusion category will be removed. If both rules are applied to the same content, then the content will NOT be indexed, as exclusion rules take priority.

The rules below should be used MINIMALLY to preserve the enterprise search experience, as most end-users expect to find all content. Most customers do not apply any rules or apply exclusion rules sparingly for sensitive folders. Exclusion rules are applied automatically after the next full crawl, which can vary by corpus size. If a recrawl is needed, please reach out to your Glean representative.

Exclusion (Red-Listing) Options

Glean provides several options for excluding content from the data crawl, which excludes data from search and chat results.

Space: Exclude certain Confluence spaces from being crawled by Glean by specifying space keys
Pages with specific labels: Exclude pages and blog posts with specific labels from being crawled by Glean
Pages with content matching specific regex: Exclude pages and blog posts with content matching specific regex from being crawled by Glean
Creators: Exclude content created by certain creators from being crawled by Glean.

Inclusion (Green-Listing) Options

Glean provides several options for including content from the data crawl, which includes data from search and chat results.

Spaces: Only allow Glean to crawl certain Confluence spaces. Glean will crawl all spaces except those in the Exclusion rules if no spaces are specified.

Note: Only content specified to be included items will show in search results, chat, or any other Glean applications. Unspecified content will not be included in search results, chat, or other Glean applications.

Confluence Cloud Connector