Pre-Requisites
In the early stages of deploying Glean in your AWS self-hosted environment, please review the following:
Review the Shared Responsibility document between Glean and the customer:
Shared Responsibility for Managing Glean.
Review the Cloud Services of Glean that will be deployed:
Review the Architecture of Glean:
High-level architecture of Glean in AWS.
Low-level architecture of Glean in AWS.
Review the AWS Account Access and Deployment Model. This document describes the proposed access model between Glean and a customer’s account as well as how Glean will deploy software within an account.
glean-deployer: This role is used by Glean's central infrastructure to perform deployment operations such as software setup and upgrades. It has permissions to invoke specific lambdas, manage load balancers and SSL certificates, and manipulate secrets in AWS Secrets Manager.
glean-viewer: This role is used by Glean engineers for viewing resources during debugging. It provides read access to a subset of account resources, including EC2 instances, load balancers, CloudWatch metrics, VPC components, Lambda functions, and EKS clusters.
cron-helper-invoker: This role is assumed by a Glean Central project service account to orchestrate workflows like machine learning jobs by invoking the cron_helper lambda function.
Glean does not support customer-managed keys (CMK). Please verify that CMK is not enabled. Note: CMK can block setup and deployment
Getting Started…
Discuss which AWS Region (e.x. `us-east-1`) to deploy the Glean architecture given your LLM model selection.
As of December 2024, here is the list of approved regions:
United States: us-east-1 (Northern Virginia, USA), us-east-2 (Ohio, USA), us-west-1 (Northern California, USA), and us-west-2 (Oregon, USA).
If using AWS Bedrock with Anthropic models, us-east-1 or us-east-2 is recommended.
Europe: eu-central-1 (Frankfurt, Germany), eu-central-2 (Zurich, Switzerland), eu-north-1 (Stockholm, Sweden), eu-south-1 (Milan, Italy), eu-south-2 (Spain), eu-west-1 (Ireland), eu-west-2 (London, England), eu-west-3 (Paris, France).
Additional Regions: ap-south-1 (Mumbai, India), ap-south-2 (Hyderabad, India), ap-southeast-1 (Singapore), ap-southeast-2 (Sydney, Australia), ap-southeast-3 (Jakarta, Indonesia), ap-southeast-4 (Melbourne, Australia), ap-southeast-5 (Bangkok, Thailand), ap-northeast-1 (Tokyo, Japan), ap-northeast-2 (Seoul, South Korea), ap-northeast-3 (Osaka, Japan), ap-east-1 (Hong Kong), af-south-1 (Cape Town, South Africa), ca-central-1 (Montreal, Canada), ca-westn-1 (Vancouver, Canada), il-central-1 (Tel Aviv, Israel), me-central-1 (Bahrain), me-south-1 (United Arab Emirates), sa-east-1 (São Paulo, Brazil).
Create a new & empty AWS Account for Glean in the region of interest.
Optional but highly recommended: For customers that want to host the Glean architecture within their AWS environment, Glean has provided a read-only script to determine & check if there are SCPs (Service Control Policies) on the newly created AWS account that could conflict with Glean’s required IAM permissions.
Run the Glean Service Control Policies (SCP) Checker script in your organization’s AWS root account to check if you have any AWS SCPs that can conflict with the AWS Account that will be hosting Glean.
Notify Glean if you have any issues or conflicts after the script has run.
Notify the Glean when the initial account setup is complete, execute the SCP Checker script, and are provide the following details:
AWS Account ID (e.g., #182306642168)
AWS Account Name (e.g., aws-glean-customer)
AWS Region (e.g., us-west-1)
Your organization’s email domain(s) (e.g. @acme.com)
Your SaaS Admin email(s) addresses (e.g. johnsmith@acme.com, Admin of Slack)
After setup, the admin(s) will receive a “magic link” to access the Glean platform/interface to aid in setup.
Note: After the following information has been provided to the Glean team, please stand-by for further instructions.
Once instructed by Glean, prepare to run the AWS CloudFormation Template (CFT) by following the steps outlined:
Login to the new AWS Account as an Administrator.
Navigate the console to your preferred (Glean-approved) region.
Download the Glean .yaml file
Navigate to the CloudFormation service > Create a new Stack > Upload the Template via the link below… > You can skip all other fields > When you are ready, click Submit to deploy the stack.
Note: After executing the AWS CloudFormation template in Step 5, the deployment process will take over an hour to complete. Once finished, an "External ID" will be generated. Glean will be notified of the External ID and will, in turn, inform your team, as this information is required for Step 6.
Please create the Glean Admin role via a different CloudFormation Template with your external ID. The Glean Team will provide an external ID to be used as a parameter when running the CloudFormation template:
STOP: Please wait for further instructions from the Glean team.
Important Notes:
The Glean Admin role will only be used in urgent and severe situations where Glean’s “on-call support engineers” need short-lived admin-level access to the root cause and address any issues. Access will be short-lived in all events and requires direct approval from the Glean leadership team.
If you utilize AWS Config, please review the cost reduction documentation, as the default AWS Config settings can lead to unexpectedly high costs. Glean provides recommended settings for AWS Config that maintain its value while keeping costs low. If you're not actively using AWS Config, you can disable it completely to reduce costs.
Please review AWS Hosting Cost & Reduction Recommendations and make any changes that can reduce cost.
FAQs
Do the setup processes differ between AWS and GCP?
Yes! The AWS setup and deployment process is detailed in Glean AWS Account Access and Deployment Model [External].
What are all the managed services used?
What should I let the Glean team know ahead of time before setup?
We’d like to understand more about your AWS environment.
What are the compliance policies enforced in your AWS environment? e.x. some environments enforce Implementing and enforcing tagging - Best Practices for Tagging AWS Resources
What are the firewall rules?
What are additional processes within the AWS environment that Glean should be aware of?
What is the preferred method of connecting on-prem data sources to the Glean VPC? At Glean we prefer using Site-to-Site VPN but support Transit Gateway Peering, Shared Transit Gateway, and PrivateLink.
Can we deploy only via Terraform?
Glean deployment uses a combination of Terraform modules, Helm charts, and Python scripts, so we require using custom deploy images.
Does the customer have to set up the connector GCP project?
No, Glean setup and deploy pipelines will automatically set up and configure the connector GCP project
What are the estimated timelines for implementation?
1 week to set up the initial infrastructure (does not include connector setup)
1-2 weeks to get the first full run of ML pipelines after the connector setup has been completed, depending on the corpus size.
Regions
Is AWS region _____ supported?
Generally yes. We now support any region that offers all of our required services.
Certain regions have limitations. For example, as of early April 2024, VPC endpoints for specific managed services may be available only in us-east-1 and us-west-2. You can look through https://www.aws-services.info/ to see which services are supported in your desired region.
We currently do not support GovCloud regions yet. We have no immediate time frame on when we will support this.
Security
What access to the AWS account is required from Glean?
Glean requires access from:
The central Glean project, which orchestrates setup and release deployments.
The Glean AWS account which hosts the images.
More details are available in Glean AWS Account Access and Deployment Model [External] and [EXTERNAL] Architecture: Glean on AWS.
Why does Glean request the customer to create an admin role?
There are situations where the Glean on-call engineer needs admin-level access to remediate or mitigate escalations. They must get approval from Glean leadership to access the Glean side internal admin service account, which can then be used for federated access to the AWS-side IAM admin role.
Will Glean manage NAF and WAF?
Yes.
Which WAF are you using?
We’re using https://aws.amazon.com/waf/.
Does WAF log to CloudWatch?
Yes, this is enabled by default for all logs except for deny requests.
Do you apply data protection filters on CloudWatch logs?
Currently, we do not apply this masking to our logs. We’re discussing internally if it makes sense to apply, but we are wary of rendering logs unusable in important support and debugging situations.
What’s the path of incoming webhooks?
First goes through the WAF (you can add rules like IP restrictions)
Then the application load balancer
Then the k8s cluster
The authentication scheme depends on the specific data source.
Can we attach custom security groups to one of the managed services?
Please provide the details to our support team who can further discuss this.
Does Glean provide any Intrusion Detection capabilities on AWS?
Glean recommends customers to leverage AWS GuardDuty for IDS capabilities on AWS. See this doc for more information.
LLM Provider
Networking
What are the network requirements?
Glean will set up and deploy all infrastructure, including VPC components, within an empty AWS account the customer owns, so there is nothing that the customer needs to do proactively with respect to networking.
Compute
What OS are the EC2 instances running on and where do the AMIs come from?
Generally Amazon Linux 2 on EKS nodes. We use the default AWS-provided AMIs here.
For some standalone EC2 instances, we run a Glean AMI image built on top of Ubuntu 20.04 LTS (Focal).
Use of custom AMIs is not currently supported.
Will Glean patch the OS, or is that the customer's responsibility?
Glean will handle the patching and maintenance of all compute instances.
Cost and Resourcing
How do we appropriately size our Glean instance?
Glean will handle dynamically sizing all of the infrastructure based on many different factors relevant to the customer-specific corpus.
Can you give me an estimate of the cost of the AWS resources? Can you give me an estimate of (1) how much data is transferred out of the AWS account per day (2) number of instances and their sizes across all services (e.g. EC2, RDS, EKS, S3, SageMaker)?
All of this can vary depending on the characteristics of your Glean deployment. To answer this question, please reach out to your Glean contact with the following information:
Number of employees in your organization
Number of documents in your corpus
The data sources to be connected, and ideally the number of docs per data source
While these are some high-level factors, many more nuances go into figuring out how much data needs to be stored and processed. We can provide some estimates based on comparable deployments.
What GPU instance types are typically needed?
Our SageMaker training jobs require ml.g4dn.* instance types (primarily ml.g4dn.xlarge). We run about 1-4 training jobs a day, with varying runtimes from 30 minutes to a few hours.
However, none of the instances we explicitly create, e.g. on the EKS cluster, require GPUs.
How do we minimize egress costs?
Most Glean-relevant traffic is ingress (incoming data). AWS generally does not charge for ingress.
Storage
RDS
Which database are you using?
We’re AWS RDS for MySQL: https://aws.amazon.com/rds/mysql/
How often are SQL backups taken?
Once a day.
S3
Do buckets have Inventory enabled?
No, we don’t enable Inventory.
Are S3 buckets accessible publicly or from Glean Central?
No.
Is S3 configured for cross-region replication?
No, we don’t configure cross-region replication and in practice have not had a strong reason to.
Disaster Recovery
How does Glean handle Disaster Recovery?
Please reference [External] Business Continuity and Disaster Recovery.
Vanity URL
Does Glean on AWS support vanity URLs?
Yes.
Lambdas
The EKS cluster is separate from the private lambdas. What is the purpose of these lambdas, which provide serverless functions?
These lambdas are used for:
Setup & deployment (Bootstrap configuration template)
Maintenance operations and cron jobs, e.g. restarting or upgrading node pools
Are the lambdas configured to be publicly accessible?
None of them are publicly accessible.
Do you add layers to lambdas, and if so, are they accessible from outside the organization?
No, Glean doesn’t add layers to lambdas.
Do you use lambda function URLs?
No, they are disabled.