Data Analysis: Technical Details

Overview

This document provides an overview of the Data Analysis feature’s architecture and request flow. Please read our Data Analysis overview and Security Whitepaper as well.

Architecture

All the data analysis requests go through as regular /chat requests to Query Endpoint (QE). We trigger data analysis flow when the user query is an analytical question on an uploaded or tagged spreadsheet (.xslx, .xls or .csv files). Data analysis is a Glean Action that repeatedly generates and executes python code to arrive at the answer. The action uses LLMs to generate code and a dedicated python sandbox to execute the code.

Components

Cloud SQL: uploaded files are stored here. See [EXTERNAL] File Upload Feature in Assistant - Technical Details for more details.
Query Endpoint: Glean k8s service that handles /chat requests. Data analysis is triggered when the conversation contains uploaded or tagged spreadsheets and if the user query is an analytical question that needs to be answered via data analysis.
Sandbox: a dedicated environment that executes python code generated for data analysis. Each chat session running data analysis uses a dedicated sandbox. The uploaded file is copied into the sandbox and code is executed on it.
Sandbox orchestrator: provisions and manages lifecycle of sandboxes.

Sandboxes for Data Analysis

Sandbox orchestrator and the sandboxes are deployed as k8s pods in Glean cluster with a dedicated namespace. Today, all the pods run in a single node.

Sandbox is a simple flask server that supports APIs for uploading files and executing python code. Sandbox has a local file system that can be used by the python code. This allows us to execute code that can read and work with the files. Each data analysis session uses a dedicated sandbox so there is no data leakage between sessions. We also implement several restrictions and security measures to secure the sandbox environment such as:

Limited resources for CPU (500mCPU) and memory (500MiB)
No network egress so no access to the internet or other internal Glean services
Limited network ingress only to QE pods
Runs with non-root permissions
GVisor to prevent side channel attacks. This prevents one sandbox being able to read data from another sandbox.

Sandbox orchestrator is another flask server that manages the lifecycle of sandboxes. It exposes APIs to requests for sandboxes and handles the initialization of the pod pool to fit the node and the destruction of stale sandboxes. It performs the following:

Initialization: The orchestrator assigns a unique sandbox instance to each chat session. There is a limit of one sandbox per user. In case a user starts a new session, the old sandbox is reset and re-provisioned.
Scaling: The orchestrator scales up sandboxes to handle concurrent file analyst executions. It prevents misuse by limiting the number of sandboxes per user and the total number of concurrent sandboxes in a deployment.
Cleanup: The orchestrator periodically cleans up stale sandbox instances that have been inactive for a specified duration (e.g., 10 minutes).
Resource Management: The orchestrator ensures that each sandbox pod has fixed memory and CPU limits. It also blocks all network egress and only allows ingress from QE pods.

Data Analysis: Technical Details

Overview

Architecture

Components

Sandboxes for Data Analysis

Data analysis flow