Run an Analysis

This guide walks you through the complete process of running a federated analysis using the Five Safes TES weave. It covers setting up the environment, configuring connections to TREs (Trusted Research Environments), submitting analysis jobs, and retrieving aggregated results.

It demonstrates running a basic statistical analysis (mean calculation) on measurement values from OMOP Concept “History Source” across multiple TREs.

Requirements

Software Prerequisites

  • Python (3.10 or later)
  • Poetry (optional, 1.8.0 recommended)

Infrastructure Prerequisites

Information Prerequisites

  • Submission Layer endpoint and API Token
  • Submission Layer MinIO endpoint.
  • Database host and credentials for each TRE

It is assumed that the complete Five Safes TES stack is deployed, and you have all the required information.

Setup

Clone the repository

Clone the repository from here: https://github.com/Health-Informatics-UoN/Five-Safes-TES-Analytics

git clone https://github.com/Health-Informatics-UoN/Five-Safes-TES-Analytics.git

Install dependencies

Using Poetry is recommended. Run the command:

poetry install

Alternatively, you can use pip with the requirements file:

pip install -r requirements.txt

Edit the env.example

Edit the env.example file to update the environment variables. The example file has placeholders for all the relevant details. They are all required.

# TRE-FX Analytics Environment Configuration
# Copy this file to .env and update with your actual values
# ALL VARIABLES BELOW ARE REQUIRED - the application will fail to start without them
 
# Authentication
TRE_FX_TOKEN=your_jwt_token_here
TRE_FX_PROJECT=your_project_name
 
# TES (Task Execution Service) Configuration
TES_BASE_URL=http://your-tes-endpoint:5034/  # Host and Port of the Submission Layer API
TES_DOCKER_IMAGE=harbor.ukserp.ac.uk/dare-trefx/control-tre-sqlpg@sha256:18a8d3b056fd573ec199523fc333c691cd4e7e90ff1c43e59be8314066e1313c
 
# Database Configuration
DB_HOST=your-database-host
DB_PORT=5432
DB_USERNAME=your-database-username
DB_PASSWORD=your-database-password
DB_NAME=your-database-name
 
# MinIO Configuration
MINIO_STS_ENDPOINT=http://your-minio-endpoint:9000/sts
MINIO_ENDPOINT=your-minio-endpoint:9000
MINIO_OUTPUT_BUCKET=your-output-bucket-name

This demo requires an SQL Docker container that will be used to run the analysis and the required container image is already set in the env.example file, at the TES_DOCKER_IMAGE variable.

⚠️

There is an known issue running this image by Docker on certain machines/configurations (ex. ARM64). We are currently working on a fix for this.

Rename the env.example to .env

Put the Access Token into the ‘.env’ file.

Paste the Access Token into the .env file under TRE_FX_TOKEN and save the file.

Run an analysis

This runs the basic default demo, which will calculate means of measurement values with a particular OMOP Concept: “Airway resistance” 21490742.

Run analysis_engine.py.

Using poetry, the command is:

poetry run python analysis_engine.py

Review submission details.

The terminal will give updates on the submission status from the Python script. The Submission GUI will give more details under the submissions tab.

Wait.

Check the status

When the processing is complete, the status in the Submission layer will change to waiting for egress. This means that the analysis has been processed and needs to be approved before the results can leave the TREs.

Approve/deny egress requests

Acting as the TREs, access the Egress control(s) and approve (or deny) the egress.

The default behaviour is to complete the analysis with the results given, even if one or more TREs don’t provide results. Once they have been approved, the status in both the submission layer GUI and the terminal will be updated the next time it polls.

Fetch partial results

The partial results from each TRE will be fetched automatically.

Aggregate results

The partial results will be aggregated and return the final result to the terminal.

Next steps

The next step is to run a different analysis on a different subset of data.

The general way to use the tool is to use it in a Python environment rather than running from the terminal.

The data selection is done with an SQL query. This is simply to select a subset of data to run the analysis on. Change the user query to select the data you want to run the analysis on.

Supported analysis types are currently mean, variance, PMCC, chi_squared_scipy and chi_squared_manual.

Once the analysis is completed, the aggregated data is stored in the engine, and the analysis, or related analyses, can be repeated without further queries to the TREs.

import analysis_engine
 
engine = analysis_engine.AnalysisEngine()
 
# Example
user_query = """SELECT value_as_number FROM public.measurement
WHERE measurement_concept_id = 3037532
AND value_as_number IS NOT NULL"""
 
print("Running mean analysis...")
 
mean_result = engine.run_analysis(
    analysis_type="mean",
    task_name="DEMO: mean analysis test",
    user_query=user_query,
    tres=["Nottingham", "Nottingham 2"]
)
 
print(f"Mean analysis result: {mean_result['result']}")
# Show what aggregated data we have stored
print(f"Stored aggregated data: {engine.aggregated_data}")