SDE Storage Options#

The Secure Data Enclave (SDE) provides several secure storage options to help you manage your research data throughout its lifecycle, from upload and analysis to long-term retention. Each option serves a different purpose depending on how you work with your data.

Data Sharing Policies

All storage options in the SDE are private to your SDE environment and monitored for compliance. Data sharing or transfer outside the SDE must follow approved egress procedures.

Types of Storage#

  • Google Cloud Storage Buckets: Use storage buckets to store, organize, and share research data securely within your project between team members. Buckets are ideal for long-term storage, collaboration, and managing large datasets that can be copied to a specific virtual machine when needed.

  • Attached VM Storage: Each virtual machine (VM) in your enclave includes its own attached disk storage. This storage behaves like a local hard drive, fast and temporary, and is best for running analyses or processing active data. Data stored here is only accessible from within your VM, and it is deleted if you delete the VM.

  • BigQuery: Provides a managed database service for working with large, structured datasets. It allows you to query, combine, and analyze data efficiently without managing physical storage or compute resources.

Storage Buckets#

Google Cloud Storage Buckets provide stable storage that can be accessed from VMs and used to ingress and egress data in compliance with data transfer procedures. Storage buckets are available in most, but not all, projects.

Project Name

Notes

Data Ingress

Buckets are added to this project if needed to facilitate Globus transfers to ingress data.

Data Lake

This project comes with several buckets with permissions assigned.

Data Ops

This project includes a “Preprocessed Data” bucket for use by Data Engineers, and additional buckets can be created by the Data Engineer as needed.

Data Egress

The Data Engineers can create buckets within this project, manage permissions, and grant access to Researcher/Data Analysts to download approved data.

Data Engineer Bucket Access#

Data Engineers can perform the following actions on all storage buckets:

Action

Description

Create buckets

Create new buckets within the different projects.

Delete buckets

Remove buckets that are no longer needed (this also deletes all contents).

Upload data

Add files or datasets from a VM to a bucket.

Download data

Retrieve files from a bucket to your endpoint or VM. Data downloads to an endpoint must only be done in a manner consistent with data egress procedures.

Move or copy data

Transfer data between buckets, or projects.

Rename files/folders

Update file names or reorganize folder structures within a bucket.

Set Permissions

Manage who can view, edit, or manage data by assigning access roles.

Researcher/Data Analyst Bucket Access#

Researcher/Data Analysts have limited access to storage buckets.

  • Read and write files to “Researcher Workspace” and “Egress Dataprep” storage buckets in the Data Lake Project from VMs in the Workspace Project. These permissions allow Researcher/Data Analysts to move data between the storage buckets in the Data Lake Project and VMs in the Workspace Projects.

  • Read files in egress storage buckets in the Data Egress Project. Data Engineers create specific storage buckets for each data egress task and assign permissions to Researcher/Data Analysts as needed to download files from these storage buckets to their managed endpoint in a manner consistent with data egress procedures.

Bucket Visibility

Some buckets are provisioned by Northwestern IT as part of your SDE environment setup, while others are created by a Data Engineer. If you do not see a bucket you expect:

Working with Buckets#

Files in storage buckets can be viewed from the Google Cloud Console or Google Cloud Command Line Interface from managed endpoints. These tools can also be used to transfer files, delete files, and manage file permissions. To open and work with files, they must be transferred to the local storage on a VM in the SDE environment. Storage buckets cannot be mounted like an external drive or otherwise synced to VMs.

See the SDE User Guide for detailed instructions on working with storage buckets.

VM Attached Storage#

Each virtual machine (VM) in the SDE includes 200 GB local storage that is directly attached to it.
This storage behaves like a built-in hard drive. It is fast, secure, and accessible only from the VM.
This storage is useful for saving active research files, temporary datasets, or analysis outputs generated during your session. Files can be transferred between storage buckets and this VM storage to work with them. Local VM storage is deleted if a VM is deleted, so data sets and other files that need to be used again must be stored in storage buckets instead.

  • Data stored on a VM’s local disk stays within the SDE environment. It cannot be accessed directly from the Internet.

  • Storage size and performance can be adjusted. Contact Northwestern IT if different resources are needed.

Data Loss due to VM Deletion

Attached storage is persistent across VM restarts (“starting” and “stopping” the same VM) but it is not automatically backed up. Important data should be copied to a storage bucket.

In the event that a VM is intentionally or accidentally deleted, attached storage might also be lost. Keep a copy of your files in storage buckets at all times.

BigQuery#

BigQuery allows researchers to store, manage, and analyze large, structured datasets efficiently. It combines data storage with computational capabilities. Data can be accessed using SQL commands. This service is especially useful for projects that involve large-scale data analysis or collaborative data access.

  • BigQuery is optimized for structured and tabular data. It’s best for analysis, not for raw file storage.

  • Access to BigQuery is managed by a Data Engineer, who controls who can view, query, or edit datasets.

  • Data Engineers can create datasets and tables and import data to BigQuery. Researcher/Data Analysts can query tables they are given access to.

  • Query results can be exported to a storage bucket for analysis if needed.

  • Python and R packages are available for interacting with data stored in BigQuery.