Computational Resources (VMs)#

Virtual Machines (VMs) in the Secure Data Enclave (SDE) provide the processing power needed to run analyses, build models, and work with data securely. VMs run the Linux Ubuntu operating system and are preconfigured with standard research software. They have attached storage, as well as access to storage buckets and BigQuery, if configured, in the SDE environment.

VMs are part of the Google Cloud Compute Engine service.

See the SDE User Guide for details on connecting to and using VMs.

VM Availability#

VMs are available in the following projects:

Project

Available to

Purpose

Compute Availability

Data Ingress

Data Engineer

For controlled data transfer into the SDE environment.

Single VM for all data ingress tasks.

Data Ops

Data Engineer

Data cleaning, curation, and management. Data analysis by Data Engineers.

1 VM by default; additional VMs can be requested.

Workspace

Researcher/Data Analyst

General research tasks and data analysis.

1 VM per workspace project. Additional VMs can be requested.

Using VMs#

  • VMs have no direct internet access. This means that R, Python, and other analysis or software packages cannot be installed directly by users. See software for more information on preinstalled software and the process for adding additional packages.

  • VMs come with persistent local storage, but it is not backed up automatically. Important files should be saved to a storage bucket.

  • A Data Engineer manages VM configuration and can request additional CPU, memory, or storage if needed.

  • Access and permissions are centrally managed to maintain compliance with NIST SP 800-171 and institutional requirements.

  • VMs can have access to authorized and configured GitHub repositories to share code between SDE users and transfer code in and out of the SDE.

VM usage is billed based on the hours they are running, including idle time. SDE users need to start VMs when ready to use them and stop them when their work is finished. Attached storage persists when a VM is stopped, allowing files to be used across multiple sessions. When VMs are deleted, all files and data on the attached storage are also deleted.

Users can start and stop VMs through the Google Cloud Console or Google Cloud Command Line Interface . Users can connect to VMs using SSH-in-browswer, a terminal program on their managed laptop or workstation, or through remote desktop or screen sharing applications.

Best Practices

Learn more about using VMs in the SDE User Guide.

Available VM Types#

The SDE offers by default a standard VM configuration (E2-Standard-8) suitable for most research and data analysis tasks.
Additional VM types can be requested for projects requiring more computational power, memory, or GPU capabilities. VMs with greater computational resources have higher per-hour costs .

VM Type

vCPUs

Memory (GB)

Typical Use Case

Availability

E2-Standard-8

8

32

General data analysis, R/Python workloads, Jupyter notebooks

Default

E2-Standard-4

4

16

Lightweight alternative for processing, scripting, or testing

Optional

N2-Highmem-16

16

128

Memory-intensive computations, large datasets

By Request

N2-Highcpu-64

64

64

CPU-heavy workloads, simulation, and parallel processing

By Request

Available Software#

Connections between the SDE environment are tightly controlled and blocked for most SDE resources. VMs include preinstalled software and libraries for data analysis, statistics, development, and document creation.

Additional software, packages, and libraries can be added to the VMs via request to Northwestern IT. When new software or packages are added, VMs need to be recreated in the SDE environment, which will delete all files stored on the attached VM storage; data and files can and should be saved in storage buckets in the SDE environment before software updates.

Web Browser#

Google Chrome

Office & Productivity#

LibreOffice Suite

  • Writer (Documents)

  • Calc (Spreadsheets)

  • Impress (Presentations)

  • Draw (Diagrams/Graphics)

  • Base (Databases)

  • Math (Formulas)

Technical & Statistical Computing#

MATLAB (licensed, available for installation when needed)

R/RStudio, with a variety of packages including:

  • Data analysis (tidyverse, data.table)

  • Statistics & modeling (survival, lme4, mgcv)

  • Machine learning (randomForest, xgboost, glmnet)

  • Bayesian analysis (rstan, brms)

  • Geospatial (sf, terra, spatstat)

  • Visualization (ggplot2, plotly, rgl)

  • Web apps & dashboards (shiny ecosystem)

  • Reporting (bookdown, rmarkdown, flextable, gtsummary)

Additional packages can be installed via request to Northwestern IT.

Python#

Python 3, with popular science and ML packages including the following libraries and their dependencies:

  • Data: pandas, polars, numpy, pyarrow

  • Statistics: scipy, statsmodels

  • Machine Learning: scikit-learn, xgboost, lightgbm

  • Deep Learning: tensorflow, theano

  • NLP: nltk, spacy

  • Visualization: matplotlib, plotly, bokeh

  • Images & OCR: Pillow, scikit-image, pytesseract

  • Geospatial: geopandas, shapely

  • Networks: networkx, additional graph analysis tools

  • HTML processing: beautifulsoup4

  • Notebooks: JupyterLab

Additional libraries can be installed via request to Northwestern IT.

Development Tools#

  • GCC and Fortran compilers

  • CMake, Make

  • git

  • pkg-config

  • autotools