nexus published, nexus index source, bash index source
The Bourne Again Shell… but confidentially the content here is one big cheat sheet for setting up a data science VM on the cloud running a Jupyter notebook server appearing in one’s (say laptop) browser.
Click on this link to skip down to a writeup on “What’s the basic idea here?”
, jupyter
, docker
, ssh
, sftp
, git
utility for AWS.CloudKeypair.pem
with permissions set to 0400
using chmod
terminal on the VM
via ssh
We continue this narrative with two intermezzo sections: Resource links and notes on using Linux from a Windows PC. The narrative then jumps into the bootstrapping recipe for the cloud VM.
It is convenient to have a local (“laptop”) instance of Linux. This is facilitated on PCs running Windows via a feature called the Windows Subsystem for Linux (WSL). This allows us to run Linux without needing a VM to run on the PC.
When setting up a hosted Linux environment in this context: Activate and use
the Windows Subsystem for Linux version 2 (WSL-2). See which version is active
by opening a Command Prompt window as Administrator and issuing wsl -l -v
. To
change from WSL 1 to WSL 2: wsl --set-version Ubuntu 2
One useful feature of mirroring the cloud VM environment on one’s local laptop is
coding locally, i.e. while not connected to the cloud VM. This is managed via
git push
and git pull
Some of the GitHub synchronization can be done by means of shell scripts. For example
for repository ant
one could set up a script called
echo ant
cd ~/ant; git pull
cd ~
This is run using source
and it “pays for itself” in workflow time once the
number of repos to synch exceeds 1.
If using Docker to build containers on the PC: Check that docker
runs properly in
. If docker
could not be found, one workaround is to start the Docker app
which seems to activate docker integration with WSL-2.
The notes that follow on this page for configuring Linux on a cloud VM apply in some measure to a laptop environment as well.
More elaborate heading: Bootstrapping an Ubuntu VM to run jupyter with a GitHub repo: Via an ssh tunnel
The Bourne Again Shell (bash
) together with ssh
is our first resource in managing cloud Virtual Machine use
as a research environment. The goal is to configure a cloud VM to have a GitHub repo and some data science
libraries, and then finally to start and use a Jupyter notebook server.
These notes were developed on AWS; and should be validated on other clouds.
prompt.ssh -i ~/.keypairs/CloudKeypair.pem ubuntu@
PRO TIP Getting a publickey error when trying the
to your VM?
- Make sure you ran
chmod 400 CloudKeypair.pem
- Make sure you are using the correct username: Is it
or …?- Make sure the home directory in VSCode is the same as that of your local Ubuntu shell
- Sometimes multiple editions of Linux get installed accidentally
Now on the VM: In ~
the .ssh
directory includes a file authorized_keys
. This file should
be pre-loaded from a keypair .pem
file selected or generated during VM spin-up, what we
refer to here as CloudKeyPair.pem
. The authorized_keys
file resides on the VM to
validate ssh
In what follows commands are given without indicating a bash
Verify the version of Ubuntu using lsb_release -a
This block installs the miniconda
cd ~
which python3
git clone
mkdir -p ~/miniconda3
wget -O ~/miniconda3/
bash ~/miniconda3/ -b -u -p ~/miniconda3
rm ~/miniconda3/
To ensure access to miniconda
from the command line, place the following line at the very end
of ~/.bashrc
export PATH=~/miniconda3/bin:$PATH
Next: Run ~/.bashrc
, confirm the conda
package manager is available, and
have conda go through its initialization process.
source ~/.bashrc
which conda
conda init
source ~/.bashrc
conda init
for conda environment use. It should suffice to runsource ~/.bashrc
. It should also suffice to log out and log back in to the VM (exit
Incidentally in addition to package managers conda
and pip
there is also available in Linux
the Advanced Package Tool or apt
. This is specific to Debian-based forms of Linux including
Ubuntu. One can update the package index and then upgrade installed packages using apt
, as
follows. I claim this is a worthwhile perfunctory action as part of this bootstrap process.
sudo apt update -y
sudo apt upgrade -y
can also be installed. The functionality we need is covered byconda
however so this recipe page does not delve intopip/venv
Continuing with conda
conda create --name testenv
ls -al ~/miniconda3/envs
conda env list
conda activate testenv
conda deactivate
conda activate testenv
The Linux command prompt should now look like: (testenv) ubuntu@ip10.0.12.240:~$
Install data science tools including the Jupyter notebook package.
conda install jupyter
which jupyter
conda install pandas -y
conda install numpy -y
conda install matplotlib -y
Suppose we step away and come back a few days later… what got installed?
conda list
tunnelWe can now set up an ssh
tunnel to a jupyter notebook server running on the VM.
This is described in more detail on the nexus tunnels
page. Here is the final command we run on the VM, to start the Jupyter notebook server process as a
background task:
(jupyter lab --no-browser --port=8889) &
There will follow from this a lot of text output on the screen. From
this text: Find and copy the Jupyter access token, for example:
From the laptop create the ssh
tunnel. It will serve as a secure connection between
the VM’s Jupyter notebook server and a browser running on the laptop.
ssh -N -f -i .keypairs/CloudKeyPair.pem -L localhost:7005:localhost:8889 ubuntu@
Note: This does not reconnect you to
on the cloud VM. Rather it creates a persistent tunnel.
To access the VM Jupyter notebook server via the tunnel: In a browser address bar enter
the text localhost:7005
. Another option is to include the token copied above,
again placed in the browser address bar as:
This command tells your browser to connect to port 7005. This was wired into port 8889 on
the cloud VM (previous ssh
command). This in turn connects to the jupyter lab notebook
server running on the cloud VM.
is an abbreviation for the Bourne Again Shell, an interface to the UNIX operating
system. Or more constructively to its descendent operating system Linux.
du -h -d1
Edit a text file with Linux commands, save it as go.script
. Issue source go.script
The source
program will attempt to execute the individual commands in sequence.
Basic bash
command sequence for file system navigation. This should become
second nature.
mkdir child
cd child
cd .
cd ..
ls -al
A filename that begins with a period is considered a system file. It
is not listed by a basic ls
command; but it is visible from ls -a
We can edit, customize and run .bashrc
by typing source .bashrc
It runs automatically on login.
foldersgit push
regularly incurs tens of thousands
of dollars in unintended cloud spend.Each file and directory has an associated access permission string, visible
via ls -al
. There are three fields each consisting of three values.
There is also a leading character that tells you if a file is a directory.
Output of ls -al
-rwxrwxrwx 1 kilroy kilroy 2668 Apr 19 15:50
drwxr-xr-x 1 kilroy kilroy 4096 May 16 19:16 miniconda3
The terraform
file permission field starts with -
meaning it is an
ordinary file. d
in this field for miniconda3
means it is a directory.
The rwx
fields that follow are bitwise permission fields, in sequence
left to right for: the User, the Users Group, and Other Users on this
computer. In the case of miniconda3
: The Group and Other
permissions prevent anyone other than the User from writing in that folder.
We can modify the permission string for a file using the change mode
command, chmod
. We access documentation for this command by issuing
the manual command: man chmod
, nano
, zile
(etcetera) are popular editors in the same basic
format. A popular alternative is the vi
, vim
(etcetera) family of editors.
rm file
deletes file
rmdir dir
deletes a (necessarily empty) directoryless file
views the contents of file
interactivelymore file
is an older version of less
cp a b
copies file a
to a new file called b
mv a b
renames file a
to file b
cat file1 >> file2
copies the contents of file1 onto the end of file2grep mohawk file1.txt
searches for the occurrence of string mohawk
in file1.txt
df .
prints the volume of the file systemdu -h -d1
prints the volume of each directory in the current directoryhistory
lists your recent commands in chronological order, conveniently numbered
re-issues command number 54 from your history!!
re-runs the last command you gave!-3
re-runs the command 3 commands back in your historyAn open question for any given research group is whether using a commercial cloud platform such as GCP or AWS or Azure is worth the time investment to learn how. This question has no automatic or simple answer; hence the idea is to talk through aspects of a cloud-for-research framework to develop an appreciation of the requisite learning curve. From my perspective this framework is closely connected with the practice of open science. The first order elements of the framework are compute power and storage, closely followed by cost management and security.
Many of us are accustomed to experiencing computers as physical boxes with a cable running to a keyboard and another cable running to a monitor; or as an integrated laptop. The transition to cloud computing is a potentially puzzling process; so this narrative begins with a view of cloud virtual machines (VMs).
As users of the ‘computer as physical object’ we experience our local environment through applications that feature very smooth, elaborate graphics that operate on some local data files that reside on a local storage drive. (This experience includes the server or cluster model where a number of machines and distributed data storage are made available to a research team.)
Furthermore we experience a vast network beyond our local computer as “a view of the Internet through the windows of browser tabs”. Now supposing we begin to explore cloud computing due to the availability of a vast pool of computing resources. These we can use and pay for by the hour, so far so good. The conceptual shift to make this work – what we write about here – is a hybridizing of the “local computer view” and the “browser tab Internet view”.
The first step in this hybrid concept of computing is to rent a cloud Virtual Machine that
is exclusively for our use: We are the root
user. As such we want to install and run
applications that operate just as an application works on our local laptop. This can be
disconcerting because the cloud VM obviously does not have direct access to data files on
our laptop. But we charge ahead for the moment: This nexus
website emphasizes two such
applications from the outset, to run on a cloud VM. The first is a Jupyter
notebook server supporting Interactive Python (IPython). The second is an Integrated
Development Environment (IDE) that we use to build additional machinery on the
cloud. This IDE is called VSCode and it is widely regarded as very useful.
What these two applications have in common is they present a smooth, elaborate interface to a working environment that appears on our local computer (which I will tend to refer to as our laptop). We have yet to address the data files but that follows below. The main idea here is that cloud computing comes to us, on our local computer in some sense as a new version of our working research environment.
Let’s take this narrative from the top once more to add an important detail. We begin with
us researchers interested in using the cloud. We rent a cloud Virtual Machine to use as
a potentially very powerful working environment and we set up two applications to run on
the cloud VM. These present an interface on our laptop by means of a secure connection called
a tunnel, specifically an ssh
tunnel. This tunnel is fast and secure, meaning there is
no way to intercept the traffic through it in any meaningful way.
Now to the question of data files. As with the jump to the hybrid picture of cloud VMs, the data situation is also a conceptual jump, arguably bigger. The bottom line is: The cloud has unlimited low cost storage that will not impact a typical research budget until the data volume approaches 200 TB. This is primarily due to multiple modes of data storage available on all cloud platforms, the most fundamental being object storage that features fast access at a rate of about $0.023 per GB per month.
As noted above this description is intended to sketch a framework in view of the time commitment necessary to learn to use the cloud effectively. Further elaboration is essential, having left off at this point shy of…