Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ contact: 'team@carpentries.org'

# Order of episodes in your lesson
episodes:
- 10-hpc-intro.md
- 10-hpc-intro.Rmd
- 11-connecting.Rmd
- 12-cluster.Rmd
- 13-scheduler.Rmd
Expand All @@ -78,12 +78,15 @@ episodes:

# Information for Learners
learners:
- setup.md

# Information for Instructors
instructors:
- instructor-notes.Rmd

# Learner Profiles
profiles:
- learner-profiles.md

# Customisation ---------------------------------------------
#
Expand Down
59 changes: 30 additions & 29 deletions episodes/10-hpc-intro.md → episodes/10-hpc-intro.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,11 @@ teaching: 15
exercises: 5
---

```{r, echo=FALSE}
# Source the external configuration script
source("load_config.R")
```

::::::::::::::::::::::::::::::::::::::: objectives

- Describe what an HPC system is
Expand All @@ -22,15 +27,15 @@ Frequently, research problems that use computing can outgrow the capabilities
of the desktop or laptop computer where they started:

- A statistics student wants to cross-validate a model. This involves running
the model 1000 times -- but each run takes an hour. Running the model on
the model 1000 times but each run takes an hour. Running the model on
a laptop will take over a month! In this research problem, final results are
calculated after all 1000 models have run, but typically only one model is
run at a time (in **serial**) on the laptop. Since each of the 1000 runs is
independent of all others, and given enough computers, it's theoretically
possible to run them all at once (in **parallel**).
- A genomics researcher has been using small datasets of sequence data, but
soon will be receiving a new type of sequencing data that is 10 times as
large. It's already challenging to open the datasets on a computer --
large. It's already challenging to open the datasets on a computer
analyzing these larger datasets will probably crash it. In this research
problem, the calculations required might be impossible to parallelize, but a
computer with **more memory** would be required to analyze the much larger
Expand All @@ -54,7 +59,7 @@ problems in parallel**.

## Jargon Busting Presentation

Open the [HPC Jargon Buster](../files/jargon#p1)
Open the [HPC Jargon Buster](files/jargon.html#p1)
in a new tab. To present the content, press `C` to open a **c**lone in a
separate window, then press `P` to toggle **p**resentation mode.

Expand All @@ -71,48 +76,44 @@ results.
## Some Ideas

- Checking email: your computer (possibly in your pocket) contacts a remote
machine, authenticates, and downloads a list of new messages; it also
uploads changes to message status, such as whether you read, marked as
junk, or deleted the message. Since yours is not the only account, the
mail server is probably one of many in a data center.
- Searching for a phrase online involves comparing your search term against
a massive database of all known sites, looking for matches. This "query"
machine, authenticates, and downloads a list of new messages; it also uploads
changes to message status, such as whether you read, marked as junk, or
deleted the message. Since yours is not the only account, the mail server is
probably one of many in a data center.
- Searching for a phrase online involves comparing your search term against a
massive database of all known sites, looking for matches. This "query"
operation can be straightforward, but building that database is a
[monumental task][mapreduce]! Servers are involved at every step.
- Searching for directions on a mapping website involves connecting your
(A) starting and (B) end points by [traversing a graph][dijkstra] in
search of the "shortest" path by distance, time, expense, or another
metric. Converting a map into the right form is relatively simple, but
calculating all the possible routes between A and B is expensive.
- Searching for directions on a mapping website involves connecting your (A)
starting and (B) end points by [traversing a graph][dijkstra] in search of
the "shortest" path by distance, time, expense, or another metric. Converting
a map into the right form is relatively simple, but calculating all the
possible routes between A and B is expensive.

Checking email could be serial: your machine connects to one server and
exchanges data. Searching by querying the database for your search term (or
endpoints) could also be serial, in that one machine receives your query
and returns the result. However, assembling and storing the full database
is far beyond the capability of any one machine. Therefore, these functions
are served in parallel by a large, ["hyperscale"][hyperscale] collection of
servers working together.


endpoints) could also be serial, in that one machine receives your query and
returns the result. However, assembling and storing the full database is far
beyond the capability of any one machine. Therefore, these functions are served
in parallel by a large, ["hyperscale"][hyperscale] collection of servers
working together.

:::::::::::::::::::::::::

::::::::::::::::::::::::::::::::::::::::::::::::::



[mapreduce]: https://en.wikipedia.org/wiki/MapReduce
[dijkstra]: https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm
[hyperscale]: https://en.wikipedia.org/wiki/Hyperscale_computing


:::::::::::::::::::::::::::::::::::::::: keypoints

- High Performance Computing (HPC) typically involves connecting to very large computing systems elsewhere in the world.
- These other systems can be used to do work that would either be impossible or much slower on smaller systems.
- High Performance Computing (HPC) typically involves connecting to very large
computing systems elsewhere in the world.
- These other systems can be used to do work that would either be impossible or
much slower on smaller systems.
- HPC resources are shared by multiple users.
- The standard method of interacting with such systems is via a command line interface.
- The standard method of interacting with such systems is via a command line
interface.

::::::::::::::::::::::::::::::::::::::::::::::::::


30 changes: 16 additions & 14 deletions episodes/11-connecting.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,11 @@ teaching: 25
exercises: 10
---

```{r, echo=FALSE}
# Source the external configuration script
source("load_config.R")
```

::::::::::::::::::::::::::::::::::::::: objectives

- Configure secure access to a remote HPC system.
Expand All @@ -17,11 +22,6 @@ exercises: 10

::::::::::::::::::::::::::::::::::::::::::::::::::

```{r, echo=FALSE}
# Source the external configuration script
source("load_config.R")
```

## Secure Connections

The first step in using a cluster is to establish a connection from our laptop
Expand All @@ -38,15 +38,17 @@ results.
If you have ever opened the Windows Command Prompt or macOS Terminal, you have
seen a CLI. If you have already taken The Carpentries' courses on the UNIX
Shell or Version Control, you have used the CLI on your *local machine*
extensively. The only leap to be made here is to open a CLI on a *remote machine*,
while taking some precautions so that other folks on the network can't see (or
change) the commands you're running or the results the remote machine sends
back. We will use the Secure SHell protocol (or SSH) to open an encrypted
network connection between two machines, allowing you to send \& receive text
and data without having to worry about prying eyes.

![](/fig/connect-to-remote.svg){max-width="50%" alt="Connect to cluster"}

extensively. The only leap to be made here is to open a CLI on a *remote
machine*, while taking some precautions so that other folks on the network
can't see (or change) the commands you're running or the results the remote
machine sends back. We will use the Secure SHell protocol (or SSH) to open an
encrypted network connection between two machines, allowing you to send \&
receive text and data without having to worry about prying eyes.

![connect-to-remote.svg](fig/connect-to-remote.svg){
max-width="50%"
alt="Connect to cluster. "
}

SSH clients are usually command-line tools, where you provide the remote
machine address as the only required argument. If your username on the remote
Expand Down
8 changes: 0 additions & 8 deletions episodes/13-hpcc-scheduler/hpcc/section2.rmd

This file was deleted.

12 changes: 6 additions & 6 deletions episodes/13-scheduler.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ In this case, the job we want to run is a shell script -- essentially a
text file containing a list of UNIX commands to be executed in a sequential
manner. Our shell script will have three parts:

- On the very first line, add ``r config$remote$bash_shebang``. The `#!`
- On the very first line, add ``r config$remote$shebang``. The `#!`
(pronounced "hash-bang" or "shebang") tells the computer what program is
meant to process the contents of this file. In this case, we are telling it
that the commands that follow are written for the command-line shell (what
Expand All @@ -75,7 +75,7 @@ manner. Our shell script will have three parts:
```

```bash
`r config$remote$bash_shebang`
`r config$remote$shebang`

echo -n "This script is running on "
hostname
Expand Down Expand Up @@ -163,7 +163,7 @@ resources we must customize our job script.
Comments in UNIX shell scripts (denoted by `#`) are typically ignored, but
there are exceptions. For instance the special `#!` comment at the beginning of
scripts specifies what program should be used to run it (you'll typically see
``r config$local$bash_shebang``). Schedulers like `r config$sched$name` also
``r config$local$shebang``). Schedulers like `r config$sched$name` also
have a special comment used to denote special scheduler-specific options.
Though these comments differ from scheduler to scheduler,
`r config$sched$name`'s special comment is ``r config$sched$comment``. Anything
Expand All @@ -179,7 +179,7 @@ name of a job. Add an option to the script:
```

```bash
`r config$remote$bash_shebang`
`r config$remote$shebang`
`r config$sched$comment` `r config$sched$flag$name` hello-world

echo -n "This script is running on "
Expand Down Expand Up @@ -253,7 +253,7 @@ for it on the cluster.
```

```bash
`r config$remote$bash_shebang`
`r config$remote$shebang`
`r config$sched$comment` `r config$sched$flag$time` 00:01 # timeout in HH:MM

echo -n "This script is running on "
Expand Down Expand Up @@ -282,7 +282,7 @@ wall time, and attempt to run a job for two minutes.
```

```bash
`r config$remote$bash_shebang`
`r config$remote$shebang`
`r config$sched$comment` `r config$sched$flag$name` long_job
`r config$sched$comment` `r config$sched$flag$time` 00:01 # timeout in HH:MM

Expand Down
4 changes: 2 additions & 2 deletions episodes/14-environment-variables.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -212,7 +212,7 @@ job was submitted.
```

```output
`r config$remote$bash_shebang`
`r config$remote$shebang`
`r config$sched$comment` `r config$sched$flag$time` 00:00:30

echo -n "This script is running on "
Expand Down Expand Up @@ -279,7 +279,7 @@ unless we type in the full path to the program,
since the directory `/users/vlad` isn't in `PATH`.

This means that I can have executables in lots of different places as long as
I remember that I need to to update my `PATH` so that my shell can find them.
I remember that I need to update my `PATH` so that my shell can find them.

What if I want to run two different versions of the same program?
Since they share the same name, if I add them both to my `PATH` the first one
Expand Down
Loading
Loading