ARC Storage (Migration)

Introduction

Over the last year the ARC team has been working on replacing part of the storage infrastructure behind the ARC and HTC clusters. This is necessary since the current storage infrastructure is end of life, and to meet the increasing space and performance demands.

As a reminder - $DATA is the shared permanent storage available to projects to store research data required for (or generated by) use of the ARC clusters. $SCRATCH is the temporary storage created at the beginning of the job which can be used by programs to store temporary data shared by processes during a job run; $SCRATCH space is deleted after the job is stopped. There is a third area available to users, $HOME, which (mostly) holds data required for the login process.

The new storage replaces $DATA and $SCRATCH data areas. Your $HOME area is unaffected.

More information on the different types of storage can be found on the ARC user guide at https://arc-user-guide.readthedocs.io/en/latest/arc-storage.html

Data Migration Timeline

All project data areas were switched to the ‘new’ storage on 4th June 2024. If you had not migrated your data before then, your $DATA area will be mounted on new storage system and appear empty. Your old data will be available read-only under /migration/$PROJECT/$USER

Who is responsible for migrating my data?

Each user is responsible for transferring their data.

How long will my data be available on the old storage after migration?

The old storage system will be decommissioned on 1st August 2024. After this date, the old data will no longer be available, and we will not be able to retrieve it.

How to transfer data

There are many ways to copy files from one folder to another but not all are appropriate in this case. cp is not advisable as it is generally very slow. The best option is probably rsync, as it is interruptible and resumable, and can check data integrity - however, it may not be the fastest solution for very large trees. For very large trees, using a tar | tar solution is likely faster.

rsync commands:

cd /migration/<projectname>/$USER
rsync -avhP . /data/<projectname>/$USER/

tar commands

cd /migration/<projectname>/$USER
tar cvf - . | tar xf - -C /data/<projectname>/$USER/

Note

The above commands can be run from an NX GUI session, an interactive session, or submitted as a job on the cluster. Please DO NOT run them from the login nodes as they will create unecessary load on these systems and as a result take much longer to complete.

An example submission script would look something like:

#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=2
#SBATCH --partition=short
#SBATCH --job-name=Data_migration

module purge

# change the value of `MYPROJECT` to the project you want to migrate
export MYPROJECT="engs-example"

cd /migration/$MYPROJECT/$USER
rsync -avhP . /data/$MYPROJECT/$USER/

Warning

Be careful when using a cluster job, and especially when copying in an interactive session; the time limit might interrupt your transfer before it is complete.

It is of course also possible to only transfer certain sub-directories, or (especially using rsync) exclude certain subdirectories from the copy process. Please refer to the ‘rsync’ or ‘tar’ man pages for details, or ask the ARC team for assistance.