SciLo: Long Term Data Archiving
SciLo is a long term archiving service at ACCRE based on the Spectra Logic BlackPearl converged storage solution. With SciLo you can archive data at a very low cost and minimal system administrator intervention. Movement of data from ACCRE (or anywhere) to SciLo and back is accomplished with command line client (ds3_java_cli
) or, if you are using portal, an available GUI (dsb-gui
). There are other options available including cyberduck
and an API with a number of SDKs depending on your expertise. There is also the . All of this is based on Spectra S3 which uses the standard HTTP S3 command set plus expanded commands designed to optimize moving data object to and from tape.
ACCRE provides two scripts to help the user to do the upload and download jobs. The two scripts encapsulate the use of ds3_java_cli
command inside, and the users don’t need to get down to the details of how to feed the correct file path information to the ds3_java_cli
command.
Getting Started
Initial sign up to the SciLo requires creation of an account on our BlackPearl and issuance of an id and key. Open a helpdesk ticket with ACCRE requesting access. You will receive confirmation of account setup and the ID and secret key.
Recommended Server
Archiving can take some time. As such running on a gateway will not work. If your group has a custom gateway connected to the cluster you can use that (I always recommend using screen or tmux
so that you can log out and back in later).
We do have a gateway dedicated to archiving in the case where you don’t have a custom gateway to use. When your access is created your login credentials will work on that gateway and your ticket will be updated with that server’s login information.
Environment
Once you have access you will want to add this information to your environment. The environment variables $DS3_ACCESS_KEY
, $DS3_SECRET_KEY
, and DS3_ENDPOINT
are all special variables that ds3_java_cli
uses by default. These can be overridden with options (see -a
, -k
, and -e
below) if that fits more into your workflow.
~ $ export DS3_ACCESS_KEY=<Assigned s3 id>
~ $ export DS3_SECRET_KEY=<Assigned secret s3 key>
~ $ export DS3_ENDPOINT=archive1.accre.vanderbilt.edu
~ $ export s3bucket=<Assigned bucket>
User Scripts
The scripts provide the following features to help the user run the upload/download job:
- Checksum verification. In the upload job, a checksum value is obtained from the original file and this information is stored in the same bucket for the later verification purpose. During the download job, the script will use this checksum data to verify the download file is not corrupted.
- Use multi-cores for a parallel job. For both download and upload jobs, the jobs can be parallelized by using multi-cores. Each core is corresponding to one job, and all of downloading/uploading files will be processed in different jobs. The jobs are launched in independent way, so one job’s failure doesn’t affect other jobs.
- Additional log folder is used for rerun purpose. Sometimes the downloading/uploading jobs could be broken for variety of reasons. In our script we set up a log folder which records the log information for the successfully finished jobs. With this log information, the rerun can easily pick up the unfinished jobs and avoid the repeat work.
- Use local disk to avoid GPFS fluctuation. In both of the uploading/downloading jobs, the users need to specify a local directory as a scratch directory (to use
/tmp
is an example). The job will copy the files to the scratch directory and perform the uploading/downloading work. In this way the uploading/downloading process will be independent from GPFS fluctuation to increase the robustness of the process. - Multiple local disks can be for the scratch directories. Both of the two scripts are able to utilize multiple local disks for scratch directory, this can help to increase the efficiency and stability for multi-cores parallel jobs.
- The uploading/downloading job can be performed for files on any ACCRE supported storage, including local storage (for example, the local folders on the custom gateways),
/home
,/data
,/scratch
,/dors
, NFS, LStore, etc. For the files on LStore, the uploading job will recursively scan all of files within the given directory. In other cases like files on GPFS, the uploading job will upload the files from the given directory (the sub-directories currently are omitted).
For the uploading and downloading jobs, one important note is to make sure the job is processed with enough scratch space. For example, if the maximum size for the uploading files is about 1 terabyte size, and the job is going to use 6 cores, the best option is to find 6 local disks with each local disk larger than 1TB. In this way, the I/O burden is evenly distributed onto the 6 disks. Otherwise in this example please make sure the local scratch directory is larger than 6TB size; so that as all of 6 uploading processes execute together the scratch space is able to hold all of the data.
In ACCRE, the best option for the SciLo job is to utilize the custom gateways. The benefit to use custom gateways is that the jobs can utilize the large local disks, and the job does not have time limit. Please make sure the scratch directory is not on the shared storage, like the GPFS or NFS folders on the node. If you prefer to use a Slurm job for the SciLo tasks, please allocate enough time for the Slurm job and use the local /tmp
folder as scratch directory.
For both uploading/downloading jobs the script needs to set up a folder to store the log related data. Since the script uses the parallel command for executing the jobs in parallel, after the job is done you can check the *.par files in the log folder; these are the log files for all uploading or downloading jobs.
Uploading Files to SciLo
The script scilo_put.sh
is to perform the uploading job to a SciLo bucket. It’s available in the path /accre/common/bin
. Before the upload job, please make sure that you have correct setup of SciLo in your home directory. In ACCRE we will place a .s3keys
file under the user’s home directory. This file contains the access key and other information needed for accessing SciLo. Before the uploading job the script will test the existence of the .s3keys
file in your home directory, and test whether SciLo can access the given bucket and retrieve information.
The uploading process has the following parameters:
-p
(--path
): provide the file path for archive purpose-b
(--bucket
): SciLo bucket name for archive-i
(--islstore
): whether the input path is an LStore path-s
(--scratch
): scratch directory for the work-n
(--ncores
): how many cores we want to use. You need to specify a value; each core for one uploading job.-l
(--log
): log directory for storing the information and checksum files regarding the progress. Please reuse the same log directory so that the re-run can pick up the unfinished jobs and avoid repeating the duplicate uploading work.
Here is an example for running a uploading job:
scilo_put.sh -p (path for upload) -b (bucket name) -s (scratch path) -n (ncores) -l (log dir)
If the uploading job is for LStore files, please make sure to have the -i
option, and the LStore path should be without the /lio/lfs
part (for example, if the LStore path is /lio/lfs/testing/test
; the path for the -p
option should be /testing/test
).
Here is an example for a job with utilizing multiple local disks for scratch directories:
scilo_put.sh -p /testing/test -i -b test -s /mnt/d1,/mnt/d2,/mnt/d3 -n 3 -l $HOME/test
This example uploads data files in LStore folder /lio/lfs/testing/test
to “test” bucket with utilizing 3 cores and 3 folders (/mnt/d1
, /mnt/d2
and /mnt/d3
) in the local disks. The log folder is in $HOME/test
.
This example shows a uploading job for data files on GPFS:
scilo_put.sh -p /home/xyz/test -b test -s /tmp -n 3 -l $HOME/scilo_logs
This example uploads files from /home
folder for user xyz
to the test
bucket with utilizing 3 cores and 1 local folder /tmp
. The log folder is in $HOME/scilo_logs
.
Downloading Files from SciLo
In comparison to the uploading job, the downloading job is divided into two steps. In Step 1 the script is to generate the downloaded file list from the bucket. Because the files with different folders are stored together on the same bucket, you may only want to download some of the files; therefore the script will pull out the a file list from the bucket and you can edit the file list and utilize it for later downloading purpose.
The downloading job script is scilo_get.sh
and locates at the same folder /accre/common/bin
. The available options for a downloading job from scilo_get.sh
are as follows:
-f
(--file
): the file with full path which stores the list of downloaded files from SciLo-g
(--generate
): enable to generate the download file list without downloading-b
(--bucket
): SciLo bucket name for achieve-s
(--scratch
): scratch directory for downloading the file from SciLo-n
(--ncores
): how many cores we want to use; you need to specify a value-o
(--output
): output directory to store the downloaded files-l
(--log
): logs directory for storing the information regarding the progress. Please reuse the same log directory so that the re-run can pick up the unfinished jobs and avoid repeating the duplicate downloading work.
To generate the downloaded file list, the script can be run as:
scilo_get.sh -f (file list name) -b (bucket name) -o (the downloaded file path) -g
In this step, the script will pull out all of available data files on the bucket and write the information into the given file list. An important note here is that the script needs to know the final downloaded file path. Although this information is not used in this step, the purpose is to put all of downloading job information together.
For example,
scilo_get.sh -f files.txt -b test -o /home/xyz/test -g
will generate a file files.txt
in the current directory, it contains the data files from bucket test
. An example of this file list is as follows:
# we use NONE to show that the data file does not have corresponding checksum file, please do not delete it!! # file_name_from_tape corresponding_check_sum_information output_dir_for_downloaded_file home/xyz/qos.txt NONE /home/xyz/test ssss/testing_archive/testing/test/pindel.TWAM-App282Req70_00194.chr1.tar.gz ssss/testing_archive/testing/test/pindel.TWAM-App282Req70_00194.chr1.tar.gz.chksum=md5sum=620a27e00ef50374bdf38de8625d5013 /home/xyz/test yyyy/testing_archive/testing/test/pindel.TWAM-App282Req70_00194.chr11.tar.gz yyyy/testing_archive/testing/test/pindel.TWAM-App282Req70_00194.chr11.tar.gz.chksum=md5sum=ce13593823e4800e4e5c26a1fc007aba /home/xyz/test
Here in the above example, the data is divided into three columns. The first and second column are the original data file path and corresponding checksum information generated from upload job. If the data file does not have checksum information, the script uses NONE
to label it so that the downloading process will ignore the checksum verification. Please make sure the first and second column are NOT changed, otherwise the downloading process will report error.
The third column corresponds to the downloaded directory. In this example, all of the files are downloaded to /home/xyz/test
. But it’s possible that different files go to different folders. You can modify the output directory for each data files in the third column.
If the input bucket has a large size of achieved data files, the generated file list could be very large, too. Since it contains all of the available files in the bucket. Please delete all of the data file lines that are not for downloading purpose; scilo_get.sh
will only download the files in the given list.
After the file list is finished, the second downloading step is as follows:
scilo_get.sh -f (file list path) -b (bucket name) -s (scratch path) -n (ncores) -l (log dir)
This step is executed in the same way as in the uploading process, you can provide multiple scratch directories and multi-cores to make the download job parallel.
Load the Software
The command line client (as well as the GUI if you are on portal) are in our Lmod setup. You will want to execute the following to get them into your environment:
~$ module load GCC
~$ module load scilo-cli #for the command line ds3_java_cli
~$ module load scilo-gui # loads the gui for portal or X 11 forwarding dsp-gui
Command Line Parameters
The command line interface is ds3_java_cli
:
~$ ds3_java_cli --help #displays a general help listing
~$ ds3_java_cli -c get_service #get a list of available buckets
+-------------------------------------------------------+--------------------------+
| Bucket Name | Creation Date |
+-------------------------------------------------------+--------------------------+
| my_bucket | 2019-03-07T00:08:24.000Z |
+-------------------------------------------------------+--------------------------+
~$ ds3_java_cli --http -c put_bulk -b mybucket -p /home/myusername/ -d /home/myusername/archivedirectory/ --sync -nt 6 --checksum
usage: ds3_java_cli
Option | Option Help |
---|---|
-a | Access Key ID or have "DS3_ACCESS_KEY" set asan environment variable |
-bs | Set the buffer size in bytes. The default is 1MB |
-c | The Command to execute. For Possible values, use '--help list_commands.' |
--debug | Debug (more verbose) output to console. |
-e | The ds3 endpoint to connect to or have "DS3_ENDPOINT" set as an environment variable. |
-h | Help Menu |
--help | Command Help (provide command name from -c) |
--http | Send all requests over standard HTTP |
--insecure | Ignore ssl certificate verification |
-k | Secret access key or have "DS3_SECRET_KEY" set as an environment variable |
--log-debug | Debug (more verbose) output to log file. |
--log-trace | Trace (most verbose) output to log file. |
--log-verbose | Log output to log file. |
--output-format | Configure how the output should be displayed. Possible values: [cli, json] |
-r | Specifies how many times puts and gets will be attempted before failing the request. The default is 5 |
--trace | Trace (most verbose) output to console. |
--verbose | Log output to console. |
--version | Print version information |
-x | The URL of the PROXY server to use or have "http_proxy" set as an environment variable |
Generally the ds3_java_cli
follows this example:
ds3_java_cli -e -a -k --http -c -o <object, if used by command> -b <bucket, if used by command>
Available Commands
Command | Command Help |
---|---|
delete_bucket | Deletes an empty bucket. Requires the '-b' parameter to specify bucket (by name or UUID). Use the '--force' flag to delete a bucket and all its contents. Use the get_service command to retrieve a list of buckets |
delete_folder | Deletes a folder and all its contents. Requires the '-b' parameter to specify bucket (name or UUID). Requires the '-d' parameter to specify folder name |
delete_job | Terminates and removes a current job. Requires the '-i' parameter with the UUID of the jobUse the '--force' flag to remove objects already loaded into cache. Use the get_jobs command to retrieve a list of jobs |
delete_object | Permanently deletes an object. Requires the '-b' parameter to specify bucketname. Requires the '-i' parameter to specify object name (UUID or name). Use the get_service command to retrieve a list of buckets. Use the get_bucket comma/nd to retrieve a list of objects |
delete_tape | Deletes the specified tape which has been permanently lost from the BlackPearl database. Any data lost as a result is marked degraded to trigger a rebuild. Requires the '-i' parameter to specify tape ID (UUID or barcode). Use the get_tapes command to retrieve a list of tape |
delete_tape_drive | Deletes the specified offline tape drive. This request is useful when a tape drive is permanently removed from a partition. Requires the '-i' parameter to specify tape drive ID. Use the get_tape_drives command to retrieve a list of tapes |
delete_tape_failure | Deletes a tape failure from the failure list. Requires the '-i' parameter to specify tape failure ID (UUID). Use the get_tape_failure command to retrieve a list of IDs |
delete_tape_partition | Deletes the specified offline tape partition from the BlackPearl gateway configuration. Any tapes in the partition that have data on them are disassociated from the partition. Any tapes without data on them and all tape drives associated with the partition are deletedfrom the BlackPearl gateway configuration. This request is useful if the partition should neverhave been associated with the BlackPearl gateway or if the partition was deleted from the library. Requires the '-i' parameter to specify tape partition |
get_bucket | Returns bucket details plus a list of objects contained. Requires the '-b' parameter to specify bucket name or UUID. Use the get_service command to retrieve a list of buckets |
get_bulk | Retrieve multiple objects from a bucket. Requires the '-b' parameter to specify bucket (name or UUID). Optional '-d' parameter to specify restore directory (default '.'). Optional '-p' parameter to specify prefix or directory name. Separate multiple values with spaces, e.g., -p prefix1 prefix2Optional '--sync' flag to retrieve only newer or non-extant files. Optional '--file-metadata' flag restores file metadata to the values extant when archived. Optional '-nt' parameter to specify number of threads |
system_information | Retrieves basic system information: software version, build, and system serial number. Useful to test communication |
get_config_summary | Runs multiple commands to capture configuration information |
get_data_policy | Returns information about the specified data policy. Requires the '-i' parameter to specify data policy (UUID or name). Use the get_data_policies command to retrieve a list of policies |
get_data_policies | Returns information about the specified data policy. Requires the '-i' parameter to specify data policy (UUID or name). Use the get_data_policies command to retrieve a list of policies |
get_job | Retrieves information about a current job. Requires the '-i' parameter with the UUID of the jobUse the get_jobs command to retrieve a list of jobs |
get_jobs | Retrieves a list of all current jobs |
get_object | Retrieves a single object from a bucket. Requires the '-b' parameter to specify bucket (name or UUID). Requires the '-o' parameter to specify object (name or UUID). Optional '-d' parameter to specify restore directory (default '.'). Optional '--sync' flag to retrieve only newer or non-extant files. Optional '--file-metadata' flag restores file metadata to the values extant when archived. Optional '-nt' parameter to specify number of threads. Use the get_service command to retrieve a list of buckets. Use the get_bucket command to retrieve a list of objects |
get_objects_on_tape | Returns a list of the contents of a single tape. Requires the '-i' parameter to specify tape (barcode or UUID). Use the get_tapes command to retrieve a list of tapes |
get_physical_placement | Returns the location of a single object on tape. Requires the '-b' parameter to specify bucket (name or UUID). Requires the '-o' parameter to specify object (name or UUID). Use the get_service command to retrieve a list of buckets. Use the get_bucket command to retrieve a list of objects |
get_service | Returns a list of buckets on the device |
get_tape_failure | Returns a list of tape failures |
get_tapes | Returns a list of all tapes |
get_user | Returns information about an individual user. Requires the '-i' parameter to specify user (name or UUID). Use the get_users command to retrieve a list of users |
get_users | Returns a list of all users |
head_object | Returns metadata but does not retrieve an object from a bucket. Requires the '-b' parameter to specify bucket (name or UUID). Requires the '-o' parameter to specify object (name or UUID). Useful to determine if an object exists and you have permission to access it |
modify_data_policy | Alter parameters for the specified data policy. . Requires the '-i' parameter to specify data policy (UUID or name). Requires the '--modify-params' parameter to be set. Use key:value pair key:value,key2:value2: . . . Legal values:name, checksum_type, default_blob_size, default_get_job_priority,default_put_job_priority, default_verify_job_priority, rebuild_priority,end_to_end_crc_required, versioning. See API documentation for possible values). Use the get_data_policies command to retrieve a list of policies and current values |
modify_user | Alters information about an individual user. Requires the '-i' parameter to specify user (name or UUID). Requires the '--modify-params' parameter to be set. Use key:value pair key:value,key2:value2: . . . Legal values:default_data_policy_idUse the get_users command to retrieve a list of users |
performance | For internal testing. Generates mock file streams for put, and a discard (/dev/null)stream for get. Useful for testing network and system performance. Requires the '-b' parameter with a unique bucketname to be used for the test. Requires the '-n' parameter with the number of files to be used for the test. Requires the '-s' parameter with the size of each file in MB for the test. Optional '-bs' parameter with the buffer size in bytes (default 1MB). Optional '-nt' parameter with the number of threads |
put_bucket | Create a new empty bucket. Requires the '-b' parameter to specify bucket name |
put_bulk | Put multiple objects from a directory or pipe into a bucket. Requires the '-b' parameter to specify bucket (name or UUID). Requires the '-d' parameter (unless \") to specify source directory. Optional '-p' parameter (unless \" ) to specify prefix or directory name. Optional '--sync' flag to put only newer or non-extant files. Optional '--file-metadata' flag archives file metadata with files. Optional '-nt' parameter to specify number of threads. Optional '--ignore-errors' flag to continue on errors. Optional '--follow-symlinks' flag to follow symlink (default is disregard) |
reclaim_cache | Forces a full reclaim of all caches, and waits untilthe reclaim completes. Cache contents that need to be retainedbecause they are a part of an active job are retained. Any cachecontents that can be reclaimed will be. This operation may take avery long time to complete, depending on how much of the cache canbe reclaimed and how many blobs the cache is managing |
verify_bulk_job | A verify job reads data from the permanent data store and verifies that the CRC of the dataread matches the expected CRC. Verify jobs ALWAYS read from the data store - even if the datacurrently resides in cache. Requires the '-b' parameter to specify bucket (name or UUID). Requires the '-o' parameter to specify object (name or UUID). Optional '-p' parameter to specify prefix or directory name |
get_data_path_backend | Gets configuration information about the data path backend |
get_cache_state | Gets the utilization and state information for all cache filesystems |
get_system_failure | |
get_capacity_summary | Get a summary of the BlackPearl Deep Storage Gateway system-wide capacity |
verify_system_health | Verifies that the system appears to be online and functioning normally and that there is adequate free space for the database file system |
verify_all_tapes | Verify the integrity of all the tapes in the black pearl |
verify_tape | |
get_suspect_objects | |
get_suspect_blob_tapes | |
modify_data_path | |
verify_pool | |
verify_all_pools | |
get_detailed_objects | Filter an object list by size or creation date. Returns one line for each object. Optional '-b' bucket_nameOptional '--filter-params' to filter results. Use key:value pair key:value,key2:value2: . . . Legal values:newerthan, olderthan specify relative date from now in format d1. h2. m3. s4 (zero values can be omitted , separate with '.')before, after specify absolute UTC date in format Y2016. M11. D9. h12. ZPDT(zero values or UTC time zone can be omitted , separate with '.')owner owner namecontains string to match in object namelargerthan, smallerthan object size in bytesNote: bucket will restrict values returned, filter-params will transfer (potentially large) object listand filter client-side |
get_detailed_objects_physical | Get a list of objects on tape, filtered by size or creation date. Returns one line for each instance on tape. Optional '-b' bucket_nameOptional '--filter-params' to filter results. Use key:value pair key:value,key2:value2: . . . Legal values:newerthan, olderthan specify relative date from now in format d1. h2. m3. s4 (zero values can be omitted , separate with '.')before, after specify absolute UTC date in format Y2016. M11. D9. h12. ZPDT(zero values or UTC time zone can be omitted , separate with '.')owner owner namecontains string to match in object namelargerthan, smallerthan object size in bytesNote: bucket will restrict values returned, filter-params will transfer (potentially large) object listand filter client-side |
eject_storage_domain | Ejects all eligible tapes within the specified storage domain. Tapes are not eligible for ejection if mediaEjectionAllowed=FALSE for the storage domain. If a tape is being used for a job, it is ejected once it is no longer in use. Use the get_storage_domains command to retrieve a list of storage domains |
get_storage_domains | Get information about all storage domains. Optional -i (UUID or name) restricts output to one storage domainOptional --writeOptimization (capacity performance) filters results to those matching write optimization. |
get_tape | Returns information on a single tape. If the tape has been ejected, then the ejection information will also be displayed. Required '-i' tape barcode or i |
get_bucket_details | Returns bucket details by either UUID or bucket name. Requires the '-b' parameter to specify bucket name or UUID. Useful to get name by ID or ID by name. Use the get_service command to retrieve a list of buckets |
eject_tape | Ejects the tape uniquely identified by ID. Tapes are not eligible for ejection if mediaEjectionAllowed=FALSE for the storage domain. If a tape is being used for a job, it is ejected once it is no longer in use. Use the get_tapes command or get_detailed_objects_physical to find tape id |
modify_job | |
recover_put_bulk | Recovers a put_bulk job. Requires the '-b' parameter to specify bucket (name or UUID). Requires the '-d' parameter (unless \|) to specify source directory. Requires the '-i" parameter with the UUID for the interrupted or failed job. Optional '--file-metadata' flag archives file metadata with files. Other parameters should match the original put_bulk |
recover_get_bulk | Recovers a get_bulk job. Requires the '-b' parameter to specify bucket (name or UUID). Requires the '-i" parameter with the UUID for the interrupted or failed job. Optional '--file-metadata' flag restores file metadata to the values extant when archived. Other parameters should match the original get_bulk |
cancel_verify_all_tapes | Cancel a previous request to verify all the tapes in the DS3 appliance |
cancel_verify_tape | Cancel a previous request to verify a tape in the DS3 appliance. Required '-i' tape id (barcode, name, UUID |
get_pools | Returns all pools matching option filter criteria |
get_pool | Returns information on a single pool. Required '-i' pool name or i |
cancel_verify_pool | Cancel previous request to verify a pool in the DS3 appliance. Required '-i' pool id (name or UUID |
cancel_verify_all_pools | Cancel previous request to verify all the pools in the DS3 appliance |
recover | Recover a failed or iterrupted put_bulk or get_bulk job using recover files. Recover files are written to temp space on put_bulk and get_bulk |