iTranslated by AI

The content below is an AI-generated translation. This is an experimental feature, and may contain errors. View original article
📀

Deleting All Data from Amazon Glacier

に公開

Introduction

Glacier is a good service. However, deleting its data is extremely tedious. Data deletion cannot be performed on the AWS Management Console; it requires deletion requests using the CLI or API. Furthermore, if any data remains, you cannot delete the vault (something like a bucket). Deleting data from Glacier is truly a hassle.

It was possible to manually execute CLI commands to delete it. However, since the target vault this time contained 2.5TB of NAS backup data and the number of archives exceeded 300,000, I created and used a Python script. In this article, I will introduce how to use that Python script. I will provide a separate introduction for the details of the script itself.

Data Deletion Flow

The steps for deleting data are shown below.

To provide an overview, the process involves requesting a list of archives within the vault from AWS and then deleting the archives based on that list. However, it takes several hours after requesting the archive list before it actually becomes available for download.

Data Deletion Work

In this chapter, I will explain the prerequisites for the data (archive) deletion work, the actual procedure, and the Python script used in this task.

Prerequisites

It is assumed that the following are installed and functioning correctly:

  • AWS CLI (Login must be completed)
  • An environment where you can run python3 in some way

Retrieving the Archive List

In this task, we will use the AWS CLI to retrieve the archive list. To retrieve the archive list using the AWS CLI, the following information is required:

  • Vault name
  • AWS account ID

Many people may not know their AWS account ID because it is a value that we don't usually pay attention to. (In fact, I didn't really know it either.) If that is the case, please refer to the following documentation:

https://docs.aws.amazon.com/ja_jp/accounts/latest/reference/manage-acct-identifiers.html

After gathering the necessary information, you can request the archive list by executing the following command:

aws glacier initiate-job --vault-name {ボールト名} --account-id {AWSアカウントのID} --job-parameters="{\"Type\":\"inventory-retrieval\"}"

After execution, a JSON containing the JobId will be displayed, so make a note of the JobId.

As mentioned earlier, retrieving the archive list takes a long time, so it's convenient to set up notifications. For more details, please see the following:

https://docs.aws.amazon.com/ja_jp/amazonglacier/latest/dev/configuring-notifications.html

How to Check the Completion of the Archive List Retrieval Process

After executing the archive list request command, the AWS CLI terminates immediately. To check the progress in real-time, you need to execute the following command:

aws glacier describe-job --vault-name {vault-name} --account-id {aws-account-id} --job-id {JobId}

The output will likely look as follows. Note: This may have changed by now.

{
    "InventoryRetrievalParameters": {
        "Format": "JSON"
    }, 
    "VaultARN": "*** vault arn ***", 
    "Completed": false, 
    "JobId": "*** jobid ***", 
    "Action": "InventoryRetrieval", 
    "CreationDate": "*** job creation date ***", 
    "StatusCode": "InProgress"
}

Please check the StatusCode to determine whether it has finished; once complete, it should show a status indicating completion.

Downloading the Archive List

The Python script used in this process performs deletion requests based on the archive list JSON. Therefore, you need to download the archive list.

Use the following command to download the archive list:

aws glacier get-job-output --vault-name {vault-name} --account-id {aws-account-id} --job-id {JobId} output.json

After executing the command, output.json should be saved in your current directory. If the number of archives is large (e.g., 300,000 or more), do not mistakenly try to open it with vi or similar editors. It will definitely freeze. (Learned it the hard way.)

Archive Deletion

This section explains how to delete archives from Glacier using the Python script I used for this task.

Cloning the Repository

Clone the following repository to your local machine:

https://github.com/ksatoshi/glacier-deleter

git clone https://github.com/ksatoshi/glacier-deleter.git

Running the Script

First, here are the steps for using an already installed version of Python 3.

When using an already installed version of Python 3

# Initial setup
cd glacier-deleter
pip install boto3
python ./main.py

# After running the script, prompts will appear asking for the vault name, target region, and the path where the archive list is saved.
vault name>> {vault-name}
AWS region (default: ap-northeast-1)>> {target-region}
file path>> {path-to-archive-list}

When using mise or uv

# Initial setup
cd glacier-deleter
mise trust
mise install
uv sync
uv run main.py

# After running the script, prompts will appear asking for the vault name, target region, and the path where the archive list is saved.
vault name>> {vault-name}
AWS region (default: ap-northeast-1)>> {target-region}
file path>> {path-to-archive-list}

While the script is running, the progress will be displayed as follows. (Example output):

# AWS response headers (omitted as there is no record of the actual output)
1/3000
# AWS response headers (omitted as there is no record of the actual output)
2/3000
# ----omitted----
# AWS response headers (omitted as there is no record of the actual output)
3000/3000

After the script finishes and the date has changed, if you check the AWS Management Console, the number of archives should be 0 or - (hyphen). With this, you can finally delete the vault and save on the costs previously spent on storing unnecessary data.

Conclusion

In this article, I introduced a method for deleting all Glacier archives using a custom script. Honestly, I don't think this is the absolute best method, but I am satisfied because it worked out in the end.

By the way, regarding the time it took to delete over 300,000 archives, it exceeded at least 24 hours. The entire deletion process was performed on an EC2 instance, and I take my hat off to the reliability of EC2 for maintaining an SSH connection for over 24 hours.

Finally, I found a screenshot of the archive count before deletion, so I'll conclude with that.

Discussion