What to Upload on Aws Storage Gateway
For AWS Storage Day 2020, we published a blog discussing how customers use AWS Storage Gateway (specifically, File Gateway) to upload private files to Amazon S3. For some customers, these files constitute a larger logical prepare of data that they should group for downstream processing. As mentioned in that blog, before the release of file upload notifications, customers had been unable to reliably initiate this processing based on individual file upload events. In society to demonstrate the implementation of this characteristic in the real earth, nosotros have created an AWS Cloud Development Kit (AWS CDK) application based on the file notification result processing solution described in our before web log.
In this blog mail, we discuss how the AWS CDK application, available in this GitHub repository, enables you to leverage individual file upload events to group together uploaded datasets for downstream processing. The repository contains a comprehensive workshop on how to deploy and test the solution for common data vaulting use cases. By conducting the workshop, you tin can gain hands-on experience in implementing file upload notifications every bit role of a larger AWS awarding stack. You can use this knowledge, forth with the code provided, to create your ain data processing pipelines for use-cases like backup and recovery.
Earlier proceeding, nosotros recommend you read our previous web log in order to familiarize yourself with the File Gateway file upload notifications. That blog mail service also covers the reference architecture that is the ground of the event processing flow implemented by this AWS CDK application.
AWS CDK application compages
The following diagram illustrates the architecture for the application. Information technology shows a information pipeline processing workflow that provides for the fill-in and recovery of critical concern avails. For instance, moving data into a secure location on AWS.
AWS CDK application principles
For the example data vaulting apply-case, the AWS CDK awarding components work with the post-obit principles:
- Logical datasets: A grouping of files and directories stored in a uniquely named folder on a File Gateway file share. These files represent a single logical dataset vaulted by the File Gateway to Amazon S3 and are treated as a single entity for the purposes of downstream processing. The files are copied from a source location that mounts the File Gateway file share using NFS or SMB.
- Logical dataset IDs: A unique string that identifies a specific logical dataset. This is part of the proper noun for the root directory containing a single logical dataset created on a File Gateway file share. The Dataset ID allows the event processing flow to distinguish betwixt different vaulted datasets and reconcile within them appropriately.
- Data files: All files that institute a logical dataset. These are contained within a root logical dataset folder on a File Gateway file share. File upload notification events generated for information files are written, by the processing flow, to Amazon DynamoDB. Directories are treated as file objects for the purposes of uploads to Amazon S3 via File Gateway.
- Manifest files: A file, ane per logical dataset, which contains a manifest listing all data files that constitute that specific logical dataset. The file copy process generates the manifest files every bit part of the information vaulting operation for a logical dataset. The processing menstruation uses it to compare against data file upload events written to a DynamoDB tabular array. Once both of these data sources are identical, it signifies the File Gateway has completed uploading all files to Amazon S3 that constitute that logical dataset and the data vaulting performance has completed.
The processing flow implemented by this AWS CDK application contains the following mandatory, simply configurable, parameters. These can be modified via AWS CDK context keys used by the application (described in detail in the workshop walkthrough). These parameters enable you to customize the directory name clients can use when vaulting data. They also enable you to customize how long to provide for the reconciliation of File Gateway file upload notifications every bit part of the vaulting process to Amazon S3:
- Vault binder directory suffix name: The directory suffix name of the root binder containing a logical dataset copied to File Gateway. The processing menstruation uses this to identify what directories created on a File Gateway should exist candy. Directories created that do non end in this suffix will be ignored by the processing menses.
- Manifest file suffix name: The suffix name for the logical dataset manifest file. The processing flow uses this to place what file should be read to notice out the list of files constituting the logical dataset. You can also use it to reconcile confronting file upload notification events received.
- Number of iterations in state car: The number of attempts the file upload reconciliation land auto makes to reconcile the contents of the logical dataset manifest file with the file upload notification events received. Due to the asynchronous nature in which File Gateway uploads files to Amazon S3, a manifest file may be uploaded prior to all data files in that logical dataset. This is particularly the case for large datasets. Hence iterating as part of the file upload reconciliation process is required.
- Expect time in Land Machine: The fourth dimension, in seconds, to wait between each iteration of the file upload reconciliation country machine. The total time the state machine continues to attempt file upload reconciliation is a product of this parameter and the total number of iterations configured for the state machine (preceding parameter).
The post-obit is an instance logical dataset directory structure that a client would create on a File Gateway file share when vaulting a dataset:
[LOGICAL DATASET ID]-vaultjob (root logical dataset directory) [LOGICAL DATASET ID]-vaultjob/[Data FILE][..] (information files at tiptop level) [LOGICAL DATASET ID]-vaultjob/[DIRECTORY][..]/[DATA FILE][..] (data files at northward levels) [LOGICAL DATASET ID]-vaultjob/[LOGICAL DATASET ID].manifest (recursive list of all files and directories)
The CDK application workshop provides scripts used during the walkthrough that volition automatically create sample data and perform a information vaulting operation. The file copy procedure generates a logical dataset ID – the following is a portion of an case directory structure where the randomly generated dataset ID is DhoTdbmBHm3DfBWL
:
DhoTdbmBHm3DfBWL-vaultjob/dir-ryN77APt1rIo DhoTdbmBHm3DfBWL-vaultjob/dir-ryN77APt1rIo/file-E3l7u3XG DhoTdbmBHm3DfBWL-vaultjob/dir-ydEDerqUGfCS DhoTdbmBHm3DfBWL-vaultjob/dir-ydEDerqUGfCS/file-PsZ514Ug […] DhoTdbmBHm3DfBWL-vaultjob/DhoTdbmBHm3DfBWL.manifest
In your own specific implementations, generate the logical dataset ID to your required naming scheme. The contents of that dataset would be your own files and directories.
AWS CDK application stacks
The AWS CDK application contains two stacks:
-
EventProcessingStack
: Deploys the outcome processing architecture just. This is intended to be used with a Storage Gateway (File Gateway) configured to generate file upload notifications. NOTE: This stack does not create the File Gateway or File Gateway client. For the workshop walkthrough, these are created as role of the data vaulting stack.
-
DataVaultingStack
: Deploys a "minimal" Amazon VPC with 2 Amazon EC2 instances – a File Gateway appliance and a File Gateway NFS client. This stack is used to demonstrate an example data vaulting operation, triggering the components created by the event-processing stack.
Since customers can deploy File Gateways in both hybrid and AWS deject-based environments, the AWS CDK application separates the information-vaulting surroundings into a defended stack. This allows you to deploy the event processing catamenia in isolation, in order to integrate with File Gateways in your specific environments. To exercise this, y'all simply associate a File Gateway with the Amazon S3 bucket created past the outcome processing stack.
Example data vaulting environment
The AWS CDK application contains the data vaulting stack every bit a useful demonstration of a existent-world use-example. All resources in this stack reside in a private VPC with no internet connectivity. The post-obit is an analogy of the architecture:
The stack creates the post-obit resources:
- An Amazon VPC with 3 individual subnets and various Amazon VPC endpoints for the relevant AWS services.
- An Amazon S3 saucepan used to deploy the AWS CDK application scripts required in the workshop walkthrough. Amazon EC2 user data commands will automatically copy these scripts to the File Gateway client.
- 1 x Amazon EC2 instance using a Storage Gateway AMI and 150 GB of boosted Amazon EBS cache volume storage – to be used as a File Gateway. This example resides inside 1 of the individual subnets. It cannot communicate outside of the Amazon VPC and only allows inbound NFS connections from the File Gateway client.
- 1 x Amazon EC2 instance using Amazon Linux 2 and 150 GB of boosted Amazon EBS storage – to be used every bit a File Gateway client. This instance resides in a private subnet. It cannot communicate exterior of the Amazon VPC and allows no inbound connections.
The workshop walks yous through generating sample information within this environment and vaulting it to Amazon S3 via the File Gateway case. The File Gateway example generates upload notifications that the event processing flow consumes and reconciles.
Observing the event processing flow
To observe the event processing flow in action, following a data vaulting operation, you can inspect the resources created by the upshot processing stack. Viewing the post-obit resource in the society listed demonstrates how the processing catamenia executed:
- Amazon S3 bucket: Objects created in the Amazon S3 bucket, uploaded by the File Gateway.
- Amazon CloudWatch Logs: Logs created to record "data" and "manifest" file upload event types.
- Amazon DynamoDB table: Items created to tape the receipt of upload events.
- AWS Pace Functions state car: Land machine execution that reconciles "manifest" file contents against the file upload events received.
- Amazon CloudWatch Logs: File upload reconciliation events emitted by the Step Functions state machine.
Amazon EventBridge rules route file upload events to their corresponding Amazon CloudWatch log groups. The following are example screenshots of file upload events.
A "information" file upload event:
A "manifest" file upload event:
The following is a diagram of the Stride Functions state car. This state machine implements the file upload effect reconciliation logic. It executes a combination of Choice, Pass, Task, and Expect states:
The following is a summary of the steps executed:
- Configure Count: Configures the maximum total number of iterations the state machine executes. The relevant CDK context key sets the count value, as described in the "AWS CDK awarding principles" section of this post.
- Reconcile Iterator: Executes an AWS Lambda function that increases the value of the current iteration count by ane. If the current value equals the maximum count value configured, the Lambda function sets the Boolean variable
continue
toFake
, preventing the state motorcar from entering some other iteration loop. - Bank check Count Reached: Checks if the Boolean variable
continue
isTrue
orSimulated
. Gain to the "Reconcile Check Upload" step ifTrue
or the "Reconcile Notify" step ifFalse
. - Reconcile Check Upload: Executes an AWS Lambda office that reads the "manifest" file from the Amazon S3 bucket and compares the contents with the file upload events written to the Amazon DynamoDB table. If these are identical, the Lambda office sets the Boolean variable
reconcileDone
toTrue
, indicating the reconcile procedure has completed. This variable is set up toFalse
if these information sources do not friction match. - Reconcile Bank check Complete: Checks if the Boolean variable
reconcileDone
isTrue
orSimulated
. Proceeds to the "Reconcile Notify" pace ifTruthful
or the "Wait" stride ifFalse
. - Wait: A uncomplicated look land that sleeps for a configured time. This state obtains the sleep fourth dimension from a CDK context key, as described in the "AWS CDK awarding principles" section of this blog post. This state is entered upon whenever Boolean variables
continue
andreconcileDone
are set toTrue
andFalse
respectively. - Reconcile Notify: Executes an AWS Lambda office that sends an event to the EventBridge custom motorcoach, notifying on the condition of the reconciliation process. This is either
Successful
if completed within the maximum number of configured iterations orTimeout
if non. Proceeds to the final "Done" state, completing the state machine execution.
The notification sent by the Step Functions state machine is the final step in the event processing menstruum and is written, via EventBridge, to an Amazon CloudWatch log grouping. See the following for an case screenshot:
The structure of this result is as follows:
{ "version": "0", "id": "[ID]", "detail-type": "File Upload Reconciliation Successful", "source": "vault.awarding", "account": "[ACCOUNT ID]", "time": "[YYYY-MM-DDTHH:MM:SSZ]", "region": "[REGION]", "resources": [], "detail": { "set-id": "[LOGICAL DATASET ID]", "outcome-time": [EPOCH TIME], "bucket-name": "[BUCKET NAME]", "object-cardinal": "[MANIFEST FILE OBJECT]", "object-size": [SIZE BYTES] } }
Since an EventBridge custom event bus is used, you can extend and customize the solution by adding additional targets in the EventBridge dominion. Past doing and so, yous can enable other applications or processes to swallow the event and perform further downstream processing on the logical dataset.
The File Gateway implements a write-back cache and asynchronously uploads data to Amazon S3. It optimizes cache usage and the order of file uploads. It may also perform temporary partial uploads during the process of fully uploading a file (the partial copy can be seen momentarily in the Amazon S3 bucket at a smaller size than the original). Hence, you may notice a small-scale delay and/or non-sequential uploads when comparing objects actualization in the Amazon S3 bucket with the arrival of respective Amazon CloudWatch Logs.
However, since the File Gateway only generates file upload notifications afterward it has completely uploaded files to Amazon S3, information technology is in these scenarios that the file upload notification feature becomes a powerful machinery to coordinate downstream processing. The AWS CDK workshop walkthrough is a expert demonstration of this feature for real-earth scenarios where a File Gateway is often managing hundreds of TBs of uploads to Amazon S3. Oft this can be for hundreds of thousands of files copied by multiple clients.
Cleaning up
Don't forget to consummate the cleanup section (Module 7) in the workshop, to prevent continuing AWS service charges in your business relationship.
Conclusion
In this post, we discussed an AWS CDK application that enables you to leverage individual File Gateway file upload events to group together uploaded datasets for downstream processing. Yous tin can use the application to vault information to AWS for the backup and recovery of critical business concern avails, or to create your own custom data processing pipelines for files uploaded to Amazon S3.
Thanks for reading our blog and we promise y'all now enjoy working through the steps in the AWS CDK awarding workshop. If you have whatever comments or questions, please leave them in the comments department or create new problems and pull requests in the GitHub repository.
To learn more about the services mentioned in this postal service, refer to these product pages:
- AWS Storage Gateway
- Amazon S3
- Amazon EventBridge
- AWS Pace Functions
- Amazon DynamoDB
- Amazon CloudWatch
- AWS Cloud Evolution Kit
Source: https://aws.amazon.com/blogs/storage/process-aws-storage-gateway-file-upload-notifications-with-aws-cdk/
0 Response to "What to Upload on Aws Storage Gateway"
Post a Comment