What is the datalake?
Prequel Import makes use of a secure cache of the source data in your cloud storage bucket. This enables efficient change detection without the need to store any customer data on Prequel’s core infrastructure.How is the datalake structured?
A Datalake configures one cloud storage bucket. Each Provider is assigned to exactly one Datalake at creation, and that assignment is immutable. The datalake can be single-tenant or multi-tenant:- Shared Datalake: Many Providers point to the same Datalake. Prequel partitions the shared bucket internally by tenant, so data from separate Providers is never comingled.
- Per-Provider Datalake: Each Provider has a dedicated Datalake. Use this if you prefer bucket-level isolation.
Lifecycle responsibility
Prequel manages lifecycle for cached data under thelakehouse/ prefix of your datalake. Staging data written to the artifacts/ prefix is not managed by Prequel. You must configure a lifecycle policy on the artifacts/ prefix (or <bucket_prefix>/artifacts/ if you set a bucket prefix on the datalake) to expire it. The setup steps below include a lifecycle policy step for each vendor. We recommend matching the lifecycle policy duration to your datalake’s retention_window_days.
Configuring your datalake
- AWS S3
- Google Cloud Storage
- Azure Blob Storage
Create an S3 bucket
In the AWS console, navigate to the S3 service page and click Create bucket. Enter a Bucket name and choose an AWS Region. We recommend setting Object Ownership to “ACLs disabled” and Block Public Access settings for this bucket to “Block all public access”. Make a note of the bucket name and region.
Create an IAM policy
Create an IAM policy with the following permissions, replacing
BUCKET_NAME with the name of the bucket you created:Create an IAM role
Create an IAM role with the trust policy below, then attach the policy from the previous step. Replace Once the role is created, make a note of its ARN.
<some_service_account_id> with the service account ID from your deployment details.Set a lifecycle policy on the artifacts prefix
On the bucket, navigate to Management > Lifecycle rules and click Create lifecycle rule. Limit the rule scope by prefix to
artifacts/ (or <bucket_prefix>/artifacts/ if you set a bucket prefix on the datalake), then add an Expire current versions of objects action set to the same number of days as the datalake’s retention_window_days.