Skip to main content

What is the datalake?

Prequel Import makes use of a secure cache of the source data in your cloud storage bucket. This enables efficient change detection without the need to store any customer data on Prequel’s core infrastructure.

How is the datalake structured?

A Datalake configures one cloud storage bucket. Each Provider is assigned to exactly one Datalake at creation, and that assignment is immutable. The datalake can be single-tenant or multi-tenant:
  • Shared Datalake: Many Providers point to the same Datalake. Prequel partitions the shared bucket internally by tenant, so data from separate Providers is never comingled.
  • Per-Provider Datalake: Each Provider has a dedicated Datalake. Use this if you prefer bucket-level isolation.
You can mix the two patterns within a single environment, sharing Datalakes among some customers and pinning other customers to their own.

Lifecycle responsibility

Prequel manages lifecycle for cached data under the lakehouse/ prefix of your datalake. Staging data written to the artifacts/ prefix is not managed by Prequel. You must configure a lifecycle policy on the artifacts/ prefix (or <bucket_prefix>/artifacts/ if you set a bucket prefix on the datalake) to expire it. The setup steps below include a lifecycle policy step for each vendor. We recommend matching the lifecycle policy duration to your datalake’s retention_window_days.

Configuring your datalake

1

Create an S3 bucket

In the AWS console, navigate to the S3 service page and click Create bucket. Enter a Bucket name and choose an AWS Region. We recommend setting Object Ownership to “ACLs disabled” and Block Public Access settings for this bucket to “Block all public access”. Make a note of the bucket name and region.
2

Create an IAM policy

Create an IAM policy with the following permissions, replacing BUCKET_NAME with the name of the bucket you created:
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "s3:ListBucket",
      "Resource": "arn:aws:s3:::BUCKET_NAME"
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Resource": "arn:aws:s3:::BUCKET_NAME/*"
    }
  ]
}
3

Create an IAM role

Create an IAM role with the trust policy below, then attach the policy from the previous step. Replace <some_service_account_id> with the service account ID from your deployment details.
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "accounts.google.com"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "accounts.google.com:sub": "<some_service_account_id>"
        }
      }
    }
  ]
}
Once the role is created, make a note of its ARN.
4

Set a lifecycle policy on the artifacts prefix

On the bucket, navigate to Management > Lifecycle rules and click Create lifecycle rule. Limit the rule scope by prefix to artifacts/ (or <bucket_prefix>/artifacts/ if you set a bucket prefix on the datalake), then add an Expire current versions of objects action set to the same number of days as the datalake’s retention_window_days.