Building and securing data lakes

Task Statement 3.5: Determine high-performing data ingestion and transformation solutions.

📘AWS Certified Solutions Architect – (SAA-C03)


1. What is a Data Lake?

A data lake is a centralized storage system that allows you to store large amounts of raw data in its original format.

Key characteristics:

  • Stores structured data (tables, databases)
  • Stores semi-structured data (JSON, XML)
  • Stores unstructured data (images, videos, logs)
  • Can scale to petabytes or more
  • Data is stored as-is, without needing transformation first

Core AWS service used:

  • Amazon S3 (Simple Storage Service) → Main storage layer for data lakes

2. Why Build a Data Lake?

A data lake is used when:

  • You want to store all types of data in one place
  • You need analytics, reporting, or machine learning
  • You want cheap, scalable storage

Benefits:

  • Low cost storage
  • High scalability
  • Supports many analytics tools
  • Flexible schema (schema-on-read)

3. Key Components of a Data Lake Architecture

To build a data lake, you must understand these layers:


3.1 Data Ingestion Layer

This layer brings data into the data lake.

Types of ingestion:

  • Batch ingestion
    • Data loaded periodically
    • Example: daily logs upload
  • Streaming ingestion
    • Real-time data flow
    • Example: application logs or clickstreams

AWS Services:

  • Amazon Kinesis → streaming data
  • AWS DataSync → large-scale transfers
  • AWS Snowball → offline data transfer
  • AWS Transfer Family → SFTP uploads

3.2 Storage Layer (Core of Data Lake)

Amazon S3 is used because:

  • Unlimited storage
  • High durability (11 9’s durability)
  • Cost-effective
  • Supports lifecycle management

Best Practice: Organize data using prefixes

Example structure:

s3://data-lake/
├── raw/
├── processed/
├── curated/

Data Zones:

  • Raw Zone
    • Original data (unchanged)
  • Processed Zone
    • Cleaned and transformed data
  • Curated Zone
    • Ready for analytics

3.3 Data Processing & Transformation Layer

This layer prepares data for analysis.

AWS Services:

  • AWS Glue
    • Serverless ETL (Extract, Transform, Load)
    • Automatically discovers schema
  • Amazon EMR
    • Big data processing using Spark/Hadoop
  • AWS Lambda
    • Lightweight transformations

3.4 Data Catalog Layer

A data catalog helps organize and find data.

AWS Service:

  • AWS Glue Data Catalog

Features:

  • Stores metadata (table definitions)
  • Enables querying data easily
  • Works with Athena, Redshift, etc.

3.5 Analytics Layer

This layer is used to analyze data.

AWS Services:

  • Amazon Athena
    • Query data directly from S3 using SQL
  • Amazon Redshift Spectrum
    • Query S3 data using Redshift
  • Amazon QuickSight
    • Data visualization and dashboards

4. Securing a Data Lake (Very Important for Exam)

Security is a major exam topic. You must secure data at multiple levels.


4.1 Identity and Access Management (IAM)

Controls who can access what.

Best Practices:

  • Use IAM roles, not users
  • Apply least privilege principle
  • Use IAM policies to restrict access

4.2 Bucket-Level Security (Amazon S3)

Key controls:

  • Bucket Policies
  • Access Control Lists (ACLs)
  • Block Public Access (VERY IMPORTANT)

Exam Tip:

Always block public access unless required.


4.3 Encryption

Types of encryption:

1. Encryption at Rest

Protects stored data.

  • SSE-S3 → Managed by S3
  • SSE-KMS → Uses AWS KMS (recommended for control)
  • SSE-C → Customer-managed keys

2. Encryption in Transit

  • Use HTTPS (TLS)

4.4 AWS Lake Formation (Key Service)

Used to simplify security and governance of data lakes.

What it does:

  • Centralized access control
  • Fine-grained permissions (table, column level)
  • Works with Glue Data Catalog

Key Features:

  • Role-based access control
  • Data filtering
  • Secure sharing of datasets

4.5 Data Access Control (Fine-Grained)

Using Lake Formation, you can:

  • Restrict access to specific tables
  • Restrict access to specific columns
  • Control row-level access (advanced)

4.6 Logging and Monitoring

Track who accessed data.

AWS Services:

  • AWS CloudTrail
    • Logs API calls
  • Amazon CloudWatch
    • Monitoring and alerts
  • S3 Access Logs
    • Track requests to S3

4.7 Data Protection and Compliance

Features:

  • Versioning (protects against deletion)
  • MFA Delete (extra security)
  • Object Lock (WORM – Write Once Read Many)

5. Data Governance (Important Concept)

Data governance ensures:

  • Data is organized
  • Data is secure
  • Data is usable

Tools:

  • AWS Lake Formation
  • AWS Glue Data Catalog

Governance Tasks:

  • Data classification
  • Access control
  • Auditing

6. Best Practices for Building Data Lakes


Storage Best Practices

  • Use Amazon S3
  • Organize data using prefixes (folders)
  • Use lifecycle policies (move to Glacier)

Performance Best Practices

  • Store data in optimized formats:
    • Parquet
    • ORC
  • Partition data:
year=2026/month=04/day=02/

Security Best Practices

  • Enable encryption (SSE-KMS preferred)
  • Use IAM roles
  • Enable S3 Block Public Access
  • Use Lake Formation for fine control

Cost Optimization

  • Use S3 lifecycle policies
  • Use S3 Intelligent-Tiering
  • Compress data

7. Common Exam Scenarios


Scenario 1:

Need centralized storage for large datasets
→ Use Amazon S3 Data Lake


Scenario 2:

Need fine-grained access control
→ Use AWS Lake Formation


Scenario 3:

Need to query data without loading
→ Use Amazon Athena


Scenario 4:

Need schema management
→ Use AWS Glue Data Catalog


Scenario 5:

Need secure data storage
→ Use:

  • S3 encryption
  • IAM policies
  • Lake Formation permissions

8. Quick Summary (Exam Revision)


Core Services:

  • S3 → Storage
  • Glue → ETL + Catalog
  • Lake Formation → Security & governance
  • Athena → Query
  • Kinesis → Streaming ingestion

Security Checklist:

  • IAM roles (least privilege)
  • S3 Block Public Access
  • Encryption (SSE-KMS)
  • Lake Formation permissions
  • Logging (CloudTrail)

Architecture Flow:

Ingestion → S3 (Data Lake) → Glue → Catalog → Athena/Analytics

Final Exam Tip

If a question mentions:

  • Large-scale storage + multiple data types → Think S3 Data Lake
  • Security + fine-grained access → Think Lake Formation
  • Query without loading → Think Athena
  • Metadata/catalog → Think Glue Data Catalog
Buy Me a Coffee