Designing data streaming architectures

Task Statement 3.5: Determine high-performing data ingestion and transformation solutions.

📘AWS Certified Solutions Architect – (SAA-C03)


1. What is Data Streaming Architecture?

A data streaming architecture is a system that processes data continuously in real time as it is generated.

Key idea:

  • Data is processed immediately (or near real-time)
  • Instead of waiting for batches, data flows continuously

Examples of streaming data in IT systems:

  • Application logs being generated continuously
  • Metrics from servers (CPU, memory usage)
  • User activity events from web or mobile apps
  • IoT sensor data

2. Streaming vs Batch Processing

FeatureStreamingBatch
Data processingReal-timeScheduled
LatencyVery low (seconds/milliseconds)High (minutes/hours)
Use caseMonitoring, alertingReporting, analytics
ComplexityHigherLower

Exam Tip:

  • If question mentions real-time, low latency → Streaming
  • If question mentions scheduled processing → Batch

3. Core Components of a Streaming Architecture

A typical streaming system has 4 main layers:


1. Data Producers

These generate data continuously.

Examples:

  • Applications
  • Servers
  • IoT devices

2. Data Ingestion Layer

This collects and streams incoming data.

AWS Services:

  • Amazon Kinesis
  • Amazon MSK (Managed Kafka)
  • Amazon SQS (sometimes for buffering)

3. Data Processing Layer

This processes data in real time.

AWS Services:

  • AWS Lambda
  • Amazon Kinesis Data Analytics
  • Amazon EMR (stream processing)

4. Data Storage Layer

Stores processed or raw data.

AWS Services:

  • Amazon S3 (data lake)
  • Amazon DynamoDB (real-time storage)
  • Amazon Redshift (analytics)

4. AWS Streaming Services (Very Important)


1. Amazon Kinesis (Core Service)

Kinesis is the most important service for streaming in the exam.

Kinesis Components:

a) Kinesis Data Streams

  • Real-time data ingestion
  • Stores data for 24 hours to 365 days
  • Supports multiple consumers

Key Concepts:

  • Shard = unit of capacity
    • 1 shard =
      • 1 MB/sec input
      • 2 MB/sec output

When to use:

  • High-throughput streaming data
  • Multiple applications need same data

b) Kinesis Data Firehose

  • Fully managed delivery service
  • Automatically loads data into:
    • S3
    • Redshift
    • OpenSearch

Features:

  • No shard management
  • Automatic scaling
  • Built-in transformation (Lambda)

When to use:

  • Simple pipeline → stream → storage
  • No need for custom processing

c) Kinesis Data Analytics

  • Real-time data processing using SQL or Apache Flink

When to use:

  • Real-time analytics
  • Filtering, aggregations

2. AWS Lambda (Serverless Processing)

  • Processes streaming data automatically
  • Works with:
    • Kinesis
    • SQS
    • DynamoDB Streams

Features:

  • No server management
  • Auto scaling
  • Event-driven

When to use:

  • Lightweight transformations
  • Real-time triggers

3. Amazon MSK (Managed Kafka)

  • Fully managed Apache Kafka service

When to use:

  • Kafka-based architectures
  • Complex event streaming systems

4. Amazon SQS (Buffering Layer)

  • Not a streaming tool, but used in streaming architecture

Types:

  • Standard Queue → high throughput
  • FIFO Queue → ordered processing

Use case:

  • Decouple producers and consumers
  • Handle traffic spikes

5. Data Flow Patterns in Streaming


1. Fan-Out Pattern

  • One stream → multiple consumers

Example:

  • One Kinesis stream → Lambda + Analytics + Storage

Types:

  • Shared throughput (standard consumers)
  • Enhanced fan-out (dedicated throughput per consumer)

2. Producer → Stream → Consumer

Basic pipeline:

Producer → Kinesis → Lambda → S3

3. Stream → Buffer → Processing

Producer → Kinesis → SQS → Lambda

Why?

  • Improve reliability
  • Prevent overload

6. Scaling in Streaming Architectures


Kinesis Scaling

  • Scale by adding/removing shards
  • More shards = more throughput

Exam Tip:

  • If throughput increases → increase shards

Lambda Scaling

  • Automatically scales based on incoming events

Firehose Scaling

  • Fully automatic (no manual scaling)

7. Data Durability and Reliability


Kinesis Data Streams

  • Data replicated across multiple AZs
  • Retention:
    • Default: 24 hours
    • Max: 365 days

Firehose

  • Retries delivery automatically
  • Stores failed data in S3 (backup)

SQS

  • Guarantees message delivery
  • Can store messages temporarily

8. Ordering and Processing


Ordering Guarantees

  • Kinesis → ordered within a shard
  • SQS FIFO → strict ordering

Exam Tip:

  • Need strict ordering → use:
    • Kinesis (same shard)
    • SQS FIFO

Exactly Once vs At Least Once

TypeMeaning
At least onceMay process duplicates
Exactly onceNo duplicates

AWS Behavior:

  • Most services = at least once

Solution:

  • Use idempotent processing

9. Data Transformation in Streaming


Methods:

1. AWS Lambda

  • Simple transformations

2. Kinesis Data Analytics

  • SQL-based real-time processing

3. Firehose + Lambda

  • Inline transformation before storage

10. Security in Streaming Architectures


Key Security Controls:

1. Encryption

  • Data in transit → TLS
  • Data at rest → KMS

2. IAM Roles & Policies

  • Control access to streams and services

3. VPC Endpoints

  • Secure private communication

4. Fine-grained Access

  • Control producers and consumers separately

11. Monitoring and Troubleshooting


AWS Tools:

Amazon CloudWatch

  • Metrics:
    • Incoming data rate
    • Errors
    • Latency

CloudWatch Logs

  • Debug processing issues

Alarms

  • Trigger alerts on failures

12. Cost Optimization


Kinesis

  • Cost based on:
    • Number of shards
    • Data volume

Firehose

  • Pay per data processed

Lambda

  • Pay per execution

Exam Tip:

  • If you want low management + cost-efficient → Firehose
  • If you want full control → Kinesis Data Streams

13. Common Exam Scenarios


Scenario 1:

Need real-time analytics with custom processing

→ Use:

  • Kinesis Data Streams + Lambda
    OR
  • Kinesis Data Analytics

Scenario 2:

Need simple delivery to S3 with minimal setup

→ Use:

  • Kinesis Data Firehose

Scenario 3:

Need multiple consumers reading same stream

→ Use:

  • Kinesis Data Streams (fan-out)

Scenario 4:

Need buffering and decoupling

→ Use:

  • SQS

Scenario 5:

Need ordered processing

→ Use:

  • Kinesis (same shard)
  • SQS FIFO

14. Key Differences (Very Important for Exam)

FeatureData StreamsFirehose
ControlHighLow
ScalingManual (shards)Automatic
ProcessingCustomLimited
Use caseComplex pipelinesSimple delivery

15. Final Exam Tips

  • Streaming = real-time processing
  • Kinesis is the core service
  • Choose:
    • Data Streams → flexibility
    • Firehose → simplicity
  • Use Lambda for transformations
  • Scale using:
    • Shards (Kinesis)
    • Auto-scaling (Lambda)
  • Ensure:
    • Durability
    • Security
    • Monitoring
Buy Me a Coffee