Task Statement 3.5: Determine high-performing data ingestion and transformation solutions.

📘AWS Certified Solutions Architect – (SAA-C03)

1. What is Data Streaming Architecture?

A data streaming architecture is a system that processes data continuously in real time as it is generated.

Key idea:

Data is processed immediately (or near real-time)
Instead of waiting for batches, data flows continuously

Examples of streaming data in IT systems:

Application logs being generated continuously
Metrics from servers (CPU, memory usage)
User activity events from web or mobile apps
IoT sensor data

2. Streaming vs Batch Processing

Feature	Streaming	Batch
Data processing	Real-time	Scheduled
Latency	Very low (seconds/milliseconds)	High (minutes/hours)
Use case	Monitoring, alerting	Reporting, analytics
Complexity	Higher	Lower

Exam Tip:

If question mentions real-time, low latency → Streaming
If question mentions scheduled processing → Batch

3. Core Components of a Streaming Architecture

A typical streaming system has 4 main layers:

1. Data Producers

These generate data continuously.

Examples:

Applications
Servers
IoT devices

2. Data Ingestion Layer

This collects and streams incoming data.

AWS Services:

Amazon Kinesis
Amazon MSK (Managed Kafka)
Amazon SQS (sometimes for buffering)

3. Data Processing Layer

This processes data in real time.

AWS Services:

AWS Lambda
Amazon Kinesis Data Analytics
Amazon EMR (stream processing)

4. Data Storage Layer

Stores processed or raw data.

AWS Services:

Amazon S3 (data lake)
Amazon DynamoDB (real-time storage)
Amazon Redshift (analytics)

4. AWS Streaming Services (Very Important)

1. Amazon Kinesis (Core Service)

Kinesis is the most important service for streaming in the exam.

Kinesis Components:

a) Kinesis Data Streams

Real-time data ingestion
Stores data for 24 hours to 365 days
Supports multiple consumers

Key Concepts:

Shard = unit of capacity
- 1 shard =
  - 1 MB/sec input
  - 2 MB/sec output

When to use:

High-throughput streaming data
Multiple applications need same data

b) Kinesis Data Firehose

Fully managed delivery service
Automatically loads data into:
- S3
- Redshift
- OpenSearch

Features:

No shard management
Automatic scaling
Built-in transformation (Lambda)

When to use:

Simple pipeline → stream → storage
No need for custom processing

c) Kinesis Data Analytics

Real-time data processing using SQL or Apache Flink

When to use:

Real-time analytics
Filtering, aggregations

2. AWS Lambda (Serverless Processing)

Processes streaming data automatically
Works with:
- Kinesis
- SQS
- DynamoDB Streams

Features:

No server management
Auto scaling
Event-driven

When to use:

Lightweight transformations
Real-time triggers

3. Amazon MSK (Managed Kafka)

Fully managed Apache Kafka service

When to use:

Kafka-based architectures
Complex event streaming systems

4. Amazon SQS (Buffering Layer)

Not a streaming tool, but used in streaming architecture

Types:

Standard Queue → high throughput
FIFO Queue → ordered processing

Use case:

Decouple producers and consumers
Handle traffic spikes

5. Data Flow Patterns in Streaming

1. Fan-Out Pattern

One stream → multiple consumers

Example:

One Kinesis stream → Lambda + Analytics + Storage

Types:

Shared throughput (standard consumers)
Enhanced fan-out (dedicated throughput per consumer)

2. Producer → Stream → Consumer

Basic pipeline:

Producer → Kinesis → Lambda → S3

3. Stream → Buffer → Processing

Producer → Kinesis → SQS → Lambda

Why?

Improve reliability
Prevent overload

6. Scaling in Streaming Architectures

Kinesis Scaling

Scale by adding/removing shards
More shards = more throughput

Exam Tip:

If throughput increases → increase shards

Lambda Scaling

Automatically scales based on incoming events

Firehose Scaling

Fully automatic (no manual scaling)

7. Data Durability and Reliability

Kinesis Data Streams

Data replicated across multiple AZs
Retention:
- Default: 24 hours
- Max: 365 days

Firehose

Retries delivery automatically
Stores failed data in S3 (backup)

SQS

Guarantees message delivery
Can store messages temporarily

8. Ordering and Processing

Ordering Guarantees

Kinesis → ordered within a shard
SQS FIFO → strict ordering

Exam Tip:

Need strict ordering → use:
- Kinesis (same shard)
- SQS FIFO

Exactly Once vs At Least Once

Type	Meaning
At least once	May process duplicates
Exactly once	No duplicates

AWS Behavior:

Most services = at least once

Solution:

Use idempotent processing

9. Data Transformation in Streaming

Methods:

1. AWS Lambda

Simple transformations

2. Kinesis Data Analytics

SQL-based real-time processing

3. Firehose + Lambda

Inline transformation before storage

10. Security in Streaming Architectures

Key Security Controls:

1. Encryption

Data in transit → TLS
Data at rest → KMS

2. IAM Roles & Policies

Control access to streams and services

3. VPC Endpoints

Secure private communication

4. Fine-grained Access

Control producers and consumers separately

11. Monitoring and Troubleshooting

AWS Tools:

Amazon CloudWatch

Metrics:
- Incoming data rate
- Errors
- Latency

CloudWatch Logs

Debug processing issues

Alarms

Trigger alerts on failures

12. Cost Optimization

Kinesis

Cost based on:
- Number of shards
- Data volume

Firehose

Pay per data processed

Lambda

Pay per execution

Exam Tip:

If you want low management + cost-efficient → Firehose
If you want full control → Kinesis Data Streams

13. Common Exam Scenarios

Scenario 1:

Need real-time analytics with custom processing

→ Use:

Kinesis Data Streams + Lambda
OR
Kinesis Data Analytics

Scenario 2:

Need simple delivery to S3 with minimal setup

→ Use:

Kinesis Data Firehose

Scenario 3:

Need multiple consumers reading same stream

→ Use:

Kinesis Data Streams (fan-out)

Scenario 4:

Need buffering and decoupling

→ Use:

Scenario 5:

Need ordered processing

→ Use:

Kinesis (same shard)
SQS FIFO

14. Key Differences (Very Important for Exam)

Feature	Data Streams	Firehose
Control	High	Low
Scaling	Manual (shards)	Automatic
Processing	Custom	Limited
Use case	Complex pipelines	Simple delivery

15. Final Exam Tips

Streaming = real-time processing
Kinesis is the core service
Choose:
- Data Streams → flexibility
- Firehose → simplicity
Use Lambda for transformations
Scale using:
- Shards (Kinesis)
- Auto-scaling (Lambda)
Ensure:
- Durability
- Security
- Monitoring