Task Statement 3.5: Determine high-performing data ingestion and transformation solutions.
📘AWS Certified Solutions Architect – (SAA-C03)
1. What is Data Streaming Architecture?
A data streaming architecture is a system that processes data continuously in real time as it is generated.
Key idea:
- Data is processed immediately (or near real-time)
- Instead of waiting for batches, data flows continuously
Examples of streaming data in IT systems:
- Application logs being generated continuously
- Metrics from servers (CPU, memory usage)
- User activity events from web or mobile apps
- IoT sensor data
2. Streaming vs Batch Processing
| Feature | Streaming | Batch |
|---|---|---|
| Data processing | Real-time | Scheduled |
| Latency | Very low (seconds/milliseconds) | High (minutes/hours) |
| Use case | Monitoring, alerting | Reporting, analytics |
| Complexity | Higher | Lower |
Exam Tip:
- If question mentions real-time, low latency → Streaming
- If question mentions scheduled processing → Batch
3. Core Components of a Streaming Architecture
A typical streaming system has 4 main layers:
1. Data Producers
These generate data continuously.
Examples:
- Applications
- Servers
- IoT devices
2. Data Ingestion Layer
This collects and streams incoming data.
AWS Services:
- Amazon Kinesis
- Amazon MSK (Managed Kafka)
- Amazon SQS (sometimes for buffering)
3. Data Processing Layer
This processes data in real time.
AWS Services:
- AWS Lambda
- Amazon Kinesis Data Analytics
- Amazon EMR (stream processing)
4. Data Storage Layer
Stores processed or raw data.
AWS Services:
- Amazon S3 (data lake)
- Amazon DynamoDB (real-time storage)
- Amazon Redshift (analytics)
4. AWS Streaming Services (Very Important)
1. Amazon Kinesis (Core Service)
Kinesis is the most important service for streaming in the exam.
Kinesis Components:
a) Kinesis Data Streams
- Real-time data ingestion
- Stores data for 24 hours to 365 days
- Supports multiple consumers
Key Concepts:
- Shard = unit of capacity
- 1 shard =
- 1 MB/sec input
- 2 MB/sec output
- 1 shard =
When to use:
- High-throughput streaming data
- Multiple applications need same data
b) Kinesis Data Firehose
- Fully managed delivery service
- Automatically loads data into:
- S3
- Redshift
- OpenSearch
Features:
- No shard management
- Automatic scaling
- Built-in transformation (Lambda)
When to use:
- Simple pipeline → stream → storage
- No need for custom processing
c) Kinesis Data Analytics
- Real-time data processing using SQL or Apache Flink
When to use:
- Real-time analytics
- Filtering, aggregations
2. AWS Lambda (Serverless Processing)
- Processes streaming data automatically
- Works with:
- Kinesis
- SQS
- DynamoDB Streams
Features:
- No server management
- Auto scaling
- Event-driven
When to use:
- Lightweight transformations
- Real-time triggers
3. Amazon MSK (Managed Kafka)
- Fully managed Apache Kafka service
When to use:
- Kafka-based architectures
- Complex event streaming systems
4. Amazon SQS (Buffering Layer)
- Not a streaming tool, but used in streaming architecture
Types:
- Standard Queue → high throughput
- FIFO Queue → ordered processing
Use case:
- Decouple producers and consumers
- Handle traffic spikes
5. Data Flow Patterns in Streaming
1. Fan-Out Pattern
- One stream → multiple consumers
Example:
- One Kinesis stream → Lambda + Analytics + Storage
Types:
- Shared throughput (standard consumers)
- Enhanced fan-out (dedicated throughput per consumer)
2. Producer → Stream → Consumer
Basic pipeline:
Producer → Kinesis → Lambda → S3
3. Stream → Buffer → Processing
Producer → Kinesis → SQS → Lambda
Why?
- Improve reliability
- Prevent overload
6. Scaling in Streaming Architectures
Kinesis Scaling
- Scale by adding/removing shards
- More shards = more throughput
Exam Tip:
- If throughput increases → increase shards
Lambda Scaling
- Automatically scales based on incoming events
Firehose Scaling
- Fully automatic (no manual scaling)
7. Data Durability and Reliability
Kinesis Data Streams
- Data replicated across multiple AZs
- Retention:
- Default: 24 hours
- Max: 365 days
Firehose
- Retries delivery automatically
- Stores failed data in S3 (backup)
SQS
- Guarantees message delivery
- Can store messages temporarily
8. Ordering and Processing
Ordering Guarantees
- Kinesis → ordered within a shard
- SQS FIFO → strict ordering
Exam Tip:
- Need strict ordering → use:
- Kinesis (same shard)
- SQS FIFO
Exactly Once vs At Least Once
| Type | Meaning |
|---|---|
| At least once | May process duplicates |
| Exactly once | No duplicates |
AWS Behavior:
- Most services = at least once
Solution:
- Use idempotent processing
9. Data Transformation in Streaming
Methods:
1. AWS Lambda
- Simple transformations
2. Kinesis Data Analytics
- SQL-based real-time processing
3. Firehose + Lambda
- Inline transformation before storage
10. Security in Streaming Architectures
Key Security Controls:
1. Encryption
- Data in transit → TLS
- Data at rest → KMS
2. IAM Roles & Policies
- Control access to streams and services
3. VPC Endpoints
- Secure private communication
4. Fine-grained Access
- Control producers and consumers separately
11. Monitoring and Troubleshooting
AWS Tools:
Amazon CloudWatch
- Metrics:
- Incoming data rate
- Errors
- Latency
CloudWatch Logs
- Debug processing issues
Alarms
- Trigger alerts on failures
12. Cost Optimization
Kinesis
- Cost based on:
- Number of shards
- Data volume
Firehose
- Pay per data processed
Lambda
- Pay per execution
Exam Tip:
- If you want low management + cost-efficient → Firehose
- If you want full control → Kinesis Data Streams
13. Common Exam Scenarios
Scenario 1:
Need real-time analytics with custom processing
→ Use:
- Kinesis Data Streams + Lambda
OR - Kinesis Data Analytics
Scenario 2:
Need simple delivery to S3 with minimal setup
→ Use:
- Kinesis Data Firehose
Scenario 3:
Need multiple consumers reading same stream
→ Use:
- Kinesis Data Streams (fan-out)
Scenario 4:
Need buffering and decoupling
→ Use:
- SQS
Scenario 5:
Need ordered processing
→ Use:
- Kinesis (same shard)
- SQS FIFO
14. Key Differences (Very Important for Exam)
| Feature | Data Streams | Firehose |
|---|---|---|
| Control | High | Low |
| Scaling | Manual (shards) | Automatic |
| Processing | Custom | Limited |
| Use case | Complex pipelines | Simple delivery |
15. Final Exam Tips
- Streaming = real-time processing
- Kinesis is the core service
- Choose:
- Data Streams → flexibility
- Firehose → simplicity
- Use Lambda for transformations
- Scale using:
- Shards (Kinesis)
- Auto-scaling (Lambda)
- Ensure:
- Durability
- Security
- Monitoring
