Task Statement 3.5: Determine high-performing data ingestion and transformation solutions.
📘AWS Certified Solutions Architect – (SAA-C03)
1. Introduction
In AWS, when designing data ingestion and transformation systems, you must always consider:
- How much data (size) is being processed
- How fast (speed) the data is generated and needs to be processed
These two factors directly affect:
- Performance
- Cost
- Scalability
- Choice of AWS services (especially streaming services like Amazon Kinesis)
2. Understanding Data Size
What is Data Size?
Data size refers to the volume of data being ingested or processed.
Common Categories
- Small-scale data
- MBs to GBs
- Example: Application logs from a few servers
- Medium-scale data
- GBs to TBs
- Example: Logs from multiple services or databases
- Large-scale data
- TBs to PBs
- Example: Enterprise-wide data platforms, analytics pipelines
Why Data Size Matters
- Determines storage choice (S3, EBS, etc.)
- Impacts processing services (Lambda vs EMR vs Glue)
- Affects network throughput
- Influences cost
3. Understanding Data Speed
What is Data Speed?
Data speed refers to how fast data is generated and processed.
Types of Speed
1. Batch Processing (Low Speed)
- Data collected over time and processed later
- Example: Daily reports from database exports
2. Near Real-Time Processing
- Small delay (seconds to minutes)
- Example: Monitoring dashboards
3. Real-Time Streaming (High Speed)
- Data processed instantly as it arrives
- Example: Real-time log processing, metrics pipelines
Why Speed Matters
- Determines latency requirements
- Influences architecture design
- Decides streaming vs batch services
4. Matching Size and Speed to AWS Services
| Requirement | Best Approach |
|---|---|
| Low size + low speed | Batch processing (S3 + Lambda) |
| High size + low speed | Batch analytics (S3 + EMR/Glue) |
| Low size + high speed | Streaming (Kinesis, Lambda) |
| High size + high speed | High-throughput streaming (Kinesis, MSK) |
5. Streaming Data Services in AWS
What is Streaming Data?
Streaming data is:
- Continuous
- Unbounded
- Generated in real-time
Instead of waiting, data is processed immediately as it arrives.
6. Amazon Kinesis Overview
Amazon Kinesis is a fully managed service used to:
- Collect
- Process
- Analyze real-time streaming data
7. Core Kinesis Services
1. Kinesis Data Streams (KDS)
Purpose
- Real-time ingestion of streaming data
Key Features
- Low latency (milliseconds)
- Scalable using shards
- Durable storage (24 hours to 7 days or more)
Important Concepts
- Shard
- Unit of capacity
- Each shard supports:
- 1 MB/sec write
- 2 MB/sec read
- Producer
- Sends data into stream
- Consumer
- Reads data from stream
2. Kinesis Data Firehose
Purpose
- Load streaming data directly into storage services
Key Features
- Fully managed (no shard management)
- Automatic scaling
- Delivers data to:
- Amazon S3
- Amazon Redshift
- Amazon OpenSearch
Use Case
- When you want simple ingestion with no management
3. Kinesis Data Analytics
Purpose
- Process streaming data using SQL or Apache Flink
Key Features
- Real-time transformations
- Filtering, aggregation, enrichment
4. Kinesis Video Streams
Purpose
- Streaming video data processing
8. When to Use Amazon Kinesis
Use Kinesis when:
- Data arrives continuously
- Low latency is required
- You need real-time analytics
- Data must be processed immediately
9. Choosing the Right Kinesis Service
| Requirement | Service |
|---|---|
| Full control over streaming | Kinesis Data Streams |
| No management, simple delivery | Kinesis Firehose |
| Real-time analytics | Kinesis Data Analytics |
| Video streaming | Kinesis Video Streams |
10. Performance and Scaling in Kinesis
Scaling with Shards (Kinesis Data Streams)
- Increase shards → increase throughput
- More shards = more parallel processing
Throughput Example
- 10 shards:
- 10 MB/sec write
- 20 MB/sec read
Important Exam Point
- Shard limits are critical for performance questions
- Know:
- 1 MB/sec write per shard
- 2 MB/sec read per shard
11. Comparing Kinesis with Other Services
| Service | Type | Use Case |
|---|---|---|
| Kinesis | Streaming | Real-time ingestion |
| Amazon SQS | Queue | Message buffering |
| Amazon SNS | Pub/Sub | Notifications |
| Amazon MSK | Kafka | Advanced streaming |
12. Architecture Considerations
When designing a solution, consider:
1. Throughput
- How much data per second?
2. Latency
- Real-time vs batch?
3. Scalability
- Will data volume increase?
4. Durability
- Need data retention?
5. Cost
- More shards = higher cost
13. Common Architecture Patterns
Pattern 1: Real-Time Processing
- Producer → Kinesis Data Streams → Lambda → Database
Pattern 2: Streaming to Storage
- Producer → Kinesis Firehose → S3
Pattern 3: Real-Time Analytics
- Producer → Kinesis Streams → Kinesis Analytics → Dashboard
14. Key Exam Tips
Must Remember
- Streaming = real-time processing
- Kinesis = main AWS streaming service
- Shards control throughput
- Firehose = easiest option (no shard management)
Service Selection Logic
- Need control → Kinesis Data Streams
- Need simplicity → Firehose
- Need analytics → Data Analytics
Typical Exam Questions
You may be asked:
- Which service handles real-time ingestion?
- How to scale streaming throughput?
- Difference between Streams vs Firehose
- When to use batch vs streaming
15. Summary
- Data size determines how much data is processed
- Data speed determines how fast it must be processed
- Streaming is used for real-time, continuous data
- Amazon Kinesis is the key AWS streaming solution
- Choose services based on:
- Throughput
- Latency
- Scalability
- Management overhead
