Task Statement 3.5: Determine high-performing data ingestion and transformation solutions.

📘AWS Certified Solutions Architect – (SAA-C03)

1. Introduction

In AWS, when designing data ingestion and transformation systems, you must always consider:

How much data (size) is being processed
How fast (speed) the data is generated and needs to be processed

These two factors directly affect:

Performance
Cost
Scalability
Choice of AWS services (especially streaming services like Amazon Kinesis)

2. Understanding Data Size

What is Data Size?

Data size refers to the volume of data being ingested or processed.

Common Categories

Small-scale data
- MBs to GBs
- Example: Application logs from a few servers
Medium-scale data
- GBs to TBs
- Example: Logs from multiple services or databases
Large-scale data
- TBs to PBs
- Example: Enterprise-wide data platforms, analytics pipelines

Why Data Size Matters

Determines storage choice (S3, EBS, etc.)
Impacts processing services (Lambda vs EMR vs Glue)
Affects network throughput
Influences cost

3. Understanding Data Speed

What is Data Speed?

Data speed refers to how fast data is generated and processed.

Types of Speed

1. Batch Processing (Low Speed)

Data collected over time and processed later
Example: Daily reports from database exports

2. Near Real-Time Processing

Small delay (seconds to minutes)
Example: Monitoring dashboards

3. Real-Time Streaming (High Speed)

Data processed instantly as it arrives
Example: Real-time log processing, metrics pipelines

Why Speed Matters

Determines latency requirements
Influences architecture design
Decides streaming vs batch services

4. Matching Size and Speed to AWS Services

Requirement	Best Approach
Low size + low speed	Batch processing (S3 + Lambda)
High size + low speed	Batch analytics (S3 + EMR/Glue)
Low size + high speed	Streaming (Kinesis, Lambda)
High size + high speed	High-throughput streaming (Kinesis, MSK)

5. Streaming Data Services in AWS

What is Streaming Data?

Streaming data is:

Continuous
Unbounded
Generated in real-time

Instead of waiting, data is processed immediately as it arrives.

6. Amazon Kinesis Overview

Amazon Kinesis is a fully managed service used to:

Collect
Process
Analyze real-time streaming data

7. Core Kinesis Services

1. Kinesis Data Streams (KDS)

Purpose

Real-time ingestion of streaming data

Key Features

Low latency (milliseconds)
Scalable using shards
Durable storage (24 hours to 7 days or more)

Important Concepts

Shard
- Unit of capacity
- Each shard supports:
  - 1 MB/sec write
  - 2 MB/sec read
Producer
- Sends data into stream
Consumer
- Reads data from stream

2. Kinesis Data Firehose

Purpose

Load streaming data directly into storage services

Key Features

Fully managed (no shard management)
Automatic scaling
Delivers data to:
- Amazon S3
- Amazon Redshift
- Amazon OpenSearch

Use Case

When you want simple ingestion with no management

3. Kinesis Data Analytics

Purpose

Process streaming data using SQL or Apache Flink

Key Features

Real-time transformations
Filtering, aggregation, enrichment

4. Kinesis Video Streams

Purpose

Streaming video data processing

8. When to Use Amazon Kinesis

Use Kinesis when:

Data arrives continuously
Low latency is required
You need real-time analytics
Data must be processed immediately

9. Choosing the Right Kinesis Service

Requirement	Service
Full control over streaming	Kinesis Data Streams
No management, simple delivery	Kinesis Firehose
Real-time analytics	Kinesis Data Analytics
Video streaming	Kinesis Video Streams

10. Performance and Scaling in Kinesis

Scaling with Shards (Kinesis Data Streams)

Increase shards → increase throughput
More shards = more parallel processing

Throughput Example

10 shards:
- 10 MB/sec write
- 20 MB/sec read

Important Exam Point

Shard limits are critical for performance questions
Know:
- 1 MB/sec write per shard
- 2 MB/sec read per shard

11. Comparing Kinesis with Other Services

Service	Type	Use Case
Kinesis	Streaming	Real-time ingestion
Amazon SQS	Queue	Message buffering
Amazon SNS	Pub/Sub	Notifications
Amazon MSK	Kafka	Advanced streaming

12. Architecture Considerations

When designing a solution, consider:

1. Throughput

How much data per second?

2. Latency

Real-time vs batch?

3. Scalability

Will data volume increase?

4. Durability

Need data retention?

5. Cost

More shards = higher cost

13. Common Architecture Patterns

Pattern 1: Real-Time Processing

Producer → Kinesis Data Streams → Lambda → Database

Pattern 2: Streaming to Storage

Producer → Kinesis Firehose → S3

Pattern 3: Real-Time Analytics

Producer → Kinesis Streams → Kinesis Analytics → Dashboard

14. Key Exam Tips

Must Remember

Streaming = real-time processing
Kinesis = main AWS streaming service
Shards control throughput
Firehose = easiest option (no shard management)

Service Selection Logic

Need control → Kinesis Data Streams
Need simplicity → Firehose
Need analytics → Data Analytics

Typical Exam Questions

You may be asked:

Which service handles real-time ingestion?
How to scale streaming throughput?
Difference between Streams vs Firehose
When to use batch vs streaming

15. Summary

Data size determines how much data is processed
Data speed determines how fast it must be processed
Streaming is used for real-time, continuous data
Amazon Kinesis is the key AWS streaming solution
Choose services based on:
- Throughput
- Latency
- Scalability
- Management overhead