Sizes and speeds needed to meet business requirements

Task Statement 3.5: Determine high-performing data ingestion and transformation solutions.

📘AWS Certified Solutions Architect – (SAA-C03)


1. What This Topic Means

When designing data ingestion and transformation solutions, you must decide:

  • How much data (size) is being processed
  • How fast (speed) the data must be processed

These two factors directly affect:

  • Performance
  • Cost
  • Scalability
  • Service selection

👉 In simple terms:
“How big is the data, and how quickly must it move?”


2. Understanding Data Size

2.1 Types of Data Sizes

Small-scale Data

  • MBs to a few GBs
  • Fits easily in memory or small storage

Medium-scale Data

  • Tens to hundreds of GBs
  • Requires distributed processing

Large-scale Data

  • TBs to PBs
  • Requires highly scalable, distributed systems

2.2 Why Size Matters

Data size affects:

1. Storage Choice

  • Small → simple storage (e.g., Amazon S3)
  • Large → partitioned storage (data lakes, distributed systems)

2. Processing Method

  • Small → batch processing on one instance
  • Large → parallel processing (e.g., AWS Glue, EMR)

3. Transfer Strategy

  • Small → direct upload
  • Large → optimized transfer tools (multipart upload, AWS DataSync)

2.3 Key AWS Concepts Related to Size

Partitioning

  • Large datasets are split into smaller parts
  • Improves performance and parallelism

Compression

  • Reduces data size
  • Saves storage and speeds up transfer

File Formats

  • Efficient formats improve performance:
    • Columnar (Parquet, ORC)
    • Row-based (CSV, JSON)

👉 Exam Tip:

  • Columnar formats = faster analytics + less data scanned

3. Understanding Data Speed

3.1 What Is Speed?

Speed refers to:

  • How quickly data is ingested
  • How quickly it is processed
  • How quickly results are available

3.2 Types of Data Processing Speeds

1. Batch Processing

  • Data processed at intervals (minutes, hours, daily)
  • Suitable when immediate results are NOT required

2. Near Real-Time Processing

  • Small delay (seconds to minutes)
  • Used when quick insights are needed

3. Real-Time (Streaming)

  • Continuous data processing
  • Very low latency (milliseconds to seconds)

3.3 Why Speed Matters

Speed affects:

1. System Design

  • Real-time → streaming architecture
  • Batch → scheduled jobs

2. Cost

  • Faster systems = higher cost
  • Slower systems = cheaper

3. User Experience

  • Faster results = better responsiveness

4. Matching Size and Speed Together

This is the most important concept for the exam.

You must choose the right architecture based on BOTH size and speed.


4.1 Common Combinations

Small Size + Low Speed

  • Simple ingestion
  • Batch processing
  • Minimal infrastructure

Large Size + Low Speed

  • Data lakes (Amazon S3)
  • Batch processing (AWS Glue, EMR)
  • Partitioned data

Small Size + High Speed

  • Streaming ingestion
  • Real-time processing
  • Lightweight services

Large Size + High Speed

  • Distributed streaming systems
  • Parallel processing
  • Scalable architecture

👉 Exam Tip:

  • Large + Fast = most complex + most expensive

5. Throughput vs Latency (Important Exam Concept)

5.1 Throughput

  • Amount of data processed per second
  • Example: MB/s or GB/s

5.2 Latency

  • Time taken to process a single request

5.3 Key Difference

ConceptMeaning
ThroughputVolume of data processed
LatencyDelay in processing

👉 Exam Tip:

  • High throughput does NOT always mean low latency

6. Scaling for Size and Speed

6.1 Horizontal Scaling (Most Important)

  • Add more resources (instances, nodes)
  • Used for large-scale systems

6.2 Vertical Scaling

  • Increase power of a single resource
  • Limited scalability

👉 Exam Tip:

  • AWS prefers horizontal scaling

7. Data Ingestion Speed Optimization Techniques

1. Parallel Uploads

  • Upload multiple parts simultaneously

2. Streaming Services

  • Continuous data ingestion

3. Buffering

  • Temporarily stores data before processing

4. Batching

  • Groups data for efficient processing

8. Data Transformation Speed Optimization

1. Distributed Processing

  • Process data across multiple nodes

2. In-Memory Processing

  • Faster than disk-based processing

3. Efficient File Formats

  • Reduces processing time

4. Partition Pruning

  • Only process required data

9. Cost vs Performance Trade-off

This is a very common exam question area.

RequirementResult
High speedHigh cost
Large dataMore resources
Real-time + large dataVery expensive

👉 Exam Tip:

  • Always choose the simplest solution that meets requirements

10. AWS Services Selection Based on Size & Speed

For Large Data

  • Amazon S3 (storage)
  • AWS Glue (ETL)
  • Amazon EMR (big data processing)

For High-Speed Streaming

  • Amazon Kinesis
  • AWS Lambda

For Batch Processing

  • AWS Glue
  • Amazon EMR

For Hybrid (Batch + Streaming)

  • Combine multiple services

11. Key Exam Scenarios to Remember

Scenario 1:

  • Data arrives slowly, processed daily
    ✅ Use batch processing

Scenario 2:

  • Continuous data flow, needs immediate processing
    ✅ Use streaming services

Scenario 3:

  • Massive data, no urgency
    ✅ Use distributed batch processing

Scenario 4:

  • Massive + real-time data
    ✅ Use streaming + distributed architecture

12. Common Mistakes (Exam Traps)

❌ Choosing real-time processing when not required
❌ Ignoring cost implications
❌ Not partitioning large datasets
❌ Using single-node processing for large data
❌ Confusing throughput with latency


13. Quick Summary (Must Remember)

  • Size = how much data
  • Speed = how fast data moves
  • Choose architecture based on both
  • Batch = slower but cheaper
  • Streaming = faster but more expensive
  • Large data requires:
    • Partitioning
    • Distributed processing
  • Optimize using:
    • Compression
    • Efficient formats
    • Parallel processing

14. Final Exam Tip

When you see a question:

  1. Identify:
    • Data size (small / large)
    • Required speed (batch / real-time)
  2. Then choose:
    • Storage
    • Processing method
    • AWS services

👉 If unsure:

  • Default to scalable, distributed, and cost-efficient solutions
Buy Me a Coffee