Selecting appropriate compute options for data processing (for example, Amazon EMR)

Task Statement 3.5: Determine high-performing data ingestion and transformation solutions.

📘AWS Certified Solutions Architect – (SAA-C03)


1. What is Data Processing in AWS?

Data processing is the act of taking raw data and performing operations on it to make it useful. This can include:

  • Filtering data
  • Transforming formats
  • Aggregating data
  • Running analytics or machine learning workloads

In AWS, this often involves large-scale data stored in services like Amazon S3, databases, or data lakes. To process this data efficiently, you need compute resources — these are essentially virtual machines, containers, or clusters that perform the processing.


2. AWS Compute Options for Data Processing

AWS provides multiple compute options. Choosing the right one depends on:

  • Data size – small vs. massive data
  • Processing type – batch vs. streaming
  • Complexity – simple transformations vs. advanced analytics
  • Cost and scalability – how fast you need results and budget constraints

Here are the main options:


A. Amazon EMR (Elastic MapReduce)

Purpose: Big data processing using distributed frameworks like Hadoop, Spark, Presto, or Hive.

When to use:

  • Large datasets (terabytes or petabytes)
  • Complex batch processing, like aggregating logs or performing ETL (Extract, Transform, Load) jobs
  • Running SQL-like queries over unstructured or semi-structured data

Key Features:

  • Managed service: AWS handles provisioning, scaling, and patching.
  • Auto-scaling: EMR can automatically add or remove instances based on processing needs.
  • Integration: Works with Amazon S3 for storage, Amazon RDS/Redshift for databases, and Amazon Athena for querying.

Example in IT terms: Processing terabytes of system logs stored in S3 nightly to generate performance reports for servers.


B. AWS Lambda

Purpose: Serverless compute for small or event-driven tasks.

When to use:

  • Lightweight transformations
  • Real-time processing of streaming data (like from Kinesis or DynamoDB Streams)
  • Quick responses without managing servers

Key Features:

  • No server management – you just upload code
  • Automatic scaling – scales instantly with traffic
  • Pay-per-use – cost is based on execution time, not idle servers

Example in IT terms: Transforming incoming log data from IoT devices in real time to extract only error messages.


C. Amazon EC2

Purpose: Traditional virtual machines for full control.

When to use:

  • Custom applications or software that cannot run on managed services
  • High-performance workloads with specific OS or configuration needs
  • Large batch jobs where cost optimization is secondary

Key Features:

  • Full control over OS, software, and instance types
  • Choice of instance type: compute-optimized, memory-optimized, GPU instances for analytics or ML

Example in IT terms: Running a custom data aggregation tool that combines logs from multiple sources before pushing to a data warehouse.


D. AWS Fargate / Amazon ECS / Amazon EKS

Purpose: Containerized data processing workloads.

When to use:

  • Microservices architecture
  • Stateless processing jobs
  • Workloads that need portability across environments

Key Features:

  • Serverless container management (Fargate) – no need to manage EC2 nodes
  • Scalable clusters – ECS or EKS handles scheduling and load balancing
  • Integration with other AWS services – S3, RDS, CloudWatch

Example in IT terms: Running nightly ETL jobs in Docker containers that process and normalize data before inserting it into a database.


E. AWS Glue

Purpose: Serverless ETL service for large-scale data transformation.

When to use:

  • Preparing data for analytics or machine learning
  • Transforming semi-structured or unstructured data in S3
  • Building data catalogs for easy query access

Key Features:

  • Serverless: no need to manage infrastructure
  • Auto-generates code for transformations
  • Integrates with Amazon Athena, Redshift, and S3

Example in IT terms: Cleaning and normalizing IoT sensor data to a structured table for analytics dashboards.


3. How to Choose the Right Compute Option

CriteriaRecommended Compute Option
Large-scale batch processingAmazon EMR
Real-time, small data tasksAWS Lambda
Custom software, OS controlAmazon EC2
Containerized workloadsECS / EKS / Fargate
ETL & data catalogingAWS Glue

Exam tip: Remember that EMR = big data batch, Lambda = small real-time, EC2 = custom control, Fargate/ECS/EKS = containerized, and Glue = serverless ETL.


4. Exam Pointers

  1. Performance & Cost: Selecting compute options should balance processing speed, scalability, and cost.
  2. Managed vs Self-managed: Managed services like EMR, Glue, and Lambda reduce operational overhead.
  3. Integration with other AWS services: Data processing usually connects with S3, Redshift, DynamoDB, Kinesis, Athena.
  4. Workload Type: Batch vs streaming is critical for choosing the correct service.

Key Takeaway:

  • For big batch jobs, choose EMR.
  • For serverless, event-driven tasks, choose Lambda.
  • For custom or legacy software, choose EC2.
  • For containerized apps, choose ECS/EKS/Fargate.
  • For ETL with minimal setup, choose Glue.

These distinctions are heavily tested on the SAA-C03 exam.

Buy Me a Coffee