What I Learned Managing a $2M AWS Data Engineering Architecture

5 min readJan 5, 2025

Real world deployment managing 100 Terabytes of data for Life Science Domain

Yes, you read that right I managed a 2 million annual budgeted AWS data engineering marvel for more than 2 years in my early days.☺️

I’ll share how MNCs use tech to provide best experience to there customers as well as their internal teams to perform tasks seamlessly.😁

Yaa, I’ll share some of the things that you should not do (Basically my blunders)🥲

So let’s get started🚀🚀

Assessing Business Requirements✍️📝

This is the most crucial not only in MNC but also in Start up as well.

The first step towards a good AWS data architecture is knowing what you are working with.

For Example (like I am trying to explain with day to day life situation),😉

Consider the planning of a road trip: you know what distance you will be covering (the volume of data), what speed limits apply (data velocity), and what kind of roads you’ll be on (data variety).

— Are you dealing in gigabytes or terabytes or exabytes🙄 of customer transactions every day?
— Or streaming real-time IoT data that never sleeps?

Learning data’s size, speed, and complexity will guide you in selecting the correct tools and building a system that survives it but thrives.

And then there’s this MONSTER: compliance and security😯. If you’re handling sensitive data, you can’t play with it without clearance and permissions.

Think HIPAA for health care, GDPR for Europe, or your company’s internal standards. 🤔

Your architecture needs to meet all those checkboxes while still fast and reliable. And reliability is important because nobody likes downtime, you know.😣 Whether it is Black Friday sales or a routine weekday, your system needs to be like a champion and ready for anything. So, let’s map this out the smart way and make sure every base is covered!

Cost Optimization Strategies💰💲💰

Source pradeepl.com

The best architecture is event based architecture if you are a bootstrap startup. (This is for you tech readers who have dreams of creating something on there own)

But why am I saying this?

Let’s talk money 💸💰💸 because AWS can be a very expensive cloud operator out there if you don’t understand the services that you NEED and that you WANT (two different things totally).

One of my early mistake: Simple Data Copy costed team 💲300 dollars😲. How?
Ran EMR Job instead of GLUE Job🥴. Copied 3.4 terabytes of data from Prod to Dev environment for development of new ETL pipeline for a new Vendor. DO NOT DO THIS❌

No company want to spend all there fortune to manage their tech.🫣

Rightsizing your resources is like finding the perfect fit in your wardrobe: no overspending on a fancy jacket (or instance type in AWS😆) you’ll never use and no squeezing into something too small that can’t handle the load.

AWS offers you various options, with on-demand instances for quick victories and reserved or spot instances with long-term discounts.

AWS Data Architecture Design🤓🧠💡🤓🤔💪🏻

IF YOU ARE INTO ARCHTECTURE THIS WOULD BE HELPFUL📚🤓📊:

A. Data Ingestion Layer

AWS Kinesis: Ideal for real-time data streaming, allowing for the ingestion of large volumes of data from various sources such as IoT devices and web applications.
AWS Data Pipeline: Suitable for batch data processing, enabling the scheduling and automation of data movement and transformation tasks.

Handling Real-Time vs. Batch Data Ingestion

Real-Time Ingestion: Utilize services like Amazon Kinesis Data Streams or AWS IoT Core to capture streaming data efficiently.
Batch Ingestion: Use AWS Data Pipeline or AWS Glue to manage scheduled data loads from on-premises systems or other cloud sources

B. Data Storage Layer

Amazon S3: Serves as the primary storage solution, offering scalability and durability for raw and processed data.
Amazon Redshift: A powerful data warehousing solution for complex queries and analytics.
Amazon RDS: Useful for structured relational database needs, supporting various database engines.

Implementing Data Partitioning and Indexing Strategies

Data Partitioning: Organize data in S3 using prefixes to optimize query performance and reduce costs.
Indexing Strategies: Implement indexing in Amazon Redshift to enhance query speed, using sort keys and distribution keys effectively

C. Data Processing Layer

AWS Glue: Provides serverless ETL capabilities, allowing for easy transformation and loading of data into target storage solutions.
Amazon EMR: Enables big data processing using frameworks like Apache Spark, Hive, and HBase for large-scale data analysis.

Designing ETL Workflows and Data Transformation Processes

Create ETL workflows using AWS Glue to automate the extraction, transformation, and loading of data from various sources into the storage layer.
Use Amazon EMR for more complex transformations that require distributed processing capabilities

D. Data Access and Analytics Layer

Amazon Athena: Allows users to perform ad-hoc querying directly on data stored in S3 without needing to load it into a database.
Amazon QuickSight: A business intelligence service that provides interactive dashboards and visualizations based on the ingested data.

Implementing Data Cataloging and Metadata Management

AWS Glue Data Catalog: Acts as a central repository for metadata about datasets stored in S3, enabling easier discovery and management of data assets.
Ensure proper governance by implementing access controls and auditing capabilities across all layers

I think for today this is a lot of information to process.🤭😁😜❤️

In the upcoming experiences I’ll share even bigger blunders of mine as well as some healthy learning for people who wants to understand business and how to integrate tech to make the processes of their business seamless and fast.

Stay tuned guys.❤️❤️❤️