Machine Learning in Layman Terms

Machine Learning concepts relies mostly on how you handle data and how you can make more sales if you owe a company

May be we can go for common example of How instagram works

Consider you started liking a sad reel, saved sad reel, started commenting on sad reel, Now your instagram alogrithm starts sending you more sad reels for you to keep engaged

Whereas your friend started liking content regarding Gym, started saving posts of nutritions, and started commenting on Transformation Videos, Now instagram starts to send more of Gym posts to your friend

Now, as both use the Same Instagram application, the kind of content you consume and the kind of content your friend consumes might be different, More you engage, More Instagram can make money on advertisements, More profit is for their Model

You can also think an example of Amazon.com, where you buy products of smartphone, it starts showing you products related to mobile like headphones, back covers, Phone Stands

Depeding on user, Data shown for each and every Individual is different

Data is collected, stored and used to learn user patterns to get more engagement or sales,

this process — collecting data, learning patterns, and making predictions automatically — is the core of Machine Learning

Why Data Lake is Preffered?

Before we get into Demo, we will discuss about storage/architecture concept commonly used in Machine Learning

Why a Data Lake is Preferred

Centralized Repository
- A Data Lake acts as a single storage location for all your raw data — whether it’s CSV, JSON, Parquet, Avro, or other formats — without forcing a predefined schema.
Scalability and Flexibility
- Can handle massive volumes of data and grow as needed (horizontal scaling). Cloud-based data lakes can scale almost instantly without complex infrastructure changes.
Supports Diverse Data Types
- Stores structured data (tables), semi-structured data (JSON, XML), unstructured data (images, audio, video, PDFs), and even streaming data from IoT or real-time sources.
Cost-Effective Storage
- Separates compute from storage, so you only pay for the storage tier you use. Raw data storage is generally cheaper compared to traditional data warehouses.

ETL = Extract → Transform → Load

ETL is a data integration process that moves data from multiple sources into a target system (like a Data Lake, Data Warehouse, or Analytics Platform) in three main steps:

Extract
- Pull data from various sources: databases, APIs, files, logs, streaming services, IoT devices, etc.
- Data can be structured (tables), semi-structured (JSON, XML), or unstructured (images, PDFs).
- The goal here is collection, not cleaning.
Transform
- Clean, normalize, and restructure the extracted data to fit the target system’s needs.
- Common transformations:
  - Removing duplicates
  - Changing formats (e.g., date formats, currency)
  - Aggregating (e.g., daily sales → monthly sales)
  - Joining multiple datasets
  - Enriching with external data
- This is where business rules are applied.
Load

Store the transformed data into the destination system.
Target could be:
- Data Warehouse (for analytics-ready data)
- Data Lake (for raw or partially processed data)
- Operational DB (for application usage)
Loading can be batch (scheduled intervals) or real-time (streaming).

Data Transformation using Pandas

While there are many tools that can help us in ETL, let us understand how things work from small to Medium Dataset using Pandas

Let us assume that we have gotten sales data from Analytics and we will be using Pandas now to Transform Data, we will be learning further on ETL pipelines, this demo we will be concentrating on core part of ETL, i.e, Transform

Download file from Repo - Dummy Data Github Repo

We will be using Jupyter Notebook on Web browser for practise - Jupyter Notebook

Once done , click on Upload and upload your csv file

Once uploaded, we will use python launcher inside Jupyter notebook

Now navigate to use Python launcher

select Python (Pyodide)

Let us try to Read the existing Data

import pandas as pd

Note: To run the python code , line by line, You can either click on "Run" as shown in screenshot or use "Shift + Enter" on keyboard and if your code moves to next line, it means, Your command exceuted successfully

and also to if there is an error, Jupyter Notebook will throw a notification

Here df stands for Data frame, and now we will be learning on how to read our Data using pandas library

df=pd.read_csv("customer_sales_data.csv")

df.head() #shows first few rows

df.info() #get summary of dataframe

df.isnull().sum() ##get missing values from each column

df['Quantity'].unique() ##gives you value for each column

Now let us try cleaning the Data, this is where our work of Transformation starts

df[df['Quantity'].isnull()] ##displays all the values that has null in quanity column

df[df['Quantity'].isnull()][['Quantity','ProductName','Price']] ##to display only limited columns

quantity_median=df['Quantity'].median()

df['Quantity']=df['Quantity'].fillna(quantity_median)##fill all null values with quantity median values

df['PhoneNumber']=df['PhoneNumber'].fillna('Unknown')##To replace empty values with unknown

df.to_csv('cleaned_data.csv', index=False)##To get data into new sheet

Here Index refers to track data or record column that is generated by pandas

and this generates a new csv file for us, where you can track changes

Now let us label the Data

##To label the data

bins=[1,2,3,4]

labels=['low', 'medium', 'high']

df['Quantity_label'] = pd.cut(df['Quantity'], bins= bins, labels= labels, include_lowest=True)

Here bins and labels are defined in following format

Bins, 1-2 will be labelled as low, and 2 to 3 Range will be labelled as Medium and 3 to 4 will be labelled as High, and we are assigning this category to new column Quantity label

df[['Quantity_label', 'Quantity']].head()

df.to_csv("transformed_data.csv", index=False)

While this is one data, and one type that we transformed, You can categorize this transform data and load into ML Model which will be discussing further, This Demo is more on overview of how things are designed in ML

Sagar Kakkala's World

Search This Blog

AIOPS/MLOPS | Introduction to Machine Learning | Sagar Kakkala´s World

Machine Learning in Layman Terms

Why Data Lake is Preffered?

ETL = Extract → Transform → Load

Data Transformation using Pandas

Comments

Post a Comment