Machine Learning in Layman Terms
Why Data Lake is Preffered?
Why a Data Lake is Preferred
-
Centralized Repository
-
A Data Lake acts as a single storage location for all your raw data — whether it’s CSV, JSON, Parquet, Avro, or other formats — without forcing a predefined schema.
-
-
Scalability and Flexibility
-
Can handle massive volumes of data and grow as needed (horizontal scaling). Cloud-based data lakes can scale almost instantly without complex infrastructure changes.
-
-
Supports Diverse Data Types
-
Stores structured data (tables), semi-structured data (JSON, XML), unstructured data (images, audio, video, PDFs), and even streaming data from IoT or real-time sources.
-
-
Cost-Effective Storage
-
Separates compute from storage, so you only pay for the storage tier you use. Raw data storage is generally cheaper compared to traditional data warehouses.
-
ETL = Extract → Transform → Load
ETL is a data integration process that moves data from multiple sources into a target system (like a Data Lake, Data Warehouse, or Analytics Platform) in three main steps:
-
Extract
-
Pull data from various sources: databases, APIs, files, logs, streaming services, IoT devices, etc.
-
Data can be structured (tables), semi-structured (JSON, XML), or unstructured (images, PDFs).
-
The goal here is collection, not cleaning.
-
-
Transform
-
Clean, normalize, and restructure the extracted data to fit the target system’s needs.
-
Common transformations:
-
Removing duplicates
-
Changing formats (e.g., date formats, currency)
-
Aggregating (e.g., daily sales → monthly sales)
-
Joining multiple datasets
-
Enriching with external data
-
-
This is where business rules are applied.
-
-
Load
-
Store the transformed data into the destination system.
-
Target could be:
-
Data Warehouse (for analytics-ready data)
-
Data Lake (for raw or partially processed data)
-
Operational DB (for application usage)
-
-
Loading can be batch (scheduled intervals) or real-time (streaming).
Data Transformation using Pandas
We will be using Jupyter Notebook on Web browser for practise - Jupyter Notebook
Once done , click on Upload and upload your csv file
Once uploaded, we will use python launcher inside Jupyter notebook
Now navigate to use Python launcher
select Python (Pyodide)
Let us try to Read the existing Data
import pandas as pd
Note: To run the python code , line by line, You can either click on "Run" as shown in screenshot or use "Shift + Enter" on keyboard and if your code moves to next line, it means, Your command exceuted successfully
Here df stands for Data frame, and now we will be learning on how to read our Data using pandas library
df=pd.read_csv("customer_sales_data.csv")
df.head() #shows first few rows
df.info() #get summary of dataframe
df.isnull().sum() ##get missing values from each column
df['Quantity'].unique() ##gives you value for each column
Now let us try cleaning the Data, this is where our work of Transformation starts
df[df['Quantity'].isnull()] ##displays all the values that has null in quanity column
df[df['Quantity'].isnull()][['Quantity','ProductName','Price']] ##to display only limited columns
quantity_median=df['Quantity'].median()
df['Quantity']=df['Quantity'].fillna(quantity_median)##fill all null values with quantity median values
df['PhoneNumber']=df['PhoneNumber'].fillna('Unknown')##To replace empty values with unknown
df.to_csv('cleaned_data.csv', index=False)##To get data into new sheet
Here Index refers to track data or record column that is generated by pandas
and this generates a new csv file for us, where you can track changes
Now let us label the Data
##To label the data
bins=[1,2,3,4]
labels=['low', 'medium', 'high']
df['Quantity_label'] = pd.cut(df['Quantity'], bins= bins, labels= labels, include_lowest=True)
Here bins and labels are defined in following format
Bins, 1-2 will be labelled as low, and 2 to 3 Range will be labelled as Medium and 3 to 4 will be labelled as High, and we are assigning this category to new column Quantity label
df[['Quantity_label', 'Quantity']].head()
df.to_csv("transformed_data.csv", index=False)
While this is one data, and one type that we transformed, You can categorize this transform data and load into ML Model which will be discussing further, This Demo is more on overview of how things are designed in ML
Comments
Post a Comment