Day 1: Introduction to Data Science

Welcome back with a different set of content.

As a part of a new year resolution, I will be shifting my focus to this series, comprising of Data Science, Machine Learning, Artifiical Intelligence, Generative AI, RAG etc. So gear up for a exicitng ride and let us learn together.

What the Heck is Data Science?

Data Science is the practice of turning data into insights. It’s a blend of statistics, programming, and domain expertise, all working together to solve complex problems.

data-science

Imagine This:

Netflix/Hulu/YouTube or any other streaming platform wants to recommend your next favorite movie/video. How does it do that?
The Answer is: tNetflix uses Data Science to analyze:

  • The movies you’ve watched.

  • How long you’ve watched them.

  • What other users with similar preferences liked.

By processing this data, it predicts what you’re most likely to enjoy next. That’s Data Science in action!


Importance of Data Science

Data Science is pre-dominantly used while making complex business decisions by predicting things based on pervious outcomes. Well actually Data Science drives decision-making across many industries. Here are a few examples:

  • Healthcare: Predict diseases early to save lives.

  • Retail: Optimize inventory so businesses never run out of popular products.

  • Transportation: Apps like Uber match riders and drivers efficiently, even during busy times.


Data Science Process

At its core, Data Science follows a structured process.

  • Define the Problem: What are you trying to solve? Example: Predict which customers will buy a product.

  • Collect Data: Gather relevant data from various sources like databases, APIs, or logs.

  • Preprocess Data: Clean and prepare data by handling missing values and removing duplicates.

  • Analyze Data: Use techniques like clustering or regression to identify trends.

  • Build Models: Train algorithms to predict or classify outcomes.

  • Deploy and Monitor: Put the model into production and track its performance over time.


Lets move on to Big Data.

Big Data

In simple words “Handling large amounts of data“.

Big Data refers to datasets that are too large and complex for traditional tools to handle. It’s everywhere:

  • Millions of tweets on X posted daily.

  • Thousands of transactions processed every second by e-commerce sites.

  • Sensor data from IoT devices like smart thermostats.

Big Data is defined by 5 Vs as below:

Big Data Characteristics

  • Volume: Massive amounts of data.

  • Velocity: Data arrives rapidly (think stock market updates).

  • Variety: Data comes in different formats (text, images, videos).

  • Veracity: Data can be uncertain or inconsistent (e.g., rumors on social media).

  • Value: The ultimate goal is to extract meaningful insights.


Setting up Coding Environment

Download Python.

Once downloaded, proceed to install Jupyter Notebook.

pip install notebook

We will also be requiring necessary libraries to kickstart.

We’ll use these libraries often:

  • Pandas for data manipulation.

  • NumPy for numerical operations.

  • Matplotlib and Seaborn for visualizations.

pip install pandas numpy matplotlib seaborn

Run this in a Jupyter Notebook to ensure everything works:

import pandas as pd
import numpy as np
print("Environment is ready!")

A Simple Example: Getting Started with Pandas.

import pandas as pd

data = {
    "Product": ["Book", "Pen", "Notebook"],
    "Price": [10.99, 1.50, 5.49],
    "Quantity": [3, 10, 5]
}
df = pd.DataFrame(data)
print(df)

Calculating Total Revenue

df["Revenue"] = df["Price"] * df["Quantity"]
print(df)

Following is the output

  Product   Price  Quantity  Revenue
0    Book  10.99         3   32.97
1     Pen   1.50        10   15.00
2 Notebook   5.49         5   27.45

Do you realize what you have just done? You’ve just analyzed a small dataset using Python—congratulations!


Key Takeaways from Day 1 Study:

  • Data Science is about extracting insights from data to solve real-world problems.

  • Big Data is defined by the 5 Vs: Volume, Velocity, Variety, Veracity, and Value.

  • Setting up your environment with Python, Pandas, and Jupyter Notebook prepares you for hands-on work.

Well, all the above concepts are enough to kickstart a journey in this exicitng field. We will be covering more exciting concepts for the later stages. So keep yourself straped up, for this epic learning journey. Ciao!!