What Does Analyzing Data Mean in Pandas?
Analyzing data in Pandas means exploring, inspecting, and understanding a dataset before deeper processing or modeling. This step helps you answer questions like:
-
What does the data contain?
-
Are there missing or duplicate values?
-
What are the data types?
-
What patterns or distributions exist?
Pandas provides simple yet powerful functions to perform Exploratory Data Analysis (EDA).
Load Sample Data
import pandas as pd df = pd.read_csv("data.csv")1. View the Data
First and Last Rows
print(df.head()) # Output: # Shows first 5 rows print(df.tail()) # Output: # Shows last 5 rowsRandom Sample
print(df.sample(3)) # Output: # Displays 3 random rows2. Understand Dataset Structure
Dataset Information
print(df.info()) # Output: # Column names, non-null count, data typesThis helps identify:
-
Number of rows and columns
-
Data types
-
Missing values
Shape of Dataset
print(df.shape) # Output: # (rows, columns)Column Names
print(df.columns) # Output: # Index of column names3. Summary Statistics
Describe Numerical Data
print(df.describe()) # Output: # count, mean, std, min, 25%, 50%, 75%, maxDescribe All Columns
print(df.describe(include="all")) # Output: # Summary for numeric and non-numeric columns4. Analyze Individual Columns
print(df["Age"].mean()) print(df["Salary"].max()) # Output: # Mean and max valuesValue Counts
print(df["City"].value_counts()) # Output: # Frequency of each category5. Check Missing Values
print(df.isnull()) # Output: # Boolean DataFrame print(df.isnull().sum()) # Output: # Count of missing values per column6. Handle Duplicates
print(df.duplicated()) # Output: # Boolean series print(df.duplicated().sum()) # Output: # Number of duplicate rows7. Correlation Analysis
Correlation shows relationships between numerical columns.
print(df.corr()) # Output: # Correlation matrixValues range from:
-
+1→ strong positive correlation -
0→ no correlation -
-1→ strong negative correlation
8. Sorting Data
print(df.sort_values(by="Age")) # Output: # Sorted by Ageprint(df.sort_values(by="Salary", ascending=False)) # Output: # Sorted descending9. Filtering Data
print(df[df["Age"] > 30]) # Output: # Rows where Age > 3010. Detect Outliers
print(df["Salary"].describe()) # Output: # Helps identify extreme valuesKey Points to Remember
-
Always analyze data before cleaning or modeling
-
Use
head(),info(), anddescribe()first -
Check missing values and duplicates early
-
Understand column distributions and relationships
-
Pandas makes EDA fast and simple