Q2: Data Wrangling II Create an "Academic performance" dataset of students and perform the following operations using Python. 1. Scan all variables for missing values and inconsistencies. If there are missing values and/or inconsistencies, use any of the suitable techniques to deal with them. 2. Scan all numeric variables for outliers. If there are outliers, use any of the suitable techniques to deal with them. 3. Apply data transformations on at least one of the variables. The purpose of this transformation should be one of the following reasons: to change the scale for better understanding of the variable, to convert a non-linear relation into a linear one, or to decrease the skewness and convert the distribution into a normal distribution. Reason and document your approach properly.

Data Wrangling II

Solution and implementation for Q2 from Data Science Laboratory (ds).

2_data_wrangling_2.ipynb Download

Code Cell [In]

# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import zscore
from sklearn.preprocessing import MinMaxScaler

Code Cell [In]

# Load dataset
df = pd.read_csv("data_2.csv")

print(df)
print(df.isnull().sum())

Code Cell [In]

# Fill missing values
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Attendance'] = df['Attendance'].fillna(df['Attendance'].median())

print(df)

Code Cell [In]

# Statistical summary
print(df.describe())

Code Cell [In]

# Boxplot for outliers
df[['Age','GPA','Test_Score','Attendance']].boxplot()

plt.show()

Code Cell [In]

# Remove outliers
z = np.abs(zscore(df[['Age','GPA','Test_Score','Attendance']]))

df = df[(z < 2.5).all(axis=1)]

print(df)

Code Cell [In]

# Normalize data
scaler = MinMaxScaler()

df[['Age','GPA','Test_Score','Attendance']] = scaler.fit_transform(
    df[['Age','GPA','Test_Score','Attendance']]
)

print(df)

Code Cell [In]

# Check skewness
print(df[['Age','GPA','Test_Score','Attendance']].skew())

Data_2.csv Download

Name,Age,GPA,Test_Score,Attendance
Amit,18.0,8.1,78,85.0
Neha,19.0,7.8,82,90.0
Rahul,,9.0,91,88.0
Priya,20.0,8.5,85,
Karan,21.0,7.2,76,76.0
Sneha,19.0,8.9,88,92.0
Rohit,18.0,3.5,620,87.0
Pooja,20.0,8.0,80,89.0
Arjun,50.0,9.5,92,900.0
Kavya,19.0,8.3,79,84.0

Group A

Data Wrangling II

Other Questions in Data Science Laboratory