Group A

Q1: Data Wrangling, I Perform the following operations using Python on any open source dataset (e.g., data.csv) 1. Import all the required Python Libraries. 2. Locate open source data from the web (e.g., https://www.kaggle.com). Provide a clear description of the data and its source (i.e., URL of the web site). 3. Load the Dataset into pandas dataframe. 4. Data Preprocessing: check for missing values in the data using pandas isnull(), describe() function to get some initial statistics. Provide variable descriptions. Types of variables etc. Check the dimensions of the data frame. 5. Data Formatting and Data Normalization: Summarize the types of variables by checking the data types (i.e., character, numeric, integer, factor, and logical) of the variables in the data set. If variables are not in the correct data type, apply proper type conversions. 6. Turn categorical variables into quantitative variables in Python. In addition to the codes and outputs, explain every operation that you do in the above steps and explain everything that you do to import/read/scrape the data set.

Data Wrangling I

Solution and implementation for Q1 from Data Science Laboratory (ds).

1_data_wrangling_1.py Download
import pandas as pd
import numpy as np
import os
import urllib.request
from sklearn.preprocessing import MinMaxScaler

file_name = "titanic.csv"
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"

if os.path.exists(file_name):
    print("Dataset already Downloaded.")
else:
    urllib.request.urlretrieve(url, file_name)

# Load Dataset
df = pd.read_csv(file_name)

print("\nFirst 5 rows of dataset:")
print(df.head())

# Data Preprocessing

# Check missing values
print("\nMissing Values:")
print(df.isnull().sum())

# Dataset statistics
print("\nStatistical Summary:")
print(df.describe())

# Dataset dimensions
print("\nDataset Dimensions (rows, columns):")
print(df.shape)

# Dataset information
print("\nDataset Information:")
print(df.info())

#Data Formatting and Data Normalization

# Check data types
print("\nData Types Before Conversion:")
print(df.dtypes)

# Convert categorical variables
df['Sex'] = df['Sex'].astype('category')
df['Embarked'] = df['Embarked'].astype('category')

print("\nData Types After Conversion:")
print(df.dtypes)

# Handle missing values
df['Age'] = df['Age'].fillna(df['Age'].mean())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

# Normalize numerical columns
scaler = MinMaxScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

print("\nNormalized Values (Age and Fare):")
print(df[['Age','Fare']].head())

# 6. Convert categorical variables to numerical

# Convert Sex to numeric
df['Sex'] = df['Sex'].cat.codes

# One-hot encoding for Embarked
df = pd.get_dummies(df, columns=['Embarked'])

print("\nDataset After Converting Categorical Variables:")
print(df.head())

print("\nFinal Dataset Dimensions:")
print(df.shape)

Other Questions in Data Science Laboratory

See All Available Questions
Download