Asking Basic Questions while analyzing data for ML model preparation.
Here’s the list of Seven basic questions you should ask while performing data analysis.
Here we are using Titanic dataset for explanation:
First we have to import data with the help of Pandas import pandas as pd ,then we are using pd.read_csv(“”) function to load csv data into jupyter notebook.
1. How big is the data ?
In order to check how big the data is , we can use df.shape command to know how much Rows and columns are there:
According to this data, there are (1310 Rows) and (14 Columns).
2. How does data look like ?
For data preview we can use some pandas functions to show how our data loo like:
df.head() : This will show first 5 rows from dataset.
df.tail() : This will show last 5 rows from dataset.
df.sample(n) : This will return any n number of random rows from the dataset.
3. What is the data type of columns ?
To understand datatype about the columns in dataset we can use df.info() and this will give the datatype pf all columns from dataset.
4. Are there any missing values ?
Finding missing values in the dataset, there can be many ways to find out the missing values but here we simply using df.isnull() method to find out. adding sum() method with df.isnull() we can find out the exact values missing from the dataset.
5. How does the data look mathematically ?
To understand high level mathematical summary of our data, we can use df.describe() method to know about our data.
6. Are there duplicate values ?
To find out the is there any duplicated rows in our dataset or not , we can use df.duplicated method and this will return series where it will show boolean values for each rows. Or we can simply apply .sum() method after df.duplicated().sum() method then we can find exact value.
And if there’s any duplicate values found then we can use pandas method df.drop_duplicates() to remove duplicate rows from the dataset.
7. How is the correlation between the columns ?
To find is there any co-realtion between given columns, this co-relation can help us to find patterns and fine tune data and feature selection of dataset or we can say removing irrelevant columns from the dataset that isn't helpful for our model.
we can use df.corr() for co-relation of columns of dataset and we can specially find co-relation of any specific column df.corr()[“column_name”] as shown below.
Conclusion : There can be numerous ways to analyze data and everyone have their own style of doing analysis , this is just a basic representation of data that we can perform to get a quick glance of data.
Feel free to comment your suggestion or correction. Till then
Keep coding :-)