Python - Exploratory Data Analysis CheatSheet

Reading a CSV file

Use header=None when the columns are not labeled in your csv file

df = pd.read_csv("pathToFile.csv", header=None)

Reading an Excel(.xlsx) file

Use header=None when the columns are not labeled in your xlsx file

df = pd.read_excel("pathToFile.xlsx", header=None)

Show first 5 rows of a DataFrame

df.head()

Show last 5 rows of a DataFrame

df.tail()

Show shape of the dataframe

df.shape

Show all column names in the DataFrame

df.columns

Count occurances of all unique values in a column

df['column_name'].value_counts()

Show mean, std dev, max etc for each column

df.describe()

Show datatypes for all columns

df.info()

Show sum of all null/NaN rows in each column

df.isnull().sum()

Heat Map of where and which columns has null/NaN values

NOTE: import seaborn as sns

sns.heatmap(df.isnull())

Drop multiple columns at once

axis=1 is for columns

df.drop(['column_1','column_2'],axis=1,inplace=True)

Fill NaN values with mean value of a column

df['column_name']=df['column_name'].fillna(df['column_name'].mean())

Get numerical values for categorical data

df['column_name'] = pd.factorize(df['column_name'])[0]

Get all unique calues in categorical data

unique = pd.factorize(df['column_name'])[1]

Get unique values in any column

df['column_name'].unique()

Convert column to float data type

df['columns_name'] = df['column_name'].astype("float")

Make existing column the index

df = df.set_index(df['column_name'])

Get subset of df where column value is equal to some value

df_bangalore = df[df['city']=='bangalore']
df_lucknow = df[df['city']=='lucknow']

Show all indexs in the dataframe

df.index

Convert dataframe to numpy array

NOTE: Column names are ignored and only float/integers allowed

df.to_numpy()

Sort values by a column

df.sort_values(by='colName')

Copy a whole dataframe

df.copy()

Drop the rows which have Nan values

df.dropna()

Replace Nan values with a specified value

df.fillna(value=10)

Return a dataframe of boolean values to check Nan values

pd.isna(df)

Calculate the mean of each column

df.mean()

Calculate the mean of each row

df.mean(1)

Concatenate dataframes

pd.concat([df[:2],df[3:6]])

Merge two dataframes with a custom index

pd.merge(df1,df2,on='indexColName')

Groupby column and sum

df.groupby('colName').sum()

Subtract all columns by a specific column

df.subtract(df['col'],axis=0)

Save a dataframe to csv file

df.to_csv('filename.csv')

Save a dataframe to excel sheet

df.to_excel('filename.xlsx',sheet_name='Sheet1')

Label Encoding

Will change categorical data into one column of integer data

from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
integer_encoded = label_encoder.fit_transform(df['column_name'])

One hot encoding / Get dummies

df_processed = pd.get_dummies(df, prefix_sep="__",columns=["column_1", "column_2"])

Standard Scaling

NOTE: Make sure you use fit_transform only on train dataset and use just transform for test and post-deployment dataset

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
train_data = scaler.fit_transform(train_data)
test_data = scaler.transform(test_data)

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Python - Exploratory Data Analysis CheatSheet

Reading a CSV file

Reading an Excel(.xlsx) file

Show first 5 rows of a DataFrame

Show last 5 rows of a DataFrame

Show shape of the dataframe

Show all column names in the DataFrame

Count occurances of all unique values in a column

Show mean, std dev, max etc for each column

Show datatypes for all columns

Show sum of all null/NaN rows in each column

Heat Map of where and which columns has null/NaN values

Drop multiple columns at once

Fill NaN values with mean value of a column

Get numerical values for categorical data

Get all unique calues in categorical data

Get unique values in any column

Convert column to float data type

Make existing column the index

Get subset of df where column value is equal to some value

Show all indexs in the dataframe

Convert dataframe to numpy array

Sort values by a column

Copy a whole dataframe

Drop the rows which have Nan values

Replace Nan values with a specified value

Return a dataframe of boolean values to check Nan values

Calculate the mean of each column

Calculate the mean of each row

Concatenate dataframes

Merge two dataframes with a custom index

Groupby column and sum

Subtract all columns by a specific column

Save a dataframe to csv file

Save a dataframe to excel sheet

Label Encoding

One hot encoding / Get dummies

Standard Scaling

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages