Knowledge Hub: April 2020

Machine learning follows a certain path. I am trying to make sure to give a structure for any dataset that you need to run

Common Steps

	Actions that needed to be done	Commands
Step 1	Import Libraries	import numpy as np
		import pandas as pd
		from matplotlib import pyplot as plt
		Import seaborn as sns

Step 2	Either import Datasets or create a dataframe
	to import datastets	from sklearn import datasets
	to create dataframe	df=pd.read_csv('path or the file name')

Step 3	once already built in Datasets are imported, load particular dataset	<dataset name>=datasets.load_<dataset name>s() for example diabetes=datasets.load_diabetes()

Step 4	to see the decription of the dataset	print(<dataset name>.DESCR) for example print(diabetes.DESCR)

Step 5	To see the head of the dataframe	<dataset name>.head() for example recipes.head()

Step 6	Get the dimensions of the dataframe	<dataset name>.shape for example recipes.shape
Step 7	to get info regarding our dataset	df.info() where df is the dataframe name
Step 8	to find null values in the dataframe	df.isnull() where df is the dataframe name
Step 9	to find the correlation	df.corr()
Step 10	Heat map of corr	sns.heatmap(df.corr())
Step 11	to create heatmap of null values	sns.heatmap(df.isnull())
Step 12	to find the value count of null values of a column	,dataset name>['Column name '].isnull().value_counts() for example df['Cabin'].isnull().value_counts()
Step 13	to change the values to numerical from alphabetical	<list>={"value":0,"value":1} df['<column>']=df['<column>'].map(<list>) for example gend={"Male":0,"Female":1} df['Gender']=df['Gender'].map(gend)
Step 14	to drop any column from your dataset	df.drop("Address",axis=1,inplace=True) where df is the dataframe name
Step 15	to find the columns	df.columns
Step 16	To scale the data (if needed)	from sklearn.preprocessing import StandardScaler ###we would need to import the scaleing model first###
		scaler=StandardScaler() where scaler can be any name like bunny, america etc
		scaler.fit(df.drop('Purchased',axis=1)) ###we are fitting the data to fit and dropping column Purchase###
		scale_arr=scaler.transform(df.drop('Purchased',axis=1)) ###creating new array and transforming thoriginal dataframe###
		<new dataframe>=pd.DataFrame(scale_arr,columns=['Column 1','Column 2']) for example new_df=pd.DataFrame(scale_arr,columns=['Age','EstimatedSalary'])

LiNear Regression

Step 17	Now divide the data into X and Y where Y is the dependent variable and X is the independent variable.	<dataset name>_X=<dataset name>.data for example diabetes_X=diabetes.data	x=<dataframe name >[['columns of dataframe', 'columns of dataframe', 'columns of dataframe', 'columns of dataframe']] for example x=df[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Area Population']] y=<dataframe name >[['Target column of dataframe']] for example y=df[['Price']]
Step 18	Import the statisctical model	from sklearn import <model> for example from sklearn import linear_model	from sklearn.<it can be any name>_selection import train_test_split
	if mean squared needed to be find out else import SVC if svc needed to be found out or import DecisionTreeRegressor	from sklearn.metrics import mean_squared_error	x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.30,random_state=42) #we need to divide the data in training and test.test_size=0.30 means how much data you want to keep in test environment #. here it is 30% #random_state=42 means it will pick same data again and again
	We will take part of the data for testing and part of the data for training	<dataset name>_X_test=<dataset name>_X[-30:] for example diabetes_X_test=diabetes_X[-30:] <dataset name>_X_training=<dataset name>_X[:-30] for example diabetes_X_training=diabetes_X[:-30]	from sklearn.linear_<it can be any name> import LinearRegression
		<dataset name>_y_test=<dataset name>.target[-30:] for example diabetes_y_test=diabetes.target[-30:] <dataset name>_y_training=<dataset name>.target[:-30] for example diabetes_y_training=diabetes.target	<it can be any name>=LinearRegression()
			<it can be any name>.fit(x_train,y_train)
Step 19	Now create the statistical model	model=linear_model.LinearRegression()	y_predict=<it can be any name>.predict(x_test)
			y_predict
			<it can be any name>.coef_
			<it can be any name>.intercept_

LOGISTIC REGRESSION

Step 18	Import the statisctical model	from sklearn import <model> for example from sklearn import linear_model	from sklearn.model_selection import train_test_split
	if mean squared needed to be find out else import SVC if svc needed to be found out or import DecisionTreeRegressor	from sklearn.metrics import mean_squared_error	x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.30,random_state=42) #we need to divide the data in training and test.test_size=0.30 means how much data you want to keep in test environment #. here it is 30% #random_state=42 means it will pick same data again and again
	We will take part of the data for testing and part of the data for training	<dataset name>_X_test=<dataset name>_X[-30:] for example diabetes_X_test=diabetes_X[-30:] <dataset name>_X_training=<dataset name>_X[:-30] for example diabetes_X_training=diabetes_X[:-30]	from sklearn.linear_model import LogisticRegression
		<dataset name>_y_test=<dataset name>.target[-30:] for example diabetes_y_test=diabetes.target[-30:] <dataset name>_y_training=<dataset name>.target[:-30] for example diabetes_y_training=diabetes.target	<it can be any name>=LogisticRegression()
			<it can be any name>.fit(x_train,y_train)
Step 19	Now create the statistical model	model=linear_model.LinearRegression()	y_predict=<it can be any name>.predict(x_test)
			y_predict
			<it can be any name>.coef_
			<it can be any name>.intercept_

KNEighbor Classifier

Step 18	Import the statisctical model	from sklearn import <model> for example from sklearn import linear_model	from sklearn.model_selection import train_test_split
	if mean squared needed to be find out else import SVC if svc needed to be found out or import DecisionTreeRegressor	from sklearn.metrics import mean_squared_error	x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.30,random_state=42,stratify=y)
	We will take part of the data for testing and part of the data for training	<dataset name>_X_test=<dataset name>_X[-30:] for example diabetes_X_test=diabetes_X[-30:] <dataset name>_X_training=<dataset name>_X[:-30] for example diabetes_X_training=diabetes_X[:-30]	from sklearn.neighbors import KNeighborsClassifier
		<dataset name>_y_test=<dataset name>.target[-30:] for example diabetes_y_test=diabetes.target[-30:] <dataset name>_y_training=<dataset name>.target[:-30] for example diabetes_y_training=diabetes.target	<it can be any name>=KNeighborsClassifier(n_neighbors=3) <it can be any name>.fit(x_train,y_train)
			y_predict=<it can be any name>.predict(x_test)
Step 19	Now create the statistical model	model=linear_model.LinearRegression()	from sklearn import metrics print("Accuracy=",metrics.accuracy_score(y_test,y_predict))
			<it can be any name>=KNeighborsClassifier(n_neighbors=5) <it can be any name>.fit(x_train,y_train) y_predict=<it can be any name>.predict(x_test) print("Accuracy=",metrics.accuracy_score(y_test,y_predict))
			<it can be any name>=KNeighborsClassifier(n_neighbors=7) <it can be any name>.fit(x_train,y_train) y_predict=<it can be any name>.predict(x_test) print("Accuracy=",metrics.accuracy_score(y_test,y_predict))
			###if the value of x have lot of distance between the values and when you plot a graph it will be very difficult so we will scale ###
			from sklearn.preprocessing import StandardScaler
			scaler=StandardScaler() x_scaled=scaler.fit_transform(x) x_scaled
			x_train,x_test,y_train,y_test=train_test_split(x_scaled,y,test_size=0.30,random_state=42,stratify=y)
			<it can be any name>=KNeighborsClassifier(n_neighbors=7) <it can be any name>.fit(x_train,y_train) y_predict=<it can be any name>.predict(x_test) print("Accuracy=",metrics.accuracy_score(y_test,y_predict))
			<it can be any name>=KNeighborsClassifier(n_neighbors=9) <it can be any name>.fit(x_train,y_train) y_predict=<it can be any name>.predict(x_test) print("Accuracy=",metrics.accuracy_score(y_test,y_predict)))
			from sklearn.model_selection import cross_val_score
			neighbors=list(range(1,50,2)) cv_scores=[] for k in neighbors: knn=KNeighborsClassifier(n_neighbors=k) scores=cross_val_score(knn,x_scaled,y,scoring='accuracy') cv_scores.append(scores.mean())
			MSE
			MSE=[1-x for x in cv_scores]
			MSE
			optimal_k=neighbors[MSE.index(min(MSE))]
			print(optimal_k)
			x_train,x_test,y_train,y_test=train_test_split(x_scaled,y,test_size=0.30,random_state=42,stratify=y) <it can be any name>=KNeighborsClassifier(n_neighbors=25) <it can be any name>.fit(x_train,y_train) y_predict=<it can be any name>.predict(x_test) print("Accuracy=",metrics.accuracy_score(y_test,y_predict))
			plt.plot(neighbors,MSE) plt.xlabel('Number of K') plt.ylabel('Error') plt.show()

Knowledge Hub

Thursday, April 30, 2020

machine learning step by step commands and structure

Featured Post

Ichimoku cloud

Search This Blog