Machine learning follows a certain path. I am trying to make sure to give a structure for any dataset that you need to run
Common Steps
LiNear Regression
LOGISTIC REGRESSION
Common Steps
Actions that
needed to be done
|
Commands
|
|
Step 1
|
Import Libraries
|
import numpy
as np
|
import pandas
as pd
|
||
from
matplotlib import pyplot as plt
|
||
Import seaborn
as sns
|
||
Step 2
|
Either import Datasets
or create a dataframe
|
|
to import
datastets
|
from sklearn
import datasets
|
|
to create dataframe
|
df=pd.read_csv('path
or the file name')
|
|
Step 3
|
once already built in Datasets are imported, load particular dataset
|
<dataset
name>=datasets.load_<dataset name>s() for example
diabetes=datasets.load_diabetes()
|
Step 4
|
to see the decription of the dataset
|
print(<dataset
name>.DESCR) for example
print(diabetes.DESCR)
|
Step 5
|
To see the head of the dataframe
|
<dataset name>.head() for example recipes.head()
|
Step 6
|
Get the dimensions of the dataframe
|
<dataset name>.shape for example recipes.shape
|
Step 7
|
to get info regarding our dataset
|
df.info() where df is the dataframe name
|
Step 8
|
to find null values in the dataframe
|
df.isnull() where df is the dataframe name
|
Step 9
|
to find the correlation
|
df.corr()
|
Step 10
|
Heat map of corr
|
sns.heatmap(df.corr())
|
Step 11
|
to create heatmap of null values
|
sns.heatmap(df.isnull())
|
Step 12
|
to find the value count of null values of a
column
|
,dataset name>['Column name
'].isnull().value_counts() for
example
df['Cabin'].isnull().value_counts()
|
Step 13
|
to change the values to numerical from
alphabetical
|
<list>={"value":0,"value":1}
df['<column>']=df['<column>'].map(<list>) for example gend={"Male":0,"Female":1} df['Gender']=df['Gender'].map(gend)
|
Step 14
|
to drop any column from your dataset
|
df.drop("Address",axis=1,inplace=True) where
df is the dataframe name
|
Step 15
|
to find the columns
|
df.columns
|
Step 16
|
To scale the data (if needed)
|
from sklearn.preprocessing import StandardScaler ###we would need to import the scaleing
model first###
|
scaler=StandardScaler() where scaler can be any name
like bunny, america etc
|
||
scaler.fit(df.drop('Purchased',axis=1)) ###we are fitting the data
to fit and dropping column Purchase###
|
||
scale_arr=scaler.transform(df.drop('Purchased',axis=1)) ###creating new array and
transforming thoriginal dataframe###
|
||
<new dataframe>=pd.DataFrame(scale_arr,columns=['Column 1','Column 2'])
for example
new_df=pd.DataFrame(scale_arr,columns=['Age','EstimatedSalary'])
|
LiNear Regression
Step 17
|
Now divide the data into X and Y where Y is the
dependent variable and X is the independent variable.
|
<dataset name>_X=<dataset name>.data for example diabetes_X=diabetes.data
|
x=<dataframe name >[['columns of dataframe',
'columns of dataframe', 'columns of dataframe',
'columns of dataframe']] for example x=df[['Avg. Area Income', 'Avg. Area House Age', 'Avg. Area Number of Rooms', 'Area Population']] y=<dataframe name >[['Target column of dataframe']] for example y=df[['Price']] |
Step 18
|
Import the statisctical model
|
from sklearn import <model> for example from sklearn import
linear_model
|
from
sklearn.<it can be any name>_selection import train_test_split
|
if mean squared needed to be find out else import SVC
if svc needed to be found out or import DecisionTreeRegressor
|
from
sklearn.metrics import mean_squared_error
|
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.30,random_state=42) #we need to divide the data in training and
test.test_size=0.30 means how much data you want to keep in test environment
#. here it is 30% #random_state=42 means it will pick same data again and again |
|
We will take part of the data for testing and part of
the data for training
|
<dataset name>_X_test=<dataset
name>_X[-30:] for example diabetes_X_test=diabetes_X[-30:]
<dataset name>_X_training=<dataset name>_X[:-30] for example diabetes_X_training=diabetes_X[:-30] |
from
sklearn.linear_<it can be any name> import LinearRegression
|
|
<dataset name>_y_test=<dataset
name>.target[-30:] for example diabetes_y_test=diabetes.target[-30:]
<dataset name>_y_training=<dataset name>.target[:-30] for example diabetes_y_training=diabetes.target |
<it can be
any name>=LinearRegression()
|
||
<it can be
any name>.fit(x_train,y_train)
|
|||
Step 19
|
Now create the statistical model
|
model=linear_model.LinearRegression()
|
y_predict=<it
can be any name>.predict(x_test)
|
y_predict
|
|||
<it can be
any name>.coef_
|
|||
<it can be
any name>.intercept_
|
LOGISTIC REGRESSION
Step 18
|
Import the statisctical model
|
from sklearn import <model> for example from sklearn import
linear_model
|
from
sklearn.model_selection import train_test_split
|
if mean squared needed to be find out else import SVC
if svc needed to be found out or import DecisionTreeRegressor
|
from
sklearn.metrics import mean_squared_error
|
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.30,random_state=42) #we need to divide the data in training and
test.test_size=0.30 means how much data you want to keep in test environment
#. here it is 30% #random_state=42 means it will pick same data again and again |
|
We will take part of the data for testing and part of
the data for training
|
<dataset name>_X_test=<dataset
name>_X[-30:] for example diabetes_X_test=diabetes_X[-30:]
<dataset name>_X_training=<dataset name>_X[:-30] for example diabetes_X_training=diabetes_X[:-30] |
from
sklearn.linear_model import LogisticRegression
|
|
<dataset name>_y_test=<dataset
name>.target[-30:] for example diabetes_y_test=diabetes.target[-30:]
<dataset name>_y_training=<dataset name>.target[:-30] for example diabetes_y_training=diabetes.target |
<it can be
any name>=LogisticRegression()
|
||
<it can be
any name>.fit(x_train,y_train)
|
|||
Step 19
|
Now create the statistical model
|
model=linear_model.LinearRegression()
|
y_predict=<it
can be any name>.predict(x_test)
|
y_predict
|
|||
<it can be
any name>.coef_
|
|||
<it can be
any name>.intercept_
|
KNEighbor Classifier
Step 18
|
Import the statisctical model
|
from sklearn import <model> for example from sklearn import
linear_model
|
from sklearn.model_selection import train_test_split
|
if mean squared needed to be find out else import SVC
if svc needed to be found out or import DecisionTreeRegressor
|
from
sklearn.metrics import mean_squared_error
|
x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=0.30,random_state=42,stratify=y)
|
|
We will take part of the data for testing and part of
the data for training
|
<dataset name>_X_test=<dataset
name>_X[-30:] for example diabetes_X_test=diabetes_X[-30:]
<dataset name>_X_training=<dataset name>_X[:-30] for example diabetes_X_training=diabetes_X[:-30] |
from sklearn.neighbors import KNeighborsClassifier
|
|
<dataset name>_y_test=<dataset
name>.target[-30:] for example diabetes_y_test=diabetes.target[-30:]
<dataset name>_y_training=<dataset name>.target[:-30] for example diabetes_y_training=diabetes.target |
<it can be any
name>=KNeighborsClassifier(n_neighbors=3)
<it can be any name>.fit(x_train,y_train) |
||
y_predict=<it can be any name>.predict(x_test)
|
|||
Step 19
|
Now create the statistical model
|
model=linear_model.LinearRegression()
|
from sklearn import metrics
print("Accuracy=",metrics.accuracy_score(y_test,y_predict)) |
<it can be any
name>=KNeighborsClassifier(n_neighbors=5)
<it can be any name>.fit(x_train,y_train) y_predict=<it can be any name>.predict(x_test) print("Accuracy=",metrics.accuracy_score(y_test,y_predict)) |
|||
<it can be any
name>=KNeighborsClassifier(n_neighbors=7)
<it can be any name>.fit(x_train,y_train) y_predict=<it can be any name>.predict(x_test) print("Accuracy=",metrics.accuracy_score(y_test,y_predict)) |
|||
###if the value of x have lot of distance between the values and when you
plot a graph it will be very difficult so we will scale ###
|
|||
from sklearn.preprocessing import StandardScaler
|
|||
scaler=StandardScaler()
x_scaled=scaler.fit_transform(x) x_scaled |
|||
x_train,x_test,y_train,y_test=train_test_split(x_scaled,y,test_size=0.30,random_state=42,stratify=y)
|
|||
<it can be any
name>=KNeighborsClassifier(n_neighbors=7)
<it can be any name>.fit(x_train,y_train) y_predict=<it can be any name>.predict(x_test) print("Accuracy=",metrics.accuracy_score(y_test,y_predict)) |
|||
<it can be any name>=KNeighborsClassifier(n_neighbors=9)
<it can be any name>.fit(x_train,y_train) y_predict=<it can be any name>.predict(x_test) print("Accuracy=",metrics.accuracy_score(y_test,y_predict))) |
|||
from sklearn.model_selection import cross_val_score
|
|||
neighbors=list(range(1,50,2))
cv_scores=[] for k in neighbors: knn=KNeighborsClassifier(n_neighbors=k) scores=cross_val_score(knn,x_scaled,y,scoring='accuracy') cv_scores.append(scores.mean()) |
|||
MSE
|
|||
MSE=[1-x for x in cv_scores]
|
|||
MSE
|
|||
optimal_k=neighbors[MSE.index(min(MSE))]
|
|||
print(optimal_k)
|
|||
x_train,x_test,y_train,y_test=train_test_split(x_scaled,y,test_size=0.30,random_state=42,stratify=y)
<it can be any name>=KNeighborsClassifier(n_neighbors=25) <it can be any name>.fit(x_train,y_train) y_predict=<it can be any name>.predict(x_test) print("Accuracy=",metrics.accuracy_score(y_test,y_predict)) |
|||
plt.plot(neighbors,MSE)
plt.xlabel('Number of K') plt.ylabel('Error') plt.show() |
No comments:
Post a Comment