綜述

所謂：“近朱者赤，近墨者黑”

本文采用編譯器：jupyter

k近鄰(簡稱kNN)算法是一種常用的監(jiān)督學(xué)習(xí)算法，其工作機制非常簡單 : 給定測試樣本，基于某種距離度量找出訓(xùn)練集中與其最靠近的 k個訓(xùn)練樣本，然后基于這 k個"鄰居"的信息來進行預(yù)測。

通常，在分類任務(wù)中可使用"投票法" 即選擇這 k個樣本中出現(xiàn)最多的類別標記作為預(yù)測結(jié)果；還可基于距離遠近進行加權(quán)平均或加權(quán)投票，距離越近的樣本權(quán)重越大。

kNN算法的示意圖如下，可以很明顯的看出當k取值不同時，判別結(jié)果可能產(chǎn)生較大的差異。

可以看出它天然的可以解決多分類問題，思想簡單卻十分強大！

01 kNN基礎(chǔ)

以下為數(shù)據(jù)準備階段

# 導(dǎo)入所需要的兩個包
import?numpy as?np
import?matplotlib.pyplot as?plt


# 各數(shù)據(jù)點
raw_data_X = [[3.3935, 2.3312],
[3.1100, 1.7815],
[1.3438, 3.3683],
[3.5822, 4.6791],
[2.2803, 2.8669],
[7.4234, 4.6965],
[5.7450, 3.5339],
[9.1721, 2.5111],
[7.7927, 3.4240],
[7.9398, 0.7916]
]
# 各數(shù)據(jù)點所對應(yīng)的標記
raw_data_y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

X_train = np.array(raw_data_X)
y_train = np.array(raw_data_y)

#繪制散點圖
plt.scatter(X_train[y_train==0, 0], X_train[y_train==0, 1], color='g')
plt.scatter(X_train[y_train==1, 0], X_train[y_train==1, 1], color='r')
plt.show()


# 模擬待分類數(shù)據(jù)
x = np.array([8.0936, 3.3657])

plt.scatter(X_train[y_train==0, 0], X_train[y_train==0, 1], color='g')
plt.scatter(X_train[y_train==1, 0], X_train[y_train==1, 1], color='r')
plt.scatter(x[0], x[1], color='b')
plt.show()

執(zhí)行結(jié)果：

kNN的過程

1.計算待測數(shù)據(jù)與所有數(shù)據(jù)點的“距離”

2.指定k的大小

3.找出模型中與待測點最近的k個點

4.應(yīng)用“投票法”預(yù)測出分類結(jié)果

# 計算歐拉距離所需跟方操作
from?math import?sqrt
distances = []

for?x_train in?X_train:
    d = sqrt(np.sum((x_train -?x) **?2)) # Universal
    distances.append(d)

# distances = [sqrt(np.sum((x_train -?x) **?2)) for?x_train in?X_train]

nearest = np.argsort(distances)?# 找離x最近的k個點的索引

# 指定k
k = 6
# 計算最近點k個點
topK_y = [y_train[i] for?i in?nearest[:k]]

# 使用Counter方法統(tǒng)計標簽類別
from?collections import?Counter
votes = Counter(topK_y)

votes.most_common(1) # 找出票數(shù)最多的那1個類別,
# Out[22]:
# [(1, 5)]

predict_y = votes.most_common(1)[0][0] # 預(yù)測結(jié)果

predict_y

# Out[27]:
# 1

02 使用scikit_learn中的kNN

python標準庫scikit_learn中也為我們封裝好了kNN算法

# 導(dǎo)入kNN算法
from sklearn.neighbors import KNeighborsClassifier

# 創(chuàng)建分類器對象
kNN_classifier = KNeighborsClassifier(n_neighbors=6)

kNN_classifier.fit(X_train, y_train) # 先訓(xùn)練模型

“”“
fit方法返回對象本身
Out[7]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', metric_params=None, n_jobs=1, n_neighbors=6, p=2, weights='uniform')
”“”

X_predict = x.reshape(1, -1) # 傳入的數(shù)據(jù)需要是一個矩陣，這里待預(yù)測的x只是一個向量

X_predict
# Out[9]:
# array([[ 8.0936, 3.3657]])


y_predict = kNN_classifier.predict(X_predict)

y_predict[0]

# Out[13]:
# 1

03 測試我們的算法

本例使用datasets數(shù)據(jù)集中的鳶尾花數(shù)據(jù)集

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets


# 加在鳶尾花數(shù)據(jù)集
iris = datasets.load_iris()

X = iris.data
y = iris.target

X.shape

# Out[4]:
# (150, 4)

y.shape

# Out[5]:
# (150,)

train_test_split

平時我們在拿到一個數(shù)據(jù)集時，往往將其一部分用于對機器進行訓(xùn)練，另一部分用于對訓(xùn)練過后的機器進行測試，即train_test_split

y

"""
Out[6]:

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
"""

shuffle_indexes = np.random.permutation(len(X)) # 打亂數(shù)據(jù)，形成150個索引的隨機排列

shuffle_indexes

"""
Out[8]:

array([ 22, 142, 86, 111, 72, 80, 17, 137, 5, 66, 33, 55, 40, 122, 108, 24, 45, 110, 68, 46, 118, 44, 136, 121, 78, 31, 103, 35, 105, 107, 76, 116, 84, 144, 123, 57, 42, 7, 38, 28, 117, 115, 89, 58, 126, 74, 49, 27, 94, 77, 85, 21, 119, 132, 100, 120, 6, 104, 62, 53, 64, 41, 106, 26, 29, 18, 129, 146, 148, 1, 82, 139, 135, 96, 127, 56, 37, 130, 65, 149, 113, 92, 131, 2, 4, 125, 54, 79, 50, 61, 112, 95, 19, 109, 102, 141, 30, 39, 83, 25, 140, 60, 12, 20, 138, 71, 59, 11, 13, 0, 52, 91, 3, 73, 23, 124, 15, 14, 81, 97, 75, 114, 16, 69, 32, 134, 36, 8, 63, 51, 147, 67, 93, 47, 133, 48, 143, 43, 34, 98, 87, 88, 145, 70, 90, 9, 10, 128, 101, 99])
"""

test_ratio = 0.2 # 設(shè)置測試數(shù)據(jù)集的占比
test_size = int(len(X) * test_ratio)


test_indexes = shuffle_indexes[:test_size] # 測試數(shù)據(jù)集索引
train_indexes = shuffle_indexes[test_size:] # 訓(xùn)練數(shù)據(jù)集索引

X_train = X[train_indexes]
y_train = y[train_indexes]

?X_test = X[test_indexes]
y_test = y[test_indexes]

print(X_train.shape)
print(y_train.shape)

# (120, 4) (120,)

print(X_test.shape)
print(y_test.shape)

# (30, 4) (30,)

sklearn中的train_test_split

sklearn中同樣為我們提供了將數(shù)據(jù)集分成訓(xùn)練集與測試集的方法

# 首先創(chuàng)建一個kNN分類器my_knn_clf，略

# 導(dǎo)入模塊

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)

print(X_train.shape)
print(y_train.shape)

# (120, 4) (120,)

print(X_test.shape)
print(y_test.shape)

# (30, 4) (30,)

my_knn_clf.fit(X_train, y_train)

# Out[32]:
# KNN(k=3)

y_predict = my_knn_clf.predict(X_test)

sum(y_predict == y_test)/len(y_test)

# Out[34]:
# 1.0

04 分類準確度

本例使用datasets中手寫識別數(shù)據(jù)集來演示分類準確度的計算

import numpy as np
import matplotlib
import matplotlib.pyplot as plt

from sklearn import datasets

digits = datasets.load_digits() # 手寫識別數(shù)據(jù)集

digits.keys()
# Out[3]:
# dict_keys(['data', 'target', 'target_names', 'images', 'DESCR'])

print(digits.DESCR)

X = digits.data # 數(shù)據(jù)集
y = digits.target # 標記

# 隨便取一個數(shù)據(jù)集
some_digit = X[666]

some_digit_image = some_digit.reshape(8, 8)

plt.imshow(some_digit_image, cmap = matplotlib.cm.binary)
plt.show()

執(zhí)行結(jié)果：

scikit-learn中的accuracy_score

# 導(dǎo)入split方法先將數(shù)據(jù)集拆分

from sklearn.model_selection import train_test_split

?X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# 導(dǎo)入創(chuàng)建kNN分類器的方法
from sklearn.neighbors import KNeighborsClassifier

?knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)

y_predict = knn_clf.predict(X_test)

# 導(dǎo)入sklearn中計算準確度的方法
from sklearn.metrics import accuracy_score

?accuracy_score(y_test, y_predict)

# Out[27]:
# 0.99444444444444446

05 超參數(shù)

?超參數(shù)：在算法運行前需要確定的參數(shù)，即kNN中的k

模型參數(shù)：算法過程中學(xué)習(xí)到的參數(shù)

通過以上對kNN方法的討論可知，kNN算法沒有模型參數(shù)

import numpy as np
from sklearn import datasets

digits = datasets.load_digits()

X = digits.data
y = digits.target

from sklearn.model_selection import train_test_split

?X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

from sklearn.neighbors import KNeighborsClassifier

?knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
knn_clf.score(X_test, y_test)

# Out[4]:
# 0.98611111111111116

尋找最好的k

best_score = 0.0
best_k = -1

for k in range(1, 11): # 搜索1到10中最好的k，分別創(chuàng)建k等于不同值時的分類器，用score方法評判
    knn_clf = KNeighborsClassifier(n_neighbors=k)
    knn_clf.fit(X_train, y_train)
    score = knn_clf.score(X_test, y_test)
    if score > best_score:
        best_k = k
        best_score = score

print("best_k=",best_k)
print("best_score=",best_score)


# best_k= 7 
# best_score= 0.988888888889

考慮距離？不考慮距離？

kNN算法如果考慮距離，則分類過程中待測數(shù)據(jù)點與臨近點的關(guān)系成反比，距離越大，得票的權(quán)重越小

best_method = ""
best_score = 0.0
best_k = -1

?for method in ["uniform", "distance"]:
    for k in range(1, 11): # 搜索1到10中最好的k
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights=method)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
            best_method = method

print("best_method=",best_method)
print("best_k=",best_k)
print("best_score=",best_score)

# best_method= uniform 
# best_k= 1 
# best_score= 0.994444444444

搜索明可夫斯基距離相應(yīng)的p

%%time

best_p = -1
best_score = 0.0
best_k = -1?

for k in range(1, 11): # 搜索1到10中最好的k
    for p in range(1, 6): # 距離參數(shù)
        knn_clf = KNeighborsClassifier(n_neighbors=k, weights="distance", p = p)
        knn_clf.fit(X_train, y_train)
        score = knn_clf.score(X_test, y_test)
        if score > best_score:
            best_k = k
            best_score = score
            best_p = p

print("best_p=", best_p)
print("best_k=", best_k)
print("best_score=", best_score)

# best_p= 2 
# best_k= 1 
# best_score= 0.994444444444 

# CPU times: user 15.3 s, sys: 51.5 ms, total: 15.4 s Wall time: 15.5 s

以上多重循環(huán)的過程可以抽象成一張網(wǎng)格，將網(wǎng)格上面的所有點都遍歷一遍求最好的值

06 網(wǎng)格搜索

補充：kNN中的“距離”

明可夫斯基距離：此時獲得了一個超參數(shù)p，當p = 1時為曼哈頓距離，當p = 2時為歐拉距離

數(shù)據(jù)準備：

import numpy as np
from sklearn import datasets

digits = datasets.load_digits()

X = digits.data
y = digits.target

from sklearn.model_selection import train_test_split

?X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=666)

from sklearn.neighbors import KNeighborsClassifier

?knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
knn_clf.score(X_test, y_test)

# Out[4]:
# 0.98888888888888893

Grid Search

可以將上一節(jié)中的網(wǎng)格搜索思想用以下方法更簡便的表達出來

# 定義網(wǎng)格參數(shù)，每個字典寫上要遍歷的參數(shù)的取值集合
param_grid = [
{
'weights':['uniform'],
'n_neighbors':[i for i in range(1, 11)]
},

{
'weights':['distance'],
'n_neighbors':[i for i in range(1, 11)],
'p':[i for i in range(1, 6)]
}
]

knn_clf = KNeighborsClassifier()

# 導(dǎo)入網(wǎng)格搜索方法（此方法使用交叉驗證CV）
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(knn_clf, param_grid)

%%time
grid_search.fit(X_train, y_train)

# CPU times: user 2min 2s, sys: 320 ms, total: 2min 2s Wall time: 2min 3s

"""
GridSearchCV(cv=None, error_score='raise', estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', 
metric_params=None, n_jobs=1, n_neighbors=5, p=2, weights='uniform'), 
fit_params=None, iid=True, n_jobs=1, param_grid=[{'weights': ['uniform'], 
'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}, {'weights': ['distance'], 
'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'p': [1, 2, 3, 4, 5]}], 
pre_dispatch='2*n_jobs', refit=True, return_train_score='warn', scoring=None, 
verbose=0)
"""

grid_search.best_estimator_ # 返回最好的分類器，（變量名最后帶一個下劃線是因為這是根據(jù)用戶輸入所計算出來的）

"""
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', 
metric_params=None, n_jobs=1, n_neighbors=3, p=3, weights='distance')
"""

grid_search.best_score_ # 最好方法的準確率

# Out[11]:
# 0.98538622129436326


grid_search.best_params_ # 最優(yōu)方法的對應(yīng)參數(shù)
# Out[12]:
# {'n_neighbors': 3, 'p': 3, 'weights': 'distance'}

knn_clf = grid_search.best_estimator_
knn_clf.score(X_test, y_test)

# Out[14]:
# 0.98333333333333328

%%time
grid_search = GridSearchCV(knn_clf, param_grid, n_jobs=-1, verbose=2) # n_jobs采用多少核,verbose：執(zhí)行時輸出，整數(shù)越大，信息越詳細

grid_search.fit(X_train, y_train)

?
Fitting 3 folds for each of 60 candidates, totalling 180 fits?
[CV] n_neighbors=1, weights=uniform ..................................?
[CV] n_neighbors=1, weights=uniform ..................................?
[CV] n_neighbors=1, weights=uniform ..................................?
[CV] n_neighbors=2, weights=uniform ..................................?
[CV] ................... n_neighbors=1, weights=uniform, total= 0.7s?

[CV] n_neighbors=3, weights=uniform ..................................?
[CV] ................... n_neighbors=2, weights=uniform, total= 1.0s ......?

CPU times: user 651 ms, sys: 343 ms, total: 994 ms Wall time: 1min 23s
[Parallel(n_jobs=-1)]: Done 180 out of 180 | elapsed: 1.4min finished
?

07 數(shù)據(jù)歸一化處理

import numpy as np
import matplotlib.pyplot as plt

最值歸一化Normalization

x = np.random.randint(1, 100, size = 100)
(x - np.min(x)) / (np.max(x) - np.min(x)) # 最值歸一化

# 對矩陣的處理
X = np.random.randint(0, 100, (50, 2))
X = np.array(X, dtype=float) # 轉(zhuǎn)換成能取小數(shù)的類型

X[:,0] = (X[:, 0] - np.min(X[:, 0])) / (np.max(X[:, 0]) - np.min(X[:, 0]))
X[:,1] = (X[:, 1] - np.min(X[:, 1])) / (np.max(X[:, 1]) - np.min(X[:, 1]))

plt.scatter(X[:,0], X[:,1])
plt.show()

# 查看最值歸一化方法的性質(zhì)
np.mean(X[:,0]) # 第一列均值
# Out[13]:
# 0.55073684210526319

np.std(X[:,0]) # 第一列方差
# Out[14]:
# 0.29028548370502699

np.mean(X[:,1]) # 第二列均值
# Out[15]:
# 0.50515463917525782

np.std(X[:,1]) # 第二列方差
# Out[16]:
# 0.29547909688276441

均值方差歸一化Standardization

X2 = np.random.randint(0, 100, (50, 2))
X2 = np.array(X2, dtype=float)

X2[:,0] = (X2[:,0] - np.mean(X2[:,0])) / np.std(X2[:,0])
X2[:,1] = (X2[:,1] - np.mean(X2[:,1])) / np.std(X2[:,1])

plt.scatter(X2[:,0], X2[:,1])
plt.show()

# 查看均值方差歸一化方法性質(zhì)
np.mean(X2[:,0]) # 查看均值
# Out[24]:
# -3.9968028886505634e-17


np.std(X2[:,0]) # 查看方差
# Out[25]:
# 0.99999999999999989

np.mean(X2[:,1])
# Out[26]:
# -3.552713678800501e-17

np.std(X2[:,1])
# Out[27]:
# 1.0

注意對測試數(shù)據(jù)集的歸一化方法：由于測試數(shù)據(jù)集模擬的是真實的數(shù)據(jù)，在實際應(yīng)用中可能只有一個數(shù)據(jù)，此時如果對其本身求均值意義不大，所以此處減去訓(xùn)練數(shù)據(jù)的均值再除以方差。

08 Scikit-Learn中的Scaler

Scikit-Learn中專門為數(shù)據(jù)歸一化操作提供了專門的類，和kNN類的對象一樣，需要先進行fit操作之后再執(zhí)行歸一化操作，如下：

數(shù)據(jù)準備

import numpy as np
from sklearn import datasets

iris = datasets.load_iris()

X = iris.data
y = iris.target

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size = 0.2, random_state=666)

scikit-learn中的StandardScaler

# 導(dǎo)入均值方差歸一化對象
from sklearn.preprocessing import StandardScaler

standardScaler = StandardScaler()

standardScaler.fit(X_train)

StandardScaler(copy=True, with_mean=True, with_std=True)

# 查看歸一化方法性質(zhì)
standardScaler.mean_
# Out[9]:
# array([ 5.83416667, 3.0825 , 3.70916667, 1.16916667])

standardScaler.scale_ # 標準差
# Out[10]:
# array([ 0.81019502, 0.44076874, 1.76295187, 0.75429833])

X_train = standardScaler.transform(X_train)

# 訓(xùn)練數(shù)據(jù)經(jīng)過歸一化之后測試數(shù)據(jù)也應(yīng)該進行歸一化操作
X_test_standard = standardScaler.transform(X_test)

from sklearn.neighbors import KNeighborsClassifier

knn_clf = KNeighborsClassifier(n_neighbors=3)
knn_clf.fit(X_train, y_train)
"""
Out[17]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski', 
metric_params=None, n_jobs=1, n_neighbors=3, p=2, weights='uniform')
"""

knn_clf.score(X_test_standard, y_test)
# Out[18]:
# 1.0

knn_clf.score(X_test, y_test) # 此結(jié)果有誤，傳進來的測試數(shù)據(jù)集也必須和訓(xùn)練數(shù)據(jù)集一樣歸一化

# Out[19]:
# 0.33333333333333331

附件：仿照實現(xiàn)sklearn中的kNN分類器，調(diào)用方法完全一樣。

pycharm，sublime，記事本可以用各種寫，你懂的

kNN.py:

import numpy as np
from math import sqrt
from collections import Counter
from .metrics import accuracy_score


class KNNClassifier:

    def __init__(self, k):
        """初始化kNN分類器"""
        assert k >= 1, "k must be valid"
        self.k = k
        self._X_train = None
        self._y_train = None

    def fit(self, X_train, y_train):
        """根據(jù)訓(xùn)練數(shù)據(jù)集X_train和y_train訓(xùn)練kNN分類器"""
        assert X_train.shape[0] == y_train.shape[0],             "the size of X_train must equal to the size of y_train"
        assert self.k <= X_train.shape[0],             "the size of X_train must be at least k"

        self._X_train = X_train
        self._y_train = y_train

        return self

    def predict(self, X_predict):
        """給定待測數(shù)據(jù)集X_predict，返回表示X_predict的結(jié)果向量"""
        assert self._X_train is not None and self._y_train is not None,             "must fit before predict"
        assert X_predict.shape[1] == self._X_train.shape[1],             "the feature number of X_predict must be equal to X_train"

        y_predict = [self._predict(x) for x in X_predict]
        return np.array(y_predict)

    def _predict(self, x):
        """給定單個待預(yù)測數(shù)據(jù)x，返回x的預(yù)測結(jié)果"""
        assert x.shape[0] == self._X_train.shape[1],             "the feature number of x must be equal to X_train"

        distances = [sqrt(np.sum((x_train - x) ** 2))
                     for x_train in self._X_train]
        nearest = np.argsort(distances)

        topK_y = [self._y_train[i] for i in nearest[:self.k]]
        votes = Counter(topK_y)

        return votes.most_common(1)[0][0]

    def score(self, X_test, y_test):
        """根據(jù)測試數(shù)據(jù)集確定當前準確度"""
        y_predict = self.predict(X_test)
        return accuracy_score(y_test, y_predict)

    def __repr__(self):
        return "KNN(k=%d)" % self.k

metrics.py:?

import numpy as np


def accuracy_score(y_true, y_predict):
    """計算y_true和y_predict之間的準確率"""
    # 保證預(yù)測對象的個數(shù)與正確對象個數(shù)相同，才能一一對比
    assert y_true.shape[0] == y_predict.shape[0],         "the size of y_true must be equal to the size of y_predict"

    return sum(y_predict == y_true) / len(y_true)

model_selection.py:?

import numpy as np


def train_test_split(X, y, test_ratio=0.2, seed=None):
    """將數(shù)據(jù)X和y按照test_ratio分割成X_train, X_test, y_train, y_test"""

    # 確保有多少個數(shù)據(jù)就有多少個標簽
    assert X.shape[0] == y.shape[0],         "the size of X must be equal to the size of y"
    assert 0.0 <= test_ratio <= 1.0,         "test_ratio must be valid"

    if seed:
        np.random.seed(seed)

    shuffle_indexes = np.random.permutation(len(X)) # 打亂數(shù)據(jù)，形成150個索引的隨機排列

    test_size = int(len(X) * test_ratio)
    test_indexes = shuffle_indexes[:test_size]  # 測試數(shù)據(jù)集索引
    train_indexes = shuffle_indexes[test_size:]  # 訓(xùn)練數(shù)據(jù)集

    X_train = X[train_indexes]
    y_train = y[train_indexes]

    X_test = X[test_indexes]
    y_test = y[test_indexes]

    return X_train, X_test, y_train, y_test

preprocessing.py:

import numpy as np


class StandardScaler:

    def __init__(self):
        self.mean_ = None
        self.scale_ = None

    def fit(self, X):
        """根據(jù)訓(xùn)練數(shù)據(jù)集X獲得數(shù)據(jù)的均值和方差"""
        assert X.ndim == 2, "The dimension of X must be 2"

        self.mean_ = np.array([np.mean(X[:,i]) for i in range(X.shape[1])])
        self.scale_ = np.array([np.std(X[:, i]) for i in range(X.shape[1])])

        return self

    def transform(self, X):
        """將X根據(jù)這個StandardScaler進行均值方差歸一化處理"""
        assert X.ndim == 2, "The dimension of X must be 2"
        assert self.mean_ is not None and self.scale_ is not None,             "must fit before transform!"
        assert X.shape[1] == len(self.mean_),             "the feature number of X must be equal to mean_ and std_"

        resX = np.empty(shape=X.shape, dtype=float)
        for col in range(X.shape[1]):
            resX[:,col] = (X[:,col] - self.mean_[col]) / self.scale_[col]

        return resX

本站僅提供存儲服務(wù)，所有內(nèi)容均由用戶發(fā)布，如發(fā)現(xiàn)有害或侵權(quán)內(nèi)容，請點擊舉報。

中文字幕理论片,69视频免费在线观看,亚洲成人app,国产1级毛片,刘涛最大尺度戏视频,欧美亚洲美女视频,2021韩国美女仙女屋vip视频

綜述