hyperopt 提供的 hyperopt-sklearn

什么是 Hyperopt-sklearn？

找到适用于您数据的正确分类器可能很难。一旦选择了分类器，调整所有参数以获得最佳结果既繁琐又耗时。即使付出了所有辛勤工作，您可能一开始就选择了错误的分类器。Hyperopt-sklearn 为此问题提供了解决方案。

用法

from hpsklearn import HyperoptEstimator

# Load Data
# ...

# Create the estimator object
estim = HyperoptEstimator()

# Search the space of classifiers and preprocessing steps and their
# respective hyperparameters in sklearn to fit a model to the data
estim.fit( train_data, train_label )

# Make a prediction using the optimized model
prediction = estim.predict( unknown_data )

# Report the accuracy of the classifier on a given set of data
score = estim.score( test_data, test_label )

# Return instances of the classifier and preprocessing steps
model = estim.best_model()

搜索算法

hyperopt 中任何可用的搜索算法都可以用来驱动估算器。您也可以提供自己的算法或混合使用多种算法。在返回之前评估的点数，以及可选的超时时间（以秒为单位），都可以用于任何搜索算法。

from hpsklearn import HyperoptEstimator
from hyperopt import tpe

estim = HyperoptEstimator( algo=tpe.suggest, 
                            max_evals=150, 
                            trial_timeout=60 )

目前可用的搜索算法

随机搜索
Parzen 估算器树 (TPE)
退火算法
树算法
高斯过程树

分类器

如果您知道希望在数据集上使用哪种类型的分类器，可以告知 hpsklearn，它将只在给定分类器的参数空间中进行搜索。

from hpsklearn import HyperoptEstimator, svc

estim = HyperoptEstimator( classifier=svc('mySVC') )

您还可以提供分类器集合，并可选择估算器选择每个分类器的概率。

from hpsklearn import HyperoptEstimator, random_forest, svc, knn
from hyperopt import hp

clf = hp.pchoice( 'my_name', 
          [ ( 0.4, random_forest('my_name.random_forest') ),
            ( 0.3, svc('my_name.svc') ),
            ( 0.3, knn('my_name.knn') ) ]

estim = HyperoptEstimator( classifier=clf )

目前内置的 sklearn 分类器

SVC
LinearSVC
KNeighborsClassifier
RandomForestClassifier
ExtraTreesClassifier
SGDClassifier
MultinomialNB
BernoulliRBM
ColumnKMeans

更多即将推出！

预处理

您还可以控制对数据应用哪些预处理步骤。这些可以作为列表传递给 HyperoptEstimator。空列表表示不进行任何预处理。

from hpsklearn import HyperoptEstimator, pca

estim = HyperoptEstimator( preprocessing=[ pca('my_pca') ] )

目前内置的 sklearn 预处理步骤

PCA
TfidfVectorizer
StandardScaler
MinMaxScaler
Normalizer
OneHotEncoder

更多即将推出！

安装

git clone https://github.com/hyperopt/hyperopt-sklearn.git
cd hyperopt
pip install -e .

文档

该项目目前正在积极开发中，正式文档即将推出。目前，您可以查看以下一些示例，或浏览源代码。

该项目基于

示例

MNIST 数字数据集上的示例

from hpsklearn import HyperoptEstimator, any_classifier
from sklearn.datasets import fetch_mldata
from hyperopt import tpe
import numpy as np

# Download the data and split into training and test sets

digits = fetch_mldata('MNIST original')

X = digits.data
y = digits.target

test_size = int( 0.2 * len( y ) )
np.random.seed( seed )
indices = np.random.permutation(len(X))
X_train = X[ indices[:-test_size]]
y_train = y[ indices[:-test_size]]
X_test = X[ indices[-test_size:]]
y_test = y[ indices[-test_size:]]

estim = HyperoptEstimator( classifier=any_classifier('clf'),  
                            algo=tpe.suggest, trial_timeout=300)

estim.fit( X_train, y_train )

print( estim.score( X_test, y_test ) )
# <<show score here>>
print( estim.best_model() )
# <<show model here>>

并非 sklearn 中的所有分类器都支持稀疏数据。为了方便起见，hpsklearn 提供了一个 any_sparse_classifier，它只会从接受稀疏数据的可用分类器中进行采样。

from hpsklearn import HyperoptEstimator, any_sparse_classifier, tfidf
from sklearn.datasets import fetch_20newsgroups
from sklearn import metrics
from hyperopt import tpe
import numpy as np

# Download the data and split into training and test sets

train = fetch_20newsgroups( subset='train' )
test = fetch_20newsgroups( subset='test' )
X_train = train.data
y_train = train.target
X_test = test.data
y_test = test.target

estim = HyperoptEstimator( classifier=any_sparse_classifier('clf'), 
                            preprocessing=[tfidf('tfidf')],
                            algo=tpe.suggest, trial_timeout=300)

estim.fit( X_train, y_train )

print( estim.score( X_test, y_test ) )
# <<show score here>>
print( estim.best_model() )
# <<show model here>>

实验结果

算法比较

测试在 20 newsgroups 数据集上进行，每个算法进行 300 次评估。可用的分类器集合包括支持向量机 (SVM)、k 近邻 (KNeighborsClassifier)、朴素贝叶斯 (MultinomialNB) 和随机梯度下降 (SGDClassifier)。在所有情况下都使用 TfidfVectorizer 进行预处理。

每个算法都使用不同的随机种子多次运行（6 到 9 次之间），并记录每次评估后的验证分数结果。结果非常嘈杂，因为这是一个很难搜索的空间。在每种情况下都对数据拟合了一条线性趋势线，并以红色叠加在数据之上。下面显示了 TPE 和随机搜索的结果。

Validation averaged over multiple trials

这些是为每个算法找到的线性趋势线。它们大致表示每个算法在每次评估后根据找到的点类型获得的改进程度。随机搜索没有改进，而其他算法在获得更多信息后倾向于在搜索空间的更有前景的区域进行搜索。

Trend of validation loss for each algorithm

与默认参数比较

示例展示了如何使用 hyperopt-sklearn 选择参数，并将其与 scikit-learn 选择的默认参数进行对比。这表明使用大致相同的代码量并在无需任何专业领域知识的情况下可以获得多大的改进。

下表显示了使用 scikit-learn 默认参数和 hyperopt-sklearn 优化参数在 20 newsgroups 数据集上运行分类器获得的 F1 分数。来自 hyperopt-sklearn 的结果是通过一次运行（25 次评估）获得的。

分类器	默认参数	优化参数
SVM	0.0053	0.8369
SGD	0.8498	0.8538
KNN	0.6597	0.6741
MultinomialNB	0.7684	0.8344

from hpsklearn import HyperoptEstimator, svc
from sklearn import svm

# Load Data
# ...

if use_hpsklearn:
    estim = HyperoptEstimator( classifier=svc('mySVC') )
else:
    estim = svm.SVC( )

estim.fit( X_train, y_train )

print( estim.score( X_test, y_test ) )
# <<show score here>>