kmeanstf.kmeanstf.BaseKMeansTF¶

class kmeanstf.kmeanstf.BaseKMeansTF(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, max_mem=1800000000, tunnel=False, max_tunnel_iter=100, max_tunnel_moves_per_iter=10, criterion=1.0, local_trials=1, collect_history=False)

Base class for KMeansTF and TunnelKMeansTF

Note

To use it directly, set the parameter tunnel to False (k-means) or True (tunnel k-means). Recommended Usage is via the derived classes KMeansTF and TunnelKMeansTF

Parameters:

n_clusters (int) – The number of clusters to form as well as the number of centroids to generate.
init ('random', 'k-means++' or array) – method of initialization
n_init (int) – number of runs of the initial k-means phase with different initializations (default 1). Only one tunnel phase is performed even if n_init is larger than 1.
max_iter (int) – Maximum number of Lloyd iterations for a single run of the k-means algorithm.
tol (float) – Relative tolerance with regards to inertia to declare convergence.
verbose (int) – Verbosity mode.
random_state (int) – None, or integer to seed the random number generators of python, numpy and tensorflow
algorithm (str) – “full”, no other values implemented, added for compatibility with sklearn. Attention: sklearn.KMeans has ‘elkan’ as default which as of Dec 2019 interprets ‘tol’-Parameter differently than ‘full’, see https://github.com/scikit-learn/scikit-learn/issues/15831
max_mem (int) – max memory for GPU data (default value fits for GTX 1060)
tunnel (boolean) – perform tunnel k-means?
max_tunnel_iter (int) – how many tunnel iterations to perform maximally
max_tunnel_moves_per_iter (int) – how many centroids to move maximally in one tunnel iteration
criterion (float) – inital required ratio error/utility (is increased adaptively)
local_trials (int) – how many time should each tunnel move be repeated with different random offset vector (1 or larger)
collect_history (bool) – collect historic information on inertia, criterion, tunnel moves, codebooks

Variables:

cluster_centers (array, [n_clusters, n_features]) – Coordinates of cluster centers. If the algorithm stops before fully converging (see tol and max_iter), these will not be consistent with labels_.
labels (array, shape(n_samples)) – Labels of each point, i.e. index of closest centroid
inertia (float) – Sum of squared distances of samples to their closest cluster center.
n_iter (int) – Number of iterations run.

:ivar .. autosummary::

.. automethod:: __init__

Methods

`__init__`([n_clusters, init, n_init, …])	Initialize self.
`fit`(X)	Compute k-means clustering.
`fit_predict`(X)	Compute cluster centers and predict cluster index for each sample.
`get_errs_and_utils`(X[, centroids])	Get error and utility values wrt.
`get_history`()	Get collected history data of performed run of fit().
`get_log`([abbr])	Get statistics of performed run of fit()
`get_params`()	Get params used to define class
`get_system_status`([do_print])	print tensorflow version and availability of GPUs.
`predict`(X)	Predict the closest cluster each sample in X belongs to.
`self_test`([X, n_clusters, n_init, n, d, g, …])	self-testing routine
`set_random_seed`(seed)	setting random seed for tensorflow, python and numpy

fit(X)

Compute k-means clustering.

Parameters:	X (tensor) – samples

sets:

self.cluster_centers_

self.inertia_

fit_predict(X)

Compute cluster centers and predict cluster index for each sample.

Parameters:	X (tensor) – samples
Returns:	array of cluster indices

get_errs_and_utils(X, centroids=None)

Get error and utility values wrt. X

Parameters:	X (tensor) – samples

Error and utility are computed for given centroids or (if centroids = None) for self.cluster_centers_

Returns:	errors (array), utilities (array)

get_history()

Get collected history data of performed run of fit().

(only present if collect_history == True)

Returns:	history (dict)

get_log(abbr=False)

Get statistics of performed run of fit()

Parameters:	abbr (bool) – return with abbreviated keys
Returns:	log (dict)

get_params()

Get params used to define class

Returns:	params (dict)

static get_system_status(do_print=False)

print tensorflow version and availability of GPUs.

Parameters:	do_print (bool) – also print the result

Example output (if do_print==True):

TENSORFLOW: 2.0.0
Physical GPUs: 1   Logical GPUs: 1

Returns:	dict with tensorflow version, no of physical GPUs, number of logical GPUs

predict(X)

Predict the closest cluster each sample in X belongs to.

Parameters:	X (tensor) – samples
Returns:	array of cluster indices

static self_test(X=None, n_clusters=100, n_init=10, n=10000, d=2, g=50, sig: float = None, verbose=0, stats_only=0, init='k-means++', plot=True, voro=True)

self-testing routine

runs both k-means++ and tunnel k-means and prints the SSE improvement of tunnel k-means over k-means++ (in the scikit-learn implementation). Uses Gaussian mixture distribution (default) or provided data set X. Typical output:

Data is mixture of 50 Gaussians in unit square with sigma=0.00711
algorithm      | data.shape  |   k  | init      | n_init  |     SSE   | Runtime  | Improvement
---------------|-------------|------|-----------|---------|-----------|----------|------------
k-means++      | (10000, 2)  |  100 | k-means++ |      10 |   0.66179 |    2.09s | 0.00%
tunnel k-means | (10000, 2)  |  100 | random    |       1 |   0.63933 |    3.37s | 3.39%

Parameters:

X – data set to use (as tensorflow or numpy array). If None, use mixture of Gaussians according to the other parameters
n_clusters (int) – the k in k-means
n_init (int) – number of runs with different initializations
n (int) – number of data points to generate
d (int) – number of features (dimensionality) of generated data points
g (int) – number of Gaussians
sig (float) – standard deviation of Gaussians, if ‘None’ a value is chosen based on number of Gaussians
init ('k-means++' or 'random') – initialization method for k-means (tunnel k-means is initialized as random)
plot (bool) – plot the result?
voro (bool) – show Voronoi regions in plot?

static set_random_seed(seed)

setting random seed for tensorflow, python and numpy

Parameters:	(int) (seed) – random seed