kmeanstf.kmeanstf.BaseKMeansTF

class kmeanstf.kmeanstf.BaseKMeansTF(n_clusters=8, init='k-means++', n_init=10, max_iter=300, tol=0.0001, verbose=0, random_state=None, max_mem=1800000000, tunnel=False, max_tunnel_iter=100, max_tunnel_moves_per_iter=10, criterion=1.0, local_trials=1, collect_history=False)

Base class for KMeansTF and TunnelKMeansTF

Note

To use it directly, set the parameter tunnel to False (k-means) or True (tunnel k-means). Recommended Usage is via the derived classes KMeansTF and TunnelKMeansTF

Parameters:
  • n_clusters (int) – The number of clusters to form as well as the number of centroids to generate.
  • init ('random', 'k-means++' or array) – method of initialization
  • n_init (int) – number of runs of the initial k-means phase with different initializations (default 1). Only one tunnel phase is performed even if n_init is larger than 1.
  • max_iter (int) – Maximum number of Lloyd iterations for a single run of the k-means algorithm.
  • tol (float) – Relative tolerance with regards to inertia to declare convergence.
  • verbose (int) – Verbosity mode.
  • random_state (int) – None, or integer to seed the random number generators of python, numpy and tensorflow
  • algorithm (str) – “full”, no other values implemented, added for compatibility with sklearn. Attention: sklearn.KMeans has ‘elkan’ as default which as of Dec 2019 interprets ‘tol’-Parameter differently than ‘full’, see https://github.com/scikit-learn/scikit-learn/issues/15831
  • max_mem (int) – max memory for GPU data (default value fits for GTX 1060)
  • tunnel (boolean) – perform tunnel k-means?
  • max_tunnel_iter (int) – how many tunnel iterations to perform maximally
  • max_tunnel_moves_per_iter (int) – how many centroids to move maximally in one tunnel iteration
  • criterion (float) – inital required ratio error/utility (is increased adaptively)
  • local_trials (int) – how many time should each tunnel move be repeated with different random offset vector (1 or larger)
  • collect_history (bool) – collect historic information on inertia, criterion, tunnel moves, codebooks
Variables:
  • cluster_centers (array, [n_clusters, n_features]) – Coordinates of cluster centers. If the algorithm stops before fully converging (see tol and max_iter), these will not be consistent with labels_.
  • labels (array, shape(n_samples)) – Labels of each point, i.e. index of closest centroid
  • inertia (float) – Sum of squared distances of samples to their closest cluster center.
  • n_iter (int) – Number of iterations run.

:ivar .. autosummary::

.. automethod:: __init__

Methods

__init__([n_clusters, init, n_init, …]) Initialize self.
fit(X) Compute k-means clustering.
fit_predict(X) Compute cluster centers and predict cluster index for each sample.
get_errs_and_utils(X[, centroids]) Get error and utility values wrt.
get_history() Get collected history data of performed run of fit().
get_log([abbr]) Get statistics of performed run of fit()
get_params() Get params used to define class
get_system_status([do_print]) print tensorflow version and availability of GPUs.
predict(X) Predict the closest cluster each sample in X belongs to.
self_test([X, n_clusters, n_init, n, d, g, …]) self-testing routine
set_random_seed(seed) setting random seed for tensorflow, python and numpy
fit(X)

Compute k-means clustering.

Parameters:X (tensor) – samples

sets:

  • self.cluster_centers_
  • self.inertia_
fit_predict(X)

Compute cluster centers and predict cluster index for each sample.

Parameters:X (tensor) – samples
Returns:array of cluster indices
get_errs_and_utils(X, centroids=None)

Get error and utility values wrt. X

Parameters:X (tensor) – samples

Error and utility are computed for given centroids or (if centroids = None) for self.cluster_centers_

Returns:errors (array), utilities (array)
get_history()

Get collected history data of performed run of fit().

(only present if collect_history == True)

Returns:history (dict)
get_log(abbr=False)

Get statistics of performed run of fit()

Parameters:abbr (bool) – return with abbreviated keys
Returns:log (dict)
get_params()

Get params used to define class

Returns:params (dict)
static get_system_status(do_print=False)

print tensorflow version and availability of GPUs.

Parameters:do_print (bool) – also print the result

Example output (if do_print==True):

TENSORFLOW: 2.0.0
Physical GPUs: 1   Logical GPUs: 1
Returns:dict with tensorflow version, no of physical GPUs, number of logical GPUs
predict(X)

Predict the closest cluster each sample in X belongs to.

Parameters:X (tensor) – samples
Returns:array of cluster indices
static self_test(X=None, n_clusters=100, n_init=10, n=10000, d=2, g=50, sig: float = None, verbose=0, stats_only=0, init='k-means++', plot=True, voro=True)

self-testing routine

runs both k-means++ and tunnel k-means and prints the SSE improvement of tunnel k-means over k-means++ (in the scikit-learn implementation). Uses Gaussian mixture distribution (default) or provided data set X. Typical output:

Data is mixture of 50 Gaussians in unit square with sigma=0.00711
algorithm      | data.shape  |   k  | init      | n_init  |     SSE   | Runtime  | Improvement
---------------|-------------|------|-----------|---------|-----------|----------|------------
k-means++      | (10000, 2)  |  100 | k-means++ |      10 |   0.66179 |    2.09s | 0.00%
tunnel k-means | (10000, 2)  |  100 | random    |       1 |   0.63933 |    3.37s | 3.39%
Parameters:
  • X – data set to use (as tensorflow or numpy array). If None, use mixture of Gaussians according to the other parameters
  • n_clusters (int) – the k in k-means
  • n_init (int) – number of runs with different initializations
  • n (int) – number of data points to generate
  • d (int) – number of features (dimensionality) of generated data points
  • g (int) – number of Gaussians
  • sig (float) – standard deviation of Gaussians, if ‘None’ a value is chosen based on number of Gaussians
  • init ('k-means++' or 'random') – initialization method for k-means (tunnel k-means is initialized as random)
  • plot (bool) – plot the result?
  • voro (bool) – show Voronoi regions in plot?
static set_random_seed(seed)

setting random seed for tensorflow, python and numpy

Parameters:(int) (seed) – random seed