Skip to content

AMozeika/clustering

Repository files navigation

clustering

This repository contains the c code which clusters data (and finds number of clusters in the data), r code that processes the results of clustering and example of the data and results of clustering.

The c code implements paralellised version of efficient population dynamics algorithm, developed for the model based Bayesian clustering in https://arxiv.org/abs/1810.02627, which assumes Gaussian distribution of data. To compile on the multiprocessor (Linux) machine the command "gcc -Wall -fopenmp PopulDynamClustV5v4.c -lm -O3 -o populdynam;" was used in the terminal. To run on the multiprocessor (Linux) machine the command "date; ./populdynam<parameters.in>parameters.out; date;" was used in the terminal.

The data has to be in the tab separated csv format (and transformed if this is needed). 10dL2c0.csv is the sample of correlated data (with the same mean vectors and different random covariance matrices) in 10 dimensions and with 2 clusters. 10dL2c0K10.in is the *.in file for this data. The meaning of the numbers "99191 20000 10 1 10 100 1000 10dL2c0.csv 0 nofile" in this file is given in the first line of the 10dL2c0K10.out file "#seed=99191 N=20000 d=10 K1=1 K2=10 rest.=100 t_max=1000 data-file=10dL2c0.csv". Here the "seed=99191" is the seed for random number generator, "N=20000" is the sample size , "d=10" is the dimension of data, "K1=1 K2=10" is the range for the number of clusters to consider , "rest.=100" is the number of restarts for the algorithm, "t_max=1000" is the maximum allowed "time" parameter, "data-file=10dL2c0.csv" is the data-file name and the last two "0 nofile" are always the same. The upper bound on the numerical complexity is proportional to $(K2-K1)\times t_{max}\times \mbox{rest.}\times N$

The clustering algorithm produces files which can be processed by the r code. For the data file "10dL2c0.csv" the r code is "10dL2c0.csv.cluster.statistics.r" and "10dL2c0.csv.clustering.statistics.r". The r code takes results of clustering and produces a number of *.tex and image files. Then one has to open "10dL2c0.csv.cluster.statistics.tex" and "10dL2c0.csv.clustering.statistics.tex" files produced by the r code and to compile in the tex editor. The latter gives the presentations "10dL2c0.csv.cluster.statistics.pdf" and "10dL2c0.csv.clustering.statistics.pdf" for statistics of, respectively, the clusters and clustering.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published