Skip to content

Commit

Permalink
Merge pull request #64 from linqiaozhi/master
Browse files Browse the repository at this point in the history
Support variable degree of freedom, better documentation, added license
  • Loading branch information
linqiaozhi authored Feb 8, 2019
2 parents 3e78adf + e960b3c commit 1647095
Show file tree
Hide file tree
Showing 11 changed files with 1,591 additions and 241 deletions.
135 changes: 135 additions & 0 deletions LICENSE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
Different attribution requirements and conditions apply to different files in this repository:

========================================================================================
The following license applies to the following files in the src directory: tsne.cpp, tsne.h, sptree.cpp, sptree.h, vptree.h

Copyright (c) 2014, Laurens van der Maaten (Delft University of Technology)
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
1. Redistributions of source code must retain the above copyright
notice, this list of conditions and the following disclaimer.
2. Redistributions in binary form must reproduce the above copyright
notice, this list of conditions and the following disclaimer in the
documentation and/or other materials provided with the distribution.
3. All advertising materials mentioning features or use of this software
must display the following acknowledgement:
This product includes software developed by the Delft University of Technology.
4. Neither the name of the Delft University of Technology nor the names of
its contributors may be used to endorse or promote products derived from
this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY LAURENS VAN DER MAATEN ''AS IS'' AND ANY EXPRESS
OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO
EVENT SHALL LAURENS VAN DER MAATEN BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY
OF SUCH DAMAGE.


========================================================================================
The following license applies to the following files in the src directory: nbodyfft.h, nbodyfft.cpp, parallel_for.h, time_code.h

(The MIT License)

Copyright (c) [2019] [George Linderman]

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

========================================================================================
The following license applies to following files in the progress_bar directory: ProgressBar.hpp

(The MIT License)

Copyright (c) 2016 Prakhar Srivastav <[email protected]>

Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
'Software'), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:

The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

========================================================================================
The following license applies to following files in the src directory: annoylib.h


Copyright (c) 2013 Spotify AB

Licensed under the Apache License, Version 2.0 (the "License"); you may not
use this file except in compliance with the License. You may obtain a copy of
the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
License for the specific language governing permissions and limitations under
the License.

========================================================================================
The following license applies to all files in the src/winlibs/fftw directory

FFTW is Copyright © 2003, 2007-11 Matteo Frigo, Copyright © 2003, 2007-11
Massachusetts Institute of Technology.

FFTW is free software; you can redistribute it and/or modify it under the terms
of the GNU General Public License as published by the Free Software Foundation;
either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
this program; if not, write to the Free Software Foundation, Inc., 51 Franklin
Street, Fifth Floor, Boston, MA 02110-1301 USA You can also find the GPL on the
GNU web site.

In addition, we kindly ask you to acknowledge FFTW and its authors in any
program or publication in which you use FFTW. (You are not required to do so;
it is up to your common sense to decide whether you want to comply with this
request or not.) For general publications, we suggest referencing: Matteo Frigo
and Steven G. Johnson, “The design and implementation of FFTW3,” Proc. IEEE 93
(2), 216–231 (2005).

Non-free versions of FFTW are available under terms different from those of the
General Public License. (e.g. they do not require you to accompany any object
code using FFTW with the corresponding source code.) For these alternative
terms you must purchase a license from MIT’s Technology Licensing Office. Users
interested in such a license should contact us ([email protected]) for more
information.
30 changes: 19 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,25 @@
# FFT-accelerated Interpolation-based t-SNE (FIt-SNE)
## Introduction
t-Stochastic Neighborhood Embedding ([t-SNE](https://lvdmaaten.github.io/tsne/)) is a highly successful method for dimensionality reduction and visualization of high dimensional datasets. A popular [implementation](https://github.com/lvdmaaten/bhtsne) of t-SNE uses the Barnes-Hut algorithm to approximate the gradient at each iteration of gradient descent. We modified this implementation as follows:
t-Stochastic Neighborhood Embedding ([t-SNE](https://lvdmaaten.github.io/tsne/)) is a highly successful method for dimensionality reduction and visualization of high dimensional datasets. A popular [implementation](https://github.com/lvdmaaten/bhtsne) of t-SNE uses the Barnes-Hut algorithm to approximate the gradient at each iteration of gradient descent. We accelerated this implementation as follows:

* Computation of the N-body Simulation: Instead of approximating the N-body simulation using Barnes-Hut, we interpolate onto an equispaced grid and use FFT to perform the convolution, dramatically reducing the time to compute the gradient at each iteration of gradient descent. See the [this](http://gauss.math.yale.edu/~gcl22/blog/numerics/low-rank/t-sne/2018/01/11/low-rank-kernels.html) post for some intuition on how it works.
* Computation of Input Similiarities: Instead of computing nearest neighbors using vantage-point trees, we approximate nearest neighbors using the [Annoy](https://github.com/spotify/annoy) library. The neighbor lookups are multithreaded to take advantage of machines with multiple cores. Using "near" neighbors as opposed to strictly "nearest" neighbors is faster, but also has a smoothing effect, which can be useful for embedding some datasets (see [Linderman et al. (2017)](https://arxiv.org/abs/1711.04712)). If subtle detail is required (e.g. in identifying small clusters), then use vantage-point trees (which is also multithreaded in this implementation).
* Early exaggeration: In [Linderman and Steinerberger (2017)](https://arxiv.org/abs/1706.02582), we showed that appropriately choosing the early exaggeration coefficient can lead to improved embedding of swissrolls and other synthetic datasets.
* Late exaggeration: Increasing the exaggeration coefficient late in the optimization process (e.g. after 800 of 1000 iterations) can improve separation of the clusters.
* Computation of Input Similarities: Instead of computing nearest neighbors using vantage-point trees, we approximate nearest neighbors using the [Annoy](https://github.com/spotify/annoy) library. The neighbor lookups are multithreaded to take advantage of machines with multiple cores. Using "near" neighbors as opposed to strictly "nearest" neighbors is faster, but also has a smoothing effect, which can be useful for embedding some datasets (see [Linderman et al. (2017)](https://arxiv.org/abs/1711.04712)). If subtle detail is required (e.g. in identifying small clusters), then use vantage-point trees (which is also multithreaded in this implementation).


Check out our [preprint](https://arxiv.org/abs/1712.09005) for more details and some benchmarks.

R, Matlab, and Python wrappers are `fast_tsne.R`, `fast_tsne.m`, and `fast_tsne.py` respectively. [Gioele La Manno](https://twitter.com/GioeleLaManno) implemented a Python (Cython) wrapper, which is available on PyPI [here](https://pypi.python.org/pypi/fitsne).
## Features
Additionally, this implementation includes the following features:
* Early exaggeration: In [Linderman and Steinerberger (2017)](https://arxiv.org/abs/1706.02582), we showed that appropriately choosing the early exaggeration coefficient can lead to improved embedding of swissrolls and other synthetic datasets. Early exaggeration is built into all t-SNE implementations; here we highlight its importance as a parameter.
* Late exaggeration: Increasing the exaggeration coefficient late in the optimization process can improve separation of the clusters. [Kobak and Berens (2018)](https://www.biorxiv.org/content/10.1101/453449v1) suggest starting late exaggeration immediately following early exaggeration.
* Initialization: Custom initialization can be provided from Python/Matlab/R. As suggested by [Kobak and Berens (2018)](https://www.biorxiv.org/content/10.1101/453449v1), initializing t-SNE with the first two principal components (scaled to have standard deviation 0.0001) results in an embedding which often preserves the global structure more effectively than the default random normalization. See there for other initialisation tricks.
* Variable degrees of freedom: [Kobak et al. (2019)]() show that decreasing the degree of freedom (df) of the t-distribution (resulting in heavier tails) reveals fine structure that is not visible in standard t-SNE.
* Perplexity combination: The perplexity parameter determines the width of the Gaussian kernel, such that small perplexity values uncover local structure while larger values reveal global structure. [Kobak and Berens (2018)](https://www.biorxiv.org/content/10.1101/453449v1) show that using combination of several perplexity values, resulting in a multi-scale embedding, can be useful.
* All optimisation parameters can be controlled from Python/Matlab/R. For example, [Belkina et al. (2018)](https://www.biorxiv.org/content/10.1101/451690v2) highlight the importance of increasing the learning rate when embedding large data sets.


## Installation
R, Matlab, and Python wrappers are `fast_tsne.R`, `fast_tsne.m`, and `fast_tsne.py` respectively. Each of these wrappers can be used after installing FFTW and compiling the C++ code, as below. [Gioele La Manno](https://twitter.com/GioeleLaManno) implemented a Python (Cython) wrapper, which is available on PyPI [here](https://pypi.python.org/pypi/fitsne).

**Note:** If you update to a new version of FIt-SNE using `git pull`, be sure to recompile.

Expand All @@ -34,18 +42,18 @@ If you would like to compile it yourself see below. The code has been currently
2. Copy the binary file ( e.g. `x64/Debug/FItSNE.exe`) generated by the build process to the `bin/` folder
3. For Windows, we have added all dependencies, including the [FFTW library](http://www.fftw.org/), which is distributed under the GNU General Public License. For the binary to find the FFTW DLLs, you need to either add `src/winlibs/fftw/` to your PATH, or to copy the DLLs into the `bin/` directory.

As of this commit, only the R wrapper properly calls the Windows executable.
The Python and Matlab wrappers can be trivially changed to call it (just
changing `bin/fast_tsne` to `bin/FItSNE.exe` in the code), and will be changed
in future commits.
As of this commit, only the R wrapper properly calls the Windows executable. The Python and Matlab wrappers can be trivially changed to call it (just changing `bin/fast_tsne` to `bin/FItSNE.exe` in the code), and will be changed in future commits.

Many thanks to [Josef Spidlen](https://github.com/jspidlen) for this Windows implementation!

## References
If you use our software, please cite:
## Acknowledgements and References
We are grateful for members of the community who have [contributed](https://github.com/KlugerLab/FIt-SNE/graphs/contributors) to improving FIt-SNE, especially [Dmitry Kobak](https://github.com/dkobak), [Pavlin Poličar](https://github.com/pavlin-policar), and [Josef Spidlen](https://github.com/jspidlen).

If you use FIt-SNE, please cite:

George C. Linderman, Manas Rachh, Jeremy G. Hoskins, Stefan Steinerberger, Yuval Kluger. (2017). Efficient Algorithms for t-distributed Stochastic Neighborhood Embedding. (2017) *arXiv:1712.09005* ([link](https://arxiv.org/abs/1712.09005))

Our implementation is derived from the Barnes-Hut implementation:

Laurens van der Maaten (2014). Accelerating t-SNE using tree-based algorithms. Journal of Machine Learning Research, 15(1):3221–3245. ([link](https://dl.acm.org/citation.cfm?id=2627435.2697068))

1,048 changes: 957 additions & 91 deletions examples/test.ipynb

Large diffs are not rendered by default.

Loading

0 comments on commit 1647095

Please sign in to comment.