Comparison of Data Visualization, Outlier Detection and Data Dimensionality Reduction Methods

Authors

  • Xingyu Zhao

DOI:

https://doi.org/10.54097/wgchmc87

Keywords:

Data Visualization, Data Dimensionality Reduction, FPS, PCA, t-SNE.

Abstract

With the deepening of the digital age of information, people's daily data is getting larger and larger, and it is more and more difficult to quantify and process. At this time, the data processing means becomes particularly important. This paper compares and analyzes some methods from data visualization to data dimensionality reduction to outlier detection. In this paper, two different types of datasets, ModelNet40, and red wine quality, are used to introduce the visualization method of the Farthest Point Sampling (FPS). This method can have a clear visual effect on the data dimension and scale, and allow users to observe the structure, type, and scale of the data. In data dimensionality reduction, the study uses Principal Component Analysis (PCA), T-Distributed Stochastic Neighbor Embedding (t-SNE), Triplets Manifold Approximation and Projection (TriMAP), Uniform Manifold Approximation and Projection (UMAP), Pairwise Controlled Manifold Approximation Projection (PaCMAP), and Autoencoder to compare their dimensionality reduction effects. Through these methods, this paper finds that different methods have different effects on different datasets. Therefore, in data dimensionality reduction, it can get twice the result with half the effort by choosing the appropriate method. Finally, this paper also detects outliers. Outliers in datasets will make it difficult for people to process data and make subsequent results inaccurate, so it is necessary to identify outliers. This paper involves methods such as isolation forest and Density-Based Spatial Clustering of Applications with Noise (DBSCAN). Through this paper, the methods of different datasets are analyzed and summarized.

Downloads

Download data is not yet available.

References

Roweis S. EM algorithms for PCA and SPCA Advances in neural information processing systems1997, 10.

Daffertshofer A., Lamoth C. J., Meijer O. G. & Beek P. J. PCA in studying coordination and variability: a tutorial Clinical biomechanics, 2004 19 (4) 415 - 428.

Wattenberg M., Viégas F. & Johnson I. How to use t-SNE effectively Distill, 2016, 1 (10) e2.

McInnes L., Healy J., & Melville J. Umap: Uniform manifold approximation and projection for dimension reduction arXiv preprint arXiv: 1802. 03426, 2018.

Amid E. & Warmuth M. K. TriMap: Large-scale dimensionality reduction using triplets arXiv preprint arXiv, 1019, 1910. 00204.

Tuncer O., Leung V. J. & Coskun A. K. Pacmap: Topology mapping of unstructured communication patterns onto non-contiguous allocations In Proceedings of the 29th ACM on International Conference on Supercomputing2015, 37 - 46.

Hariri S., Kind M. C. & Brunner R. J. Extended isolation forest IEEE transactions on knowledge and data engineering, 2015, 33 (4) 1479 - 1489.

Wang Y., Huang H., Rudin C. & Shaposhnikov Y. Understanding how dimension reduction tools work: an empirical approach to deciphering t-SNE UMAP TriMAP and PaCMAP for data visualization The Journal of Machine Learning Research, 2021, 22 (1) 9129 - 9201.

Schubert. Sander, J., Ester M., Kriegel H.P. & Xu, X. DBSCAN revisited revisited: why and how you should (still) use DBSCAN ACM Transactions on Database Systems (TODS), 2017, 42 (3) 1 - 21.

Eldar Y., Lindenbaum M., Porat M. & Zeevi Y. Y. The farthest point strategy for progressive image sampling IEEE Transactions on Image Processing1997, 6 (9) 1305 - 1315.

Downloads

Published

13-03-2024

How to Cite

Zhao, X. (2024). Comparison of Data Visualization, Outlier Detection and Data Dimensionality Reduction Methods. Highlights in Science, Engineering and Technology, 85, 1141-1149. https://doi.org/10.54097/wgchmc87