Machine Learning Zuihitsu — VI

Eren Ünlü
6 min readJan 5, 2022

On the Sequential Treatment of Tabular Data

Dr. Eren Unlu, Data Scientist and Machine Learning Engineer @Datategy, Paris

Important Post-Edit: I figured out, i made some fundamental mistakes and missed certain parts to mention due to my limited time. So please consider these points, after reading the article first.

‘1. I included response variable in the UMAP reduction phase, which is a no-no of course.

‘2. If a sequential treatment such as proposed in this article would be considered for tabular regression or classification; the most proper approach to infer out-of-sample points combined with training dataset. In other words, after the RNN model is trained; the training and test points’ calculated UMAP values should be concatenated, reordered; and then all the points shall be inferred. Of course, this is not computationally plausiable; however we would like to demonstrate a basic and fun tabular experimentation. Note that, somehow, the proposed approach is like neighbouring-regression algorithms, like k-NN, where we leverage the potential of proximity information between datapoints, with backing of recurrent neural networks.

‘3. Just a single sequential-step (time-step) is considered, but can be evaluated furtherk; which is not obligatory.

I was dealing with transformer based tabular data models for a particular project at hand, which are gaining traction nowadays in the field; such as Amazon’s Tabtransformer [1] and Google’s TabNet [2]. Any kind of revisit with a different type of touch or a completely novel approach on tabular data always fascinates me. Therefore, i try to keep up with the latest developments on the issue. At the end of the day, everything in machine learning, from any type of computer vision application to NLP, is somehow a tabular regression at the lowest level. Add this the hard fact that it still reigns the data world, being the most popular type of information scheme out there in the market.

So, the idea sparked in my head, whether we can use recurrent neural networks for tabular data. I am sure this idea is not innovative at all and people have investigated the issue extensively. I even did not bothered to check scholar to see, where i am sure academics have generated some spectacular outputs. I had limited time and just wanted to give it a shot to the first idea which came to my mind and share the process with you.

So, as being said this is a quick shot, I won’t be dealing with hyper-parameter optimization etc. We will do everything in once in turbo mode, but in a decent and proper manner; complying all the rules of a most basic data science pipeline. So, let it begin !

We will be using the Boston House Price Dataset [3], one of the most popular one in the wild. Using certain attributes of the neighborhood and features of a house, we want to predict its value. We have 13 input features, most of them being numerical.

In order to use recurrent neural networks for the tabular data, first we need to induce sequentiality on it. So rather than the usage of recurrent networks, the real deal of this approach would be the method of tabular sequentialization. As i said, i am not trying to present a proper scientific study or a novel breakthrough; just ‘coding aloud’ with you.

OK. So, how we can inject sequentiality to a basic tabular dataset, with 0 spatial coherence inherently (On the ‘positions’ of the data points). There is no notion of neighborhood; first data point is nothing to do statistically with the second one etc. I propose to reduce the dataset into 1 dimension and sort indices according to these values. We will perform this operation with the current head of the dimensionality reduction leaderbord, UMAP [4][6]. Uniform Manifold Approximation and Projection (UMAP) algortihm has proven efficiency and robustness; preserving local and global topology well. Similar to other manifold approximation techniques, the main goal is to project proportional distances between data points in lower dimensions.

After separating training and test instances, we concatenate input features and output before reducing it to single dimension. Note that, we make sure there is no test leakage at any step.

As mentioned previously, we don’t seek to maximize efficiency etc., thus the choice of hyper-parameters for all models, such as the number of neighbors for UMAP etc. are quasi-arbitrary.

Now, each data point in our system is represented by a single scalar value. So, how we will impose sequentiality ? We sort data points according to their UMAP values. Let’s visualize sorted data points where certain features are highlighted with a color mapping based on their raw values. Note that, both training and test points are consisted which indicates the validity.

As you can see above, UMAP ordering functions quite well in clustering neighborhoods semantically.

Now, as we have a sequential context, we can use recurrent neural networks for predicting y values. We just need to reshape the dataset according to. Note that, i will also include the UMAP values as an additional feature. I prefer to use bidirectional LSTM with 2 stacked recurrent layers.

As you can see below, we have a decent outcome considering we did everything in one-shot without any optimization.

Thank you and see you soon !

References

[1] Huang, Xin, et al. “Tabtransformer: Tabular data modeling using contextual embeddings.” arXiv preprint arXiv:2012.06678 (2020).

[2]! Arık, Sercan O., and Tomas Pfister. “Tabnet: Attentive interpretable tabular learning.” arXiv (2020).

[3] https://www.kaggle.com/prasadperera/the-boston-housing-dataset

[4] Becht, Etienne, et al. “Dimensionality reduction for visualizing single-cell data using UMAP.” Nature biotechnology 37.1 (2019): 38–44.

[5] https://towardsdatascience.com/how-exactly-umap-works-13e3040e1668

[6] McInnes, Leland, John Healy, and James Melville. “UMAP: uniform manifold approximation and projection for dimension reduction.” (2020).

--

--

Eren Ünlü

Data Scientist and Machine Learning Engineer, PhD @ Datategy, Paris