DeConFuse: a deep convolutional transform-based unsupervised fusion framework

This work proposes an unsupervised fusion framework based on deep convolutional transform learning. The great learning ability of convolutional filters for data analysis is well acknowledged. The success of convolutive features owes to the convolutional neural network (CNN). However, CNN cannot perform learning tasks in an unsupervised fashion. In a recent work, we show that such shortcoming can be addressed by adopting a convolutional transform learning (CTL) approach, where convolutional filters are learnt in an unsupervised fashion. The present paper aims at (i) proposing a deep version of CTL, (ii) proposing an unsupervised fusion formulation taking advantage of the proposed deep CTL representation, and (iii) developing a mathematically sounded optimization strategy for performing the learning task. We apply the proposed technique, named DeConFuse, on the problem of stock forecasting and trading. A comparison with state-of-the-art methods (based on CNN and long short-term memory network) shows the superiority of our method for performing a reliable feature extraction.


Introduction
In the last decade, Convolutional Neural Network (CNN) has enjoyed tremendous success in different types of data analysis.It was initially applied for images in computer vision tasks.The operations within the CNN were believed to mimic the human visual system.Although such a link between human vision and CNN may be present, it has been observed that deep CNNs are not exact models for human vision [1].For instance, biologists consider that the human visual system would consist of 6 layers [2,3] and not 20+ layers used in GoogleNet [4].
Neural network models have also been used for analyzing time series data.Until recently, long short-term memory (LSTM) networks were the almost exclusively used neural network models for time series analysis as they were supposed to mimic memory and hence were deemed suitable for such tasks.However, LSTM are not able to model very long sequences, and their training is hardware intensive.Owing to these shortcomings, LSTMs are being replaced by CNNs.The reason for the great results of CNN methods for time series analysis (1D data processing in general) is not well understood.One possibility may lie in the universal function approximation capacity of deep neural networks [5,6] rather than its biological semblance.The research in this area is primarily led by its success rather than its understanding.
An important point to mention is that the performance of CNN is largely driven by the availability of very large labeled datasets.This probably explains their tremendous success in facial recognition tasks.Google's FaceNet [7] and Facebook's DeepFace [8] architectures are trained on 400 million facial images, a significant proportion of world's population.These companies are easily equipped with gigantic labeled facial images data as these are 'tagged' by their respective users.In the said problem, deep networks reach almost 100% accuracy, even surpassing human capabilities.However, when it comes to tasks that require expert labeling, such as facial recognition from sketches (requiring forensic expertise) [8] or ischemic attack detection from EEG (requiring medical expertise) [9], the accuracies become modest.Indeed, such tasks require expert labeling that is difficult to acquire, thus limiting the size of available labeled dataset.
The same is believed by a number of machine learning researchers, including Hinton himself, are wary of supervised learning.In an interview with Axios, 1Hinton mentioned his 'deep suspicion' on backpropagation, the workhorse behind all supervised deep neural networks.He even added that "I don't think it's how the brain works," and "We clearly don't need all the labeled data".It seems that Hinton is hinting towards unsupervised learning frameworks.Unsupervised Learning technique does not require targets / labels to learn from data.This approach typically takes benefit from the fact that data is inherently very rich in its structure, unlike targets that are sparse in nature.Thus, it does not take into account the task to be performed while learning about the data, saving from the need of human expertise that is required in supervised learning.More on the topic of unsupervised versus supervised learning can be found in a blog by DeepMind. 2n this work, we would like to keep the best of both worlds, i.e. the success of convolutive models from CNN and the promises of unsupervised learning formulations.With this goal in mind, we developed convolutional transform learning (CTL) [10].This is a representation learning technique that learns a set of convolutional filters from the data without label information.Instead of learning the filters (by backpropagating) from data labels, CTL learns them by minimizing a data fidelity loss, thus making the technique unsupervised.CTL has been shown to outperform several supervised and unsupervised learning schemes in the context of image classification.In the present work, we propose to extend the shallow CTL version to deeper layers, with the aim to generate a feature extraction strategy that is well suited for 1D time series analysis.This is the first major contribution of this work -deep convolutional transform learning.
In most applications, time series signals are multivariate, as they arise from multiple sources/sensors.For example, biomedical signals like ECG and EEG come from multiple leads; financial data from stocks are recorded with different inputs (open, close, low, high and net asset value), demand forecasting problems in smartgrids come with multiple types of data (power consumption, temperature, humidity, occupancy, etc.).In all such cases, the final goal is to perform prediction/classification task from such multivariate time series.We propose to address such problem as one of feature fusion.The information from each of the sources will be processed by the proposed deep CTL pipeline, and the generated deep features will be finally fused by an unsupervised fully connected layer.This is the second major contribution of this work -an unsupervised fusion framework with deep CTL.
The resulting features can be used for different applicative tasks.In this paper, we will focus on the applicative problem of financial stock analysis.The ultimate goal may be either to forecast the stock price (regression problem) or to decide whether to buy or sell (classification problem).Depending on the considered task, we can pass the generated features into suitable machine learning tool, that may not be as data hungry as deep neural networks.Therefore, by adopting such a processing architecture, we expect to yield better results than traditional deep learning especially in cases where access to labeled data is limited.

CNN for Time Series Analysis
Let us briefly review and discuss CNN based methods for time series analysis.For a more detailed review, the interested reader can peruse [22].We mainly focus on studies on stock forecasting as it will be our use case for experimental validation.
The traditional choice for processing time series with neural network is to adopt a recurrent neural network (RNN) architecture.Variants of RNN like long-short term memory (LSTM) [38] and gated recurrent unit (GRU) [39] have been proposed.However, due to the complexity of training such networks via backpropagation through time, they have been progressively replaced with 1D CNN [11].For example, in [12], a generic time series analysis framework was built based on LSTM, with assessed performance on the UCR time series classification datasets [14].The later study from the same group [13], based on 1D CNN, showed considerable improvement over the prior model on the same datasets.
There are also several studies that convert 1D time series data into a matrix form so as to be able to use 2D CNNs [15][16][17].Each column of the matrix corresponds to a subset of the 1D series within a given time window and the resulting matrix is processed as an image.The 2D CNN model has been especially popular in stock forecasting.In [17], the said techniques have been used on stock prices for forecasting.A slightly different input is used in [18]: instead of using the standard stock variables (open, close, high, low and NAV), it uses high frequency data for forecasting major points of inflection in the financial market.In another work [19], a similar approach is used for modeling Exchange Traded Fund (ETF).It has been seen that the 2D CNN model performs the same as LSTM or the standard multi-layer perceptron [20,21].The apparent lack of performance improvement in the aforementioned studies may be due to an incorrect choice of CNN model, since an inherently 1D time series is modeled as an image.

Deep Learning and Fusion
We now review existing works for processing multivariate data inputs, within the deep learning framework.Since the present work aims at being applied to stock price forecasting / trading, we will mostly focus our review on the multi-channel / multi-sensor fusion framework.Multimodal data and fusion for image processing, less related to our work, will be mentioned at the end of this subsection for the sake of completeness.
Deep learning has been widely used recently for analyzing multi-channel / multi-sensor signals.In several of such studies, all the sensors are stacked one after the other to form a matrix and 2D CNN is used for analyzing these signals.For example, [23] uses this strategy for analyzing human activity recognition from multiple body sensors.It is important to distinguish such an approach from the aforementioned studies [17][18][19][20][21]. Here, the images are not formed from stacking windowed signals from the same signal one after the other, but by stacking signals from different sensors.The said study [23] does not account for any temporal modeling; this is rectified in [24].In there, 2D CNN is used on a time series window; but the different windows are finally processed by GRU, thus explicitly incorporating time series modeling.There is however no explicit fusion framework in [23,24].The information from raw multivariate signals is simply fused to form matrices and treated by 2D convolutions.A true fusion framework was proposed in [25].Each signal channel is processed by a deep 1D CNN and the output from the different signal processing pipelines are then fused by a fully connected layer.Thus, the fusion is happening at the feature level and not in the raw signal level as it was in [23,24].
Another area that routinely uses deep learning based fusion is multi-modal data processing.This area is not as well defined as multi-channel data processing; nevertheless, we will briefly discuss some studies on this topic.In [26] a fusion scheme is shown for audio-visual analysis that uses a fusion scheme for deep belief network (DBN) and stacked autoencoder (SAE) for fusing audio and video channels.Each channel is processed separately and connected by a fully connected layer to produce fused features.These fused features are further processed for inference.We can also mention the work on video based action recognition addressed in [27], which proposes a fusion scheme for incorporating temporal information (processed by CNN) and spatial information (also processed by CNN).
There are several other such works on image analysis [28][29][30].In [28], a fusion scheme is proposed for processing color and depth information (via 3D and 2D convolutions respectively) with the objective of action recognition.In [29], it was shown that by fusing hyperspectral data (high spatial resolution) with Lidar (depth information), better classification results can be achieved.In [30], it was shown that by fusing deeply learnt features (from CNN) with handcrafted features via a fully connected layer, can improve analysis tasks.In this work, our interest lies in the first problem; that of inference from 1d / time-series multichannel signals.To the best of our knowledge, all prior deep learning based studies on this topic are supervised.In keeping with the vision of Hinton and others, our goal is to develop an unsupervised fusion framework using deeply learn convolutive filters.

Convolutional Transform Learning
Convolutional Transform Learning (CTL) has been introduced in our seminal paper [10].Since it is a recent work, we present it in detail in the current paper, to make it self-content.CTL learns a set of filters (t m ) 1≤m≤M operated on observed samples s (k)  1≤k≤K to generate a set of features (x . Formally, the inherent learning model is expressed through convolution operations defined as Following the original study on transform learning [34], a sparsity penalty is imposed on the features for improving representation ability and limit overfitting issues.Moreover, in the same line as CNN models, the non-negativity constraint is imposed on the features.Training then consists of learning the convolutional filters and the representation coefficients from the data.This is expressed as the the following optimization problem minimize where ψ is a suitable penalization function.Note that the regularization term "µ • 2 F − λ log det" ensures that the learnt filters are unique, something that is not guaranteed in CNN.Let us introduce the matrix notation where T = t 1 . . .t M , S = s (1) . . .s (K) , and .
The cost function in Problem (2) can be compactly rewritten as5 where Ψ applies the penalty term ψ column-wise on X.
A local minimizer to (4) can be reached efficiently using the alternating proximal algorithm [31][32][33], which alternates between proximal updates on variables T and X.More precisely, set a Hilbert space (H, • ), and define the proximity operator [21] Then, the alternating proximal algorithm reads For n = 0, 1, ...
with initializations T [0] , X [0] and γ 1 , γ 2 positive constants.For more details on the derivations and the convergence guarantees, the readers can refer to [10].

Fusion based on Deep Convolutional Transform Learning
In this section, we discuss our proposed formulation.First, we extend the aforementioned CTL formulation to a deeper version.Next, we develop the fusion framework based on transform learning, leading to our DeConFuse 3 strategy.

Deep Convolutional Transform Learning
Deep CTL consists of stacking multiple convolutional layers on top of each other to generate the features, as shown in Figure 1.To learn all the variables in an end-to-end fashion, deep CTL relies on the key property that the solution X to the CTL problem, assuming fixed filters T , can be reformulated as the simple application of an element-wise activation function, that is with φ the proximity operator of Ψ [41].For example, if Ψ is the indicator function of the positive orthant, then φ identifies with the famous rectified linear unit (ReLU) activation function.Many other examples are provided in [41].Consequently, deep features can be computed by stacking many such layers where X 0 = S and φ a given activation function for layer .Putting all together, deep CTL amounts to minimize T1,...,T L ,X where This is a direct extension of the one-layer formulation in (4).

Multi-Channel Fusion Framework
We now propose a fusion framework to learn in an unsupervised fashion a suitable representation of multi-channel data that can then be utilised for a multitude of tasks.This framework takes the channels of input data samples to separate branches of convolutional layers, leading to multiple sets of channel-wise features.These decoupled features are then concatenated and passed to a fully-connected layer, which yields a unique set of coupled features.The complete architecture, called DeConFuse, is shown in where where the operator "flat" transforms X (c) into a matrix where each row contains the "flattened" features of a sample.
To summarize, our formulation aims to jointly train the channel-wise convolutional filters T (c) and the fusion coefficients T in an end-to-end fashion.
We explicitly learn the features X and Z subject to non-negativity constraints so as to avoid trivial solutions and make our approach completely unsupervised.Moreover, the "log-det" regularization on both T (c) and T breaks symmetry and forces diversity in the learnt transforms, whereas the Frobenius regularization ensures that the transform coefficients are bounded.

Optimization algorithm
As for the solution of Problem (11), we remark that all terms of the cost function are differentiable, except the indicator function of the non-negativity constraint.We can, therefore, find a local minimizer to (11) by employing the projected gradient descent, whose iterations read with initialization T , γ > 0, and P + = max{•, 0}.In practice, we make use of accelerated strategies [36] within each step of this algorithm to speed up learning.
There are two notable advantages with the proposed optimization approach.Firstly, we rely on automatic differentiation [37] and stochastic gradient approximations to efficiently solve Problem (11).Secondly, we are not limited to ReLU activation in (8), but rather we can use more advanced ones, such as SELU [35].This is beneficial for the performance, as shown by our numerical results.

Computational Complexity of Proposed Framework -DeConFuse
Table 1 summarizes the computational complexity of DeconFuse architecture, both for training and test phases.Specifically, it is reported the cost incurred for every input sample at each iteration of gradient descent in the training phase, and for the output computation in testing phase.The computational complexity of DeConFuse architecture is comparable to a regular CNN.The only addition is the log-det regularization, which requires to compute the truncated singular value decomposition of T (c) and T c .However, as the size of these matrices is determined by the filter size, the number of filters, and the number of output features per sample, the training complexity is not worse than that of a CNN.

Experimental Evaluation
We carry out experiments on the real world problem of stock forecasting and trading.The problem of stock forecasting is a regression problem aiming at estimating the price of a stock at a future date (next day for our problem) given inputs till the current date.Stock trading is a classification problem, where the decision whether to buy or sell a stock has to be taken at each time.The two problems are related by the fact that simple logic dictates that if the price of a stock at a later date is expected to increase, the stock must be bought; and if the stock price is expected to go down, the stock must be sold.We will use the five raw inputs for both the tasks, namely open price, close price, high, low and net asset value (NAV).One could compute technical indicators based on the raw inputs [17] but, in keeping with the essence of true representation learning, we chose to stay with those raw values.Each of the five inputs is processed by a separate 1D processing pipeline.Each of the pipelines produces a flattened output (Fig. 1).The flattened outputs are then concatenated and fed into the Transform Learning layer acting as the fully connected layer (Fig. 2) for fusion.While our processing pipeline ends here (being unsupervised), the benchmark techniques are supervised and have an output node.The node is binary (buy / sell) for classification and real valued for regression.More precisely, we will compare with two state-of-the-art time series analysis models, namely TimeNet [12] and ConvTimeNet [13].In the former, the processing individual processing pipelines are based on LSTM and in the later they use 1D CNN.
We make use of a real dataset from the National Stock Exchange (NSE) of India.The dataset contains information of 150 symbols between 2014 and 2018; these stocks were chosen after filtering out stocks that had less than three years of data.The companies available in the dataset are from various sectors such as IT (e.g., TCS, INFY), automobile (e.g., HEROMOTOCO, TATAMOTORS), bank (e.g., HDFCBANK, ICICIBANK), coal and petroleum (e.g., OIL, ONGC), steel (e.g., JSWSTEEL, TATASTEEL), construction (e.g., ABIRLANUVO, ACC), public sector units (e.g., POWERGRID, GAIL).The detailed architectures for each tested techniques, namely DeConFuse, ConvTimeNet and TimeNet are presented in the Table 2.For DeConFuse, TimeNet and ConvTimeNet, we have tuned the architectures to yield the best performance and have randomly initialized the weights for each stock's training.

Stock Forecasting -Regression
Let us start with the stock forecasting problem.We feed the generated unsupervised features from the proposed architecture into an external regressor, namely ridge regression.Evaluation is carried out in terms of mean absolute error (MAE) between the predicted and actual stock prices for all 150 stocks.The stock forecasting results are shown in Table 5 in appendix section A. The MAE for individual stocks are presented for each of close price, open price, high price, low price and net asset value.
From Table 5, it can be seen that the MAE values reached for the proposed DeConFuse solution for the four first prices (open, close, high, low) are exceptionally good for all of the 150 stocks.Regarding NAV prediction, the proposed method performs extremely well for 128 stocks.For the remaining 22 stocks, there are 13 stocks, highlighted in red, for which DeConFuse does not give the lowest MAE but it is still very close to the best results given by the TimeNet approach.
For a concise summary of the results, the average values over all stocks are shown in Table 3.For a concise summary of the results, the average values over all stocks are shown in Table 3. From the summary Table 3, it can be observed that our error is more than an order of magnitude better than the state-of-the-arts.The plots for one of the regressed prices (close price) for some examples of stocks in Fig. 3 show that the predicted close prices from DeConFuse are closer to the true close prices than benchmarks predictions.

Stock Trading -Classification
We now focus on the stock trading task.In this case, the generated unsupervised features from DeConFuse are inputs to an external classifier based on Random Decision Forest (RDF) with 5 decision tree classifiers and depth 3.Even though we used this architecture, we found that the results from RDF are robust to changes in architecture.This is a well known phenomenon about RDFs [40].We evaluate the results in terms of precision, recall, F1 score, and area under the ROC curve (AUC).From the financial viewpoint, we also calculate annualized returns (AR) using the predicted trading signals / labels as well as using true trading signals / labels named as Predicted AR and True AR respectively.The  6 are highlighted in bold or red.The first set of results, marked in bold, are the ones where one of the techniques for each metric gives the best performance for each stock.The proposed solution DeConFuse gives the best results for 89 stocks for precision score, 85 stocks for recall score, 125 stocks for F1 score, 91 stocks for AUC measure, and 56 stocks in case of the AR metric.The other set marked in red highlights the cases where DeConfuse has not performed the best but performs nearly equal (here, a difference of maximum 0.05 in the metric is considered) to the best performance given by one of the benchmarks i.e.DeConFuse gives the next best performance.We noticed that there are 24 stocks for which DeConFuse gives the next best precision metric value.Likewise, 18 stocks in case of recall, 22 stocks for F1 score, 26 stocks for AUC values, and 1 stock in case of AR.Overall, DeConfuse reaches very satisfying performance over the benchmark techniques.This is also corroborated from the summary of trading results in Table 4.We also display empirical convergence plots for few stocks, namely RELIANCE, ONGC, HINDUNILVR and ICICIBANK, in Fig. 4. We can see that the training loss decreases to a point of stability for each example.The advantage of our framework is its ability to learn in an unsupervised fashion.For example, consider the problem we address.For traditional deep learning based models, we need to retrain to deep networks for regression and classification.But we can reuse our features for both the tasks, without the requirement of re-training, for specific tasks.This has advantages in other areas as well.For example, one can either do ischemia detection, i.e. detect whether one is having a stroke at the current time instant (from EEG); or one can do ischemia prediction, i.e. forecast if a stroke is going to happen.In standard deep learning, two networks need to be retrained and tuned to tackle these two problems.With our proposed method, there is no need for this double effort.
In the future, we would work on extending the framework for supervised / semi-supervised formulations.We believe that the semi-supervised formulation will be of immense practical importance.We would also like to extend it to 2D convolutions in order to handle image data.

Consent for publication
Not Applicable.

Availability of data and materials
The dataset used is a real dataset of the Indian National Stock Exchange (NSE) of past four years and is publicly available.We have shared the data with our implementation available at: https://github.com/pooja290992/DeConFuse.git.

Competing interests
The authors declare that they have no competing interests.

Funding
This work was supported by the CNRS-CEFIPRA project under grant NextGenBP PRC2017.A Detailed Stock Forecasting Results

Fig. 1 :
Fig. 1: Deep CTL architecture.The illustration is given for L = 2 layers, with the first layer T 1 composed of M 1 = 4 filters of size 5 × 1, and the second layer composed of M 2 = 8 filters of size 3 × 1.

Fig 2 .L
Since we have multi-channel data, for each channel c ∈ {1, . . ., C}, we learn a different set of convolutional filters T and features X(c) .At the same time, we learn the (not convolutional) linear transform T = ( T c ) 1≤c≤C to fuse the channel-wise features X = (X (c) ) 1≤c≤C , along with the corresponding fused features Z, which constitute the final output of the proposed DeConFuse model, as shown in Fig 2. This leads to the joint optimization problem minimize T,X, T ,Z

D
= input sample size -K = num. of samples -C = num. of channels -L = num. of layers P = filter size at layer -M = num. of filters at layer -D = output sample size at layer I = D L M L is the num. of output features per sample and per channel at last convolution layer O = αIC (with α ∈]0, 1]) is the num. of output features per sample at the fully-connected layer

6. 7
Authors' contributions -Ms.Pooja Gupta has introduced the CTL within the fusion framework and performed all the numerical experiments.-Ms.Jyoti Maggu originally formulated the transform learning model and the deep version for it.-Dr.Angshul Majumdar has helped with the model formulation and the assessment of the experimental part.-Dr.Emilie Chouzenoux and Dr. Giovanni Chierchia have contributed in the formulation of the model and the optimization algorithms.-All the authors have contributed to the writing and proofreading of the paper.

Table 1 :
Time complexity in training and test phases (for one input sample)

Table 2 :
Description of compared models

Table 3 :
Summary of Forecasting Results

Table 4 :
Summary of Trading Results

Table 5 :
Stock-wise Forecasting Results