AyushNet_Blog‎ > ‎


posted Sep 1, 2015, 10:47 AM by Shyam Sarkar
The CALSTATDN model combines methods of Calculus (CAL), Statistics (STAT) and database normalization (DN) to process and analyze large volumes of data captured from sensors, embedded devices or any other internet data source within any physical, environmental or biological systems as part of “Internet of Things” (IoT). FIG 1 is a conceptual flow diagram illustrating unique conceptual model for machine learning in CALSTATDN model. An unknown target function f (X) is shown. Some samples are created from probability distribution P on data values X for training purposes. Derivatives or rates of changes of values in X are computed and are also used for training a model. The training sets are used to train a model for correct hypothesis in order to find an approximation of the function f. The best hypothesis g must fit the training samples very well. In CALSTATDN model, if the set H is chosen carefully to involve models from both calculus and statistics then the approximation will be very close to the true unknown function f provided there are enough training data sets. The best hypothesis g(X) uniquely applies both calculus and statistics-based models with high levels of correctness leading to very small error estimated. The CALSTATDN model exploits the powerful ideas of rates of changes, differentiation and integration in calculus along with ideas of generalizations over sets of values in statistical computing to derive the best hypothesis g(X) to explain the behavior of any function f(X) with fewer data points and with few generalizations over unseen data points compared to other conventional machine learning methods.

FIG 2 is a flow diagram showing the implementation with iterative stages of final machine learning hypothesis described in FIG 1. 

The final hypothesis derived is applied to large volumes of data collected from Internet of Things (IoTs) and other Internet sources. In the flow diagram, analysis based on models in calculus such as finding rates of changes and performing differentiation operations to find first, second or higher derivatives are performed on data sets. Next, analysis based on statistics such as clustering, regression or other generalization techniques are performed on data sets. This stage finds generalized values to characterize sets or subsets of values inside the large data set. These generalized values are used as key values inside tables in a normalized database. Normalization over data tables is performed using such keys. 
FIG 2 further illustrates partitioning of data sets based on queries into normalized data tables. Data partitions based on partitioning over primary keys are illustrated. These partitions are used for parallel executions at the next level of machine learning. There are condition checks whether more processing is needed or not. If more processing is needed with answer “Yes”, the next stage of processing goes back to calculus based computing stage for iteration(s). If no more processing is necessary with answer “No”, the next processing stage goes for integration operations in calculus along with other operations to generate results of analysis in the form of graphs and charts. The integration operation is necessary as an inverse operation over derivatives computed at earlier stages in this method. In CALSTATDN model, parallel executions at several iterative stages improve overall performance by several orders of magnitudes.

The flow diagram of FIG 2 demonstrates unique iterative stages over calculus based operations; statistics based operations and database normalizations for querying data partitions leading to next stage of parallel computations. These iterative stages are applied to analyze extremely large data sets with much more improvements in performance and much higher levels of correctness for machine learning with reduction in errors.