posted Sep 1, 2015, 10:47 AM by Shyam Sarkar

The CALSTATDN model combines methods of Calculus (CAL), Statistics (STAT) and database normalization (DN) to process and analyze large volumes of data captured from sensors, embedded devices or any other internet data source within any physical, environmental or biological systems as part of “Internet of Things” (IoT). FIG 1 is a conceptual flow diagram illustrating unique conceptual model for machine learning in CALSTATDN model. An unknown target function f (X) is shown. Some samples are created from probability distribution P on data values X for training purposes. Derivatives or rates of changes of values in X are computed and are also used for training a model. The training sets are used to train a model for correct hypothesis in order to find an approximation of the function f. The best hypothesis g must fit the training samples very well. In CALSTATDN model, if the set H is chosen carefully to involve models from both calculus and statistics then the approximation will be very close to the true unknown function f provided there are enough training data sets. The best hypothesis g(X) uniquely applies both calculus and statistics-based models with high levels of correctness leading to very small error estimated. The CALSTATDN model exploits the powerful ideas of rates of changes, differentiation and integration in calculus along with ideas of generalizations over sets of values in statistical computing to derive the best hypothesis g(X) to explain the behavior of any function f(X) with fewer data points and with few generalizations over unseen data points compared to other conventional machine learning methods.

FIG 2 is a flow diagram showing the implementation with iterative stages of final machine learning hypothesis described in FIG 1. 

The final hypothesis derived is applied to large volumes of data collected from Internet of Things (IoTs) and other Internet sources. In the flow diagram, analysis based on models in calculus such as finding rates of changes and performing differentiation operations to find first, second or higher derivatives are performed on data sets. Next, analysis based on statistics such as clustering, regression or other generalization techniques are performed on data sets. This stage finds generalized values to characterize sets or subsets of values inside the large data set. These generalized values are used as key values inside tables in a normalized database. Normalization over data tables is performed using such keys. 
FIG 2 further illustrates partitioning of data sets based on queries into normalized data tables. Data partitions based on partitioning over primary keys are illustrated. These partitions are used for parallel executions at the next level of machine learning. There are condition checks whether more processing is needed or not. If more processing is needed with answer “Yes”, the next stage of processing goes back to calculus based computing stage for iteration(s). If no more processing is necessary with answer “No”, the next processing stage goes for integration operations in calculus along with other operations to generate results of analysis in the form of graphs and charts. The integration operation is necessary as an inverse operation over derivatives computed at earlier stages in this method. In CALSTATDN model, parallel executions at several iterative stages improve overall performance by several orders of magnitudes.

The flow diagram of FIG 2 demonstrates unique iterative stages over calculus based operations; statistics based operations and database normalizations for querying data partitions leading to next stage of parallel computations. These iterative stages are applied to analyze extremely large data sets with much more improvements in performance and much higher levels of correctness for machine learning with reduction in errors.

The THING in "Internet of Things"

posted Aug 30, 2015, 9:41 PM by A Sarkar   [ updated Aug 31, 2015, 12:41 AM by Shyam Sarkar ]

A thing, in the Internet of things (IoT), is an entity or physical object that has the ability to transfer data over a network. A thing has an identifier. 
I have some questions about such THING in "Internet of Things".

IoT is expected to generate large amounts of data from diverse locations that increase very quickly, thereby increasing the need to better index, store and process such data. 
A thing can be simple thing like a sensor or a composite thing made up of disparate or separate parts of things. 
IoT software vendors might like to master a particular stage of the machine-to-machine (M2M) data flow process, such as: 
    (1) managing the communication with connected devices/sensors or with composite things comprising many sensors/devices;
    (2) providing middleware for integration to data repositories where definitions for both simple and composite things should be maintained
    (3) storing data from simple or composite things with static or dynamic relationships;
    (4) securing the simple or composite data; and 
    (5) analyzing and visualizing data for all types of things including composite things;

Are we going to deal with objects and interrelationships over objects at all levels of data flow process ?

"PAC" (Probably Approximately Correct) Learning and Calculus

posted Aug 29, 2015, 9:57 PM by Shyam Sarkar   [ updated Aug 31, 2015, 12:38 AM ]

The learning processes in machine learning algorithms are generalizations from past experiences. After having experienced a learning data set, the generalization process is the ability of a machine learning algorithm to accurately execute on new examples and tasks. The learner needs to build a general model about a problem space enabling a machine learning algorithm to produce sufficiently accurate predictions in future cases. The training examples come from some generally unknown probability distribution.

In theoretical computer science, computational learning theory performs computational analysis of machine learning algorithms and their performance. The training data set is limited in size and may not capture all forms of distributions in future data sets. The performance is represented by probabilistic bounds. Errors in generalization are quantified by bias-variance decompositions. The time complexity and feasibility of learning in computational learning theory describes a computation to be feasible if it is done in polynomial time. Positive results are determined and classified when a certain class of functions can be learned in polynomial time whereas negative results are determined and classified when learning cannot be done in polynomial time.

PAC (Probably Approximately Correct) learning is a framework for mathematical analysis of machine learning theory.  The basic idea of PAC learning is that a really bad hypothesis can be easy to identify.  A bad hypothesis will err on one of the training examples with high probability. A consistent hypothesis will be probably approximately correct. If there are more training examples, then the probability of “approximately correct” becomes much higher. The theory investigates questions about (a) sample complexity: how many training examples are needed to learn a successful hypothesis, (b) computational complexity: how much computational effort is needed to learn a successful hypothesis, and finally (c) bounds for mistakes: how many training examples will the learner misclassify before converging to a successful hypothesis.

Mathematically, let (1) X be the set of all possible examples, (2) D be the probability distribution over X from which observed instances are drawn, (3) C be the set of all possible concepts c, where c: X ® {0.1}, and (4) H be the set of possible hypothesis considered by a learner, H Í C. The true error of hypothesis h, with respect to the target concept c and observation distribution D is the probability P that h will misclassify an instance drawn according to D:

The error should be zero in the ideal case. A concept class C is “PAC learnable” by a hypothesis class H if and only if there exists a learning algorithm L such that given any target concept c in C, any target distribution D over the possible examples X, and any pair of real numbers 0 < ε, δ < 1, L takes as input a training set of m examples drawn according to D, where the size of m is bounded above by a polynomial in 1/ε and 1/δ and outputs an hypothesis h in H about which it is true with confidence (probability over all possible choices of the training set) greater than 1 −  δ, then the error of the hypothesis is less than ε.

A hypothesis is consistent with the training data if it returns the correct classification for every example presented it. A consistent learner returns only hypotheses that are consistent with the training data. Given a consistent learner, the number of examples sufficient to assure that any hypothesis will be probably (with probability (1 - δ)) approximately (within error ε) correct is

Calculus is an important branch of mathematics not considered so far as one of the building blocks of machine learning techniques. Calculus is used in every branch of physical science, computer science, statistics, engineering, economics, business, medicine, meteorology, epidemiology and in other fields wherever there is a need to mathematically model a problem to derive an optimal solution. It allows one to go from (non-constant) rates of change to the total change or vice versa. A mathematical model represented in calculus for a large data set can very well represent a hypothesis with very low error (ε) or zero error in machine learning. A complex hypothesis is possible to be constructed with one or more part(s) being represented in calculus based model(s).

The fundamental theorem of calculus states that differentiation and integration are inverse operations. More precisely, it relates the values of anti-derivatives to definite integrals. It can also be interpreted as a precise statement of the fact that differentiation is the inverse of integration. In machine learning, if a hypothesis involves model(s) represented in calculus then there must be complementing processes of differentiation and integration involved in the overall learning processes. Calculus based mathematical models can be used as part of a hypothesis for machine learning over a wide variety of data sets derived from devices such as heart monitoring implants, bio-chip transponders on farm animals, electric clams in coastal waters, automobiles with built-in sensors, smart homes, smart cities or airplanes with sensors. These devices or sensors used inside physical, biological or environmental systems collect large volumes of data. Efficient machine learning algorithms for such data sets can use hypothesis based on mathematical models involving both calculus and statistics. 

1-3 of 3