Can constraints on input data be learned controlling for output?

felvel

Hey guys,

I have a very tough (to me) problem, so I decided to ask the pattern recognition wizards in this forum. Although I'm no mathematician nor computer scientist, I think my mathematical background is decent enough to describe to you the problem in precise terms, so here I go:

For each experiment k=1,…, m there are j observations for n continuous variables X1,…,Xn each (as input) that are associated to the same j observations for an output variable Yk. It is known that in each condition, X1,…,Xn >= 0 for all k and that Y depends on the input variables in the same manner, i.e. Yk= f (X1,…,Xn) for all k. So, although this function f is not defined explicitly (it's probably linear with respect to the input values though), it is known to be the same for all experiments. Now, here's the question: is there a way to recognize an underlying pattern (if it exists of course) in the variables X1,…,Xn conserved in all k conditions such that is *not* the result of their known contribution to the output variable Y?

In other words, is there a way to recognize/learn from the data non-trivial constraint(s) on X1,…,Xn -or at least on some input variables- to which f(X1,…,Xn) —expressed as the output data for Yk— is subject in all k conditions? When I say 'recognize/learn', I mean that I only need to be able to assert the existence of such implicit pattern or constraint with statistical significance although to deduce it explicitly would be even better (but I don't know if that's possible to begin with).

After studying from Bishop's book, I thought that tree-based models or multivariate regression could help. Treating X1,…,Xn as binary variables (0 or 1 for non-zero values) is certainly an affordable simplification in this case, but I cannot assume a priori that X1,…,Xn are dependent on each other (that would be a possible outcome at best). On the other hand, I thought of using the residues of a multivariate regression for each condition k, but that gives me "what's left of Yk that's not explained linearly by X1,…,Xn" data for each k. What I rather need is to find a pattern in "what's left of X1,…,Xn after explaining the Y values" for each k.

I'm sorry for being so wordy l but I needed to describe the problem to you as good as possible. Perhaps I misunderstood some method in the book that could be useful, or maybe you guys think a different technique should be used (if there's any). All advice and thoughts are most welcome.

Thanks a lot in advance.