FINAL EXAM for 4721, S99.  This is an open book exam, not intended to
be more than a 3 hour exam.  But you are permitted to use as much time
as you like.  Due May 10, under Eric's door, 703 CEPSR.

MANDATORY: Write on your hand-in, "I promise I have done this exam
completely on my own without any consultation with any people besides
Eric Siegel or Jeff Sherwin," and sign it.  Note that your instructor
has prosecuted students that infringed on this agreement in the past.

Feel free to ask Jeff or me questions while you are working on the
exam.

------------------------------------------------------------------------

QUESTION 1


Given the following events in a life time:

Grammer School
Recess
Learn ABC's
Secondary School
College Application Process
Standardized Tests
College Acceptance
UnderGraduate
Cram for Finals
Semester Abroad
Find One's Self
Graduate School
TA for Eric's class
Thesis Defense

Draw an Allen temporal network that breaks this (or a domain of your
choosing) up hierarchically into a few sub-networks (note, you do not
need to read the extra parts of the Allen paper to get this concept).
Assume each event corresponds to one interval (i.e., each only happens
once).  The stipulation is that each sub-set denotes one *Reference
Interval*. Reference Intervals are intervals that may "bridge" two sets
of interconnected intervals (i.e., sub-nets), thus putting one
sub-space of time in relation to another sub-space of time. It then
follows that a change in the sub-space A may or may not affect the
sub-space B if the added constraints in A never propagate to the
bridge node (i.e., reference interval) in A.  Identify sub temporal
spaces, and their bridges to other sub temporal spaces.  Please label
each node in the format "intervalName(referenceIntervalName)" and each
arc with a list of constraints.  Assume everything in a subnet is
during the reference interval -- the reference interval covers a time
span that includes everything in the sub-net.

Add another interval to one of the sub-nets and explain why it would
or would not effect relations outside of it's temporal sub-space.
Explain in general when and if constraint propagation needs to "bridge
a gap", influencing relations across sub-spaces.

From both a representation and implementation standpoint, please
explain a few advantages and disadvantages of considering such
subspaces. How would a modified propagation algorithm make use of
these advantages while keeping the problem just as solvable without
sub-spaces?  As a part of the answer, please explain the new
algorithm, and as a result, any differences in time and space
complexity. Feel free to draw pictures to help explain your answer.

(Thanks to Jeff Sherwin for the first draft of this exam question.)

------------------------------------------------------------------------

QUESTION 2

Quantitative (or metric) temporal constraints are between individual
time points (not intervals), and indicate a simple constraint on the
maximum or minimum distance between these time points, e.g., time
point x is at most 3 time units before y.  Unlike qualitative (Allen)
constraints, assume there are no disjunctions allowed.

A. Provide a detailed domain example in which it is useful to be able
to represent and reason about both qualitative (Allen) temporal
constraints, as well as metric constraints.  The example must motivate
the need for both.  The second to last video of the semester has a
presentation on combining the two into a hybrid model, although you
probably don't need to review it to answer this question.

B. In a couple sentences, abstract from this example to describe the
general types of domain situations that require this kind of approach.

C. Give two examples limitations of the hybrid approach -- what still
cannot be represented and/or reasoned about?

------------------------------------------------------------------------

QUESTION 3

The basic belief network (BN) algorithm in Russel and Norwig for
general inference over polytree BNs (Section 15.3) takes a set of
evidence as input (conjunction of known values for a subset of
variables), and computes the conditional probabilities for some
individual variable (the query variable) over its possible values,
given the evidence.

A. Expand this algorithm to accept an arbitrary boolean expression as
the evidence (the query is still a single variable).

Hints: 
0. You don't need to change the textbook's algorithm -- just use it as
a subroutine.
1. Assume you have an algorithm to convert an arbitrary boolean
expression into disjunctive normal form.
2. Describe and use a (sub-)algorithm that converts a query P(Q|E)
(where E is a boolean expression in disjunctive normal form, and Q is
a literal consisting of one variable) into an arithmetic expression of
probabilities and conditional probabilities, each of which has no
disjunction in the condition (i.e., only NOT and conjunction on the
right side of the vertical bar "|").

B. How many times must the book's polytree algorithm be invoked by
your algorithm, in terms of some measure (your choice) of the
complexity of the boolean expression that describes the evidence?

C. In a couple sentences, describe a domain example in which your new
algorithm would be useful.

------------------------------------------------------------------------

QUESTION 4

A. In about 5-8 sentences of detail, generalize the backpropagation
algorithm for neural network induction by allowing it to concurrently
induce its own step size (i.e., learning rate).  This is different
from momentum, because it is rate of decrease in error (gradient of
gradient -- meta-gradient) that controls the change to step size.
Thus, it is learning the weights, and simultaneously learning the best
learning rate -- a type of meta-learning.  Keep a separate learning
rate at each node of the network.  What are the initial learning rate
settings?  Are they all the same?  How is the meta-gradient measured?
How is learning rate updated?  There is more than one way to do this.

B. In 2-4 sentences, how might this influence backprop's automatic
delegation of "sub-tasks" to different internal nodes?

C. In 2-4 sentences, how could this approach be evaluated -- what set
of experiments would make it possible to conclude whether it helps?


------------------------------------------------------------------------

QUESTION 5

Consider designing an automatic induction method to derive an
hypothesis in a Turing-Complete representation.  This representation
could be a Turing machine, some variation thereof, or a programming
language.  Clearly, the induction method would depend on the
representation choice.

A. Describe 2 pros and 2 cons to having such a power hypothesis
representation.

B. One con (which you can't use :) is the halting problem -- when
evaluating a hypothesis (by trying it out), how do you know if it goes
into an infinite loop?  Describe 2 methods to deal with this.

C. Describe your technical representation choice of how an hypothesis
would access its input data (i.e., a training or testing example).
Note that this data could be a scalably large array.  Would it be
random access, or like a Turing Machine tape head?  Justify your
answer with a few sentences describing an example domain, and how
learning might proceed if the learning method made random mutations to
the hypothesis, keeping only those that improve performance over
training cases.  (In describing this, you are in a sense forced to
visualize certain features of the error surface over the space of
Turing Machines! -- may the force be with you.)

D. If you implemented such an algorithm with such-and-such success, by
what method could you evaluate whether such a powerful hypothesis was
actually needed?  Showing that a less powerful hypothesis
representation can solve the problem would show that
Turing-Completeness was not needed, but, if not, what could
theoretically empirically support the need for Turing-Completeness?

------------------------------------------------------------------------

QUESTION 6

I am applying neural network (NN) backpropagation to a classification
problem over a set of supervised training examples.  I also have a
separate set of supervised testing examples.  I run the NN for 100
epochs.  At the end of each epoch, I evaluate the NN over the test
cases.  However, the testing performance does not influence learning
at all -- weight adjustments are based solely on performance on
training cases.  At the end of the 100 epochs, as output from the
learning algorithm, I output the best hypothesis I have seen, i.e.,
the one that scored highest over the test cases, as well as its test
performance.

There is a problem with this technique.  The performance measure of
this best individual (over the test cases) may be biased, and not
accurately estimate its performance in general.  Why is this?  Note
that performance over test cases never influenced the operations of
backprop, which is a good thing.  Answer in 1-3 sentences.

(fyi, beware, this type of mistake is actually made in some research
papers!)