FINAL EXAM for 4721, S99. This is an open book exam, not intended to be more than a 3 hour exam. But you are permitted to use as much time as you like. Due May 10, under Eric's door, 703 CEPSR. MANDATORY: Write on your hand-in, "I promise I have done this exam completely on my own without any consultation with any people besides Eric Siegel or Jeff Sherwin," and sign it. Note that your instructor has prosecuted students that infringed on this agreement in the past. Feel free to ask Jeff or me questions while you are working on the exam. ------------------------------------------------------------------------ QUESTION 1 Given the following events in a life time: Grammer School Recess Learn ABC's Secondary School College Application Process Standardized Tests College Acceptance UnderGraduate Cram for Finals Semester Abroad Find One's Self Graduate School TA for Eric's class Thesis Defense Draw an Allen temporal network that breaks this (or a domain of your choosing) up hierarchically into a few sub-networks (note, you do not need to read the extra parts of the Allen paper to get this concept). Assume each event corresponds to one interval (i.e., each only happens once). The stipulation is that each sub-set denotes one *Reference Interval*. Reference Intervals are intervals that may "bridge" two sets of interconnected intervals (i.e., sub-nets), thus putting one sub-space of time in relation to another sub-space of time. It then follows that a change in the sub-space A may or may not affect the sub-space B if the added constraints in A never propagate to the bridge node (i.e., reference interval) in A. Identify sub temporal spaces, and their bridges to other sub temporal spaces. Please label each node in the format "intervalName(referenceIntervalName)" and each arc with a list of constraints. Assume everything in a subnet is during the reference interval -- the reference interval covers a time span that includes everything in the sub-net. Add another interval to one of the sub-nets and explain why it would or would not effect relations outside of it's temporal sub-space. Explain in general when and if constraint propagation needs to "bridge a gap", influencing relations across sub-spaces. From both a representation and implementation standpoint, please explain a few advantages and disadvantages of considering such subspaces. How would a modified propagation algorithm make use of these advantages while keeping the problem just as solvable without sub-spaces? As a part of the answer, please explain the new algorithm, and as a result, any differences in time and space complexity. Feel free to draw pictures to help explain your answer. (Thanks to Jeff Sherwin for the first draft of this exam question.) ------------------------------------------------------------------------ QUESTION 2 Quantitative (or metric) temporal constraints are between individual time points (not intervals), and indicate a simple constraint on the maximum or minimum distance between these time points, e.g., time point x is at most 3 time units before y. Unlike qualitative (Allen) constraints, assume there are no disjunctions allowed. A. Provide a detailed domain example in which it is useful to be able to represent and reason about both qualitative (Allen) temporal constraints, as well as metric constraints. The example must motivate the need for both. The second to last video of the semester has a presentation on combining the two into a hybrid model, although you probably don't need to review it to answer this question. B. In a couple sentences, abstract from this example to describe the general types of domain situations that require this kind of approach. C. Give two examples limitations of the hybrid approach -- what still cannot be represented and/or reasoned about? ------------------------------------------------------------------------ QUESTION 3 The basic belief network (BN) algorithm in Russel and Norwig for general inference over polytree BNs (Section 15.3) takes a set of evidence as input (conjunction of known values for a subset of variables), and computes the conditional probabilities for some individual variable (the query variable) over its possible values, given the evidence. A. Expand this algorithm to accept an arbitrary boolean expression as the evidence (the query is still a single variable). Hints: 0. You don't need to change the textbook's algorithm -- just use it as a subroutine. 1. Assume you have an algorithm to convert an arbitrary boolean expression into disjunctive normal form. 2. Describe and use a (sub-)algorithm that converts a query P(Q|E) (where E is a boolean expression in disjunctive normal form, and Q is a literal consisting of one variable) into an arithmetic expression of probabilities and conditional probabilities, each of which has no disjunction in the condition (i.e., only NOT and conjunction on the right side of the vertical bar "|"). B. How many times must the book's polytree algorithm be invoked by your algorithm, in terms of some measure (your choice) of the complexity of the boolean expression that describes the evidence? C. In a couple sentences, describe a domain example in which your new algorithm would be useful. ------------------------------------------------------------------------ QUESTION 4 A. In about 5-8 sentences of detail, generalize the backpropagation algorithm for neural network induction by allowing it to concurrently induce its own step size (i.e., learning rate). This is different from momentum, because it is rate of decrease in error (gradient of gradient -- meta-gradient) that controls the change to step size. Thus, it is learning the weights, and simultaneously learning the best learning rate -- a type of meta-learning. Keep a separate learning rate at each node of the network. What are the initial learning rate settings? Are they all the same? How is the meta-gradient measured? How is learning rate updated? There is more than one way to do this. B. In 2-4 sentences, how might this influence backprop's automatic delegation of "sub-tasks" to different internal nodes? C. In 2-4 sentences, how could this approach be evaluated -- what set of experiments would make it possible to conclude whether it helps? ------------------------------------------------------------------------ QUESTION 5 Consider designing an automatic induction method to derive an hypothesis in a Turing-Complete representation. This representation could be a Turing machine, some variation thereof, or a programming language. Clearly, the induction method would depend on the representation choice. A. Describe 2 pros and 2 cons to having such a power hypothesis representation. B. One con (which you can't use :) is the halting problem -- when evaluating a hypothesis (by trying it out), how do you know if it goes into an infinite loop? Describe 2 methods to deal with this. C. Describe your technical representation choice of how an hypothesis would access its input data (i.e., a training or testing example). Note that this data could be a scalably large array. Would it be random access, or like a Turing Machine tape head? Justify your answer with a few sentences describing an example domain, and how learning might proceed if the learning method made random mutations to the hypothesis, keeping only those that improve performance over training cases. (In describing this, you are in a sense forced to visualize certain features of the error surface over the space of Turing Machines! -- may the force be with you.) D. If you implemented such an algorithm with such-and-such success, by what method could you evaluate whether such a powerful hypothesis was actually needed? Showing that a less powerful hypothesis representation can solve the problem would show that Turing-Completeness was not needed, but, if not, what could theoretically empirically support the need for Turing-Completeness? ------------------------------------------------------------------------ QUESTION 6 I am applying neural network (NN) backpropagation to a classification problem over a set of supervised training examples. I also have a separate set of supervised testing examples. I run the NN for 100 epochs. At the end of each epoch, I evaluate the NN over the test cases. However, the testing performance does not influence learning at all -- weight adjustments are based solely on performance on training cases. At the end of the 100 epochs, as output from the learning algorithm, I output the best hypothesis I have seen, i.e., the one that scored highest over the test cases, as well as its test performance. There is a problem with this technique. The performance measure of this best individual (over the test cases) may be biased, and not accurately estimate its performance in general. Why is this? Note that performance over test cases never influenced the operations of backprop, which is a good thing. Answer in 1-3 sentences. (fyi, beware, this type of mistake is actually made in some research papers!)