Separating the evaluation into components and developing measures for
  each component. Metrics that could be obtained for individual output
  instances include content/fidelity to the source information, fluency,
  grammaticality, style. There are also metrics that would apply to a
  system/task, such as coverage, variability, development time.  Additional
  ad hoc measures of the kind Matthew mentioned can be useful, but I will
  argue that some standardization of the measures, and the identification
  of different aspects of output quality, will facilitate objective
  measurements of progress. Part of the goal would be to reduce, quantify,
  and control the effects of any human evaluators. Another aim would be to
  have a comparison basis for measuring the benefits offfered by
  statistical generation techniques over traditional methods.

 Integration of statistical and knowledge-based models. I believe there
  is the opportunity to go beyond the initial statistics-does-all approach,
  by judiciously re-inserting traditional generation knowledge models
  and mechanisms in the process. I will give a couple of examples where
  such knowledge would help a statistical model, and where obtaining
  the same performance within the statistical paradigm alone would be
  less efficient than the integrated approach.