
I've always been especially interested in unsupervised learning, with pattern-recognition approaches at the center of my modeling experiments.
The unsupervised mode of learning is especially valuable because supervised learning requires that the "student" follow some expert's verbalization of the available "wisdom" in the relevant subject domain. And that wisdom is often incomplete, faulty or poorly articulated. How much and how well even animals can learn without a teacher has provided continuing stimulus for a search for simple learning rules.
My professional interest in Pattern-Recognition and Artificial Neural Networks (ANN's) goes back to my graduate studies at Columbia University. Some of the connections and motivations are reviewed briefly in the following excerpt from my "Tenuous but Contingent Connections":
"By 1965, we had completed our work under the DRB (Diagnostic Research Branch of the NIH) contract and had published our Disc Electrophoresis papers [72, 73]. I had also published a paper which grew out of my summary to DRB on my work on pattern-recognition algorithms designed to perform automated diagnosis from electropherograms of serum proteins [76]. This paper, which I consider my most important intellectual accomplishment, was a distillation of a number of activities and interests: As a graduate student at Columbia, I had taken a course, with L. Zadeh (now at the University of California at Berkeley), on Claude Shannon's then brand new "information theory" [77], because it somehow seemed relevant to the problems in which I was deeply immersed, of analyzing multiple measurements on microscopic images. The cytoanalyzer and flying-spot microscopes were ways to collect such data rapidly. But it was not clear what might be optimal stategies for reducing the mass of numbers to a meaningful few. Later, as Tanimoto and I commuted to work together, we discussed his work on numerical taxonomy and his and BJ's thoughts on the Zworykin group's automated diagnosis effort. Our high-resolution electropherograms suddenly seemed like ideal one-dimensional models for the much more complex two-dimensional images of cells, for exploring the potentials of numerical taxonomy, pattern-recognition and automated diagnosis, which we were beginning to view as different versions of the same problem."
The Abstract and the Summary of the paper referred to above, appear below, and the original document can be downloaded here:
Ornstein,
L."Computer Learning and the Scientific Method: A Proposed Solution
to the Information Theoretical Problem of Meaning"(540K)J.
Mount Sinai Hosp. 32, 437 - 494 (1965).
ABSTRACT:
This discussion outlines and implements the theory of an inductive inference technique that automatically discovers classes among large numbers of input patterns, generates operational definitions of class membership with explicit levels of confidence, creates a continuously up-dated "self-organized" coded hierarchical taxonomic classification of patterns, and recognizes to which already discovered class or classes, if any, a new input belongs in an information theoretically efficient way. Relationships to the "scientific method" and learning are discussed.
The purpose of this paper has been to examine some aspects of the general problem of learning in terms of pattern recognition.
In our approach, we start with a detailed "image" of each pattern of the set, placing a minimum number of arbitrary constraints on the limits of the raw data, (e.g., setting only a maximum level of resolution and the outer bounds of each image field). The data points of the image or pattern are then represented as the coordinates of each pattern, and the pattern itself, as a point in an n-dimensional space (hyper-space), and a semi-metric measure of "similarity" of pattern to pattern is used to define a distance between samples and characterizes this hyperspace as a semi-metric space (rather than a metric space). The semi-metric distances between patterns are examined and the largest "natural" constellations or clusters are separated from one another. The data points of patterns within each constellation are reweighted, based on the information content of the subpopulation, and are further divided into subgroups by the same techniques. Those boundary conditions which must be arbitrarily set at the outset can be systematically varied to produce the maximum decrease in information at each subdivision. This serves as the device for converging to the "most natural" subdivision by measuring how closely our semi-metric space approximates a defined, ideal "Information Space", thus removing a large element of remaining "arbitrariness" from the procedure. In this way, a taxonomy of patterns is automatically generated. By introducing a simple binary coding device which assigns 0's and l's to designate the smaller and larger constellations at each subdivision, an efficient Shannon-Fano Code (10, 22) (in terms of Shannon's Binary Coding Theorem (10)) for the "description," "communication" and processing of such patterns is automatically generated. Such a taxonomic procedure both "discovers" classes of patterns and orders them in a "dictionary" and provides the means of efficiently recognizing whether any new sample of a pattern "belongs" to one or more of these already discovered classes. It thus solves the "semantic problem" of Information Theory (10). Mechanisms for continuously updating the process on the basis of the statistics of the portion of the population of patterns already sampled and analyzed are explicitly introduced into the scheme.
This technique departs widely from the more familiar decision-space model of decision theory, (e.g., see 30, 94, 95, 96, 97, 98, 99, 100), where, at the outset, specific "teleological" constraints are usually placed on a limited synthetic set of pattern parameters which it is hoped will allow the differentiation and recognition of the different kinds of known patterns; where a metric decision-space is usually carved up by sets of "hyper-planes" which are intuitively and computationally more complex and usually more arbitrarily distributed than our decision boundaries; and where no attempt is usually made to structure the resulting set of classes in an efficient "dictionary" or taxonomy.
The relationship of our model to "learning" and to the "scientific method" is discussed. The areas of mathematics which might be further exploited to increase the sophistication and generality of such procedures are noted. The pertinence of computer design to the efficient and widespread application of such a class of techniques is briefly discussed and a change in direction is recommended in this connection. Attention is finally drawn to potential social consequences of the widespread use of such techniques.
Then (1969) I began my long-term collaboration with Technicon
Instruments Corporation (see Hematology and Cytometry
Page)
on the
development of
automated
white and red blood cell analyzers, and soon tried (again,
unsuccessfully) to convince my friend, E.C. ("Jack") Whitehead, the
owner of Technicon, that the time had come for equiping his clinical
blood-analyzer machines with automated diagnostic capabilities.
At about that time, I was also asked to prepare a paper for a special November, 1969 issue of the Proceedings of the I.E.E.E.
That paper, "Hierarchic Heuristics", re-explored pattern recognition, using monosyllabic speech as a model, attempting to add some additional useful perspective to the general treatment I had developed in "Computer Learning and the Scientific Method" . However, its publication was blocked by J.R. Pierce (then head of Bell Labs) as a "naive amateurish effort". Although he refused to provide explicit criticisms, shortly thereafter, Pierce published a note, "Whither Speech Recognition" J. Acoust. Soc. Am. 46, 1049 (1969), which clearly reveals the source of the predjudice with which he reviewed my manuscript.
The Abstract and Summary of Hierarchic Heuristics appear below, and the original unpublished document can be downloaded here:
Ornstein,
L. "Hierarchic Heuristics: Their relevance to economic
pattern-recognition and high-speed data-processing" (290K) 43
pages (1969)
ABSTRACT:
Economic and intellectual motives exist for trying to use computers for the resolution of questions that typically require resort to vast stores of empirical experience in the search for their answers (e.g., questions which may require rapid random access to reference files containing more than l08 bits of data). Such questions can be recast as pattern-recognition problems (43,44,45). They range from the routine and intellectually trivial, such as word recognition in ordinary speech, to tasks which can tax the intellectual capacity of the trained professional, such as the workng out of a medical diagnosis. As of this time, no economic or adequately quick solutions to such problems, even using late third generation computers, have been proposed. I will attempt to establish some confidence in the proposition that such pattern-recognition problems might be solved economically by the application of hierarchical principles. Important problems in telecommunications coding also will be shown to find novel solutions within the hierarchically structured solution to the pattern-recognition problem. The general pertinence of large and fast hierarchically structured computer memories to real-time solutions will be discussed and a new design* for such a memory will be briefly described.
* Ornstein, L. ""Data Processing", U.S. Patent No. 3,633,171 (1972)
Properties of taxonomic keys and hierarchical logic which permit the design of search techniques with increased efficiency, over simple sequential searches, of N/MlogMN (where M is the number of branches at each branch point in a taxonomic tree, and N is the total number of branch tips---i.e., item classes) have been examined. Near optimum significant codes which can be produced with such techniques have been discussed. Hierarchic hardware configurations for mass-memories together with hierarchic parallel processing techniques, which can provide additional speed-up factors of M and K/log2K (where K is the number of sequential algebraic operations performed on a list with a serial computer) have also been reviewed. These will permit such coding and decoding in real-time. For purposes of illustration, algorithms for a hierarchical pattern-recognition model for automatic speech recognition (without a teacher) have been outlined. Such techniques should permit transmission of speech over channels with capacities as small as 50 bits/ sec. Relevance to automated medical diagnosis has also been briefly discussed.
Neural networks are potentially attractive because they are most easily implementable in hardware as parallel computing devices. And "programing" them to operate in parallel promised to be many orders of magnitude easier than writing software for other kinds of parallel machinery. However Minsky and Papert seemed to have driven a silver stake into the heart of the Perceptron approach when they proved that a two-layer Perceptron could never separate sub-classes which require a separating hypersurface more complicated than a hyperplane. I'd had only marginal interest in Perceptrons up to then, and I (and many others) now felt vindicated in having payed so little attention to Rosenblatt's work.
But, as is now well known, it turns out that multi-layer Perceptrons deserve much more attention; and by 1986, [Rumelhart, D.E., Hinton, G.E., and Williams, R.J. "Learning Representations by Back Propagating Errors" Nature, 323, 533-536 (1986)] new methods for the supervised training of multi-layer ANN's , using so-called "Back Propagation", "Back-Propagation" or "Backpropagation" burst on the scene and have been applied successfully for a very wide range of interesting classification problems.
This stimulated me to re-examine multi-layer ANN's, especially with respect to my earlier work, to see whether it was feasible to harness Back Propagation, or something like it, to operate in the unsupervised mode. In the narrow context of my current responsibilities, I judged that this deserved attention, because by 1990, the rapidly dropping cost and increasing speed of desktop CPU's and RAM meant that the time was ripe for building sophisticated automated diagnosis into the Technicon (now Bayer) automated analyzers which I had been helping to develop since 1969. Part of what resulted was:
Ornstein, L. "Unsupervised Neural Network Classification with Back Propagation" U.S. Patents No. 5,444,796, Aug. 22, 1993 (hardware claims) and No. 5,590,218, Dec. 31, 1996 (software claims).
An unsupervised back propagation method for training neural networks: For a set of inputs, target outputs are assigned 1's and 0's randomly or arbitrarily for a small number of outputs. The learning process is initiated and the convergence of outputs towards targets is monitored. At intervals, the learning process is paused and the values for those targets for outputs which are converging at less than average rate, are changed (e.g., 0 to 1, or 1 to 0), and the learning is then resumed with the new targets. The process is continuously iterated and the outputs converge on a stable classification, thereby providing unsupervised back propagation. In a further embodiment, samples classified with the trained network may serve as the training sets for additional layers of a hierarchical classification tree which converges to indivisible brach tips. After training is completed, such a tree may be used to classify new unlabelled samples with high efficiency. In yet another embodiement, the unsupervised back propagation method of the present invention may be adapted to classify fuzzy sets.
More:![]()
Return to Len's Home Page (Contingent Links)
<lenornst@pipeline.com>