David Vernon
,
MACHINE VISION
Automated Visual Inspection . and Robot VISion
Machine Vision
FROM THE BOOKS OF
Ki,b:>.:tf(&
Li?r(
I
?
? ?
lSI
Automated Visual
Inspection and
Robot Vision
David Vernon
Department of Computer Science Trinity College Dublin Ireland
Prentice Hall
New York London Toronto Sydney Tokyo Singapore
First published 1991 by Prentice Hall International (UK) Ltd 66 Wood Lane End, Hemel Hempstead Hertfordshire HP2 4RG A division of Simon & Schuster International Group
? Prentice Hall International (UK) Ltd, 1991
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form, or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior permission, in writing, from the publisher. For permission within the United States of America contact Prentice Hall Inc., Englewood Cliffs, NJ 07632. Typeset in 10 on 12 point Times by MCS Typesetters, Salisbury, Wiltshire, England Printed and bound in Great Britain by Cambridge University Press Library of Congress CataloginginPublication Data is available from the publisher British Library Cataloguing in Publication Data Vernon, David Machine vision. 1. Title 006.3 ISBN 0135433983 1 2 3 4 5 95 94 93 92 91
Everything that I can spy Through the circle of my eye, Everything that I can see Has been woven out of me; I have sown the stars, and threw Clouds of morning and of eve Up into the vacant blue; Everything that I perceive, Sun and sea and mountain high, All are moulded by my eye: Closing it, what shall I find?  Darkness and a little wind.
James Stephens
The Hill of Vision
Contents
Preface Acknowledgements
xi xiii
1
An introduction to computer vision
1.1 1.2 1.3 1.4 Computer vision: image processing or artificial intelligence? Industrial machine vision vs. image understanding Sensory feedback for manufacturing systems: why vision? Examples of industrial machine vision problems and solutions 1.4.1 Measurement of steel bars 1.4.2 Inspection of computer screens 1.5 A typical system architecture 1 3 4 6 7 8 9
2
Illumination and sensors
2.1 Illumination 2.2 Sensors 2.2.1 Image formation: elementary optics 2.2.2 Camera sensors 2.2.3 Camera interfaces and video standards 2.2.4 Characteristics of camera sensors 2.2.5 Commercially available cameras
15
15 17 17 19
22 23
27
3
Image acquisition and representation
3.1 Sampling and quantization 3.1.1 Spatial frequency and the effects of sampling 3.2 Interpixel distances 3.3 Adjacency conventions 3.4 Image acquisition hardware 3.5 Speed considerations
vii
28
28
29
34
35 37
41
Contents
Contents
4
Fundamentals of digital image processing
4.1. Point operations
44
45
46 49 51 52 53 53 56 61
6
Image analysis
6.1 Introduction: inspection, location, and identification 6.2 Template matching
118
118 119 119 121 122 122 123 124 126 130 130 134 136
4.1.1 Contrast stretching 4.1.2 Thresholding 4.1.3 Noise suppression by image addition 4.1.4 Background subtraction 4.2 Neighbourhood operations 4.2.1 Convolution 4.2.2 Noise suppression 4.2.3 Thinning, erosion, and dilation 4.3 Geometric operations 4.3.1 Spatial warping 4.3.1.1 The spatial transformation 4.3.1.2 Greylevel interpolation 4.3.2 Registration and geometric decalibration 4.4 Mathematical morphology 4.4.1 Basic set theory 4.4.2 Structuring elements and hit or miss transformations 4.4.3 Erosion and dilation 4.4.4 Opening and closing 4.4.5 Thinning and the extraction of endpoints 4.4.6 Application: identification of endpoints of electrical wires 4.4.7 A brief introduction to greyscale mathematical morphology
67 67
69 71 74 74 74 75
6.2.1 Measures of similarity 6.2.2 Local template matching 6.3 Decisiontheoretic approaches 6.3.1 Components of statistical pattern recognition process 6.3.2 Simple feature extraction 6.3.3 Classification 6.3.3.1 A synopsis of classification using Bayes' rule 6.4 The Hough transform 6.4.1 Hough transform for line detection and circle detection 6.4.2 The generalized Hough transform 6.5 Histogram analysis
7
An overview of techniques for shape description
7.1 7.2 7.3 7.4 A taxonomy of shape descriptors External scalar transform descriptors: features of the boundary Internal scalar transform descriptors: features of the region External space domain descriptors: spatial organization of the boundary
140
141 141 143 145 148 150
76 78 79
80 80
5
The segmentation problem
5.1 Introduction: region and boundarybased approaches 5.2 Thresholding
85
85 86
7.4.1 An algorithm for resampling the boundary chain codes 7.5 Internal space domain descriptors: spatial organization of the region
5.2.1 Global, local, and dynamic approaches 5.2.2 Threshold selection 5.3 An overview of edge detection techniques 5.3.1 Gradient and differencebased operators 5.3.2 Template matching 5.3.3 Edge fitting 5.3.4 Statistical techniques 5.3.5 Assessment of edge detection 5.4 Region growing 5.4.1 The split and merge procedure using quadtrees 5.5 Boundary detection 5.5.1 Boundary refining 5.5.2 Graphtheoretic techniques 5.5.3 Dynamic programming 5.5.4 Contour following
viii
87 87 90 92
99 103 105 106 106 107 108
8
Robot programming and robot vision
8.1 8.2 8.3 8.4 8.5 A brief review of robot programming methodologies Description of object pose with homogeneous transformations Robot programming: a wire crimping task specification A simple robotprogramming language Two vision algorithms for identifying ends of wires
156
157 158 164 181
109 109
110 110
8.5.1 A binary vision algorithm 8.5.2 A greyscale vision algorithm 8.5.3 The vision/manipulator interface 8.6 The camera model and the inverse perspective transformation 8.6.1 The camera model 8.6.2 The inverse perspective transformation 8.6.3 Recovery of the third dimension 8.7 Threedimensional vision using structured light
ix
189 189 192
195 196 197
200 202
203
Contents
9
Introduction to image understanding
211
9.1 Representations and information processing: from images to object models 211 9.2 Organization of visual processes 212 9.3 Visual representations 214 9.3.1 The raw primal sketch 214 9.3.2 The full primal sketch 215 9.3.3 The twoandahalfdimensional sketch 221 9.3.4 Threedimensional models 224 9.3.4.1 Volumetric representations 224 9.3.4.2 Skeletal representations 225 9.3.4.3 Surface representations 226 9.3.5 The extended Gaussian image 228 9.4 Visual processes 230 9.4.1 Stereopsis 230 9.4.2 Camera motion 231 9.4.3 Shading 243 9.5 Concluding remarks 248
Appendix: Separability of the Laplacian of Gaussian Operator 253
Index
255
Machine vision is a multidisciplinary subject, utilizing techniques drawn from optics, electronics, mechanical engineering, computer science, and artificial intelligence. This book is intended to be an indepth introduction to Machine Vision which will allow the reader quickly to assimilate and comprehend the essentials of this evolving and fascinating topic. Significant emphasis will be placed on providing the reader with a solid grounding in the fundamental tools for image acquisition, processing, and analysis; a range of techniques, dealing with very simple twodimensional systems, through more sophisticated robust twodimensional approaches, to the current state of the art in threedimensional robot vision, will be explained in some detail. Both application areas of automated visual inspection and robot vision are addressed. Recognizing that machine vision is just a component of a larger automation system, a brief introduction to robot programming will be provided, together with an explanation of the mechanisms by which robot vision modules interact with the programming language. It is important to recognize that the discipline of machine vision is presently undergoing a maturing process, with sophisticated techniques drawn from current research being exploited more and more in industrial systems. Without doubt, there is a long way to go, but the die is well cast. Acknowledging this trend, the last chapter of the book is devoted to the more researchorientated topics of threedimensional image understanding and early visual processing (e.g. stereopsis and visual motion). It would indeed be foolhardy to attempt an exhaustive treatment of these areas; each deserves a volume on its own. However, if the essence of the philosophy of robot vision in its broadest sense is cogently imparted to the reader, then the exercise will have been successful and worth while. The book is directed at finalyear undergraduate and firstyear graduate students in computer science and engineering, and at practising industrial engineers; the fundamental philosophy being to impart sufficient knowledge so that the reader will be competent to begin the implementation of a simple vision system and to enable him/her to study each issue independently in more depth. To that end, care
xi
x
is taken to provide adequate references to supporting texts, rep~rts, ~nd research papers. In this way the book may be viewed both as a selfcontamed mtroductory text and as a springboard to more detailed and specific study.
Acknowledgements
Special thanks are due to Kenneth Dawson of the Computer Vision Group, Department of Computer Sciences, Trinity College, Dublin, for his work on the raw primal sketch, the extended Gaussian image, and the polyhedral models; and to Massimo Tistarelli and Prof. Giulio Sandini at the University of Genoa for their help with the examples of camera motion and stereopsis. Many people in the Trinity Computer Vision Group read draft versions of this book and I am grateful for their contributions. I would also like to record a note of thanks to Dr R. Dixon, The University, Manchester, for his many valuable comments on an earlier draft of this book. Several of the examples in this book were facilitated by research funded by the Commission of the European Communities under the European Strategic Programme for Research and Development in Information Technology: Project 419  Image and Movement Understanding.
xii
xiii
An introduction to computer
VISion
? ?
1. 1 Computer vision: image processing or
artificial intelligence?
What is computer vision and why would one be interested in studying it? It is perhaps easier to answer these two questions in reverse order. There are several reasons why one would be interested in computer vision, but the following two will serve to illustrate the many directions from which one can view the subject area: 1. All naturally occurring intelligent lifeforms exhibit an ability to interact with and manipulate their environment in a coherent and stable manner. This interaction is facilitated by ongoing intelligent interplay between perception and motioncontrol (Le. action); visual perception is fundamentally important to most intelligent life. Most manufacturers are concerned with the cosmetic integrity of their products; customers quite often equate quality of appearance with functional quality. So, to ensure the successful longterm marketing of an item, it is highly desirable that its appearance is checked visually before packaging and shipping. Likewise, it is desirable that the inspection process be automated and effected without human intervention.
2.
These two motivations for the study of perception characterize two possible extremes of interest in the processing, analysis, and interpretation of visual imagery: from the philosophical and perhaps esoteric to the immediate and pragmatic. And the subject matter of everything between these two extremes presents one with wide and varied spectrums of commercial interest, difficulty and, indeed, success. The answer to the first question (what is computer vision?) now becomes a little easier to identify. The world we live in and experience is filled with an endless variety of objects, animate and inanimate, and, to borrow a phrase from David Marr (of whom we shall hear more later in the book), it is by looking and seeing
1
An introduction to computer vision
Industrial machine vision vs. image understanding
that we come to know what is where in this world. So, if vision is a means to an end  to know the world by looking  then computer vision is exactly the same except that the medium by which the knowledge is gained is now a computational instrument rather than the brain of some living creature. Without doubt, this is a very broad definition. But the subject matter of computer vision is this broad: topics such as image restoration, image enhancement, automated visual inspection, robot vision, computerbased image understanding of general threedimensional scenes, and visual perception and cognition all fall under the umbrella of the term 'computer vision'. Although for centuries man has been interested in solving the puzzle of how man comes to 'see', the first computational experiments in developing artificial machine vision systems were conducted in the late 1950s and, over the last twentyfive to thirty years computerbased vision systems of widely varying degrees of complexity have been used in many diverse areas such as office automation, medicine, remote sensing by satellite, and in both the industrial world and the military world. The applications have been many and varied, encompassing character recognition, blood cell analysis, automatic screening of chest Xrays, registration of nuclear medicine lung scans, computeraided tomography (CAT), chromosome classification, landuse identification, traffic monitoring, automatic generation of cartographic projections, parts inspection for quality assurance industrial, part identification, automatic guidance of seam welders, and visual feedback for automatic assembly and repair. Military applications have included the tracking of moving objects, automatic navigation based on passive sensing, and target acquisition and rangefinding. As we have seen, computer vision is concerned with the physical structure of a threedimensional world by the automatic analysis of images of that world. However, it is necessary to qualify the use of the word image. First, the image is a twodimensional one and, hence, we inevitably lose information in the projection process, i.e. in passing from a threedimensional world to a twodimensional image. Quite often, it is the recovery of this lost information which forms the central problem in computer vision. Second, the images are digital images: they are discrete representations (i.e. they have distinct values at regularly sampled points) and they art quantized representations (i.e. each value is an integer value). Computer vision includes many techniques which are useful in their own right, e.g. image processing (which is concerned with the transformation, encoding, and transmission of images) and pattern recognition (frequently the application of statistical decision theory to general patterns, of which visual patterns are but one instance). More significantly, however, computer vision includes techniques for the useful description of shape and of volume, for geometric modelling, and for socalled cognitive processing. Thus, though computer vision is certainly concerned with the processing of images, these images are only the raw material of a much broader science which, ultimately, endeavours to emulate the perceptual capabilities of man and, perhaps, to shed some light upon the manner by which he accomplishes his amazingly adaptive and robust interaction with his environment.
2
1.2 Industrial machine vision vs. image
understanding
Computer vision, then, is an extremely broad discipline (or set of disciplines) and in order to get to grips with it, we need to identify some way of classifying different approaches. To begin with, we note that humans live and work within a general threedimensional world, pursuing many goals and objectives in an unconstrained and constantly changing environment in which there are many varied and, often, illdefined objects. Industrial automation, on the other hand, is given to performing single repeated tasks involving relatively few objectives, all of which are known and defined, in manufacturing environments which are normally constrained and engineered to simplify those tasks. Industrial systems do not yet work with general threedimensional environments (although the environments they do work in are often much less structured than one would suppose) and vision systems for manufacturing still exploit many assumptions, which would not generally apply to unconstrained worlds with many objects and many goals, in order to facilitate processing and analysis. There is a considerable dichotomy between the two approaches  a situation which must change and is changing  it is for this reason that the final chapter is concerned with advanced techniques and their migration to the industrial environment. Let us look a little closer at each of these classes of computer vision. Approaches associated with general environments are frequ'ently referred to by the terms 'image understanding' or 'scene analysis'. The latter term is now quite dated as it typically refers to approaches and systems developed during the 1970s. Vision systems specifically intended for the industrial environment are often referred to generically as 'industrial machine vision systems'. Image understanding vision systems are normally concerned with threedimensional scenes, which are partially constrained, but viewed from one (and often several) unconstrained viewpoint. The illumination conditions may be known, e.g. the position of the room light might be assumed, but usually one will have to contend with shadows and occlusion, i.e. partially hidden objects. As such, the data or scene representation is truly a twodimensional image representation of a threedimensional scene, with high spatial resolutions (i.e. it is extremely detailed) and high greyscale resolutions (Le. it exhibits a large variation in greytone). Occasionally, colour information is incorporated but not nearly as often as it should be. Range data is sometimes explicitly available from active rangesensing devices, but a central theme of image understanding is the automatic extraction of both range data and local orientation information from several twodimensional images using e.g., stereopsis, motion, shading, occlusion, texture gradients, or focusing. One of the significant aspects of image understanding is that it utilizes several redundant information representations (e.g. based on the object edges or boundaries, the disparity between objects in two stereo images, and the shading of the object's surface); and it also incorporates different levels of representation in 3
An introduction to computer vision
Sensory feedback for manufacturing systems: why vision?
order to organize the information being made explicit in the representation in an increasingly powerful and meaningful manner. For example, an image understanding system would endeavour to model the scene with some form of parameterized threedimensional object models built from several lowlevel processes based on distinct visual cues. At present, imageunderstanding systems utilize both explicit knowledge (or models) and softwareembedded knowledge for reasoning, that iJ, for controlling image analysis. Most industrial machine vision systems contrast sharply with the above approach. The scenes in an industrial environment are usually assumed to be twodimensional, comprising known isolated rigid parts, frequently with a contrasting visual backdrop. Lighting is almost always a critical factor and must be very carefully organized. Typically, the ambient room lighting will be totally inadequate, and even confusing, so that each inspection station will require its own set of dedicated lights, each designed for the task in hand. The images which industrial machine vision systems use are frequently twodimensional binary images (pure black and white, with no intermediate greylevels) of essentially twodimensional scenes. There is normally just one simple internal object representation or model; the analysis strategy being to extract salient features (e.g. area, circularity, or some other measure of shape) and to make some decision, typically using featurebased discrimination. This process frequently uses softwareembedded (hardcoded) knowledge of the scene. There are two complementary areas of industrial machine vision: robot vision and automated visual inspection. Both of these use essentially the same techniques and approaches, although the visual inspection tasks are, in general, not as difficult as those involved in visual perception for robotic parts manipulation, identification, and assembly. This is because the inspection environment is usually easier to control and the accept/reject decisions required for inspection are often easier to determine than the location and identification information needed for assembly. The significant problems associated with robotic part handling, too, has meant that advanced threedimensional robot vision has not received the attention it merits.
a remote manner rather than having to be in contact with the objects being analysed. Systems which are equipped with (useful) visual capabilities are inherently adaptive and can deal with uncertainty in the environment, or at least that is what one would hope for. The upshot of this is that, by incorporating vision in the manufacturing process, not only can we identify when things go wrong (e.g. in visual inspection) but the uncertain and variable nature of the manufacturing environment can be catered for. Unfortunately, vision, while versatile, is also the most complex of the senses due mai~IY to the :act that ~ost information in visual imagfs is implicitly coded and reqUIres extenSIve processmg and analysis to make it explicit. Visual sensing is difficult: in humans, ten of the estimated one hundred cortical areas in the brain are devoted to vision and much work remains to be done before we can claim to have even a modest grasp of visual sensing. Given that one acknowledges that vision is (potentially) a very powerful sense, let us look at some of the motivations for using visual processes in the industrial workplace. '
D
Safety and reliability
Considerations of safety and reliability usually arise in environments which are hazardous for humans (e.g. in close proximity to a milling bit) or because manufactured parts are of critical importance and 100 per cent inspection is required (e.g. defects in a brake lining might conceivably cause loss of life). Machine vision also facilitates consistency in inspection standards; such systems don't suffer from the 'Mondaymorning' syndrome and their performance can be (and should be) quantitatively assessed.
D
Product quality
1.3 Sensory feedback for manufacturing systems: why vision?
The answer to this question must necessarily be doublebarrelled: 1. We need feedback because the manufacturing system is not perfect and free of errors: we wish to ensure that we are informed when errors begin to creep into the process, so that we can take corrective action and ensure that quality and productivity are maintained. We use vision because it is by far the most versatile sense available and conveys extremely rich information when compared with, e.g., sonar or infrared sensing. Furthermore, unlike tactile sensing, it senses the environment in 4
Highvolume production using humans seldom facilitates inspection of all parts but automated visual inspection techniques may make it feasible; this depends on the complexity of the task and the effective throughput that is required by the manufacturing system. The latter consideration is particularly important if the vision system is to be incorporated in an online manner, i.e. inspecting each part as it is manufactured.
D
Flexible automation
2.
In environments where quality assurance is performed by a machine, it is feasible to integrate the inspection task into the complete production or manufacturing cycle, and allow it to provide feedback to facilitate online control. This provides for the adaptive requirements of AMT (advanced manufacturing technology) systems and facilitates the overall control by computer, such as is found (or, more realistically, as will be found in the future) in advanced computer integrated manufacturing (CIM) environments.
5
An introduction to computer vision
Integration within a CIM system is, however, one of the problems that is forestalling the rapid deployment of vision technology in advanced manufacturing environments, not least because the vision system will not usually know how to communicate with, e.g., the CAD (computer aided design) database or another manufacturer's robotic palletizing system.
Industrial machine vision: problems and solutions
offered by commercial manipulators. It is for this reason that the development of compliant manipulation and the incorporation of tactile sensors in the robot grippers are fundamentally important. Despite these negative comments, there is still a great deal of potential for existing vision technologies; there are many common industrial problems which can be successfully and costeffectively solved with the judicious use of wellunderstood and simple vision tools. To illustrate this and some of the techniques that have been alluded to so far, a few very simple examples are in order; a more sophisticated application is described in detail in Chapter 8.
1.4 Examples of industrial machine vision
problems and solutions
Of the two subsections of industrial machine vision, automated visual inspection is presently by far the most important area, accounting for a~ lea~t 80 per ce?t of all current applications. Within inspection, there is a great dIverSIty of uses m fields such as quality assurance, machine monitoring, and test and calibration. These applications may be classified on a functional basis as either gauging, inspection, or sorting: " " Gauging is concerned with the measurement of dimensional characteristics of parts and with checking tolerances. Inspection, per se, is concerned with performing part verification, i.e. establishing whether there are any parts or sections missing from the object being inspected or whether there are any extraneous parts which should not be present. Alternatively, one might be interested in performing flaw detection, i.e. effecting the detection and classification of flaws (usually surface flaws) on the part: for example, the detection of scratches on plastic housings. Sorting is concerned with the identification and recognition of parts. Parts will usually be on a conveyer system and it is pertinent to note that this does not necessarily involve robot manipulation as the part can quite simply be pushed into an appropriate bin using a simple pneumatically actuated flipper.
1.4.1 Measurement of steel bars
Raw material for steel ingots to be used in a casting process is delivered in fourfoot long bars with a circular crosssection. Individual ingots are then cut from the bar and stacked in a wooden box before being moved to the next stage of the casting process. To ensure highquality casting with minimal wastage of raw material, it is essential that the weight of the charges fall within acceptable tolerances. The nominal tolerances on weight are ± 5 g. If the crosssectional area of the bar of raw material were constant, the required weight would be given by an ingot of fixed length. Unfortunately, the crosssectional area is not constant and the outside diameter of the bar can vary considerably (a 1.25" bar has a nominal variation of 0.01" but often exceeds this limit). Thus, in order to ensure the correct weight of a cut charge, the crosssectional area of the bar must be monitored and a length of bar chosen such that the volume (and hence weight) falls within the required tolerance. To measure the crosssectional area, an 'outside diameter' gauge comprising a diametrically opposed light source and an image sensor is mounted on a robot end effector and moved along the axis of the bar (see Figure 1.1). In this manner, the
"
The applications of robot vision are less varied. At present, this is primarily due to the uncertain nature of the industrial environment within which a robot will be operating: there are typically three dimensions to deal with instead of two, partly manufactured objects tend to be poorly finished having spurious material attached (e.g. swarf) and they tend not to appear exactly where they are expected or are supposed to be. The two application areas which are well developed are materials handling and welding. The reasons for this are that these problems can often be reduced to twodimensions; objects on a conveyer or pallet are visualized as twodimensional silhouetted shapes and tracking a welding seam is conceived, locally, as a twodimensional exercise in the estimation of the disparity between the position of the welding rod and the metal seam. The application which one would expect to be the main subject matter of robot vision  part assembly  is very poorly developed for the reasons given above and, additionally, because the assembly operation requires greater dexterity on the part of the robot than is currently 6
~;:::"',_';~_~_____
:,Q~
" I /
Collimated light source
Diameter given by width of shadow
/
Figure 1.1
Measurement and inspection of steel bars.
7
An introduction to computer vision
bar will form a 'shadow' on the light sensor, the size of which corres~onds to the diameter of the bar at that point. Thus, a profile of the bar can be ~U11t as. the bar passes through the gauge. One pass of the sensor generat~s .a smgle sIgnature describing how the diameter varies over the scanned leng~h. If It.lS assumed that the bar is truly circular in crosssection, this single measure IS suffiCIent to compute the volume of the bar; several passes of the sensor at different orien~ations (i.e: rotating the sensor about the axis of the bar after each pass) allow thIS assumptIon to be dropped. The essential point to be noticed about this system is that th~ techniques used are inherently simple (measuring the length of a shadow cast dI:ectly on a sensor) and do not require any special sophisticat~d software. The ~augmg process is very fast and consequently can be integrated mto the productIon system as a feedback device to control the ingot cutting operation directly.
A typical system architecture
Apple Macintosh II, a data translation image acquisition device (framestore), and a Panasonic CCD (charge coupled device) camera. For the purposes of illustrating the image analysis, consider the problem of identifying the position of the picture on the screen, such as is depicted in Figure 1.2. We wish to measure the distance between the edge of the picture (shown in white) and the inside edge of the computer faceplate, i.e. the width of the black border. To do this, we sample a small linear section (shown as the vertical XX section in Figure 1.2) and identify the transitions between white and black and between black and grey. This is accomplished by estimating the rate of change in intensity or colour along the strip. A signature of this intensity change is shown in Figure 1.2; the required distances, D1 and D2, correspond to the distance between the first two peaks we encounter as we journey from the centre of the strip to the periphery. The CRT is adjusted until these two distances are, to within specified tolerances, equal. All the other features which need to be checked, with the exception of focus and brightness, are computed in a similar manner.
1.4.2 Inspection of computer screens
Cathode ray tubes (CRTs) must be inspected and c~librate.d du~ing th~ assembly of personal computers. The picture position, width, heIght, dlstortI?n, bn~htness, .and focus need to be computed and, if any of these measurements he outsIde specIfied tolerances realtime feedback is required to facilitate online adjustment. Such a system ha~ been configured using standard offtheshelf equipment, comprising an
1.5 A typical system architecture
In the context of a manufacturing environment, a machine vision process will typically encompass nine components: the manufacturing assembly line; some mechanism for delivering the parts to the machine vision system; a method of automatically presenting the part to the system (including part acquisition, part positioning, and registration); some sensing system which provides a representation understandable by computer systems (and digital systems, in general); the representation itself; some process to extract pertinent features; a set of criteria upon which some decision process can be based; a method facilitating the automatic release of the system; some mechanism for collecting the processed part (see Figure 1.3). Rather than discuss in detail each component of the imaging system, as this would preempt the subject matter of later chapters, we will present here no more than a schematic overview (illustrated in Figure 1.4) denoting many components and terms which have not yet been explained; the reader is asked to bear in mind that their introduction here is by way of preview rather than a selfcontained exposition. The imaging part of the machine vision system comprises subsystems for image formation, image acquisition and (pre)processing, image analysis, and image interpretation. The image formation system, consisting of the part illumination, the sensing element and the associated optics, is critical to the successful deployment of the system. This system generates an analogue representation of the image scene, typically in the form of a conventional TV video signal. The task of the image acquisition and processing subsystem is to convert this signal into a digital image, and to manipulate the resultant image to facilitate the subsequent extraction of information. The image analysis phase, working with 9
x
x
~ 01
T
x
c:::==:I
~02
x
T
Change in brightness 
Vertical eccentricity = 01  02
Figure 1.2
Inspection of computer screens.
8
An introduction to computer vision
MANUFACTURING ASSEMBLY LINE
A typical system architecture
DELIVERY TO INSPECTION DEVICE
.. ............ _ ........ _ ?...?...??.??. _....... .
c: o
(])
C/l
AUTOMATIC REGISTRATION AND POSITIONING
'u
"C
Cl '.;:i
'Cij
c:
IMAGING SYSTEM CAMERA/LENS SUBSYSTEM LIGHTING SUBSYSTEM
:;
Automated visual inspection system
ro
Q;
+"'
u.
o
E
:l 0.
E o (J
~
"
u
c
0
,~
c: ,2
fII
Q) '(ij O)~
§
'.;:i
'w E
g,,~
..... ro
>co
ro ro
,~
c:
_ C E co
IV ?I
; 0
~ ~
CIJ"C
~
????????????????????????????????????????????????????? ,§
INSPECTION AND PART PICKUP
C)
V5
+"'
~
cu
C '(ij
fII
Figure 1.3
Components of a machine vision system in the context of a manufacturing environment.
(/J
:l C
'c 0
0 +"'
E
0
C1.l
0
:E B 0
::l ...
(.)
I
'0,
co .....
OJ
c:
CD
C1.l
E
c: ro ..c c:
C1.l C1.l
(J
:0
:l
C1.l
:>
"C
co
Q)
the processed image, is concerned with the extraction of explicit information regarding the contents of the image (e.g. object position, size, orientation). Thus, there is a fundamental difference between image processing and image analysis: the former facilitates transformations of images to (hopefully, more useful) images, while image analysis facilitates the transformation from an image to an explicit (symbolic) piece of information, The final phase, image interpretation, is concerned with making some decision (e,g, accept or reject) based upon the description of what is in the image; it provides the link back to the environment to control the inspection process or the robot manipulator. Each subsystem requires its own specialized type of hardware: image formation uses camera technology, optics, and various light sources. Image acquisition requires an analoguetodigital converter or 'framegrabber' to capture a frame of video information; image processing is a computationally intensive task (due to the amount of data comprising an image) and the predominant trend is to accomplish as much of this in hardware as possible. Image analysis typically takes place on a conventional micro/ mini computer, though, again, there is a trend to accomplish as much of the analysis as possible in dedicated hardware. Image
10
C0)
C"..E:f
?I
..... c:
ctl > c:
"
o
c:
0) ....
'iii
Q)
0
:>
"C
.E
co,.,
Cl
c
?
c: o c:
0
(J
.E
ro
o
Q)
C
'.j:i
'.j:l ctl
0)?1
E 0
c: o
ro
co

E
'E @
a..
J.I;~i!1 .Ctl
c:
',p
+"'
c:
0
I< ..
~,
1:
C1.l
(J
0..0.
ro
e
(/J
11
An introduction to computer vision
References and further reading
Corby, N.R. 1983 'Machine vision for robotics', IEEE Transactions on Industrial Electronics, Vol. IE30, No.3, pp. 28291. Caudrado, J. L. and Caudrado, C. Y. 1986 'A.1. in computer vision', Byte, Vol. 11, No.1, pp.23758. Duda, R.O. and Hart, P.E. 1973 Pattern Classification and Scene Analysis, Wiley, New York. Fairhurst, M.C. 1988 Computer Vision Jor Robotic Systems, Prentice Hall International (UK), Hertfordshire. Fischler, M.A. 1981 Computational Structures oj Machine Perception, SRI International, Technical Note No. 233. Gonzalez, R.C. and Safabakhsh, R. 1982 'Computer vision techniques for industrial applications and robot control', Computer Vol. 15, No. 12, pp. 1732. Gonzalez, R.C. and Wintz, P. 1977 Digital Image Processing, AddisonWesley, Reading, Mass. Hall, E.L. 1979 Computer Image Processing and Recognition, Academic Press, New York. Hanson, A.R. and Riseman, E.M. 1978 'Segmentation of natural scenes', in Computer Vision Systems, Hanson, A.R. and Riseman, E.M. (eds), Academic Press, New York. Hara, Y., Akiyama, N. and Karasaki, K. 1983 'Automatic inspection system for printed circuit boards', IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI5, No.6, pp. 62330. Horn, B.K.P. 1986 Robot Vision, The MIT Press, Cambridge, Massachusetts. Jarvis, J. 1982 Computer Vision Experiments on Images oj WireWrap Circuit Boards, Conference Record on 1982 Workshop on Industrial Applications of Machine Vision, pp. 14450. Kelley, R.B., Birk, J. R., Martins, H.A.S. and Tella, R. 1982 'A robot system which acquires cylindrical workpieces from bins', IEEE Transactions on Systems, Man, and Cybernetics, Vol. SMC12, No.2, pp. 20413. Kelley, R.B., Martins, H.A.S., Birk, J.R. and Dessimoz, JD. 1983 'Three vision for acquiring workpieces from bins', Proceedings oj the IEEE, Vol. 71, No.7, pp.80321. Mahon, J., Harris, N. and Vernon, D. 1989 'Automated visual inspection of solder paste deposition on surface mount technology printed circuit boards, accepted for publication in Computers in Industry. Moore, F.W. 1987 'Remote visual inspection of nuclear fuel pellets with fibre optics and video image processing', Optical Engineering, Vol. 26, No.2, pp. 1525. Nevatia, R. 1982 Machine Perception, Prentice Hall, New Jersey. Pau, L.F. 1983 'Integrated testing and algorithms for visual inspection of integrated circuits', IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI5, No.6, pp. 6028. Pau, L.F. 1984 'Approaches to industrial image processing and their limitations', Electronics and Power, February, pp. 13540. Pinker, S. (ed.) 1984 'Cognition', The International Journal oj Cognitive Psychology, Vol. 18, Nos. 13. Pratt, W.K. 1978 Digital Image Processing, Wiley, New York. Pugh, A. (ed.) 1983 Robot Vision, IFS (Publications) Ltd, United Kingdom. Rosenfeld, A. 1984 Why Computers Can't See (Yet), Abacus, Vol. 1, No.1, pp. 1726, Springer Verlag, New York. Rosenfeld, A. and Kak, A. 1976 Digital Picture Processing, Academic Press, New York.
interpretation is normally effected in software on a conventional computer, usually the same one which implements the image analysis.
Exercises
1. Comment on the validity of the statement that 'Industrial machine vision and image understanding have nothing in common'. 2. Given that an inspection task typically comprises three main stages of sensing, image processing, and image analysis, identify the component processes within each stage by describing the functional structure of a task involving the automated visual inspection of Oring seals. The objective of the inspections task is to compute the outside and inside diameters and the diameter ratios and to check that each of the three values is within a certain tolerance. Pay particular attention to the components related to image analysis. 3. Describe briefly the essential difference between information contained in images and the information required to make a decision in a manufacturing environment.
References and further reading
Ballard, D.H. and Brown, C.M. 1982 Computer Vision, Prentice Hall, New Jersey. Batchelor, B.G., Hill, D.A. and Hodgson, D.C. (eds) 1985 Automated Visual Inspection, IFS (Publications), Ltd, United Kingdom. Besl, P.J., Delp, E.J. and Jain, R. 1985 'Automatic visual solder joint inspection', IEEE Journal oj Robotics and Automation, Vol. RAl, No.1, pp. 4256. Binford, T.O. 1982 'Survey of modelbased image analysis systems', The International Journal oj Robotics Research, Vol. 1, No.1, pp. 1864. Brady, J.M. (ed.) 1981 Computer Vision, Elsevier Science Publishers, The Netherlands. Brady, M. 1982 'Computational approaches to image understanding', ACM Computing Surveys, Vol. 14, No.1, pp. 371. Bolles, R.C. 1981 An Overview oj Applications oj Image Understanding to Industrial Automation, SRI International, Technical Note No. 242. Boyle, R.D. and Thomas, R.C. 1988 Computer Vision  A First Course, Bla::kwell Scientific Publications, Oxford. Chin, R.T. and Harlow, C.A. 1982 'Automated Visual Inspection: A survey', IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI4, No.6, pp.55773. Christiansen, D. 1984 'The automatic factory', IEEE Spectrum, Vol. 20, No.5, p. 33. Connors, R.W., McMillin, C.W., Lin, K. and VasquezEspinosa, R.E. 1983 'Identifying and locating surface defects in wood: part of an automated lumber processing system', IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI5, No.6, pp. 57383.
12
13
An introduction to computer vision
Rosenfeld, A. and Kak, A. 1982 Digital Picture Processing, Academic Press, New York. Sanderson, R.J. 1983 Machine Vision Systems: A Summary and Forecast, Tech Tran Corp. Illinois. . Spoehr, K.T. and Lehmkuhle, S.W. 1982 Visual Information Processmg, Freeman and Co., San Francisco. . Suresh, B., Fundakowski, R.A., Levitt T.S. and Overland, J.E. 198~ 'A realtime automate~ visual inspection system for hot steel slabs', IEEE TransactIOns on Pattern AnalysIs and Machine Intelligence, Vol. PAMI5, No.6, pp. 56372. . . Vernon, D. (Rapporteur) 1987 Sensoric Feedback jor Control and DecIsIOn Support Systems, Report of a Workshop in the Framework of ESPRIT Conference 1987, Directorate General XIII (eds.). Vernon, D. 1989 'Computers See the Light', Technology Ireland, Vol. 21, No.1, pp. 213.
2
Illumination and sensors
2.1
Illumination
Scene and object illumination playa key role in the machine vision process. The central purpose of imposing controlled constant illumination is to enhance visually the parts to be imaged so that their flaws, defects, and features are highlighted and so that their identification and classification by the vision system becomes somewhat easier. Although the choice of lighting will typically be applicationdependent, some general points may be pertinent. The common incandescent bulb is probably the simplest source of light. It is costeffective and it is easily adjusted for light intensity; however, it generally provides directional illumination since it is, to an approximation, a point source of light. Hence, incandescent bulbs cast strong shadows which invariably cause problems for machine vision software. Special bulbs are normally required as degradation in emitted light intensity is common with age. Furthermore, incandescent bulbs emit considerable infrared radiation; this does not cause problems for humans as we are not sensitive to such light but some camera sensors, particularly socalled CCD cameras, are sensitive and visual data can be washed out by the reflected infrared rays. For most machine vision applications, a diffuse source of light is the most suitable. Diffuse lighting is nondirectional and produces a minimum amount of shadow. Fluorescent lighting is the simplest and most common method of obtaining diffuse illumination and is especially good for providing illumination of large areas. In situations where the only features that need to be inspected are evident from the silhouette of the object, backlighting is the most appropriate. Backlighting, e.g. in the form of a light table, provides high contrast between the object and the background upon which the object rests. Its advantage is that it facilitates very simple object isolation or segmentation, a topic to which we will return in Chapter 5.
15 14
Illumination and sensors
In some manufacturing environments, it is necessary to inspect moving objects. Depending on the characteristics of image sensor, it may be necessary to 'freeze' the motion of the object for an instant by the use of a strobe light or electronic flash. The lamp emits a short (1 ms) burst of light, thus the moving object is illuminated for a very short period and appears s!ationary. The activation of the strobe must be synchronized with the acquisition of the image. Alternatively, you can exploit cameras with a very fast 'shutter speed' or, rather, the electronic equivalent, a very short exposure time. The exposure is usually referred to as the integration time since it is the period over which the sensor integrates or averages the incident light. One would normally choose the latter option of a short integration time, since it is more ergonomic and less disruptive for humans.
Sensors
2.2 Sensors
The task of a camera system is to convert an optical picture, typically representing a twodimensional or threedimensional scene, into some form suitable for use with electrical systems. Since the camera represents the direct link between the environment being examined and the informationprocessing system, it plays a particularly significant role and merits a good deal of attention.
2.2.1 Image formation: elementary optics
Before commencing a discussion of cameras, image sensors, and image acquisition
co~~onents proper, we will first address a few fundamental issues on the optics of
Control of illumination and light levels As many inspection systems base much of their analysis on the absolute intensity of the incident light, the control of the object illumination can be important. In particular, if image processing and analysis decisions are being made on the basis of a fixed intensity datum (threshold), then some problems will occur if the illumination, and hence the reflected light, changes. If possible, the vision system should be able to adapt to such changes, although this does not necessarily mean that it should be capable of dealing with dynamic changes. Most illumination systems degrade quite slowly over time and it would be quite satisfactory if the system were capable of selfcalibration at the beginning of each day. Other alternatives exist, however, to this adaptive approach. One solution is to ensure that the illumination does, in fact, remain constant by monitoring it using light meters and adjusting the illumination system appropriately. Alternatively, the aperture of the camera lens might be altered. Note, however, that electronically controlled aperture lenses (socalled autoiris lenses) should not be employed directly; their function is to alter aperture so that the average amount of light passing through the lens remains constant. This is not appropriate for machine vision systems as the greytone shade of a particular feature would vary, depending on the intensity of the ambient lighting. It was mentioned above that incandescent lighting is not suitable for some cameras and, in general, one should ensure that the lighting system is compatible with the image sensor. For example, mainspowered lighting is inherently 'flickery' due to the a.c. characteristics of the mains electricity supply. Humans do not notice this, in general, because they effectively 'integrate' (or average) the incident illumination over a short period of time. This process also accounts for our ability to view moving objects in cinema films and perceive the motion to be continuous. Machine vision sensors, as we shall see in the next section, do not integrate in quite the same way and, when they acquire the image, the flicker can become apparent. The use of an appropriate (d.c., say) power supply can alleviate this problem, when it does occur.
16
D
a VISIOn system and of lenses in particular. Lenses are required to focus part of the visual environment onto the image sensor. It is possible to construct an imaging system without lenses, using, for example, a collimated light source to project a shadow of a (small) object onto the sensor but such a configuration is not typical of the requirements of a vision system. Lenses are defined by their foeallength (quoted in millimetres: mm) and their aperture (the f number). These parameters determine the performance of the lens in terms of lightgathering power and magnification, and it often has a bearing on its physical size. The focal length of a lens is a guide to the magnification it effects and its field of view. Selecting the focal length which is appropriate to a particular application is simply a matter of applying the basic lens equation:
where
+=U v f
v is the distance from the lens to the image, u is the distance from the lens to the object, f is the focal length.
111
Noting the magnification factor Mis:
M = image size object size
and, equivalently:
M = image distance object distance
Thus:
uM
f= M+l
17
Illumination and sensors
Hence, if we know the required magnification factor and the distance from the object to the lens, we can compute. the re~uire~ focallen~th. For example if a 10 cm wIde object IS to be Imaged on a common 8.8 X 6.6 mm sensdr from a distance of 0.5 m this implies a magnification factor of:
Sensors
typically less than 10 mm in width and hence the optical surface can be much smaller. Many of the visual problems which cause difficulty in interpreting a scene can often be solved by the use of simple filters on the camera lens. Filters are frequently used by the television and video industries to produce special effects, for example, the starlight filter used to make candlelight seem soft and starlike. Filters are just as often used to reduce, or remove, such effects; polaroid sunglasses are probably the most widely used 'filters' and are used to reduce glare. In machine vision, one of the most annoying and potentially disruptive problems is that due to specular reflection (mirrorlike reflection on shiny objects). The use of a simple polarizing filter on the camera lens can often reduce the effect of these reflections. Some sensors are sensitive to segments of the electromagnetic spectrum which do not convey visual data, e.g .. infrared radiation. The use of a simple infrared blocking filter can solve this problem in a simple and effective manner.
M=~=0.088
100 So:
f
= 500 X 0.088 = 40.44 1.088
Thus we require a 40.44 mm focal length lens. Typically, one would use a slightly short~r focal length (e.g. 35 mm) and accept the slight loss in resolution due to the larger field of view. The minimum focal distance is the minimum distance between the front of the lens and the object at which the object can still be in focus. Generally speaking, lenses with short focal lengths have smaller minimum focal distances. If the lens is required to focus on a relatively small object at short distances, the minimum focal distance (typically 300 mm) may be too large. In that case, an extension tube can be used to increase the distance, v, of the lens from the sensor and hence decrease the distance, u, from the lens to the object. For a given focal length, f, the lens equation stipulates that u decreases as v increases, if the image is to remain in focus. The fnumber is a measure of the amount of light allowed to pass through the lens and is a normalized measure of lens aperture. It is defined by the focal length divided by the diameter of the aperture. The standard scale is 1.4, 2, 2.8, 4, 5.6, 8, 11, 16; each increase reduces the amount of light passing through the lens by onehalf. The depth of field is the distance between the nearest and the furthest points of a scene which remains in acceptable focus at anyone aperture setting. In general, the depth of field gets larger as the focused point moves away from the lens (given constant aperture and focal length). Also, the depth of field increases significantly as the aperture closes (i.e. as the f number increases) for any given focusing distance. Lenses have many types of standard mounts, for example the Pentax, Olympus, Nikon bayonet mounts, but the standard on television, and CCTV (Closed Circuit TV), cameras is a screw mount called the C mount. Since there is a vast choice of 35 mm photographic lenses, it makes sense to be able to use these photographic lenses with C mount cameras and there are several types of bayonet adaptors available for C mount cameras. However, it should be remembered that these lenses are usually much more bulky than the miniature lenses which are specifically designed for CCD cameras. The reason for this is that a photographic lens is designed to image a 35 mm format and, for a given focal length, the optical surface must be much larger. CCD sensors (as we will see in the next section) are
2.2.2 Camera sensors
There are essentially two types of video camera available: one, the vidicon, is based on vacuumtube technology and the other is based on semiconductor technology. Briefly, a vidicon tube is a photoconductive device which employs a photosensitive sensor layer consisting of several million mosaic cells insulated from one another on a transparent metal film (refer to the Figure 2.1). Each cell represents a small capacitor whose charge is a function of incident light. The sensor layer is scanned in a raster format with an electron beam over 625 lines in accordance with the television standard (discussed in the next section). This beam is deflected magnetically by a set of coils outside the tube bulb. The electron beam makes up charge lost through the incidence of light in individual mosaic cells and
Transparent front plate
~~~_/Lens
Focusing coil

 Cathode
~
Deflected electron beam
Deflection C~O~il~;::;;::;;:_ __
ensor layer
o
Video signal
Figure 2.1
Schematic of the pickup tube of the Vidicon television camera.
18
19
Illumination and sensors
Sensors
Lig htsensitive photosites
so generates the video signal at the sensor element. This video signal is simply a continuous analogue signal proportional to the light intensity of the focused image. The camera electronics insert synchronization pulses (syncs.) to indicate scan lines, fields and frame ends (see Section 2.2.3). A distinction is made between the following camera types depending on the sensor element: the standard vidicon has a sensor element comprised of antimony sulphide (Sb2S3), the silicon diode vidicon has a sensor element made from silicon (Si), while the plumbicon has a sensor element made of lead oxide (P?O). Most solidstate cameras are based on chargecoupled devIce (CCD) technology, though there are several variations on the theme. In order for the reader to be at least acquainted with the names of these devices, they are listed here:
CIt
Charges are shifted to the shielded area (2) The charges in the column are shifted down one cell ? A row of charge is then shifted out
CD
I
Shielded CCD shift registers
I
CD
LL:..L~4J~.L..l.....L.u...Lu..J..JLJo9P Output signa I
+ +
.. .. .. .. .. ..
charge transfer devices single transfer devices bucket brigade devices charge coupled devices charge injection devices surface charge coupled devices bulk charge coupled devices
CTD STD BBD CCD CID SCCD BCCD
(to amplifier)
Figure 2.2
Interline transfer of charge in CCD sensors.
Lightsensitive photosites
CCD technology is the most widespread, the basic structure of which is that of an analogue shift register consisting of a series of closely spaced capacitors. Charge integration (accumulation) by the capacitors, photosites, caused by the photons comprising the incident light, provides the analogue representation of light intensity. At the end of the integration period (exposure time) these charges are read out of the sensor. CCD sensors most commonly use one of three addressing strategies: interline transfer, frame transfer, and columnrow transfer. The interline transfer CCD is organized into column pairs of devices. An imaging column of photosensors is adjacent to an opaque vertical shift register (see Figure 2.2). Charge accumulates in the imaging column until the end of the integration period, when it is transferred to the opaque column. The signal then shifts vertically into a horizontal shift register that represents the picture sequentially, line by line. The advantage of the interline transfer is that the transfer time (to opaque storage) is short compared to the integration period. This is desirable because when transfer time approaches the integration time, solidstate sensors tend to exhibit a locally contained spreading of the image response, called smear. Thus, the interline transfer minimizes smear. In the frame transfer organization (refer to Figure 2.3) the sensor consists of vertical columns of CCD shift registers divided into two zones. One zone, where charge accumulates during integration time, is photosensitive. When integration is complete, the whole array is transferred in parallel to the opaque storage area of the second zone. A third type of solidstate sensor employs xy addressing to transfer charge from the photo site to the output signal amplifier. The sensor elements are addressed 20
All charges are shifted down to the shielded area (2) Each row of charge is then shifted out
CD
Shielded CCD shift registers  
&....L....&...JLL...L..I......IJ...J....L&....L.....I...JLL.L..JIl"'........
. . .
Output signal (to amplifier)
Figure 2.3
Frame transfer of charge in CCD sensors.
by selecting individual column and row electrodes. Charge collected under the column electrode is transferred to the row electrode and amplified for output. All the devices discussed so far have been socalled 'area sensors', i.e. the sensor has been a twodimensional array of photosites. However, there is another important class of solidstate sensor (and camera): this is the linear array sensor or 21
Illumination and sensors
Sensors
linescan sensor. In effect, these sensors are simply a onedimensional array (row) of photosites, and use exactly the same technology as the twodimensional array sensors. They differ in two important characteristics, however. Firstly, these sensors can have between 256 and 4096 photo sites in a rowand, hence, can achieve much greater resolution than stateoftheart array cameras. Since they are inherently onedimensional devices they can only take pictures of slices of a twodimensional scene, and if a twodimensional image is required, several such slices must be acquired. Thus, these sensors are best suited to the inspection applications in which the scene to be scanned is in continuous linear motion (or, indeed, where the camera itself can be translated). The second point to notice about these sensors is that the video signal that they generate does not correspond to any particular video standard and what is produced is essentially a timevarying analogue voltage which represents the incident light along the line of photosites. The repercussion of this characteristic is that most systems which use a linescan camera tend to have customdesigned computer interfaces, although it is worth noting that matched linescan cameras and digitizers and, indeed, linescan digitizers are now commercially available (e.g. from Datacube Inc. and Imaging Technology Inc.).
time available for one line is:
T= l/f= 1/15625 s = 64 X 10 6 s = 64 /LS
11.5/Ls of this are used for the blanking and synchronizing signal. This consists of a 6.5/Ls porch blanking signal and a 5/Ls sync. The sync. pulse signals the beginning of one line and the end of another and the porch represents a quiescent voltage level which prevents the beam appearing as a bright line during the line flyback (when being displayed on a video monitor). The end of the picture consisting of 625 lines, i.e. the frame, is characterized by several picture pulses which are significantly different from the line pulses. During these pulses, the picture is blanked and the beam returns to the top of the picture for the next frame. A (simplified) diagrammatic representation of a video signal is depicted in Figure 2.4. Unfortunately for machine vision, the CCIR standard specifies a 4:3 horizontaltovertical aspect ratio for video signals. The television monitor has a similar aspect ratio and thus an image of a square will appear square. However, as we shall see shortly, image acquisition devices often have a 1: 1 aspect ratio, meaning that the rectangular video picture is effectively squeezed into a square with a consequent geometric distortion. This has very severe repercussions for visionbased gauging applications which we will address in Chapter 3.
2.2.3 Camera interfaces and video standards
The monochrome video signal standard used in the United States and Japan is RS170, a subset of the NTSC (National Television Systems Committee) standard. Europe uses the international CCIR (International Radio Consultative Committee) standard, which is similar to, but incompatible with, RS170. Television video signal standards are, unfortunately, inappropriate for the requirements of machine vision and both standards present essentially the same problem to machine vision applications. However, since the entertainment industry is still a far more lucrative market for camera manufacturers than machine vision is, few image sensors and cameras deviate from television standards. We will have a brief look at the CCIR standards to better understand the limitations it imposes on the imaging system. Approximately fifty pictures per second are necessary for a flickerfree television picture. As the generation of a complete new picture every toth of a second (20 ms) would place a severe load on the capabilities of the video amplifier so, in an attempt to obtain a flickerfree image, the odd lines of a picture are scanned first and then, in the following toth of a second, the even lines are scanned. This means that a complete picture is established in tsth of a second. This procedure whereby every alternate line is skipped in each field is called interlaced scanning. Thus, the number of complete pictures per second, i.e. the picture frame frequency is 25 Hz; the raster field frequency is 50 Hz. Each frame comprises 625 lines in the CCIR standard (525 in the NTSC RS170 standard) thus, with twentyfive pictures of 625 lines, 625 x 25 = 15625 lines will be scanned in a second. The socalled line frequency, f, is, hence, 15625 Hz. The signal from a line comprises the picture information, a synchronizing pulse marking the end of a line (horizontal sync. pulse), and a blanking period. The
2.2.4 Characteristics of camera sensors
There are several measures of the usefulness and reliability of a camera sensor; these characteristics will be briefly considered in turn.
Resolution Given that the purpose of the camera sensor is to convert an incident analogue optical image to some electrical signal, the resolution of a sensor can be defined as the number of optical image elements which can be discriminated by that sensor
o
Reference white _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __ voltage level
Reference black voltage level Blanking voltage level
Sync. pulse
..........................................................................u......................................
Sync. voltage level
.· . ··. .· .·· .· D
................... .
figure 2.4 Composite video signal.
22
23
Illumination and sensors
Sensors
and represented by the resulting signal. The resolution is limited by the number of photo sites in the sensor since this defines the frequency with which the optical image is sampled. The concept of image sampling is an important one and we will defer detailed discussion of this issue until Chapter 3. For the moment, note that in solidstate cameras the effective resolution of the sensor is approximately half that of the number of photo sites in any given direction. Resolution is also limited by the array geometry and by how much opaque material separates the photosites: interline transfer sensors have less effective resolution than frame transfer sensors due to the presence of shielded buffers between each photosensitive line. As we will see in Chapter 3, the resolution of the camera system is also constrained by the limitations imposed by the CCIR video standard and by the sampling frequency of the analoguetodigital converter which converts the video signal to a digital image. Tube camera resolution is a function of the electronbeam diameter relative to the area of the photoconductive layer. Tube camera resolution is generally higher than that of solidstate cameras and easily outstrips the limitations imposed by the CCIR standard.
linearity for common types of sensor:
Image sensor Sb 2S3 vidicon PbO plumbicon CCD
Gamma 0.6 0.95
1
o
Lag Lag is often defined as the percentage of the signal current at a certain point of the target some period after the illumination has been switched off. Typical values for an elapsed time of 60 ms are shown below: Image sensor Sb2 S3 PbO plumbicon CCD Lag 20070 20705070 1070
The lag introduces a limitation in the practical industrial application of TV cameras in terms of the permissible speed of movement of an object under consideration.
o
Geometric faults For television cameras with electronbeam scanning, deviations in constancy of the vertical and horizontal deflection show up as faults in the geometrical placement of the picture content. Standard industrial cameras are not designed as measuring cameras but to generate a picture for subjective human examination, and they exhibit relatively large geometrical faults. For the standard industrial television camera it is usually ± 1 per cent or ± 2 per cent of the picture frame and it is usually much larger with cheaper cameras. With CCD cameras there are no geometrical faults due to electron beam scanning; any geometric distortion is due to the lens.
o
Spectral sensitivity The spectral sensitivity of an image sensor system is defined as the variation of the output as a function of the wavelength of the incident light. Referring to Figure 2.5, we see that solidstate image sensors are sensitive to light in the near infrared region of the spectrum, and since such radiation does not carry any useful visual information, an infrared cut filter should be used with such cameras, particularly if they are to be used with incandescent lighting. Blooming
o
If the sensor of a television camera is subjected to intensive brightness, then the
o
Sensitivity and transfer linearity The input signal of an image sensor is a distribution of light or brightness. The output signal is a current or voltage which is related to the brightness. The sensitivity of the sensor is defined as the ratio of the output magnitUde to the input magnitude. In general, the output will be some power function of the input magnitude:
output magnitude = (input magnitude)'Y
1.0
Cll
Cl. UI
o
c:
UI
~
Cll
> '.j:;
(1J
0.5
Ultraviolet //
where 'Y (gamma) is the exponent of the transfer function which, rearranging the above equation, is given by: _log(output magnitude) 'Y  log (input magnitude) A linear sensor would have a 'Y of 1. The following are typical values of transfer
24
?
500
600
700
800, 900
1000
1100·1200
Wavelength (nm)
F,igure 2.5
Spectral sensitivity of Vidicon and
ceo
cameras.
25
Jl/umination and sensors
References and further reading
excess charge carriers spread into neighbouring zones and light is registered there also. Thus a thin beam of very bright light will be sensed over an area considerably larger tha~ the crosssectional area of the beam. This effect is called 'blooming' and is especially noticeable with specular (mirrorlike) reflections. Whil.e tube .c~mer~s are most susceptible to blooming, solidstate sensors, too, WIll exhIbIt thIS characteristic under extreme conditions.
expresses the ratio of electrical variables of the same unit in logarithmic terms, one can compute the level, defined as twenty times the decimal log of the ratio of the linear variables and expressed in decibels (dB). Thus, a signaltonoise ratio of 20 dB means that the picture quality is very bad (20 dB implies that 10glO(S/N) equals 1 and, thus, the signaltonoise ratio is 10: 1). For satisfactory quality, a signaltonoise ratio of 40 dB or more is required.
D
Noise
Noise, unwanted interference on the video signal, is defined quantitatively by the signaltonoise ratio (SIN), i.e. the ratio of the amplitude of the wanted signal to the average amplitude of the interference. For television cameras, the signaltonoise ratio is defined as the peaktopeak video signal divided by the effective amplitude of the noise. If one follows general practice in telecommunications and
2.2.5 Commercially available cameras
Table 2.1 summarizes the characteristics of a number of popular CCD cameras which are available at time of writing. We have listed only CCD cameras since these are the most popular for industrial machine vision.
Exercises
Table 2.1 CCD Camera systems
Vendor
Sony Sony Panasonic Pulnix Fairchild Hitachi Videk
Model
XC77CE XC57CE WV50 TM540 CCD3000 KP120 Megaplus
Camera type
CCD Area CCD Area CCD Area CCD Area CCD Area CCD Area CCD Area
Sensor resolution
756 x 581 500 x 582 500 x 582 500 x 582 488 x 380 320 x 240 1320 X 1035
Output signal
CCIR CCIR CCIR CCIR CCIR CCIR Noninterlaced analogue; digital Analogue RS232; RS422 Analogue; Digital flags Analogue; Digital
Inteljace required
Standard framestore Standard framestore Standard framestore Standard framestore Standard framestore Standard framestore Special framestore Requires camera controller Conventional serial port Can be directly interfaced to PLca Slowscan framestore; can be directly interfaced to PLC
1.
2.
3.
Identify the basic operational differences between vidicontube cameras and solidstate cameras. Which type of camera is more appropriate for use in FMS (flexible manufacturing systems) requiring automated vision for either inspection or robot gUidance? Why is this so? What limitations does the European CCIR television standard impose on the effective resolving power of imaging equipment? Is it an appropriate standard for machine vision? Explai n. It is required to image a 5 cm wide object using a ceo camera, whose sensor measures 8.8 mm x 6.6 mm, at a distance of 0.5 m. What focal length lens should you choose? What options do you have if you wish to image an object which measures only 1 mm x 1 mm.
References and further reading
Batchelor, B.G., Hill, D.A. and Hodgson, D.C. (eds) 1985 Automated Visual Inspection, IFS (Publications) Ltd, U.K. Blouke, M.M., Corrie, B., Heidtmann, D.L., Yang, F.H., Winzenread, M., Lust, M.L. and Marsh, H.H. 1987 'Large format, high resolution image sensors', Optical Engineering, Vol. 26, No.9, pp. 83743. CCIR, International Radio Consultative Committee 1982 Recommendations and Reports of the CCIR, 1982, XVth Plenary Assembly, Geneva, Vol. XI, Part 1. Dunbar, P. 1986 'Machine vision', Byte, Vol. 11, No.1, pp. 16173. Herrick, C.N. 1972 Television Theory and Servicing, Reston Publishing Company, Virginia. Market Intelligence Research Co. 1986 Solidstate Cameras for Machine Vision, Market Intelligence Research Company, 4000 Middle Field Road, Palo Alto, California, 94303, USA. (Vision, SME, Vol. 4, No.1, p. 12.) Tassone, J. 1987 'An illumination system for inspecting printed circuit boards with surface mounted components', Vision, SME, Vol. 4, No.4, pp. 15.
Fairchild Honeywell Fuji Analytic Vision Systems
CCD1600R HVS 256 PJ1 IMS90
CCD Linescan CCD Linescan CCD Linescan CCD Linescan
3456 X 1 256 xI 2048 x 1 1024 x 1 2048 x 4 4096
apLC: Programmable Logic Controller.
26
27
Sampling and quantization
3
Image acquisition and
representation
Rectangular sampling pattern
Greyscale
3. 1 Sampling and quantization
Any visual scene can be represented by a continuous function (in two dimensions) of some analogue quantity. This is typically the reflectance function of the scene: the light reflected at each visible point in the scene. Such a representation is referred to as an image and the value at any point in the image corresponds to the intensity of the reflectance function at that point. A continuous analogue representation cannot be conveniently interpreted by a computer and an alternative representation, the digital image, must be used. Digital images also represent the reflectance function of a scene but they do so in a sampled and quantized form. They are typically generated with some form of optical imaging device (e.g. a camera) which produces the analogue image (e.g. the analogue video signal discussed in the preceding chapter), and an analoguetodigital converter: this is often referred to as a 'digitizer', a 'framestore', or 'framegrabber' . The framegrabber samples the video signal in some predetermined fashion (usually in an equally spaced square grid pattern) and quantizes the reflectance function at those points into integer values called greylevels (see Figure 3.1). Each integer greylevel value is known as a pixel and is the smallest discrete accessible subsection of a digital image. The number of greylevels in the (equally spaced) greyscale is called the quantization or greyscale resolution of the system. In all cases, the greyscale is bounded by two greylevels, black and white, corresponding to the minimum and maximum measurable intensity respectively. Most current acquisition equipment quantizes the video signal into 256 discrete greylevels, each of which are conveniently represented by a single byte. In certain cases, a greyscale of 128 to + 127 is more convenient; processed images need not necessarily represent the reflectance function and pixels may assume negative values but the greylevel can still be represented by a signedbyte integer. The sampling density, the number of sampling points per unit measure, is
Figure 3.1 Sam pi i ng and quantization.
usually referred to as the (spatial) resolution and, since the sampling device is usually arranged as a square grid, it is measured in terms of the number of sampling elements along each orthogonal axis. This normally corresponds to the extent of the number of pixels in both the horizontal and vertical directions. Most current commercial framegrabbers have spatial resolutions of 512 x 512 pixels. In summary, digital image acquisition equipment is essentially concerned with the generation of a twodimensional array of integer values representing the reflectance function of the actual scene at discrete spatial intervals, and this is accomplished by the processes of sampling and quantization. Since these are fundamentally important concepts, we will look at them in a little more depth in the remainder of this section.
3.1.1 Spatial frequency and the effects of sampling
Recall that the objective of the image acquisition system (which includes the sensor subsystem) is to convert an analogue optical image to a digital image in as faithful a manner as possible. As we have seen, this is achieved by first using the sensor to sample the analogue optical image, generating an analogue video signal, and then by subsequently resampling the video signal and generating the digital signal. Thus, there are three factors which can limit the effective resolution and fidelity of the final digital image: (a) (b) (c) the sensor sampling frequency; the bandwidth of the video signal; the sampling frequency of the analoguetodigital converter, i.e. the framegrab ber .
28
29
Image acquisition and representation
Sampling and quantization
We shall consider each of these in turn. First, however, we need to look a little closer at the idea of sampling an analogue image and to develop the concept of spatial frequency. We will do this intuitively at first and then we will formalize the ideas somewhat. Readers who find this area interesting should consult Duda and Hart (1973) which contains a particularly lucid account of this topic. Highfrequency signals, such as highpitch soundwaves, periodically change their value over a very short period of time. Similarly, a highfrequency spatial signal, such as an image, periodically changes its value (e.g. its intensity or greylevel) over a very short distance, i.e. it changes abruptly from one greylevel to another. Conversely, low spatial frequencies correspond to 'slower' changes in intensity where the change occurs gradually from one position in the image to another. To make this idea more formal, consider a spatially unbounded analogue optical image g(x, y). The Fourier transform G(fx,h) of the image g(x, y) is defined by:
G(fx,h) = §(g(x, y))
=
1:00 i:oo g(x,Y) exp [27ri(fxx +hY)] dxdy
[yl (G(fx,
(3.1)
(a)
The inverse Fourier transform of G(fx, h) is defined by:
g(x, y) = h)) (3.2)
= 1:00 ~:oo
G(fx,h) exp [27ri(fxx +fyy)] dfx dh
The variables fx and h which identify the domain over which G (fx, h) is defined are known as the spatial frequencies in the x and ydirections, respectively, and the domain is known as the spatial frequency domain. What do these variables and this domain represent? The integral in equation (3.2) can be viewed as a 'generalized sum' of complex exponentials, defined in terms of the spatial frequencies Ix and h, and in terms of the spatial coordinates x and y. Each exponential is weighted by a value given by G(lx,h) and, thus, equation (3.2) is an expansion (or expression) of the analogue optical function g(x, y) in terms of this weighted generalized sum of exponentials. These weights are, in fact, given by equation (3.1), i.e. the Fourier transform of the image function, and will, of course, vary with image content. Since these complex exponentials can also be expressed in terms of sine functions, * a spatial frequency domain which has, for example, just two nonzero values at, say (Ix!! hJ and ( fXl'  hJ corresponds to a sinusoidal variation in intensity in the spatial domain, i.e. to an image g(x, y) which comprises sinusoidally undulating 'stripes' of alternating light and dark intensity. The period and orientation of these stripes depends on the exact values of fXI and
(b)
Figure 3.2 (a) An image comprising a sinusoidal variation in intenSity along the x axis; and (b) its Fourier transform, comprising two spatial frequency components (fXl , fYI) and ( fXl'  f y1 ), both of which are spatial frequenCies in the x direction.
30
31
Image acquisition and repre::;entation
Sampling and quantization
hi'
Conversely, an image g(.x, y) which comprises a sinu::midal vanatlOn in intensity can be expressed in terms of the spatial frequencies (.f~t' ,h· I ) and .(' ( _ jXll _.('). see Figure 3?2 The 'quicker' these sinusoidal variations, i.e. the jYt, ? greater the frequency of variation in the spatial domain, the further (fxl! fYI) and .(' ( _ JXI' _.(') are from the origin in the G(j~, Iv) domain. Of course, a sinusoidal JYI . . variety in intensity is not a particularly common image. However, Fourier analysis tells us that more complex functions can be constructed by including more terms of varying weight in the 'generalized sum' of exponentials, i.e. by including further spatial frequency components. The exact weight is, again, determined by equation (3.1), i.e. the Fourier transform. An abrupt change in intensity will require the presence of a large number of terms which will correspond to high spatial frequencies, many of which are far removed from the origin of the Ceft', };.) domain; see Figure 3.3. Thus, we now have arrived at the interpretation we required. That is: high spatial frequencies correspond to the presence of abrupt, or sharp, changes in the intensity of the image. The next issue to which we turn is that of sampling. In particular, we would like to know what sampling frequency is required, i.e. how often one needs to sample in the spatial domain in order to represent an image faithfully. Shannon's sampling theorem tells us that a bandlimited image (i.e. an image which does not comprise infinitely high spatial frequencies) can be faithfully represented (i.e. reconstructed) if the image is sampled at a frequency twice that of the highest spatial frequency present in the image. This sampling frequency is often referred to as the Nyquist frequency. We are now in a position to return to address the three issues we raised at the beginning of this section: the effect of the sensor sampling frequency (resolution), the video signal bandwidth, and the analoguetodigital (framegrabber) sampling frequency on the effective resolving power of the image acquisition syst.em. First, we now see that a sensor which has, for example, 756 photosites in the horizontal direction, i.e. along a line, will only be capable of representing a sinusoidal variation in intensity which has a spatial frequency of 378 cycles per unit distance. A pattern of 378 alternately light and dark bars with abrupt edges, i.e. discontinuities in intensity, would obviously require a much higher sensor sampling frequency to represent the image faithfully. Second, the resolution of a television picture is also limited by the number of lines and the frequency bandwidth of the video signal. We saw in Chapter 2 that the line frequency for the CCIR standard video signal is 15625 Hz. In addition, the nominal bandwidth of the CCIR standard is 5.0 MHz, meaning that a signal can transmit a video image with five million periodic variations in the signal (brightness) levels. This results in an absolute maximum of 5 x 10 6 + 15625 = 320 periodic (or sinusoidal) variations per line, that is, the maximum spatial frequency which can be faithfully represented by a video signal is 320 cycles per line. Third, a framegrabber which has, for example, a sampling frequency of 512 pixels in the horizontal direction, will only be capable of faithfully representing a sinusoidal variation in intensity which has a spatial frequency of 256 cycles per unit
(a)
(b)
Figure 3.3 (a) An image comprising a step discontinuity in intensity along the x axis; and (b) its Fourier transform, exclusively comprising spatial frequency components {x, i.e. spatial frequencies in the x direction.
32
33
Image acquisition and representation
distance. Again, a pattern of 256 alternately light and dark bars with abrupt edges would require a much higher sampling frequency to represent the image faithfully.
Adjacency conventions
square image and, hence, the effective distance between horizontal pixels is 1times greater than that between vertical neighbours (see Figure 3.4). The situation becomes even more unsatisfactory when one considers the distance between a pixel and its diagonal neighbour. While most discussions of interpixel distances usually assume the diagonal interpixel interval to be J2 (i.e. the length of the hypotenuse completing the rightangle triangle formed by the horizontal interpixel interval of length 1 and the vertical interpixel interval of length 1). However, if we are working with a framestore which has a square aspect ratio and has a video signal which as a 4:3 aspect ratio, then the diagonal interpixel interval is, in fact, J~, i.e. the length of the hypotenuse completing the rightangle triangle formed by the horizontal interpixel interval of length ~ and the vertical interpixel interval of length 1 (see Figure 3.5). Unfortunately, this is not the complete story. The CCIR standard stipulates that a picture comprises 625 lines. However, only 576 of these carry visual information while the remainder are used for other purposes. Framestores with a vertical resolution of 512 pixels (Le. 512 lines) do not capture all 576 of these lines of video; they only capture the first 512 of them. This introduces a further distortion, resulting in an effective CCIR aspect ratio of 4:(3 x ;~i) = 4:2.66 or 3:2. The effective vertical, horizontal and diagonal interpixel distances are thus 1, ~,
~ 4'
3.2 Interpixel distances
As mentioned in the preceding chapter, video signals assume an aspect ratio of 4:3 (horizontaltovertical) and we noted in Section 3.1 that, although framestores with 4:3 aspect ratios are becoming available, they normally use a square aspect ratio of 1: 1. The aspect ratio mismatch has serious repercussions for gauging or measuring functions of an inspection system: the rectangular picture is being squeezed into a
100
100
4:3 aspect ratio (video camera signals)
/11
Image acquisition
3.3 Adjacency conventions
75
100
1: 1 aspect ratio (framestores)
There is yet another problem which is associated with the representation of digital images: the exact spatial relationship of one pixel with its neighbours. In effect, this problem is one of defining exactly which are the neighbours of a given pixel. Consider the 3 x 3 neighbourhood in an image (shown in Figure 3.6) where the
Image display 5/3
100
/
4: 3 aspect ratio (video monitor signals)
/
1
4/3
1100
Figure 3.4 Distortion arising when acquiring a video signal with a conventional (square) framestore.
Figure 3.5
I nterpixel distances.
34
35
Image acquisition and representation
Image acquisition hardware
3
2
___ ._________  _.._f
4
8
o
7
5
6
~~..+
figure 3.6 A 3 >( 3 pixel neighbourhood.
within other regions as one goes from level to level in the nesting. Thus, one would apply 4adjacency to the background region, 8adjacency to the objects resting on the background, 4adjacency to holes or regions contained within the objects, 8adjacency to regions contained within these, and so on. This convention means that one never encounters topological anomalies such as the one described in the above example. Because the 8adjacency convention allows diagonally connected neighbours, measurements of object features (e.g. its perimeter) will be more faithful with 8adjacency. Consequently, if it is possible to stipulate which convention is to be applied to a particular object of interest, one should opt for the 8adjacency convention. The preceding discussion has been developed in the context of the acquisition of digital images from area cameras. As we have seen, however, linescan cameras are extremely important imaging devices, particularly for highresolution gauging applications. The same sampling and quantization processes must also be used with these linear array cameras, the only difference in this case is that only one line of video information needs to be digitized. Similarly, the discussion of adjacency conventions and interpixel distances applies equally. Image acquisition equipment for linescan sensors will often be customdesigned for a particular application and configuration, although it is worth reiterating the point made earlier that matched linescan cameras and digitizers and, indeed, linescan digitizers are now appearing on the market. These linescan digitizers are, in fact, general purpose devices which can deal with many different scan rates. They are often referred to as slowscan digitizers. This is an unfortunate misnomer: they are really variablescan digitizers and can deal with extremely high data rates.
Figure 3.7 Adjacency conventions: a dark doughnut on a white table.
pixels are labelled through 8; which pixels does pixel 8 touch? One convention, called 4adjacency, stipulates that pixel 8 touches (i.e. is connected to) pixels 0, 2, 4, and 6, but does not touch (i.e. is not connected to pixels 1, 3, 5, and 7). An alternati ve convention, called 8adjacency, stipulates that pixel 8 is connected to all eight neighbours. Adopting either convention universally can, however, lead to diffIculties. Figure 3.7 shows parts of a digital image depicting, in extremely simplifIed form, a dark doughnut on a white table. If we apply the 4adjacency convention, then we have an obvious problem: there are four 'doughnut segments' (two vertical and two horizontal) but none of the segments is touching: the segments are not connected. Applying the 8adjacency convention, the segments are now connected in a ring (as we would expect) but now so too is the inside of the doughnut connected to the outside: a topological anomaly, In themselves, neither convention is sufficient and it is normal practice to use both conventions: one for an object and one for the background on which it rests. In fact, this can be extended quite generally so that the adjacency conventions are applied alternately to image regions which are recursively nested (or embedded)
36
°
3.4 Image acquisition hardware
A typical machine vision system can be configured in two ways: by building it yourself using offtheshelf equipment or by buying a complete turnkey system. The former alternative is very costeffective but a significant amount of work needs to be done to integrate the system, ensuring that the image acquisition and processing devices are correctly matched with the host computer and can be controlled by it. The latter alternative is more expensive but, depending on the application, turnaround time on development should be much reduced. There are two mutually dependent components to be chosen when configuring a machine vision system with offtheshelf equipment: the CPU and the framestore. Most of the commercial framestores are dedicated to just a few of the more popular computer buses, in particular, the VME bus, the Multibus, the QBus, the IBM PC bus, the Nubus (Apple Macintosh II), and to some extent the IBM MicroChannel Architecture used in the newer PS2 models. Each bus enforces certain restrictions on the way the framestore is used and most of the functional differences are related to the bus structure and the available support equipment (Le. the framestore sister boards).
37
Table 3.1
Overview of commercial framegrabber hardware
DT 2603 DT 2861
16
Feature
ITI ITr PCVision Series 100 Plus
2 up to 4
ITI 150/151 Family
ITI 200 Family
Datacube DTb Matrox MVPAT Max Video 2851/ 2858 MVPNB Family
2
DT 225
O'Q
("!) ()
:t:u 3
SlJ
.Q
c:
Frame buffers
w
~
Spatial resolution Bus compatibility
RS170 CCIR Grey levels Dedicated Graphics plane I/O mapped Memory mapped Input LUTS Pseudocolour Pan/scroll/zoom Dedicated video bus Video inputs 4:3 aspect ratio compensation
2 per board C 512 x 512 512 x 512 512 x 512 VME PCAT PCAT 151 : VME Multibus VMEAT adaptor Qbus Yes Yes Yes Yes Yes Yes 256 256 256 4 0 No Yes Yes Yes Yes No 2 Yes Yes Yes Yes Yes Yes No 3 Yes Yes Yes Yes Yes c Yes 4 No
Manyc 4 up to 4 per board C 512 x 512 512 x 512 512 x 512 VME PCAT Qbus
~:
0 :J
t:u
:to
512 x 512 256 x 256 512 x 512 768 x 512 Nubus PCAT PCAT PCAT Qbus VME Yes Yes 256 0 Yes Yes Yes Yes Yes 1 No Yes Yes 64 2 Yes Yes Yes No No 1 No Yes Yes 256 0 Yes Yes Yes Yes 1 No Yes Yes 256 0 Yes Yes No No No 4 Yes
:J
Q..
(ti
""0
Yes Yes 256 Yes No Yes Yes Yes c Yes 4 No
Yes Yes 256 4 Yes Yes Yes Yes Yes 1 No
Yes Yes 256 Several Yes Yes Yes Yes c Yes 8 No
(ti
("!)
en
:J
@"
::!". 0 :J
Feature
ITId ITI PCVision Series Plus 100
No No No No No No No No No No Binary images Binary images Binary images No No No No No No No
ITI 150/151 Family
Yes c Yes c Yes Yes Yes c Yes c Yes c Yes c Yes Yes
ITI 200 Family
Yes c Yes c Yes c Yes Yes No No Yes c No No
Datacube DTe Matrox MVPAT MJP? Video 2851/ MVPNB Family 2858
Yes 3x3 400ms Yes Yes Yes Yes Yes No Yes No No Yes c up to 8 x 8 in realtime Yes c Yes c Yes Yes c Yes c Yes c Yes c Yes c Yes c Yes not in realtime Yes Yes Yes Yes Yes No No No No
DT 2603
No No No No No No No No No No
DT 2861
Yes not in realtime Yes Yes Yes Yes No No No Yes
DT 225
No No No No No No No No No No
Convolution Erosion Dilation Sister boards Realtime ALU operations Histogram Feature f extraction Area of interest Proto typing board Variable Scan interface
a
O'Q
("!) ()
3" t:u
t:u
.Q
\.0
w
:to 0 :J
t:u ..... 0.. t:u
en' .
c:
:J~
Imaging Technology Corporation. b Data Translation Corporation. Functionality provided by sister board. d Imaging Technology Corporation. e Data Translation Corporation. f A feature in this context has an extremely restricted meaning: it is a region having a specified greylevel.
C
(ti
Image acquisition and representation
It would not be appropriate to enter into a discussion of different CPUs here; suffice it to say that most microprocessors are supported by VME bus and M ultibus boards (e.g. 68030, 80386, 32032), while Qbus computers are generally MicroVax based. The IBM PC AT bus is supported by IBM's PCAT and the multitude of PC clones (most of which are now significantly faster than the AT). Once an image has been digitized, it is stored on the framegrabber in memory as a square twodimensional array. It may now be accessed, processed, and analysed by the host computer. However, it is frequently desirable to process the digital image somewhat before the computer ever uses the information, mainly because the resultant image may be more amenable to subsequent analysis and it may take the computer an inordinately long time to perform such processing. This preprocessing can frequently be accomplished in one frametime (40 ms) by hardware resident on the framegrabber or by sister boards for the framegrabber. This rate of processing is often called realtime image processing, as the image data can be processed at the same rate at which the camera system generates it. Recall, however, the comment in Chapter 1 that a more pragmatic definition of realtime in machine vision is simply that the vision process should keep up with production rates, i.e. if the vision system produces the information as quickly as it is required, it runs in realtime. To provide the reader with some idea of the capabilities that are provided by some commercial framegrabbers and sister boards, a brief summary of the main characteristics is given in Table 3.1. Unfortunately, many of the capabilities that are cited refer specifically to image processing techniques which are not covered until the next chapter. To describe them here, however briefly, would be preemptive and you should refer again to this section after having read Chapter 4. Custom commercial systems, which can be bought complete with processing and analysis software, and subsequently programmed to accomplish a required function, tend to be significantly more expensive than a system configured from of1'theshelf equipment. Since such systems are by their very nature unique, a direct comparison is very difficult. However, a brief synopsis of some systems that are available is included so that the reader can form some idea of their capabilities. Universal Instruments Corp. market several printed circuit board (PCB) inspection systems. Model 5511 exploits four highresolution CCD cameras, each of which views a small portion of the board from a different angle in order to facilitate thorough inspection. The complete board is inspected by moving the PCB on an X Y table. It uses highspeed strobed light emitting diode (LED) illumination to allow it to take images while the PCB is being moved on the XY table. This increases the overall speed of the system quite significantly since it removes the acceleration and deceleration time between snapshots. The quoted inspection performance is an impressive twelve square inches per second. The 5515 is based on the Motorola 68020 (nominally running at 20 MHz) and can use CAD data to drive the inspection strategy. IRI (International Robomation Intelligen<;e) Ltd offer a wide variety of systems, from development vision systems with imaging resolutions of 256 X 256
Speed considerations
and 512 x 512, to fully fledged turnkey PCB inspection systems. The development systems are typically based on Motorola 680XX microprocessors and incorporate a realtime operating system and an extensive library of image processing functions. It should be emphasized that this library of functions is a collection of image processing routines, rather than image analysis routines. The distinction will be discussed in detail in the next chapter but suffice it for the present to note that a significant amount of software development will be required in most cases before a fully operational target system can be configured and incorporated in a manufacturing environment. A typical IRI development system, the SD512 Vision System, features 512 x 512 x 8 bit resolution, and can store eight images simultaneously. It can interface to normal video cameras, to linescan cameras, and to specialpurpose video cameras such as the Videk MEGAPLUS 1320 x 1035 sensor. Optional array processors are available to implement convolution and algebraic operations in near real time (250 million operations per second). A development system such as this would normally incorporate a MC68020 25 MHz host processor, a MC68851 memory management unit, 8 Mbytes of memory, a convolution processor, a correlation processor, CCD camera, colour monitor, terminal, 56 Mbyte hard disk, eight camera ports, and 2 Mbytes of image memory. Complete turnkey systems, e.g. the IRI PCB INSPECTOR would typically cost between two and three times the cost of a development system. In a similar vein to IRI Ltd, Computer Recognition Systems Ltd, in the United Kingdom, offer a mix of configured development systems and application engineering expertise and have developed machine vision systems for several industries. A typical CRS development system comprises at least a VMEbus workstation (MC68020, 1 Mbyte program RAM, 40 Mbyte hard disk, 5.25" floppy disk drive, Idris operating system, Pascal, and C), two 512 x 512 x 8 bit image memory, 8 bit resolution framegrabber, extensive image processing algorithm library and development software, edge detector (simple 3 x 3 masks), and four video inputs. Note that these summaries, and indeed Table 3.1, are providedfor illustration only; while every effort has been made to ensure that this information is correct, manufacturers are continually upgrading their products and specific models often reappear with much enhanced capabilities. As such, no responsibility can be accepted for errors in the functional specifications or prices. You should contact the vendor to ensure that the information is accurate and uptodate.
3.5 Speed considerations
There are several issues which must be addressed when evaluating vision systems and their potential processing speed. The first of these is obviously the processing power of the host computer. An Intel 80286based PCAT will operate about three times faster than an 8086based PC, while an 80386based system can deliver nearly ten times the power of a PC. A DEC MicroVAX will outperform an LSIll or
40
41
Image acquisition and representation
References and further reading
PDPll. Many of the newer, more powerful VMEbus systems are based on Motorola's 68030; but you should be sure to compare the processor clock frequencies and memory access times ~f .otherwise simila~ CPU bo~rds. . No matter how fast the CPU is, If It can't commumcate effectively wIth the framestore then the effective image acquisition time will become significant, i.e. the image may be grabbed in tsth second but it may take nearly one second t~ transfer it to the host. In general, memorymapped framestores are the most efficIent since transfer may not be necessary; this is the distinct advantage of VME systems, such as some of those marketed by Datacube and Imaging Technology. Many boards are input/ output (I/O) mapped, however, and image transfer must take place pixel by pixel; some boards attempt to alleviate this bottleneck by offering DMA (direct memory access) transfer capabilities. Most microprocessor systems do not support floating point arithmetic directly (the Inmos Transputer is one exception) and, if your application requires a significant number of floating point operations, this may be another computational bottleneck. Even if you can get a floating point coprocessor, it is best to adhere to integer arithmetic wherever possible when writing image analysis software. As is evident from the survey of imaging systems in the preceding section, most framestores facilitate simple image manipulation (such as contrast stretching, thresholding, trivial segmentation: see next chapter for detailed explanation) through the use of lookup tables (LUT). Each LUT will have an entry for every greylevel that the system can deal with, typically 256 of them. This entry corresponds to the value to which this greylevel is to be changed. When digitizing the video signal each incoming pixel is checked against the LUT, and the table value, not the digitized value, is stored in the frame buffer. Most vendors now offer ALU (arithmetic logic unit) boards which contain highspeed pipeline processors. Basic operation includes 8 bit 2's complement, logical AND, OR, XOR, multiplication, and 16 bit addition. A full frame can usually be processed in one frametime (40 ms). These boards are intended to provide realtime frame summation, thresholding, contrast enhancement, and relatively fast edge enhancement and filtering. If realtime filtering, such as edge detection, is required then sister boards can often be deployed. It is important to note that these systems generally utilize their own highspeed video bus to effect communication between the framegrabber and the sister processing boards. As a concluding note, remember that the application requirements will dictate the effective pixel throughout the inspection or robot vision system, which, in turn, will dictate the required architecture and whether or not the application is feasible. If realtime response is required, as it normally will be, this does not necessarily mean that it must process the part in, say, 10 ms; it means that the machine vision system must not delay the overall production system. If extremely fast onthefly processing is required then one may have to restrict the functionality of the system somewhat to ensure a feasible solution or, alternatively, one may have to deploy dedicated hardware to implement either the initial processing or subsequent analysis, or both.
42
Exercises
1. What compatibility problems would you encounter in measuring the perimeter of objects when the alternate Badjacency/4adjacency convention is used? What do you understand by the terms 'quantization resolution' and 'sampling density'? Identify two adjacency conventions and discuss the merits of each. What adjacency convention would you choose in an application to measure visually the perimeter of coastlines in ordnance survey maps using a conventional 4:3 aspect ratio CCIR camera and a 1:1 aspect ratio framestore? How would you ensure that your measurement is as accurate as possible, given a fixed field of view? What problems do you encounter when attempting to use linescan cameras? Dedicated video buses are used to alleviate the communication bottleneck between framegrabbers and their sister boards. Why?
2.
3.
4. 5.
References and further reading
Agin, G. 1975 An Experimental Vision System for Industrial Application, SRI International, Technical Report No. 103. Agin, G. 1977 Servoing with Visual Feedback, SRI International, Technical Report No. 149. Agin, G. 1980 'Computer vision systems for industrial inspection and assembly', Computer, Vol. 13, No.5. Beedie, M. 1986 'Image IC detects edges in real time', Electronic Design, May, pp. 508. Duda, R.O. and Hart, P.E. 1973 Pattern Classification and Scene Analysis, Wiley, New York. Dunbar, P. 1986 'Machine vision', Byte, Vol. 11, No.1, pp. 16173. Giordano, A., Maresca, M., Sandini, G., Vernazza, T. and Ferrari, D. 1985 A Systolic Convolver for Parallel Multiresolution Edge Detection, Internal Report, DIST, University of Genoa. Giordano, A., Maresca, M., Sandini, G., Vernazza, T. and Ferrari, D. 1987 'VLSIbased systolic architecture for fast Gaussian convolution', Optical Engineering, Vol. 26, No.1, pp. 638. Healy, P. and Vernon D. 1988 Very Coarse Granularity Parallelism: Implementing 3D Vision with Transputers, Proc. Image Processing '88, Blenheim Online Ltd., London, pp.22945. Li, H.F., Tsang, C.M. and Cheung, Y.S. 1983 'A lowcost realtime imaging and processing system, Software and Microsystems, Vol. 2, No.5. pp. 1219. Plessey Semiconductors Ltd 1986 PDSP16401 2Dimensional Edge Detector, Preliminary product information. Selfridge, P .G. and Mahakian, S. 1985 'Distributed computing for vision: architecture and a benchmark test, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI/7, No.5, pp. 6236.
43
Point operations
Fundamentals of digital ? ? Image processing
digital image processing from the second viewpoint, addressing some simple but useful techniques, and drawing when appropriate from the signal processing approach. There are, broadly speaking, three distinct classes of operations: point operations, neighbourhood operations, and geometric operations. We mentioned in the preceding chapter that there is a trend to perform as much image processing as possible in hardware, particularly in advanced framegrabbers or in their sister boards which have access to (he image data via a dedicated video bus. Point and neighbourhood operations are typically the type of operations that are performed by these framegrabber sister boards, while very few systems offer any extensive operations for geometric image processing.
4. 1 Point operations
Although the distinction between digital image processing and digital image analysis is not immediately obvious, it is an extremely important one. Image processing can be thought of as a transformation which takes an image into an image, i.e. it starts with an image and produces a modified (enhanced) image. On the other hand, digital image analysis is a transformation of an image into something other than an image, i.e. it produces some information representing a description or a decision. However, digital image analysis techniques are usually applied to previously processed (enhanced) images. Since the analysis of enhanced images is accomplished quickly and easily by human observers, it is often erroneously assumed that the analysis phase is easy, if not trivial. Although image processing techniques often provide more pleasing or more visually accessible images when viewed by a human observer, and the human is able to detect salient features without dif11culty, it is essential to realize that the interpretation of such emergent features is simple only in the context of the powerful perceptual abilities of the human visual system. For machine vision systems, the sole purpose of image processing is to produce images which make the subsequent analysis more simple and more reliable. We are not at all interested in how 'well' the image looks. In particular, the image processing phase should facilitate the extraction of information, it should compensate for nonuniform illumination, and, possibly, readjust the image to compensate for distortions introduced by the imaging system. Historically, digital image processing had two main thrusts to its development. One was the natural extension of onedimensional (temporal) digital signal processing to two (spatial) dimensions. Consequently, twodimensional signal processing was approached from a mathematical basis, allowing a great deal of rigorous manipulation to be performed, using classical linear system theory. The second, more heuristic, thrust considers digital images as a set of discrete sample points, performing arithmetic operations on the individual points. This contrasts with the signal processing approach which treats images as a discrete representation of a continuous twodimensional function. We will continue our discussion of A point operation is an operation in which each pixel in the output image is a function of the greylevel of the pixel at the corresponding position in the input image, and only of that pixel. Point operations are also referred to as greyscale manipulation operations. They cannot alter the spatial relationships of the image. Typical uses of point operations include photometric decalibration, to remove the effects of spatial variations in the sensitivity of a camera system; contrast stretching
101 110 134 132 134 130 126 130
100 140 134 132 140 138 133 140
103 120 135 132 140 139 138 150
105 122 131 133 135 150 149 169
107 130 137 133 140 169 163 178
105 130 138 150 156 175 169 185
103 121 120 160 160 170 180 190
110 120 121 155 174 165 185 200
figure 4.1
Digital image.
44
45
Fundamentals of digital image processing
Point operations
Number of pixels
(e.g. if a feature or object occupies a relatively small section of the total greyscale image, these point operations can manipulate the image so that it occupies the entire range); and thresholding, in which all pixels having greylevels in specified ranges in the input image are assigned a single specific greylevel in the output image. As we shall see, these operations are most effectively accomplished using the hardware input lookup tables (LUTs) which are provided by most framegrabbers.
4. 1. 1 Contrast stretching
Consider the contrived digital image shown in Figure 4.1. If we look at the distribution of the greylevels in this image, we find that there are greylevels in the ranges from 100 to 200. Obviously the complete range is not being used and the contrast of this image would be quite poor. The contrived greylevel histogram shown in Figure 4.2 illustrates graphically this poor use of the available greyscale. We wish to enhance the contrast so that all greylevels of the greyscale are utilized. This contrast stretching is achieved by first shifting all the values so that the actual pixel greylevel range begins at 0, i.e. add to every pixel the difference between the final low value (0) and the initial low value (100): 0  100 =  100. The effect of this is shown in Figure 4.3. Next, we scale everything, reassigning values in the range 0100 to the range 0255, i.e. we scale by a factor = (255  0)/ (100  0) = 2.55; see Figure 4.4. Thus, in this instance, to stretch the contrast so that all greylevels in the greyscale are utilized, one must simply apply the following operation:
120 140 160 180 200 220 240 255
Greylevel
figure 4.3
Shifted greylevel histogram.
Number of pixels
new pixel value:=(old pixel value100)*2.55
40 60 80 100 120 140 160 180 200 220 240 255
Greylevel
figure 4.4 Stretched greylevel histogram.
Number of pixels
By way of example, let us consider the application of this stretching to any pixel having either the lowest greylevel (100) in the original image or the highest greylevel (200) in the original image:
100
120
140
160
180,200
220
240255
If old pixeL vaLue=100 new pixeL vaLue:=(100100)*2.55 =0 If oLd pixeL vaLue=200 new pi xe L va Lue:= (2001 00) *2.55 =255
Greylevel
figure 4.2 Greylevel histogram.
46
47
Fundamentals of digital image processing
Point operations
We can express the algorithm a little more formally in pseudocode as follows:
1* contrast stretching in a 512x512 pixel image */
/* HIGH and LOW are assumed to be the highest and lowest greyleveLs, respectively, in the unstretched image */ scale_factor:=255 I (HIGHLOW) FOR i :=1 TO 512 DO FOR j :=1 to 512 DO IF image[i, j] < LOW THEN image[;, j] :=0 ELSE IF image[i I j] > HIGH THEN image[;, j] :=255 E LS E
figure 4.5 Contrast stretching. The greylevel histogram of the original
image[i, j] :=(;mage[i *sca le_factor
I
j]  LOW)
image (topleft) is shown at the topright; the stretched image with its histogram is shown below.
while the LUT formulation might be written as:
1* contrast stretching using LUT */
scale factor:=255 I (HIGHLOW)
1* initiaLise LUT */
FOR i :=0 TO LOW1 DO LUT[i] :=0 FOR I :=LOW TO HIGH DO LUT[i] :=(iLOW)*scaLefactor FOR i :=HIGH+1 TO 255 DO LUT[iJ:=255
As an example, Figure 4.5 shows the results of applying this contrast stretching algorithm to a real digital image. The associated greylevel histograms are displayed to the right of the image. There are many other approaches which can be used for contrast enhancement, e.g. histogram equalization is a technique which computes the histogram of the greylevel distribution in the image and reassigns greylevels to pixels in an effort to generate a histogram where there are equally many pixels at every greylevel, thereby producing an image with a flat or uniform greylevel histogram.
4.1.2 Thresholding
Some scenes, e.g. those of bare printed circuit boards (PCBs), can be very simple, from a visual point of view, in that they comprise just one or two types of objects: in this case the solder tracks and the circuit board. Suppose we wish to process an image of such a scene so that it exhibits extremely high contrast and so that the solder tracks and circuit board are very easily distinguishable. If we stipulate that there are just two allowable greylevels, black and white, then this would certainly result in the requisite contrast. The problem is to convert a greyscale image (typically with 256 greylevels) into an image with just two levels of grey. Consider again a contrived greyscale histogram. In Figure 4.6 we can see that all the image
1* stretch using LUT *1
FOR i :=1 TO 512 DO FOR j := 1 to 51 2 DO image[i, j] :=LUT[ image[i, j]
48
49
Fundamentals of digital image processing
Point operations
Number of pixels
pixels representing the circuit board probably have greylevels in the range 0160 while the solder tracks probably.,..have greylevels in the range 160255. If we assign all the pixels in the range 0160 to be black and all pixels in the range 160255 to be white we will give effect to this extreme contrast stretching. Happily, we also now have an explicit labelling in our image: the circuit board pixels are labelled (identified) with a greylevel of 0 and the solder is labelled with a greylevel of 255. In effect we have nominated the greylevel 160 as the threshold and the reassignment of pixel values is known as thresholding. The effect of thresholding on the greyscale histogram can be seen in Figure 4.7; the pseudocode that follows summarizes the thresholding algorithm:
/* ThreshoLd an image in pLace (; .e. without an */ /* expLicit destination_image) */
Solder tracks
20
40· 60
80
100
120
140
160
180
200
220
240 255
Greylevel
Figure 4.6 Greylevel histogram of a PCB.
FOR i :=1 TO 512 DO FOR j :=1 to 512 DO IF image[;, jJ > threshoLd
THEN
Number of pixels
image[i, jJ:=255
ELSE
image[i,jJ:=O
Circuit board Solder tracks
The algorithm may also be formulated using a LUT:
/* ThreshoLd an image in pLace (i .e. without an */ /* expLicit destination_image) using a LUT */ /* initiaLise LUT */
20
40
60
80
100
120
140
160
180
200
220
240 255
Greylevel
Figure 4.7 Greylevel histogram of a PCB after thresholding.
FOR i :=0 TO threshoLd DO LUT[iJ:=O FOR i :=threshoLd+1 TO 255 DO LUT[iJ:=255
/* threshoLd using LUT */
selection of the threshold point and to the incorporation of contextual information when applying the threshold. An example of digital image thresholding is shown in Figure 4.8.
4.1.3 Noise suppression by image addition
FOR i :=1 TO 512 DO FOR j :=1 to 512 DO image[;, j J :=LUT[ image[;, j J ]
If it is possible to obtain multiple images of a scene, each taken at a different time, and if the scene contains no moving objects, then the noise in the image can be reduced by averaging these images. The rationale is quite simple: in the averaging process the constant part of the image (that which is due to light reflected from stationary objects) is unchanged while the noise, which will in general change from image to image, will accumulate more slowly. The assumptions inherent in this
Note that the subject of thresholding is an important one to which we will return and will discuss in greater detail in Chapter 5, specifically with regard to the
50
51
Fundamentals of digital image processing
Neighbourhood operations
highlighted when two images of the same dynamic scene, which have been taken at slightly different times, are subtracted. This process of subtraction of an uninteresting background image from a foreground image containing information of interest is referred to, not surprisingly, as 'background subtraction'. Photometric decalibration is one of the most important applications of background subtraction. In some circumstances, camera systems exhibit nonuniform response to light across the field of view. Quite often, this is caused by the lens and is manifested as an image centre which is somewhat brighter than the periphery. This can cause severe problems if one is using thresholding techniques to isolate objects. Since the greylevel which is assumed to correspond to the object may change (from point to point) depending on the position in the image, one solution (apart from replacing the lens) is to model this nonuniform response by, e.g. taking an image of a uniformly shaded surface, identifying the minimum greylevel of this image and subtracting this value from each pixel to generate an image which represents the effective response of the camera. Images which are subsequently acquired can then be processed by subtracting this calibration image from them.
4.2 Neighbourhood operations
Figure 4.8 Thresholding. A greyscale image (topleft) with its histogram (topright) is thresholded at a greyscale value of 128 (bottomleft); the resultant histogram is shown in the bottomleft quadrant.
approach are as follows: 1. 2. The noise in each image is indeed uncorrelated. The noise has a zero mean value, i.e. it averages out to an image with a greylevel of zero which contributes nothing to the real image.
With these assumptions, it is possible to show that averaging N images increases the signaltonoise ratio by Many commercial framestores incorporate facilities to accomplish this averaging in real time (Le. as the image is acquired) and, as such, it is worth bearing this technique in mind as it involves very little computational overhead. However, you should also bear in mind the assumptions upon which this technique is based; not all noise has these desirable properties.
IN.
4.1.4 Background subtraction Digital image subtraction refers to an operation in which the pixel values of two images are subtracted on a point by point basis. It can be useful for the subtraction of a known pattern (or image) of superimposed noise or, indeed, for motion detection: stationary objects cancel each other out while moving objects are
A neighbourhood operation generates an 'output' pixel on the basis of the pixel at the corresponding position in the input image and on the basis of its neighbouring pixels. The size of the neighbourhood may vary: several techniques use 3 X 3 or 5 x 5 neighbourhoods centred at the input pixel, but many of the more advanced and useful techniques now use neighbourhoods which may be as large as 63 x 63 pixels. The neighbourhood operations are often referred to as 'filtering operations'. This is particularly true if they involve the convolution of an image with a filter kernel or mask. Such filtering often addresses the removal (or suppression) of noise or the enhancement of edges, and is most effectively accomplished using convolver (or filtering) hardware, available as sister boards for most framegrabbers. Other neighbourhood operations are concerned with modifying the image, not by filtering it in the strict sense, but by applying some logical test based on, e.g. the presence or absence of object pixels in the local neighbourhood surrounding the pixel in question. Object thinning, or skeletonizing, is a typical example of this type of operation, as are the related operations of erosion and dilation, which, respectively, seek to contract and enlarge an object in an orderly manner. Since convolution is such an important part of digital image processing, we will discuss it in detail before proceeding.
4.2. 1 Convolution
The convolution operation is much used in digital image processing and, though it can appear very inaccessible when presented in a formal manner for real continuous
52
53
Fundamentals of digital image processing
Neighbourhood operations
functions (signals and images), it is quite a simple operation when considered in a discrete domain. Before discussing discrete convolution, let us consider why one is interested in the operation in the first place. The twodimensional convolution integral, which is given by the equation:
g(i,j)
others. A particular filter h is often referred to as a 'filter kernel'. Attempts to form a conceptual model of this convolution operation in one's mind are often fruitless. However, it becomes a little easier if we transfer now to the discrete domain of digital images and replace the integral with the sigma (summation) operator. Now the convolution operation is given by:
g(i,j)
= f*h = ):00
):00 f(i  m,j 
n) hem, n) dm dn
= f*h
= ~ ~ f(i  m,j  n) hem, n)
m
11
embodies the fact that the output g of a shiftinvariant linear system (i.e. most optical electronic and optoelectronic systems, to a good approximation, and most filtering techniques) is given by the 'convolution' or application of the input signal f with a function h which is characteristic of the system. Thus, the form o~ g depends on the input f (obviously) and on the form of the system h through WhICh it is being passed. The relationship is given by the convolution integral. The function h is normally referred to as the filter, since it dictates what elements of the input image are allowed to pass through to the output image. By choosing an appropriate filter, we can enhance certain aspects of the output and attenuate
Filter h(m, n)
loput Im'g'
IV, ji '\...
\
The summation is taken only over the area where f and h overlap. This multiplication and summation is illustrated graphically in Figure 4.9. Here, a filter kernel h is a 3 x 3 pixel image, with the origin h (0, 0) at the centre, representing a mask of nine distinct weights: h( 1, 1) ... h( + 1, + 1); see Figure 4.10. The kernel or mask is superimposed on the input image, the input image pixel values are multiplied by the corresponding weight, the nine results are summed, and the final value of the summation is the value of the output image pixel at a position corresponding to the position of the centre element of the kernel. Note that the convolution formula requires that the mask h be first rotated by 180 0 since, e.g., f(i  1, j  1) must be multiplied by h (1, 1), f(i  1, j) must be multiplied by h(1,O), ... , and 0f(i+ I,j+ 1) must be multiplied by heI, 1). Quite often, the rotation by 180 is omitted if the mask is symmetric. The algorithm to generate g is given by the following pseudocode:
FOR i :=(Low Limit of i + 1) TO (high Limit of i 1) DO FOR j :=(Low_Limit_of_j + 1) TO (h;gh_Lim;t_of_j  1) DO
101 110
100 140
103 120
105
11111 .? 11
107
IJlII.OI
105
11111.
103
II
110 120
122
Ilil.ill 1110. ,II
130
/1,··1.11
130
Iii l.i l ll 1110. 11
121
1* for each feasible point in the image ??. form the convoLution *1
temp :=0 FOR m:= 1 TO +1 DO FOR n :=1 TO +1 DO temp:=temp+FEim, jnJ G[i,jJ:=temp
I
IHO.OI
134
134
135
131
Iii. I II' 111··1.111
137
Ili./1 111 .. 1.01
138
~.illl
1111. 11
120

121
fI.."~...,,..f~.i
~1_32t._l_3_2+_1_ ~~}::=II~~?=}::='il~I}~'T.: =?=jl ~16_0t~_~1_5l5~~;[_ 32J!!='}~;
134 130 140 138 140 139 135 150 149 140 169 156 175 160 170 174 165
g(/,/)
??... __.?.?,?.. _...
*
HEm, nJ
I
f...._{(im, jn) h(m, n)
1
~~
111,11
126
133
138
163
169
180
185
~+r+"
130
140
150
169
178
185
190
......u ...u _
200
f..._._..~._u
u .......................... _ ..................................................... ..
figure 4.9. Convolution.
Before returning to the discussion of neighbourhood operators, a short note about the relationship of the convolution operator to classical twodimensional digital signal processing is appropriate. This can be safely skipped without any loss of continuity. All image systems introduce some amount of distortion into an image, for example image blur due to camera shake. Since optical and electrical systems are (to an approximation) linear, the imaging system may also be assumed to be linear. This implies that, in the language of linear system theory, it has some transfer function H(wx, Wy). This is the frequency domain equivalent of the system impulse response hex, y) in the spatial domain; hex, y) typifies how the system would
54
55
Fundamentals of digital image processing
Neighbourhood operations
h(1,1)
h(1,0)
h(1,+1)
Figure 4.11 shows this local neighbourhood average mask and Figure 4.12 illustrates the application of the mask to part of an image. Referring to Figure 4.12, we can see that the result of this filtering, i.e. the value of the output pixel which would be placed in the output image at the same position as the input pixel corresponding to the centre position of the mask, is: 10hl/9 + 100*1/9 + 103*1/9 + 110*1/9 + 140*1/9 + 120*1/9 + 134*1/9 + 134*1/9 + 135*1/9
h(O, 1)
h(O,O)
h(O,
+ 1)
h(+
1, 1)
h(+
1,0)
h(+1,+1l
1/9
1/9
1/9
1/9
Figure 4.10 3 x 32 convolution filter h.
1/9
1/9
respond if a single unit spike were input to it. Thus, the transfer function describes how some input image is transformed into some output image. Since the output of this system is a distortion of the correct image we would expect that if we could model the system, i.e. define its transfer function, compute the inverse, and apply it to the distorted image, we would arrive at a much better approximation to the original. This is accomplished by convolving the inverse impulse response with the image: this operation is referred to as a 'deconvolution'. Alternatively, the image may be transformed to the frequency domain using the digital Fourier transform, multiplied by the inverse of the transfer function, and transformed back to the spatial domain by use of the inverse Fourier transform.
1/9
Figure 4.11
1/9
1/9
Local average mask.
101 * 1/9 110
.*
100 * 1/9 140 * 1/9 134
~.
103 * 1/9 120 * 1/9 135 * 1/9 132
105
107
105
103
110
122
1/9 134 * 1/9 132
130
130
121
120
4.2.2 Noise suppression
If we define noise as any unwanted contamination of an image then the mechanism by which noise is removed depends on the assumption we make regarding the form of this unwanted contamination. One of the most common (and realistic) assumptions made about noise is that it has a high spatial frequency (refer back to Section 3.1.1). In this case, it is often adequate to apply a lowpass spatial fIlter which will attenuate the higher spatial frequencies and allow the low spatial frequency component to pass through to the resultant destination image. Of course if the image itself exhibits high spatial frequencies then it will be somewhat degraded after filtering. These lowpass filters can be implemented by convolving the image with some simple mask; the mask values constitute the weighting factors which will be applied to the corresponding image point when the convolution is being performed. For example, each of the mask values might be equally weighted, in which case the operation we are performing is simply the evaluation of the local mean of the image in the vicinity of the mask.
1/9 132
131
137
138
120
121
133
133
150
160
155
134
140
140
135
140
156
160
174
130
138
139
150
169
175
170
165
126
133
138
149
163
169
180
185
130
140
150
169
178
185
190
200
Figure 4.12
Image smoothing using local average mask.
56
57
Fundamentals of digital image processing
Neighbourhood operations
which is equivalent to: 1/9* [101 + 100 + 103 + 110 + 140 + 120 + 134 + 134 + 135] . . which is equal to 121. Thus the central point becomes 121 instead of 140 and the Image WIll appear much smo~ther. This averaging or smoothing is, of course, applied at all points of the image. Occasionally, it may be more useful to apply this smoothing subject to some condition, e.g. the centre pixel is only assigned if the difference between the average value and the original pixel value is greater than a previously set threshold. This goes some way towards removing noise without smoothing out too much of the detail in the original image. The algorithm may be expressed in pseudocode as follows:
complex and is not easily effected in hardware and thus tends not to be used much in machine vision. Gaussian smoothing, whereby the image is convolved with a Gaussian function, is perhaps one of the most commonly used smoothing techniques in advanced computer vision since it possesses several useful properties. Unfortunately, it is not yet widely used in industrial machine vision because the sizes of the masks are very large and the processing is consequently computationally intensive. It should be noted, however, that some vendors do offer dedicated image processing boards for Gaussian filtering and, if such hardware is available, this type of smoothing should be considered. The Gaussian function G(x, y) is defined by:
G(x, y)
=  2 exp [ 2
7r(J
1
(x 2 + y2)j2(J2]
1* For a 512x512 image, indexed from 0 to 511 *1
FOR i :=1 TO 510 DO FOR j :=1 TO 510 DO
1* for each feasibLe point in the image: compute average *1
average:=(source_image[i1, j1] source_image[i1, j] source image[i1, j+1] sourceimage[i, j1] sourceimage[i, j] sourceimage[; I j+1] sourceimage[;+1, j1] sourceimage[i+1, j] source=image[i+1, j+1]
+ +
+ +
where (J defines the effective spread of the function: Gaussian functions with a small value for (J are narrow, whereas those with a large value for (J are broad. Figure 4.13 illustrates the shape of a Gaussian function with (J = 1.5, 3.0, and 6.0 pixels respectively. The mask weights are included for reference. Note that, since the Gaussian function is defined over an infinite support, i.e. it has nonzero (but very small) values at x, y = ± 00, we must decide at what point to truncate the function. Normally, one chooses the quantization resolution of the function by deciding on the integer number which will represent the maximum amplitude at the centre point (e.g. 1000), and then one chooses the mask size which includes all nonzero values. Typically, a mask size of 23 X 23 pixels would be required to represent a twodimensional Gaussian function with a value of 3.0 pixels for (J. Fortunately, there is no need to use twodimensional Gaussian functions since the convolution of a twodimensional Gaussian can be effected by two convolutions with onedimensional Gaussian functions. Specifically:
G(x, y) * lex, y) = G(x) * {G(y)
+ +
+
* lex, y)}
+
) I 9
1* destination image assumes the average vaLue *1
dest; nat; on_i mage [;
I
j] :=ave rage
There are other noisesuppression techniques, of course. For example, median filtering is a noisereducing technique whereby a pixel is assigned the value of the median of pixel values in some local neighbourhood. The size of the neighbourhood is arbitrary, but neighbourhoods in excess of 3 X 3 or 5 X 5 may be impractical from a computational point of view since the evaluation of the median requires that the image pixel values be first sorted. In general, the median filter is superior to the mean filter in that image blurring is minimized. Unfortunately, it is computationally
58
Thus, the image is first convolved with a 'vertical' Gaussian and then the resulting image is convolved with a 'horizontal' Gaussian. Why is the Gaussian such a popular smoothing function? The reason is quite straightforward. While the primary purpose of a smoothing function is to reduce noise, it is often desirable to be able to choose the resolution at which intensity changes are manifested in the image, i.e. to choose the level of detail which is retained in the image. For example, an image which has been smoothed just a little (small (J) will retain a significant amount of detail, while one which has been smoothed a great deal (large (J) will retain only the gross structure (you can accomplish the same thing yourself by squinting). In more formal terms, we wish to sharply delimit the spatial frequencies (see Section 3.1.1) which are present in the image, i.e. to localize the spatial frequency bandwidth of the image. On the other hand, we also need to ensure that the smoothing function does not distort the image excessively by smearing the features. To allow for this, we need to ensure that
59
Fundamentals of digital image processing
Neighbourhood operations
the function has a limited support in space. The Gaussian function optimizes the tradeoff between these two conflicting requirements.
1 2 3 6 11 18 28 43 65 95 135 186 0.2 249 324 411 506 606 706 800 882 945 986 1000 986 945 882 800 706 606 506 0 411 324 20 249 186 135 95 65 43 28 18 11 6 3 2 1
4.2.3 Thinning, erosion, and dilation
Thinning is an iterative neighbourhood operation which generates a skeletal representation of an object. It assumes, of course, that you know exactly what constitutes the object in the image and what constitutes the background (i.e. everything which is not part of the object). Such an image is said to have been segmented into its component parts: the topic of segmentation is extremely important and will be discussed in detail in the next chapter. The skeleton of an object may be thought of as a generalized axis of symmetry of the object and hence it is a suitable representation for objects which display obvious axial symmetry. The medial axis transform (MAT) proposed by Blum is one of the earliest and most widely studied techniques for generating the skeleton. More recently, Brady has introduced the related (but extended) concept of smoothed local symmetries (SLS) which we will discuss in the last section of Chapter 7. The skeleton is frequently used as a shape descriptor which exhibits three topological properties: connectedness (one object generates one skeleton); invariance to scaling and rotation; and information preservation in the sense that the object can be reconstructed from the medial axis. The concept of thinning a binary image  an image comprising just two greylevels: black and white  of an object is related to such medial axis transformations in that it generates a representation of an approximate axis of symmetry of a shape by successive deletion of pixels from the boundary of the object. In general, this thinned representation is not formally related to the original object shape and it is not possible to reconstruct the original boundary from the object. Thinning can be viewed as a logical neighbourhood operation where object pixels are removed from an image. Obviously, the removal must be constrained somewhat so that we have a set of conditions for pixel removal. The first restriction is that the pixel must lie on the border oj the object. This implies that it has at least one 4connected neighbouring pixel which is a background pixel. The removal of pixels from all borders simultaneously would cause difficulties: for example, an object two pixels thick will vanish if all border pixels are removed simultaneously. A solution to this is to remove pixels of one border orientation only on each pass of the image by the thinning operator. Opposite border orientations are used alternately to ensure that the resultant skeleton is as close to the medial axis as possible. The second restriction is that the deletion of a pixel should not destroy the object's connectedness, i.e. the number of skeletons after thinning should be the same as the number of objects in the image before thinning. This problem depends on the manner in which each pixel in the object is connected to every other pixel. A pixel can be considered to be connected to, and a component of, an object if it has 61

3 28 135 411 800 1000 800 411 135 28 3
1 3 11 28 65 135 249 411 606 800 945 1000 945 800 606 411 249 135 65 28 11 3 1
X
20
Figure 4.13 The Gaussian function for three values of a (1.5, 3.0, and 6.0) together with their corresponding discrete onedimensional masks; note that the result of convolution with these masks should be normalized by dividing by the sum of the mask weights.
60
Fundamentals of digital image processing
Neighbourhood operations changes to the image are made, at which stage thinning ceases. Figure 4.16 illustrates the result of thinning an object in a binary image. The concepts of erosion and dilation are related to thinning in the sense that erosion can be considered as a single pass of a thinning operator (stripping object pixels of all four border orientations). However, there is one significant difference in that, for most applications, one does not mind if the object breaks in two and, hence, the complex check for critical connectivity is no longer required. In fact, it
A
E
B
C
0
figure 4.14 A critically connected object.
a greylevel of 255 and at least one adjacent object pixel (note that we are assuming object pixels have a greylevel of 255  white  and background pixels have a greylevel of 0  black). Consider now the 5 pixel object shown in Figure 4.14. The pixel C 'connects' the two object segments AB and ED, that is, if C were removed then this would break the object in two; this pixel is 'critically connected'. Obviously, this property may occur in many more cases than this, and criticalconnectivity may be characterized as follows: Given a pixel, labelled 9 and its eight adj acent neighbours, labelled 07 (see Figure 3.6), and assume that writing the pixel number (e.g. 7) indicat:.s presence, i.e. it is an object pixel, whereas writing it with an overbar (e.g. 7) indicates absence, i.e. it is a background pixel. Assume, also, normal Boolean logic sign conventions (+ indicates logical OR, and. indicates logical AND). Then pixel 8 is critically connected if the following expression is true. 8. {[ (1 + 2 + 3).(5 + 6 + 7).4.0] + [(1 + 0 + 7).(3 + 4 + 5).2.6] + [3.(5 + 6 + 7 + 0 + 1).2.4] + [1.(3 + 4 + 5 + 6 + 7).2.0] + [7. (1 + 2 + 3 + 4 + 5).0.6] + [5.(7 + 0 + 1 + 2 + 3).4.6]) Figure 4.15 depicts these six neighbourhood conditions which correspond to the presence of critical connectivity. Hence, the second restriction implies that if a pixel
[3. (5 + 6 + 7 + 0 + 1).2.4] [1.(3+4+5+6+7).2.0] [(1 +2+3).(5+6+7).4.0) [(1 +0+7).(3+4+5).2.6
is critically connected then it should not be deleted.
A thinning algorithm should also preserve an object's length. To facilitate this, a third restriction must be imposed such that arcends, i.e. object pixels which are adjacent to just one other pixel, must not be deleted. Note that a thinned image should be invariant under the thinning operator, i.e. the application of the thinning algorithm to a fully thinned image should produce no changes. This is also important as it provides us with a condition for stopping the thinning algorithm. Since the pixels of a fully thinned image are either critically connected or are arcends, imposing the second and third restrictions allows this property to be fulfilled. The final thinning algorithm, then, is to scan the image in a raster fashion, removing all object pixels according to these three restrictions, varying border from pass to pass. The image is thinned until four successive passes (corresponding to the four border orientations) producing no
[7.(1 +2+3+4+5).0.6]
[5.(7+0+1 +2+3).4.6]
Figure 4.15
Neighbourhoods exhibiting critical connectivity.
62
63
Fundamentals of digital image processing
Neighbourhood operations
figure 4.17 PCB track with extraneou s copper.
figure 4.16 A grey~scale image (top~left) is thresholded to produce a binary image (topright) which is then thinned (bottom~right). The image at the bottom~left is an intermediate partially thinned version.
is this property of object splitting that makes the erosion operation particularly useful. The dilation operation effects the reverse of erosion, i.e. an expansion of the object into all those background pixel cells which border the object. Thus, erosion 'shrinks' an object while dilation 'enlarges' it. This interpretation of erosion and dilation, although common, is quite a loose one. The operations of erosion and dilation do have a much more formal meaning in the context of a branch of image processing referred to as mathematical morphology. We will return to mathematical morphology in Section 4.4 but, for the present, we will continue with the informal treatment. To see the usefulness of erosion and dilation operations, consider the following common printed circuit board inspection problem. Typically, in PCB manufacturing, a film negative depicting the appropriate pattern of electrical contacts (pads) and conductors (tracks) is contactprinted on a copperclad board which has been covered with a photo resistive solution. The board is then etched, leaving just the copper circuit patterns. Such a process can lead to many problems with the final circuit pattern, e.g. the track may be broken, it may be too wide or too thin, or there may be spurious copper remaining on the PCB. These potential faults (shorts and breaks) are impossible for conventional functional (electrical) testing to detect and it is for this reason that these visual techniques are so useful. 64
figure 4.18 Dilated PCB track.
figure 4.19 PCB track with a neck.
65
Fundamentals of digital image processing
Geometric operations
Figure 4.20
Eroded PCB track.
If the tracks (and/or pads) are too large or have extraneous copper attached (see Figure 4.17), then dilating the image a number of times will cause the two tracks to merge (see Figure 4.18) and a subsequent analysis of the track connectivity will identify this potential fault. Conversely, a track which is too thin or has a neck (see Figure 4.19) will break when eroded (Figure 4.20). Similar connectivity analysis will identify this potential circuit break. A pseudocoded algorithm for erosion can be formulated as follows:
figure 4.21 A greyscale image (topleft) is thresholded to produce a binary image (topright) which is then eroded twice (bottomleft). The application of two passes of the dilation algorithm is shown at bottomright.
FOR aLL pixels in the image IF the pixeL is an object pixeL AND aLL its neighbours are obj ect pi xe ls copy it to the destination image
4.3
Geometric operations
while a dilation algorithm can be formulated as:
FOR alL pixels in the image IF the pixel is an object pixeL make it and its eight neighbours object pixels in the destination image
Geometric operations change the spatial relationships between objects in an image, i.e. the relative distances between points a, band c will typically be different after a geometric operation or 'warping'. The applications of such warping include geometric decalibration, i.e. the correction of geometric distortion introduced by the imaging system (most people are familiar with the barrel distortion that arises in photography when using a very short focal length 'fisheye' lens), and image registration, i.e. the intentional distortion of one image with respect to another so that the objects in each image superimpose on one another. The techniques used in both these applications are identical and will be discussed in detail before describing actual usage.
4.3.1 Spatial warping
The approach to geometric image manipulation described here is called spatial warping and involves the computation of a mathematical model for the required
Figure 4.21 illustrates the effect of several applications of these erosion and dilation algorithms to an object in a binary image.
66
67
Fundamentals of digital image processing
Geometric operations
distortion, its application to the image, and the creation of a new corrected (decalibrated or registered) image. The distortion may be specified by locating control points (also called fiducial points) in the input image (the image to be warped) and identifying their corresponding control points in an ideal (undistorted or registered) image. The distortion model is then computed in terms of the transformation between these control points generating a spatial warping function which will allow one to build the output image pixel by pixel, by identifying the corresponding point in the input image. Since, in general, the estimates of the coordinates of input pixels yielded by the warping function will not correspond to exact (integer) pixel locations, we also require some method of estimating the greylevel of the output pixel when the 'corresponding' pixel falls 'between the integer coordinates' (see Figure 4.22). The question is: how do the four pixels surrounding the computed point contribute to our estimate of its greylevel, i.e. how do we interpolate between the four? We will return to this question later; suffice it at present to summarize the two requirements of geometric operations:
(a)
(b)
a spatial transformation which allows one to derive the position of a pixel in the input which corresponds to the pixel being 'filled' or generated in the output; an interpolation scheme to estimate the greylevel of this input pixel.
Note that the greylevel interpolation algorithm may be permanently established in the software whereas the spatial transformation will change as the environment changes (e.g. camera, lens, registration requirements),
4.3.1.1
The spatial transformation
The spatial transformation is expressed in general form as a mapping from a point (x, y) in the output image to its corresponding (warped) position (i, j) in the input image:
(i, j) = (Wx(x, y), Wy(x, y?
Input image
Output image
That is, the first coordinate, i, of the warped point is a function of the current position in the output; likewise for the second coordinate, j. Thus, given any point (x, y) in the output image, the coordinates of the corresponding point in the input image may be generated using the warping functions Wx and Wy respectively. It would, of course, be ideal if we had some analytic expression for Wx and Wy but this is rarely the case. Instead, we normally model each spatial warping function by a polynomial function. So, we assume that, for example, the warping functions are given by the following equations:
II 11
Wx(x, y) = ~ ~ apqxPyq
p q
Wy(x, y)
Position of estimated input pixel
=~
P
n
~ bpqxPyq
q
11
For example, if n = 2 (which is adequate to
corre~t
for most distortions):
Wx(x, y) = aooxOyO + alOxl yO + a20x2 y O + aOlxOyl + allxlyl + a2lx 2y l + ao2xOy2 + a12xly2 + a22x2y2 Wy(x, y) = booxOyO + blOXlyO + b20X2 y O + bOIxo yl + bllx 1yl + b2l X2y l + b02X O + bl2X1 y2 + b22X2 y2 y2
Greylevel of estimated input pixel
Figure 4.22 Spatial transformation and greylevel interpolation.
Now, the only thing that remains to complete the specification of the spatial warping function is to determine the values of these coefficients, i.e. to compute aOOa22 and boob22. To do this, we have to assume that we know the transformation exactly for a number of points (at least as many as the number of coefficients; nine in this case), that is, to assume that we know the values of x and y and their corresponding i
68
69
Fundamentals of digital image processing
Geometric operations
and j values. We then write the relationships explicitly in the form of the two equations above. We then solve these equations simultaneously to d~termine the value of the coefficients. Remember that we have two sets of sImultaneous equations to set up: one for the' a' coeffic~ents and one. for the' b' coe.ffi~ients. However the same values relating x, Y to i, J can be used m each case. ThIS IS now where th~ control points come in as we are going to use these to provide us with the (known) relationships between (x, y) and (i, j). If we have nine unknown coefficients, as in the example above, then in order to obtain a solution we require at least nine such observations, (Xl, yt), (it, jt}l ... {(X9, Y9), (i9, h)}, say. Such a system. is said .to be ~~actly determined. However, the solution of these exact systems IS often IllcondItIOned (numerically unstable) and it is usually good practice to overdetermine the system by specifying more control points than you need (and hence generate more simultaneous equations). These equations can then be solved, yielding the coefficients and hence the warping functions, using standard personal computer based maths packages such as MATLAB ? ? For the sake of completeness, we include here details of how to solve such an overdetermined system. The reader can safely skip this section if (s)he so wishes. The first point to note is that an overdetermined system (where the number of equations is greater than the number of unknown values) does not have an exact solution and there are going to be some errors for some points. The idea, then, is to minimize these errors. We will use the common approach of minimizing the sum of the square of each error (i.e. to generate the socalled leastsquareerror solution). Consider, again, a single control point and assume we are attempting to compute the apq coefficients:
11
.
Similar
j=Xb+e
We req appro pI greater " Howeve a in ten
!
!
a, so we might think of multiplying across by XI to obtain an expression. Unfortunately, X is nonsquare (number of equations is 1 the number of coefficients) and one cannot invert a nonsquare matrix. fe can indulge in a little algebra and calculus to derive an expression for of X and i: i =Xa+e e= iXa
We forr
le sum of the square of each error by computing eTe:
eTe = (i  Xa)T (i  Xa)
Differen coefficie
ing e Te with respect to a, to find out how the errors change as the change:
d(eTe) d(a)
= (0 _ XI) T (i _ Xa) + (i _ Xa) T (0 
XI)
= ( XI) T(i  Xa) + (iT  (Xa) T)(  XI) = IXT(i  Xa) + (iT  aTXT)( Xl) = IXTi + IXTXa  iTXI + aTXTXI
But noting that iTXI and aTXTXI are 1 x 1 matrices and that the transpose of a 1 X 1 matrix is equal to itself, we transpose these two subexpressions:
= IXTi + IXTXa  IXTi + IXTXa = 2(I)(XTXa  XTi)
= aOOXl 0 0 + alOXI 1 0 + a20X 2 IYI ° Yl Yl
+
aOIXIOYl 1
+ allXI 1Yl l + a21X1 2yJ l
1
+ a02XloYl + al2XI
l
Y1
2
+
2 2 a22 X 1 Y1
The sum of the square of each error is minimized when d(eTe)jd(a) is equal to zero, thus:
0= 2(l)(XTXa  XTi) (XTX)lXTXa = (XTX)IXTi a = (XTX)IX Ti (XTX) IXT is commonly referred to as the 'pseudoinverse' of X and is written xt.
If we use m control points in total we will have m such equations which (noting that XO and yO are both equal to 1) we may write in matrix form as:
When this has been computed, the coefficient matrix a may be computed by simply multiplying xt by i; b is obtained in a like manner. 4.3.1.2 We have to include the errors since there will not be a set of aoo a22 which will simultaneously provide us with exactly itim, in the overdetermined case. Let us abbreviate this matrix equation to:
i= Xa
Grey~level
interpolation
+e
Once the spatial mapping function has been found, the output image can be built, pixel by pixel and line by line. The coordinates given by the warping function, denoting the corresponding points in the input image, will not in general be integer values and the greylevel must be interpolated from the greylevels of the surrounding pixels. 71
70
Fundamentals of digital image processing
The simplest interpolation function is nearestneighbour interpolation (zeroorder interpolation) whereby the greylevel of the output pixel (which is what we are trying to estimate) is given by the greylevel of the input pixel which is nearest to the calculated point in the input image (see Figure 4.23). The computation involved in this interpolation function is quite trivial but the function generally yields quite adequate results. If the image exhibits very fine detail in which adjacent pixel greylevel varies significantly, i.e. the image exhibits high spatial frequencies, some of this detail may be lost. In such cases, bilinear (Le. firstorder) interpolation should be considered since the estimate is made on the basis of four neighbouring input pixels. Consider the case shown in Figure 4.24 where we need to estimate the greylevel of a point somewhere between image pixels (i, j), (i, j + 1), (i+ l,j), and (i+ l,j+ 1). Let the position of this point relative to pixel (i,j) be given by coordinates (p, q); ~ p, q ~ 1. The greylevel at point (p, q) is constrained by the greylevel at the four neighbouring pixels and is a function of its position between these neighbours. To estimate the greylevel, J(p, q), we fit a surface through the four neighbours' greylevels, the equation of which will in general identify the greylevel at any point between the neighbours. The surface we fit is a hyperbolic paraboloid and is defined by the bilinear equation:
Geometric operations

q
p
I
II
II
f(i, j)
0
f(i, j
+ 1)
f(i+p,j+q) l 
e
t(i + 1, j)
II
f(i+1,j+1)
°
Figure 4.24 Bilinear interpolation.
J(p, q) = ap + bq + cpq + d
There are four coefficients, a, b, c, and d, which we must determine to identify this
function for any given 2 x 2 neighbourhood in which we wish to interpolate. Thus we require four simultaneous equations in a, b, c, and d; these are supplied from our knowledge of the greylevel at the four neighbours given by relative coordinates (0,0), (0, 1), (l, 0), (1, 1). Specifically, we know that:
a X + b X 0 + c X 0 X 0 + d = J(i, j) a X 0 + b X 1 + c X 0 X 1 + d =J(i, j + 1)
aX 1+ b X0+ cX 1X0+ d a X 1+ b X 1+ c X 1 X 1+ d
°
(4.1) (4.2) (4.3) (4.4)
101 110 134 132 134 130 126 130
100 140 134 132 140 138 133 140
103 120 135 132 140 139 138 150
105 122 131 133 135 150 14.9 169
107 130 137 133 140 169 163 178
105 130 138 150 156 175 169 185
103 121 120 160 160 170
110 120 121 155 174
Nearest neighbour
=J(i + 1, j)
=J(i + 1, j + 1)
Directly from (4.1), we have:
d=J(i,j)
Rearranging (4.2) and substituting for d, we have:
(4.5)
b=J(i,j+ 1)J(i,j)
Rearranging (4.3) and substituting for d, we have:
(4.6)
A
180
v
/
185
~
V

a = J(i + 1, j)  J(i, j)
Rearranging (4.4) and substituting for a, b, and d, we have: c = J(i + 1, j
Computed point
(4.7)
+ 1) + J(i, j)  J(i + 1, j)  J(i, j + 1)
(4.8)
190
200
Figure 4.23 Nearestneighbour interpolation.
Equations (4.5)(4.8) allow us to compute the coefficients a, b, c, and d, which define the bilinear interpolation for a given 2 X 2 neighbourhood with known pixel greylevels. For example, if the coordinates of the point at which we wish to estimate the greylevel are (60.4, 128.1) and the greylevel at pixels (60, 128), (60, 129), (61, 128),
72
73
Fundamentals of digital image processing
Mathematical morphology
and (61, 129) are 10, 12, 14, and 15, respectively, then the greylevel at this point, in relative coordinates, is given by:
f(O.4, 0.1) = (14  10) x 0.4 + (12  10) x 0.1 + (15 + 10  14  12) + 10 = 11.76
?
Intersection. The intersection of two sets X and Y, written X
X
n Y,
is defined:
nY=
(XC
uy
C
)
c
x 0.4 x 0.1
In effect, the intersection of two sets is the complement of the union of their respective complements. Thus, the intersection of X and Yis the complement of the set of elements which are not in X or not in Y, i.e. the set of elements which is common to both sets X and Y. Let us now consider two further concepts. The first is translation. The translation of a set X by h is denoted XI!. This is a set where each element (point) is translated by a vector h. Second, we need to introduce a general symbolism for set transformation. A transformation of a set X is denoted by '!reX). '!r( ) is the set transformation and in an expression such as Y = '!r (X), Y is the transformed set. Finally, we require the concept of duality of set transformations. Given some transformation '!r, we define the dual of the transformation '!r*:
'!r*(X) ~ ('!r (XC) C
4.3.2 Registration and geometric decalibration
This technique for spatial warping can be used directly to effect either registration of two images or to correct the geometric distortion which may have been introduced by the imaging system. In the former case, one merely needs to identify several corresponding points in the two images and use these as the co~trol points when generating the polynomial coefficients. In the latter case, one mIght .use the imaging system to generate an image of a test card (e.g. a square gnd) and superimpose an undistorted copy of this pattern on the distorted image. The corresponding control points can then be explicitly identified in each ~ersio~ of the pattern. For example, in the case of the grid pattern, the control pomts mIght be the points of intersection of the grid lines.
For example, intersection is the dual of union since X
nY=
(XC
uY
C )
c.
4.4.2 Structuring elements and hit or miss transformations
4.4
Mathematical morphology
4.4.1 Basic set theory
Mathematical morphology is a methodology for image processing and image analysis which is based on set theory and topology. As such, it is a formal and rigorous mathematical technique and we need to establish a basic 'language' before proceeding. For the most part, this is the language of set theory. In the following, points and vectors will be denoted by latin lowercase letters: x, y, z; sets will be denoted by latin uppercase letters: X, Y, Z; and the symbol cp denotes the empty set. The more common set operations we will encounter in this brief treatment of morphology include the following: .. Set inclusion. This is written: Y C X, i.e. Y is a subset of (is included in) the set X. This is defined as y E Y ~ Y E X: if y is an element of Y, then y is also an element of X. Complement. For any set X, the complement of X is written Xc. This is the set of all elements which are not elements of X. Union. The union of two sets X and Y, written XU Y, is defmed:
XU Y = {x I x E X or x E Y}
We are now in a position to continue with the discussion of mathematical morphology. Let us begin with the concept of a structuring element. A structuring element Bx , centred at x, is a set of points which is used to 'extract' structure in a set, X, say. For example, the structuring element might be a square or a disk, as shown in Figure 4.25, or any other appropriate shape. Now we define a hit or miss transformation as the 'point by point' transformation of a set X, working as follows. We choose, and fix, a structuring 1 element B. DefineBx to be that subset of Bx (recall Bx is the translate of B to a position x) whose elements belong to the 'foreground' and B/ to be the subset of Bx whose elements belong to the 'background' (Le. BIn B2 = cp).
.. ..
This should be read: the set X union Y is the set of all x such that x is an element of X or x is an element of Y.
74
??? ??? ??? ??? ? ???? ????? ????? ???
figure 4.25 Square and circular structuring elements.
75
Fundamentals of digital image processing
Mathematical morphology
A point x belongs to the hit or miss transform, denoted X ? B, if and only if Bx 1 is included in X and Bx 2 is included in Xc the complement of X:
X (8) B
symmetrical set of B with respect to its origin. Then, the erosion operation is denoted and the erosion of a set X with B is defined:
e
= [xl Bx 1 C
X;B/ C XC}
XeB= {xIBxcX}
Note that this is equivalent to a hit or miss transformation of X with B, where B2 = cp, i.e. where there are no background points. Note also that X e B is not an erosion  it is the Minkowski subtraction of B from X. Intuitively, the erosion of a set by a structuring element amounts to the " generation of a new (transformed/eroded) set where each element in the transformed set is a point where the structuring element is included in the original set X. For example, let B and X be the structuring element and set shown in Figure B is depicted in Figure 4.27(c). 4.27(a) and (b), respectively; then X In a more complex case, if B is a circular structuring element, Figure 4.28 schematically illustrates the effect of erosion .
Thus, X (8) B defines the points where the structuring element B exactly matches (hits) the set X, i.e. the image. . . For example, let B be the structuring element sho,:n III FIgure 4.26(~) a~d let X be the set shown in Figure 4.26(b). Then X ? B IS the set shown III FIgure 4.26(c).
4.4.3 Erosion and dilation
In Section 4.2.3, we informally introduce~ the concepts of erosion and dilat~on. whe will now define them formally. Let B be the transposed set of B, I.e. t e
e
? ?
(a)
(a)
000000 000000 00 ? ? 00 0 ? ? ? 00 00.000 000000
(b)
000000 000000 00 ? ? 00 0 ? ? ? 00 00.000 000000
(b)
000000 000000 000000 000000 00.000 000000
(c)
000000 000000 008.00 008000 000000 000000
(c)
Figure 4.26 (a) Structuring element Bi (b) image set Xi (c) B 'hit or miss'
Figure 4.27
X:X?B.
(a) Structuring element Bi (b) image set Xi (c) the erosion of X with B: XeS.
76
77
Fundamentals of digital image processing
Mathematical morphology
x
The opening is the domain swept out by all the translates of B which are included in X. This effec:s a smoothing of the contours of X, cuts narrow isthmuses, suppresses small Islands and sharp edges in X.
4.4.5 Thinning and the extraction of endpoints
As we have seen in Section 4.2.3, an approximation to the skeleton can be achieved by an ite.rative transformation known as thinning. From the perspective of mathematIcal morphology, the thinning of a set X by a sequence of structuring elements L, is denoted
Erosion of X by a circular structuring element B.
Figure 4.28
that is
Dilation is a closely related operation. In fact, dilation is the dual of erosion. The dilation operation is denoted by the symbol (±). Thus:
X 0 L is defined:
X (±) B = Xc
e iJ
XOL=X/X(8)L
That is, the set X less the set of points in X which hit L. Thus, if X (8) L identified border points, and ~ is, appropriately structured to maintain connectivity of a set, then repeated applIcatIOn of the thinning process successively removes border p~ints fr~m ~ set until the s~eleton is achieved. At this point, further application of the thlllnlllg transform YIelds no change in the skeletal set. This, of course, should be reminiscent of the thinning algorithm discussed in Section 4.2.3. Recalling the definition of a hit and miss transformation:
X (8) L = {x I 1 Lx C X; 2 Lx C XC}
That is, the dilation of X is the erosion of Xc. This amounts to saying that the erosion of a set (e.g. an object) with a given structuring element is equivalent to the dilation of its complement (i.e. its background) with the same structuring element, and vice versa.
4.4.4 Opening and dosing
After having eroded X by B, it is not possible in general to recover the initial set by dilating the eroded set X e B by the same B. This dilate reconstitutes only a part of X, which is simpler and has fewer details, but may be considered as that part which is most essential to the structure of X. This new set (i.e. the results of erosion followed by dilation) filters out (i.e. generates) a new subset of X which is extremely rich in morphological and size distribution properties. This transformation is caIIed an opening. The opening of a set X with a structuring element B, denoted Xl]' is defmed:
X B = (X
lLxn 2L x = ?
We can now proceed to develop a thinning algorithm by defining L. The sequence {L} which is used for thinning is based on a single structuring element and is generated by rotating the structuring element (through 360 0 in increments of 45 0 for a. square lattice). This sequence (L} is shown in Figure 4.29. The thinning algonthm then amounts to the repeated transformation of a set Xi l Xi+ 1 defined:
e B) (±) B
in BB
Ll L2
Xi+l
= ?( ... (Xi 0 Lt) 0 L2) 0 L3) ... 0
L3 L4 L5 L6
Lg)
The closing of X with respect to B, denoted X B , is defllled:
XB = (X (±)
Opening is the dual of closing, i.e.:
(XC)B = (XB)c
0 ?? 00
?
000
?? ??
?? 0 00
?
0 ?? 0
and
Figure 4.29
?
?
e
00
0 ??
L8
00
?? 0
0
?
? ? ??
000
0
0
?
0 ??
Sequence of structu ring elements used in the thinning operation.
? ?
78
79
Fundamentals of digital image processing
?' ?2 ?3 ?4 ?5 ?6 ?7 ?8
Intensity
Mathematical morphology
000
oee
000
ooe oeo
000
oeo oeo
000
eoo oeo
000
000 880 000
000
oeo
.00
000 080 080
000 080 00.
figure 4.30 Structuring elements used to identify endpoints.
The skeleton is achieved when Xi = Xi+ 1. Initially Xo = X, i.e. the original (unthinned) image. Given a skeleton set X, we can identify the endpoints, i.e. points which are connected to just one other point, using the hit or miss transform and an appropriate set of structuring elements {E}, shown in Figure 4.30. Thus, the endpoints of the skeleton are given by:
y=
X position
(a)
U
i=O
8
X?Ei
That is, the union of all those points which hit with one of these endpoint structuring elements.
T
4.4.6 Application: identification of endpoints of electrical wires
In Chapter 8, we will be considering a complete vision and robotics case study: automated crimping of electrical wires. Part of this application is concerned with the identification of the positions of the ends of these electrical wires lying on a flat surface. In Chapter 8, we will be employing conventional techniques, but for the sake of illustration we will briefly consider a morphological approach here. Assuming that we are dealing with a greyscale image of long thin wires (see Figures 8.18 and 8.19), then there are just three simple steps in achieving our objective: Step 1. Threshold the greylevel image to obtain a binary image. Call this set Xl. Step 2. Generate the skeleton X2:
Intensity
(b)
X2=XIO (L}
Step 3. Identify endpoints of skeleton X3:
8
Intensity t"..,.,,'~

I
\ \
.....
I
A
X3=
U
i=O
X2 ?Ei
X position
(e)
X3 is the set of all endpoints of these electrical wires.
4.4.7 A brief introduction to greyscale mathematical morphology
So far, all of the mathematical morphology has assumed that we are dealing with 80
figur.e 4.31 (a) A onedimensional slice of a twodimensional image function. (b) Thresholding this onedimensional function at a value T generates a set x}, = (x I x ~ T) depicted by the bold line. (c) Representation of an image by a sequence of such sets.
81
Fundamentals of digital image processing
References and further reading
a set X and its complement XC, i.e. that we have been dealing with binary images comprising a foreground (the set of object points X) and a background (the set of all other points XC). Unfortunately, greyscale mathematical morphology is considerably more difficult to understand than binary morphology and this section is just intended to serve as an introduction to the manner in which we approach greylevel images from a morphological point of view. For the following, we will consider onedimensional slices of a twodimensional image function. This makes it easier to represent in diagrams and it is somewhat easier to follow. Thus, a slice along the Xaxis (i.e. a line of) a greyscale image can be viewed as shown in Figure 4.31(a). If we threshold this function choosing values ;;::= Xh, that is, generate a set Xh:
Xh= [xl x;;::= T}
X position
Intensity
where T is the threshold value, we generate the set of points shown in a bold horizontal line in Figure 4.31(b). A greyscale image, then, is considered to be a function f and is a sequence of sets:
f
~
figure 4.33 Dilation of a onedimensional function with a flat structuring element; the dilated function is depicted by the thin curve.
Similarly, greylevel dilation is defined:
f~ f (f) B) [x}..(f)} ~ {x}..(f) (f) B)}
[xh(f)}
and the greyscale image is effectively the entire sequence of sets generated by successively decreasing the threshold level, as shown in Figure 4.31 (c). Greylevel erosion is defined:
and Figure 4.33 illustrates the dilation of a (onedimensional) function with a flat structuring element.
and equivalently:
[xh(f)} ~ (Xh(f)
Exercises
e B)}
1.
This greylevel erosion is the function, or sequence of sets, which are individually the erosion of sets generated at successive thresholds. Figure 4.32 illustrates the erosion of a (onedimensional) function with a flat structuring element.
2.
3. 4. 5.
X position
Why is convolution a useful operation in image processing? Be specific in your answer by identifying the relationship between convolution and filtering. Identify and annotate two simple techniques for noise removal in digital images; detail any assumptions upon which the techniques are based. What is the essential difference between erosion and thinning? What is the relationship between thinning and the medial axis transform? Why are lookup table (LUT) formulations of algorithms computationally efficient? If two images can be registered by translation and rotation operations, is it necessary to use spatial warping techniques? Will greylevel interpolation be an issue?
Intensity
References and further reading
Arcelli, C. 1979 'A condition for digital points removal', Signal Processing, Vol. 1, pp.2835.
figure 4.32 Erosion of a onedimensional function with a flat structuring element; the eroded function is depicted by the thin curve. 82
83
Fundamentals of digital image processing
Blum, H. 1967 'A transformation for extracting new descriptors of shape', in Models for the Perception of Speech and Visual Form, W. WathenDunn (ed.), MIT Press Cambridge, Massachusetts, pp. 15371. Brady, M. and Asada, H. 1984 'Smoothed local symmetries and their implementation', The International Journal of Robotics Research, Vol. 3, No.3, pp. 3661. Castleman, K.R. 1979 Digital Image Processing, Prentice Hall, New York. Gonzalez, R.C. and Wintz, P. 1977 Digital Image Processing, AddisonWesley, Reading, Massachusetts. Hall, E.L. 1979 Computer Image Processing and Recognition, Academic Press, New York. Hilditch, C.l. 1983 'Comparison of thinning algorithms on a parallel processor', Image and Vision Computing, Vol. 1, No.3, pp. 11532. Kenny, P.A., Dowsett, D.l., Vernon, D. and Ennis l.T. 1990 'The application of spatial warping to produce aerosol ventilation images of the lung immediately after perfusion with the same labelled isotope', Physics in Medicine and Biology, Vol. 35, No.5, 67985. Motzkin, Th. 1935 'Sur Quelques Proprietes Caracteristiques des Ensembles Bornes Non Convexes', Atti. A cad. Naz. Lincei, 21, pp. 7739. Nackman, L.R. and Pizer, S.M. 1985 'Three dimensional shape description using the symmetric axis transform 1: Theory', IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI7, No.2, pp. 187202. Pratt, W.K. 1978 Digital Image Processing, Wiley, New York. Rosenfeld, A. and Kak, A. 1982 Digital Picture Processing, Academic Press, New York. Rosenfeld, A. 1975 'A characterization of parallel thinning algorithms', Information and Control, Vol. 29, pp. 28691. Serra, l. 1982 Image Analysis and Mathematical Morphology, Academic Press, London. Tamura, H. 1978 'A comparison of linethinning algorithms from a digital geometry viewpoint', Proceedings 4th International Joint Conference on Pattern Recognition, pp.71519. Zhang, T. Y. and Suen, C. Y. 1984 'A fast parallel algorithm for thinning digital patterns', Communications of the ACM, Vol. 27, No.3, pp. 2369.
5
The segmentation problem
5.1
Introduction: region and boundarybased approaches
Segmentation is a word used to describe a grouping process in which the components of a group are similar with respect to some feature or set of features. The inference is that this grouping will identify regions in the image which correspond to unique and distinct objects in the visual environment. There are two complementary approaches to the problem of segmenting images and isolating objects: boundary detection and region growing. Region growing effects the segmentation process by grouping elemental areas (in simple cases, individual image pixels) sharing a common feature into connected twodimensional areas called regions. Such features might be pixel greylevel or some elementary textural pattern, e.g. the short thin bars present in a herringbone texture. Boundarybased segmentation is concerned with detecting or enhancing the boundary pixels of objects within the image and subsequently isolating them from the rest of the image. The boundary of the object, once extracted, may easily be used to define the location and shape of the object, effectively completing the isolation. An image comprising boundaries alone is a much higher level representation of the scene than is the original greyscale image, in that it represents important information explicitly. In cases where the boundary shape is complicated, the gap between these two representations is wide and it may be necessary to introduce an intermediate representation which is independent of the shape. Since boundaries of objects are often manifested as intensity discontinuities, a natural intermediate representation is composed of local objectindependent discontinuities in image intensity, normally referred to as 'edges'. The many definitions of the term edge can be summarized by the observation (or premise) that an edge occurs in an image
85
84
The segmentation problem
Thresholding
if some image attribute (normally image intensity) changes its value discontinuously. In particular, edges are seen as local intensity discontinuities while boundaries are global ones. The usual approach to segmentation by boundary detection is to first construct an edge image from the original greyscale image, and then to use this edge to construct the boundary image without reference to the original greyscale data by edge linking to generate shortcurve segments, edgethinning, gapfilling, and curve segment linking, frequently with the use of domaindependent knowledge. This association of local edges is normally referred to as boundary detection and the generation of the local edge image is referred to as edge detection. Boundary detection algorithms vary in the amount of domaindependent information or knowledge which they incorporate in associating or linking the edges, and their effectiveness is obviously dependent on the quality of the edge image. The more reliable the edge elements in terms of their position, orientation and, indeed, authenticity, the more effective the boundary detector will be. However, for relatively simple, welldefined shapes, boundary detection may become redundant, or at least trivial, as edge detection performance improves. Since computational complexity for the segmentation process as a whole is a function of the complexity of both the edge detection and boundary detection then minimal segmentation complexity may be achieved by a tradeoff between the sophistication of the edge detector and the boundary detector. It is worth noting, however, that since edge detection is essentially a filtering process and can often be effected in hardware, while boundary detection will require more sophisticated software, the current (and, probably, correct) trend is to deploy the most effective and sophisticated edge detector (e.g. the Canny operator or the MarrHildreth operator) and to simplify the boundary detection process. The remainder of this chapter is devoted to a discussion of a regionbased segmentation technique (thresholding), edge detection, region growing and, finally, boundary detection.
5.2.1 Global, local, and dynamic approaches
In a more general sense, a threshold operation may be viewed as a test involving some function of the greylevel at a point, some local property of the point, e.g. the average greylevel over some neighbourhood, and the position of the point in the image. Thus, a threshold operation may be viewed as a test involving a function T of the form:
T(x, y, N(x, y), g(x, y))
where g(x, y) is the greylevel at the point (x, y) and N(x, y) denotes some local property of the point (x, y). If g(x, y) > T(x, y, N(x, y), g(x, y)) then (x, y) is labelled an object point, otherwise it is labelled a background point, or conversely. This is the most general form of the function T, however, and three classes of thresholding (global, local, and dynamic) may be distinguished on the basis of restrictions placed on this function (see Weszka, 1978). These are: Global thresholding: the test is dependent only on the greylevel of the point. T= T(N(x, y), g(x, y)) Local thresholding: the test is dependent on a neighbourhood property of the point and on the greylevel of the point. T= T(x, y, N(x, y), g(x, y)) Dynamic thresholding: the test is dependent on the point coordinates, a neighbourhood property of the point and on the greylevel of the point.
T= T(g(x, y))
It is pertinent to note, however, that most systems utilize the simplest of these three
approaches, global thresholding: the threshold test is based exclusively on the global threshold value and on the greylevel of a testpoint, irrespective of its position in the image or of any local context. The approach is facilitated either by constraining the scene to ensure that there is no uneven illumination or by photometrically decalibrating the image before thresholding. The advantage of this approach is that the thresholding can be accomplished by commonly available hardware using a lookup table, as described in Chapter 4.
5.2 Thresholding
Greylevel thresholding, which we covered briefly in Chapter 4, is a simple regionbased technique. However, we include it in a section on its own here because it is a very commonly used and popular technique. As we saw in Chapter 4, in situations where an object exhibits a uniform greylevel and rests against a background of a different greylevel, thresholding will assign a value of 0 to all pixels with a greylevel less than the threshold level and a value of 255 (say) to all pixels with a greylevel greater than the threshold level. Thus, the image is segmented into two disjoint regions, one corresponding to the background, and the other to the object.
5.2.2 Threshold selection
The selection of an appropriate threshold is the single major problem for reliable segmentation. Of the several techniques which have been proposed, most are based on the analysis of the greylevel histogram, selecting thresholds which lie in the region between the two modes of the (bimodal) histogram. The assumption that the histogram is indeed bimodal, with one mode corresponding to the greylevel representing the object and the other to the greylevel representing the background, is often not valid; histograms are frequently noisy and the two modes may be
86
87
The segmentation problem
Thresholding
difficult to detect (see Figure 5.1). Just as often, the object will generate a single mode while the background will comprise a wide range of greylevel, giving rise to a unimodal histogram (see Figure 5.2). While several applications of a simple smoothing (or local averaging) operator to a noisy histogram will help with noisy bimodal histograms, it will be of little use with unimodal histograms. Figure 5.3
Frequency
o
Figure 5.1
Greyscale
255
Figure 5.3 Greyscale histogram smoothing: topleft: no smoothing; topright: one application of a 3 x 1 neighbourhood average operator; bottomleft: two applications; bottomright: three applications.
Noisy bimodal greyscale histogram.
Frequency
o
Figure 5.2
255
Greyscale
Un imodal greyscale histogram.
illustrates the effect of smoothing a greyscale histogram by iterative application of a local 3 x 1 neighbourhood averaging operator. Another useful approach to thresholding selection is to use the average greylevel of those pixels which are on the boundary between the object and the background as an estimatL: of the threshold value. As the greylevel of this boundary pixel will typically lie between those of the object and the background, it provides a good indication of the threshold value (see Figure 5.4). The difficulty, of course, lies in deciding which pixels are on the boundary. One of the best approaches is to use a reliable edge detector, such as the MarrHildreth operator described in the next section, to identify these boundary points. The threshold selection procedure first uses a MarrHildreth operator to locate edges in the image and the mean greylevel of the image pixels at these edge locations is computed. This mean represents the global threshold value. To illustrate this approach, Figure 5.5 shows the binary image generated by thresholding the original greyscale image at a threshold equal to the mean greylevel of the boundary points generated using the MarrHildreth operator. It should be noted that, although this threshold selection technique is computationally complex and may take a significant amount of time to compute, it is only a calibration exercise and need not be performed before every threshold operation.
88
89
The segmentation problem
An overview of edge detection techniques
(a)
Edge pixels
Edg
pixels
(a)
255
Greylevel
Mean greylevel , . . . .   of edge pixels: threshold value
oL______l~=======i~
(b)
________
Figure 5.4 Using edge pixels to select a threshold: (a) image of dark, round object on a light background with section X  X shown; (b) profile of image intensity along section XX.
5.3 An overview of edge detection techniques
As might be expected when dealing with a process which is fundamental to image segmentation, the literature concerning edge detection is large. This section will not attempt to review all detectors in detail; rather it will survey and describe the different approaches to edge detection and illustrate the approach with specific algorithms. There are four distinct approaches to the problem of edge detection: (a) (b) gradient and differencebased operators; template matching;
90
(b)
Figure 5.5
Automatic threshold selection using the MarrHildreth theory of edge detection: (a) original greyscale image; (b) automatically thresholded binary image.
91
The segmentation problem
(c) (d) edge fitting; statistical edge detection.
An overview of edge detection techniques
An operator due to Roberts estimates the derivatives diagonally over a 2 x 2 neighbourhood. The magnitude of the gradient g(x, y), at an image point (x, y), is approximated by taking the RMS of the directional differences:
g(x,y) ~ R(x,y) = J[{f(x,y)  f(x+ I,y + I)} 2 + {f(x,y
Each of these four approaches will be considered in turn.
+ 1)  f(x+ I,y)} 2]
5.3.1 Gradient and differencebased operators
If we define a local edge in an image to be a transition between two regions of significantly different intensities, then the gradient !unction of t?~ image, which measures the rate of change, will have large values III these transItIOnal boundary areas. Thus gradientbased, or firstderivativebased, edge detectors enhance the image by estimating its gradient function and then signal that an edge is present if the gradient value is greater than some defined threshold. In more detail, if a/ax and a/ay represent the rates of change of a twodimensional function f(x, y) in the x and ydirections, respectively, then the rate of change in a direction {) (measured in the positive sense from the Xaxis) is given by:
R(x, y) is usually referred to as the Roberts cross operator. The differences may be combined in a way other than the RMS to provide a computationally simpler version, the Roberts absolute value estimate of the gradient function, given by:
g(x,y)~R(x,y)= If(x,y)f(x+ I,y+ 1)1
+ If(x,y+ l)f(x+ I,y)1
Rosenfeld and Kak have argued that a third version, the Roberts max operator, given by:
g(x,y)~R(x,y)=Max (If(x,y)f(x+ I,y+ I)I,lf(x,y+ I)f(x+ I,y)!)
a cos {) + aaf sin f ax y
{)
The direction {), at which this rate of change has the greatest magnitude is given by:
arctan[:~ I ;~]
with magnitude:
affords better invariance to edge orientation. Applying the Roberts max operator to edges of equal strength, but of different orientation, produces less variation in the resultant magnitude value than if the cross operator were used. To illustrate these edge detectors, a test image with fairly fine structure (a tray of electrical wires) was acquired: see Figure 5.6. The Roberts RMS, absolute value, and Roberts max operators are shown in Figures 5.7, 5.8, and 5.9 respectively. One of the main problems with the Roberts operator is its susceptibility to noise because of the manner in which it estimates the directional derivatives, i.e. the first differences, of the image function f(x, y). This has prompted an alternative estimation of the gradient by combining the differencing process with local averaging. For example, the Sobel operator estimates the partial derivative in the
The gradient of f(x, y) is a vector at (x, y) with this magnitude and direction. Thus the gradient may be estimated if the directional derivatives of the function are known along (any) two orthogonal directions. The essential differences between all gradientbased edge detectors are the directions which the operators use, the manner in which they approximate the onedimensional derivatives of the image function in these directions, and the manner in which they combine these approximations to form the gradient magnitude. These gradient functions are intuitively easy to understand when we confine ourselves to the discrete domain of digital images where partial derivatives become simple first differences. For example, the first difference of a twodimensional function in the xdirection is simply:
f(x+ 1, y)  f(x, y)
Similarly, the first difference of a twodimensional function in the ydirection is simply:
f(x, y
+ 1) 92
f(x, y)
Figure 5.6 A tray of wires.
93
The segmentation problem
An overview of edge detection techniques
figure 5.7 Roberts RMS edge detection operator.
figure 5.9 Roberts max edge detection operator.
r
"""
I ..
'
:,
",
? I"
_
tr.
.. ,.
, ' ..~ ,
figure 5.8 Roberts absolute value edge detection operator.
figure 5.10 Sobel RMS edge detection operator.
xdirection over a 3 x 3 region centred at !(x, y) by: Sx= {!(x+ 1,yl)+2!(x+ I,y)+!(x+ 1,y+ 1?)
 {!(xI, y 1) + 2!(xI, y) + !(xI, y + 1?)
The gradient may then be estimated as before by either calculating the RMS (see Figure 5.10):
g(x, y)
=:::
S = J(S; + S;)
S = I Sx I + I Sy I
This essentially takes the difference of a weighted average of the image intensity on either side of !(x, y). Similarly: Sy= {!(xI,y+ 1)+2!(x,y+ l)+!(x+ 1,y+ I)}
 {!(xI,y1)+2!(x,y1)+!(x+ 1,yI)}
94
or by taking the absolute values (see Figure 5.11):
g (x, y)
=:::
In an analogous manner Prewitt suggests an approximation of the partial 95
The segmentation problem
An overview of edge detection techniques
derivatives by:
Px = {!(x+ l,yI)+!(x+ I,y)+!(x+ l,y+ I)}  {!(xI,yl)+!(xl,y)+!(xl,y+ I)} Py = {!(xl,y+ 1)+!(x,y+ I)+!(x+ l,y+ I)}  {!(xI,yl) +!(x,yl)+!(x+ l,yI)}
and the gradient may be estimated as before (see Figures 5.12 and 5.13). Quite often, the directional differences are estimated using simple convolution
figure 5.13 Prewitt absolute value edge detection operator.
figure 5.11 Sobel absolute value edge detection operator.
kernels, one kernel for each different operator. These yield two partial derivative images, which are then combined on a point by point basis, either as the RMS or as the sum of absolute values, to produce the final gradient estimate. Figure 5.14 illustrates the convolution kernels for the Roberts, Sobel, and Prewitt operators. 0 Strictly speaking, the kernel should first be rotated by 180 before the convolution is performed (see Section 4.2.1). However, this is normally omitted since the 0 resultant error of 180 in the gradient direction can be ignored. Once the gradient magnitude has been estimated, a decision as to whether or not an edge exists is made by comparing it to some predefined value; an edge is deemed to be present if the magnitude is greater than this threshold. Obviously the choice of threshold is important and in noisy images threshold selection involves a tradeoff between missing valid edges and including noiseinduced false edges. So far, edge detection has been discussed on the basis of firstderivative directional operators. However, an alternative method uses an approximation to the Laplacian:
V2= ax2 + ay2
a2
a2
i.e. the sum of secondorder, unmixed, partial derivatives. The standard approximation is given by:
L(x, y)
= lex, y) 
1/4 {!(x, y + 1) + lex, y  1) + !(x + I,y) + !(x  1, y)}
figure 5.12 Prewitt RMS edge detection operator.
The digital Laplacian has zero response to linear ramps (and thus gradual changes in intensity) but it does respond on either side of the edge, once with a positive sign and once with a negative sign. Thus in order to detect edges, the image is enhanced by evaluating the digital Laplacian and isolating the points at which the resultant
96
97
The segmentation problem
An overview of edge detection techniques
Despite some criticism of this technique, it is very widely used. The operator possesses a number of useful properties: for example: the evaluation of the Laplacian and the convolution commute so that, for a Gaussian with a given standard deviation, we can derive a single filter: the Laplacian of Gaussian:
rn rn
(a)
D
1
o
0 0 0
1
v2 (I(x, y) * G(x, y)}
= V2 G(X, y)
* I(x, y)
Furthermore, this twodimensional convolution is separable into four onedimensional convolutions (see Appendix I for a derivation):
1
2
0
1
0
1
v (J(x, y). G(x, y)) = G(x)' (J(X, y).
2
1
2
:;2
G(Y)]
0
1
2
1
2
+ G(y). (J(X, y).
aa;2
G(X)]
(b)
1
 1
1
1
0 0 0
1
1
0
0
1
0
1
1
1
1
1
(c)
Figure 5.14
Convolution kernels for estimation of the partial derivatives with (a) Roberts; (b) Sobel; and (c) Prewitt edge detection operators.
image goes from positive to negative, i.e. at which it crosses zero. The Laplacian has one significant disadvantage: it responds very strongly to noise. A different and much more successful application of the Laplacian to edge detection was proposed by Marr and Hildreth in 1980. This approach firs