当前位置:首页 >> 机械/仪表 >>

machine vision


David Vernon

,

MACHINE VISION
Automated Visual Inspection . and Robot VISion

Machine Vision

FROM THE BOOKS OF

Ki,b:>.:-tf(&

Li?r(

I

?

? ?

lSI

Automated Visual
Inspection and

Robot Vision
David Vernon
Department of Computer Science Trinity College Dublin Ireland

Prentice Hall
New York London Toronto Sydney Tokyo Singapore

First published 1991 by Prentice Hall International (UK) Ltd 66 Wood Lane End, Hemel Hempstead Hertfordshire HP2 4RG A division of Simon & Schuster International Group

? Prentice Hall International (UK) Ltd, 1991
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form, or by any means, electronic, mechanical, photocopying, recording or otherwise, without prior permission, in writing, from the publisher. For permission within the United States of America contact Prentice Hall Inc., Englewood Cliffs, NJ 07632. Typeset in 10 on 12 point Times by MCS Typesetters, Salisbury, Wiltshire, England Printed and bound in Great Britain by Cambridge University Press Library of Congress Cataloging-in-Publication Data is available from the publisher British Library Cataloguing in Publication Data Vernon, David Machine vision. 1. Title 006.3 ISBN 0-13-543398-3 1 2 3 4 5 95 94 93 92 91

Everything that I can spy Through the circle of my eye, Everything that I can see Has been woven out of me; I have sown the stars, and threw Clouds of morning and of eve Up into the vacant blue; Everything that I perceive, Sun and sea and mountain high, All are moulded by my eye: Closing it, what shall I find? - Darkness and a little wind.

James Stephens
The Hill of Vision

Contents

Preface Acknowledgements

xi xiii

1

An introduction to computer vision
1.1 1.2 1.3 1.4 Computer vision: image processing or artificial intelligence? Industrial machine vision vs. image understanding Sensory feedback for manufacturing systems: why vision? Examples of industrial machine vision problems and solutions 1.4.1 Measurement of steel bars 1.4.2 Inspection of computer screens 1.5 A typical system architecture 1 3 4 6 7 8 9

2

Illumination and sensors
2.1 Illumination 2.2 Sensors 2.2.1 Image formation: elementary optics 2.2.2 Camera sensors 2.2.3 Camera interfaces and video standards 2.2.4 Characteristics of camera sensors 2.2.5 Commercially available cameras

15
15 17 17 19

22 23

27

3

Image acquisition and representation
3.1 Sampling and quantization 3.1.1 Spatial frequency and the effects of sampling 3.2 Inter-pixel distances 3.3 Adjacency conventions 3.4 Image acquisition hardware 3.5 Speed considerations
vii

28
28

29
34

35 37
41

Contents

Contents

4

Fundamentals of digital image processing
4.1. Point operations

44
45
46 49 51 52 53 53 56 61

6

Image analysis
6.1 Introduction: inspection, location, and identification 6.2 Template matching

118
118 119 119 121 122 122 123 124 126 130 130 134 136

4.1.1 Contrast stretching 4.1.2 Thresholding 4.1.3 Noise suppression by image addition 4.1.4 Background subtraction 4.2 Neighbourhood operations 4.2.1 Convolution 4.2.2 Noise suppression 4.2.3 Thinning, erosion, and dilation 4.3 Geometric operations 4.3.1 Spatial warping 4.3.1.1 The spatial transformation 4.3.1.2 Grey-level interpolation 4.3.2 Registration and geometric decalibration 4.4 Mathematical morphology 4.4.1 Basic set theory 4.4.2 Structuring elements and hit or miss transformations 4.4.3 Erosion and dilation 4.4.4 Opening and closing 4.4.5 Thinning and the extraction of endpoints 4.4.6 Application: identification of endpoints of electrical wires 4.4.7 A brief introduction to grey-scale mathematical morphology

67 67
69 71 74 74 74 75

6.2.1 Measures of similarity 6.2.2 Local template matching 6.3 Decision-theoretic approaches 6.3.1 Components of statistical pattern recognition process 6.3.2 Simple feature extraction 6.3.3 Classification 6.3.3.1 A synopsis of classification using Bayes' rule 6.4 The Hough transform 6.4.1 Hough transform for line detection and circle detection 6.4.2 The generalized Hough transform 6.5 Histogram analysis

7

An overview of techniques for shape description
7.1 7.2 7.3 7.4 A taxonomy of shape descriptors External scalar transform descriptors: features of the boundary Internal scalar transform descriptors: features of the region External space domain descriptors: spatial organization of the boundary

140
141 141 143 145 148 150

76 78 79
80 80

5

The segmentation problem
5.1 Introduction: region- and boundary-based approaches 5.2 Thresholding

85
85 86

7.4.1 An algorithm for resampling the boundary chain codes 7.5 Internal space domain descriptors: spatial organization of the region

5.2.1 Global, local, and dynamic approaches 5.2.2 Threshold selection 5.3 An overview of edge detection techniques 5.3.1 Gradient- and difference-based operators 5.3.2 Template matching 5.3.3 Edge fitting 5.3.4 Statistical techniques 5.3.5 Assessment of edge detection 5.4 Region growing 5.4.1 The split and merge procedure using quad-trees 5.5 Boundary detection 5.5.1 Boundary refining 5.5.2 Graph-theoretic techniques 5.5.3 Dynamic programming 5.5.4 Contour following
viii

87 87 90 92
99 103 105 106 106 107 108

8

Robot programming and robot vision
8.1 8.2 8.3 8.4 8.5 A brief review of robot programming methodologies Description of object pose with homogeneous transformations Robot programming: a wire crimping task specification A simple robot-programming language Two vision algorithms for identifying ends of wires

156
157 158 164 181

109 109
110 110

8.5.1 A binary vision algorithm 8.5.2 A grey-scale vision algorithm 8.5.3 The vision/manipulator interface 8.6 The camera model and the inverse perspective transformation 8.6.1 The camera model 8.6.2 The inverse perspective transformation 8.6.3 Recovery of the third dimension 8.7 Three-dimensional vision using structured light
ix

189 189 192
195 196 197

200 202
203

Contents

9

Introduction to image understanding

211

9.1 Representations and information processing: from images to object models 211 9.2 Organization of visual processes 212 9.3 Visual representations 214 9.3.1 The raw primal sketch 214 9.3.2 The full primal sketch 215 9.3.3 The two-and-a-half-dimensional sketch 221 9.3.4 Three-dimensional models 224 9.3.4.1 Volumetric representations 224 9.3.4.2 Skeletal representations 225 9.3.4.3 Surface representations 226 9.3.5 The extended Gaussian image 228 9.4 Visual processes 230 9.4.1 Stereopsis 230 9.4.2 Camera motion 231 9.4.3 Shading 243 9.5 Concluding remarks 248

Appendix: Separability of the Laplacian of Gaussian Operator 253
Index
255

Machine vision is a multi-disciplinary subject, utilizing techniques drawn from optics, electronics, mechanical engineering, computer science, and artificial intelligence. This book is intended to be an in-depth introduction to Machine Vision which will allow the reader quickly to assimilate and comprehend the essentials of this evolving and fascinating topic. Significant emphasis will be placed on providing the reader with a solid grounding in the fundamental tools for image acquisition, processing, and analysis; a range of techniques, dealing with very simple twodimensional systems, through more sophisticated robust two-dimensional approaches, to the current state of the art in three-dimensional robot vision, will be explained in some detail. Both application areas of automated visual inspection and robot vision are addressed. Recognizing that machine vision is just a component of a larger automation system, a brief introduction to robot programming will be provided, together with an explanation of the mechanisms by which robot vision modules interact with the programming language. It is important to recognize that the discipline of machine vision is presently undergoing a maturing process, with sophisticated techniques drawn from current research being exploited more and more in industrial systems. Without doubt, there is a long way to go, but the die is well cast. Acknowledging this trend, the last chapter of the book is devoted to the more research-orientated topics of three-dimensional image understanding and early visual processing (e.g. stereopsis and visual motion). It would indeed be foolhardy to attempt an exhaustive treatment of these areas; each deserves a volume on its own. However, if the essence of the philosophy of robot vision in its broadest sense is cogently imparted to the reader, then the exercise will have been successful and worth while. The book is directed at final-year undergraduate and first-year graduate students in computer science and engineering, and at practising industrial engineers; the fundamental philosophy being to impart sufficient knowledge so that the reader will be competent to begin the implementation of a simple vision system and to enable him/her to study each issue independently in more depth. To that end, care
xi

x

is taken to provide adequate references to supporting texts, rep~rts, ~nd research papers. In this way the book may be viewed both as a self-contamed mtroductory text and as a spring-board to more detailed and specific study.

Acknowledgements

Special thanks are due to Kenneth Dawson of the Computer Vision Group, Department of Computer Sciences, Trinity College, Dublin, for his work on the raw primal sketch, the extended Gaussian image, and the polyhedral models; and to Massimo Tistarelli and Prof. Giulio Sandini at the University of Genoa for their help with the examples of camera motion and stereopsis. Many people in the Trinity Computer Vision Group read draft versions of this book and I am grateful for their contributions. I would also like to record a note of thanks to Dr R. Dixon, The University, Manchester, for his many valuable comments on an earlier draft of this book. Several of the examples in this book were facilitated by research funded by the Commission of the European Communities under the European Strategic Programme for Research and Development in Information Technology: Project 419 - Image and Movement Understanding.

xii

xiii

An introduction to computer
VISion
? ?

1. 1 Computer vision: image processing or

artificial intelligence?
What is computer vision and why would one be interested in studying it? It is perhaps easier to answer these two questions in reverse order. There are several reasons why one would be interested in computer vision, but the following two will serve to illustrate the many directions from which one can view the subject area: 1. All naturally occurring intelligent life-forms exhibit an ability to interact with and manipulate their environment in a coherent and stable manner. This interaction is facilitated by on-going intelligent interplay between perception and motion-control (Le. action); visual perception is fundamentally important to most intelligent life. Most manufacturers are concerned with the cosmetic integrity of their products; customers quite often equate quality of appearance with functional quality. So, to ensure the successful long-term marketing of an item, it is highly desirable that its appearance is checked visually before packaging and shipping. Likewise, it is desirable that the inspection process be automated and effected without human intervention.

2.

These two motivations for the study of perception characterize two possible extremes of interest in the processing, analysis, and interpretation of visual imagery: from the philosophical and perhaps esoteric to the immediate and pragmatic. And the subject matter of everything between these two extremes presents one with wide and varied spectrums of commercial interest, difficulty and, indeed, success. The answer to the first question (what is computer vision?) now becomes a little easier to identify. The world we live in and experience is filled with an endless variety of objects, animate and inanimate, and, to borrow a phrase from David Marr (of whom we shall hear more later in the book), it is by looking and seeing
1

An introduction to computer vision

Industrial machine vision vs. image understanding

that we come to know what is where in this world. So, if vision is a means to an end - to know the world by looking - then computer vision is exactly the same except that the medium by which the knowledge is gained is now a computational instrument rather than the brain of some living creature. Without doubt, this is a very broad definition. But the subject matter of computer vision is this broad: topics such as image restoration, image enhancement, automated visual inspection, robot vision, computer-based image understanding of general three-dimensional scenes, and visual perception and cognition all fall under the umbrella of the term 'computer vision'. Although for centuries man has been interested in solving the puzzle of how man comes to 'see', the first computational experiments in developing artificial machine vision systems were conducted in the late 1950s and, over the last twentyfive to thirty years computer-based vision systems of widely varying degrees of complexity have been used in many diverse areas such as office automation, medicine, remote sensing by satellite, and in both the industrial world and the military world. The applications have been many and varied, encompassing character recognition, blood cell analysis, automatic screening of chest X-rays, registration of nuclear medicine lung scans, computer-aided tomography (CAT), chromosome classification, land-use identification, traffic monitoring, automatic generation of cartographic projections, parts inspection for quality assurance industrial, part identification, automatic guidance of seam welders, and visual feedback for automatic assembly and repair. Military applications have included the tracking of moving objects, automatic navigation based on passive sensing, and target acquisition and range-finding. As we have seen, computer vision is concerned with the physical structure of a three-dimensional world by the automatic analysis of images of that world. However, it is necessary to qualify the use of the word image. First, the image is a two-dimensional one and, hence, we inevitably lose information in the projection process, i.e. in passing from a three-dimensional world to a two-dimensional image. Quite often, it is the recovery of this lost information which forms the central problem in computer vision. Second, the images are digital images: they are discrete representations (i.e. they have distinct values at regularly sampled points) and they art quantized representations (i.e. each value is an integer value). Computer vision includes many techniques which are useful in their own right, e.g. image processing (which is concerned with the transformation, encoding, and transmission of images) and pattern recognition (frequently the application of statistical decision theory to general patterns, of which visual patterns are but one instance). More significantly, however, computer vision includes techniques for the useful description of shape and of volume, for geometric modelling, and for socalled cognitive processing. Thus, though computer vision is certainly concerned with the processing of images, these images are only the raw material of a much broader science which, ultimately, endeavours to emulate the perceptual capabilities of man and, perhaps, to shed some light upon the manner by which he accomplishes his amazingly adaptive and robust interaction with his environment.
2

1.2 Industrial machine vision vs. image

understanding
Computer vision, then, is an extremely broad discipline (or set of disciplines) and in order to get to grips with it, we need to identify some way of classifying different approaches. To begin with, we note that humans live and work within a general three-dimensional world, pursuing many goals and objectives in an unconstrained and constantly changing environment in which there are many varied and, often, ill-defined objects. Industrial automation, on the other hand, is given to performing single repeated tasks involving relatively few objectives, all of which are known and defined, in manufacturing environments which are normally constrained and engineered to simplify those tasks. Industrial systems do not yet work with general three-dimensional environments (although the environments they do work in are often much less structured than one would suppose) and vision systems for manufacturing still exploit many assumptions, which would not generally apply to unconstrained worlds with many objects and many goals, in order to facilitate processing and analysis. There is a considerable dichotomy between the two approaches - a situation which must change and is changing - it is for this reason that the final chapter is concerned with advanced techniques and their migration to the industrial environment. Let us look a little closer at each of these classes of computer vision. Approaches associated with general environments are frequ'ently referred to by the terms 'image understanding' or 'scene analysis'. The latter term is now quite dated as it typically refers to approaches and systems developed during the 1970s. Vision systems specifically intended for the industrial environment are often referred to generically as 'industrial machine vision systems'. Image understanding vision systems are normally concerned with threedimensional scenes, which are partially constrained, but viewed from one (and often several) unconstrained viewpoint. The illumination conditions may be known, e.g. the position of the room light might be assumed, but usually one will have to contend with shadows and occlusion, i.e. partially hidden objects. As such, the data or scene representation is truly a two-dimensional image representation of a three-dimensional scene, with high spatial resolutions (i.e. it is extremely detailed) and high grey-scale resolutions (Le. it exhibits a large variation in grey-tone). Occasionally, colour information is incorporated but not nearly as often as it should be. Range data is sometimes explicitly available from active range-sensing devices, but a central theme of image understanding is the automatic extraction of both range data and local orientation information from several two-dimensional images using e.g., stereopsis, motion, shading, occlusion, texture gradients, or focusing. One of the significant aspects of image understanding is that it utilizes several redundant information representations (e.g. based on the object edges or boundaries, the disparity between objects in two stereo images, and the shading of the object's surface); and it also incorporates different levels of representation in 3

An introduction to computer vision

Sensory feedback for manufacturing systems: why vision?

order to organize the information being made explicit in the representation in an increasingly powerful and meaningful manner. For example, an image understanding system would endeavour to model the scene with some form of parameterized three-dimensional object models built from several low-level processes based on distinct visual cues. At present, image-understanding systems utilize both explicit knowledge (or models) and software-embedded knowledge for reasoning, that iJ, for controlling image analysis. Most industrial machine vision systems contrast sharply with the above approach. The scenes in an industrial environment are usually assumed to be twodimensional, comprising known isolated rigid parts, frequently with a contrasting visual backdrop. Lighting is almost always a critical factor and must be very carefully organized. Typically, the ambient room lighting will be totally inadequate, and even confusing, so that each inspection station will require its own set of dedicated lights, each designed for the task in hand. The images which industrial machine vision systems use are frequently two-dimensional binary images (pure black and white, with no intermediate grey-levels) of essentially two-dimensional scenes. There is normally just one simple internal object representation or model; the analysis strategy being to extract salient features (e.g. area, circularity, or some other measure of shape) and to make some decision, typically using feature-based discrimination. This process frequently uses software-embedded (hard-coded) knowledge of the scene. There are two complementary areas of industrial machine vision: robot vision and automated visual inspection. Both of these use essentially the same techniques and approaches, although the visual inspection tasks are, in general, not as difficult as those involved in visual perception for robotic parts manipulation, identification, and assembly. This is because the inspection environment is usually easier to control and the accept/reject decisions required for inspection are often easier to determine than the location and identification information needed for assembly. The significant problems associated with robotic part handling, too, has meant that advanced three-dimensional robot vision has not received the attention it merits.

a remote manner rather than having to be in contact with the objects being analysed. Systems which are equipped with (useful) visual capabilities are inherently adaptive and can deal with uncertainty in the environment, or at least that is what one would hope for. The upshot of this is that, by incorporating vision in the manufacturing process, not only can we identify when things go wrong (e.g. in visual inspection) but the uncertain and variable nature of the manufacturing environment can be catered for. Unfortunately, vision, while versatile, is also the most complex of the senses due mai~IY to the :act that ~ost information in visual imagfs is implicitly coded and reqUIres extenSIve processmg and analysis to make it explicit. Visual sensing is difficult: in humans, ten of the estimated one hundred cortical areas in the brain are devoted to vision and much work remains to be done before we can claim to have even a modest grasp of visual sensing. Given that one acknowledges that vision is (potentially) a very powerful sense, let us look at some of the motivations for using visual processes in the industrial workplace. '

D

Safety and reliability

Considerations of safety and reliability usually arise in environments which are hazardous for humans (e.g. in close proximity to a milling bit) or because manufactured parts are of critical importance and 100 per cent inspection is required (e.g. defects in a brake lining might conceivably cause loss of life). Machine vision also facilitates consistency in inspection standards; such systems don't suffer from the 'Monday-morning' syndrome and their performance can be (and should be) quantitatively assessed.

D

Product quality

1.3 Sensory feedback for manufacturing systems: why vision?
The answer to this question must necessarily be double-barrelled: 1. We need feedback because the manufacturing system is not perfect and free of errors: we wish to ensure that we are informed when errors begin to creep into the process, so that we can take corrective action and ensure that quality and productivity are maintained. We use vision because it is by far the most versatile sense available and conveys extremely rich information when compared with, e.g., sonar or infrared sensing. Furthermore, unlike tactile sensing, it senses the environment in 4

High-volume production using humans seldom facilitates inspection of all parts but automated visual inspection techniques may make it feasible; this depends on the complexity of the task and the effective throughput that is required by the manufacturing system. The latter consideration is particularly important if the vision system is to be incorporated in an on-line manner, i.e. inspecting each part as it is manufactured.

D

Flexible automation

2.

In environments where quality assurance is performed by a machine, it is feasible to integrate the inspection task into the complete production or manufacturing cycle, and allow it to provide feedback to facilitate on-line control. This provides for the adaptive requirements of AMT (advanced manufacturing technology) systems and facilitates the overall control by computer, such as is found (or, more realistically, as will be found in the future) in advanced computer integrated manufacturing (CIM) environments.

5

An introduction to computer vision
Integration within a CIM system is, however, one of the problems that is forestalling the rapid deployment of vision technology in advanced manufacturing environments, not least because the vision system will not usually know how to communicate with, e.g., the CAD (computer aided design) database or another manufacturer's robotic palletizing system.

Industrial machine vision: problems and solutions
offered by commercial manipulators. It is for this reason that the development of compliant manipulation and the incorporation of tactile sensors in the robot grippers are fundamentally important. Despite these negative comments, there is still a great deal of potential for existing vision technologies; there are many common industrial problems which can be successfully and cost-effectively solved with the judicious use of well-understood and simple vision tools. To illustrate this and some of the techniques that have been alluded to so far, a few very simple examples are in order; a more sophisticated application is described in detail in Chapter 8.

1.4 Examples of industrial machine vision

problems and solutions
Of the two sub-sections of industrial machine vision, automated visual inspection is presently by far the most important area, accounting for a~ lea~t 80 per ce?t of all current applications. Within inspection, there is a great dIverSIty of uses m fields such as quality assurance, machine monitoring, and test and calibration. These applications may be classified on a functional basis as either gauging, inspection, or sorting: " " Gauging is concerned with the measurement of dimensional characteristics of parts and with checking tolerances. Inspection, per se, is concerned with performing part verification, i.e. establishing whether there are any parts or sections missing from the object being inspected or whether there are any extraneous parts which should not be present. Alternatively, one might be interested in performing flaw detection, i.e. effecting the detection and classification of flaws (usually surface flaws) on the part: for example, the detection of scratches on plastic housings. Sorting is concerned with the identification and recognition of parts. Parts will usually be on a conveyer system and it is pertinent to note that this does not necessarily involve robot manipulation as the part can quite simply be pushed into an appropriate bin using a simple pneumatically actuated flipper.

1.4.1 Measurement of steel bars
Raw material for steel ingots to be used in a casting process is delivered in four-foot long bars with a circular cross-section. Individual ingots are then cut from the bar and stacked in a wooden box before being moved to the next stage of the casting process. To ensure high-quality casting with minimal wastage of raw material, it is essential that the weight of the charges fall within acceptable tolerances. The nominal tolerances on weight are ± 5 g. If the cross-sectional area of the bar of raw material were constant, the required weight would be given by an ingot of fixed length. Unfortunately, the cross-sectional area is not constant and the outside diameter of the bar can vary considerably (a 1.25" bar has a nominal variation of 0.01" but often exceeds this limit). Thus, in order to ensure the correct weight of a cut charge, the crosssectional area of the bar must be monitored and a length of bar chosen such that the volume (and hence weight) falls within the required tolerance. To measure the cross-sectional area, an 'outside diameter' gauge comprising a diametrically opposed light source and an image sensor is mounted on a robot end effector and moved along the axis of the bar (see Figure 1.1). In this manner, the

"

The applications of robot vision are less varied. At present, this is primarily due to the uncertain nature of the industrial environment within which a robot will be operating: there are typically three dimensions to deal with instead of two, partly manufactured objects tend to be poorly finished having spurious material attached (e.g. swarf) and they tend not to appear exactly where they are expected or are supposed to be. The two application areas which are well developed are materials handling and welding. The reasons for this are that these problems can often be reduced to two-dimensions; objects on a conveyer or pallet are visualized as twodimensional silhouetted shapes and tracking a welding seam is conceived, locally, as a two-dimensional exercise in the estimation of the disparity between the position of the welding rod and the metal seam. The application which one would expect to be the main subject matter of robot vision - part assembly - is very poorly developed for the reasons given above and, additionally, because the assembly operation requires greater dexterity on the part of the robot than is currently 6

~;:::"',_---';~_~_____

-:,Q~

" I /

Collimated light source

Diameter given by width of shadow

/

Figure 1.1

Measurement and inspection of steel bars.

7

An introduction to computer vision
bar will form a 'shadow' on the light sensor, the size of which corres~onds to the diameter of the bar at that point. Thus, a profile of the bar can be ~U11t as. the bar passes through the gauge. One pass of the sensor generat~s .a smgle sIgnature describing how the diameter varies over the scanned leng~h. If It.lS assumed that the bar is truly circular in cross-section, this single measure IS suffiCIent to compute the volume of the bar; several passes of the sensor at different orien~ations (i.e: rotating the sensor about the axis of the bar after each pass) allow thIS assumptIon to be dropped. The essential point to be noticed about this system is that th~ techniques used are inherently simple (measuring the length of a shadow cast dI:ectly on a sensor) and do not require any special sophisticat~d software. The ~augmg process is very fast and consequently can be integrated mto the productIon system as a feedback device to control the ingot cutting operation directly.

A typical system architecture
Apple Macintosh II, a data translation image acquisition device (framestore), and a Panasonic CCD (charge coupled device) camera. For the purposes of illustrating the image analysis, consider the problem of identifying the position of the picture on the screen, such as is depicted in Figure 1.2. We wish to measure the distance between the edge of the picture (shown in white) and the inside edge of the computer face-plate, i.e. the width of the black border. To do this, we sample a small linear section (shown as the vertical X-X section in Figure 1.2) and identify the transitions between white and black and between black and grey. This is accomplished by estimating the rate of change in intensity or colour along the strip. A signature of this intensity change is shown in Figure 1.2; the required distances, D1 and D2, correspond to the distance between the first two peaks we encounter as we journey from the centre of the strip to the periphery. The CRT is adjusted until these two distances are, to within specified tolerances, equal. All the other features which need to be checked, with the exception of focus and brightness, are computed in a similar manner.

1.4.2 Inspection of computer screens
Cathode ray tubes (CRTs) must be inspected and c~librate.d du~ing th~ assembly of personal computers. The picture position, width, heIght, dlstortI?n, bn~htness, .and focus need to be computed and, if any of these measurements he outsIde specIfied tolerances real-time feedback is required to facilitate on-line adjustment. Such a system ha~ been configured using standard off-the-shelf equipment, comprising an

1.5 A typical system architecture
In the context of a manufacturing environment, a machine vision process will typically encompass nine components: the manufacturing assembly line; some mechanism for delivering the parts to the machine vision system; a method of automatically presenting the part to the system (including part acquisition, part positioning, and registration); some sensing system which provides a representation understandable by computer systems (and digital systems, in general); the representation itself; some process to extract pertinent features; a set of criteria upon which some decision process can be based; a method facilitating the automatic release of the system; some mechanism for collecting the processed part (see Figure 1.3). Rather than discuss in detail each component of the imaging system, as this would pre-empt the subject matter of later chapters, we will present here no more than a schematic overview (illustrated in Figure 1.4) denoting many components and terms which have not yet been explained; the reader is asked to bear in mind that their introduction here is by way of preview rather than a self-contained exposition. The imaging part of the machine vision system comprises sub-systems for image formation, image acquisition and (pre-)processing, image analysis, and image interpretation. The image formation system, consisting of the part illumination, the sensing element and the associated optics, is critical to the successful deployment of the system. This system generates an analogue representation of the image scene, typically in the form of a conventional TV video signal. The task of the image acquisition and processing sub-system is to convert this signal into a digital image, and to manipulate the resultant image to facilitate the subsequent extraction of information. The image analysis phase, working with 9

x
x

~ 01

T

x

c:::==:I

~02
x

T
Change in brightness -------

Vertical eccentricity = 01 - 02

Figure 1.2

Inspection of computer screens.

8

An introduction to computer vision
MANUFACTURING ASSEMBLY LINE

A typical system architecture

DELIVERY TO INSPECTION DEVICE

.. -............ _- ........ _- ?...?...??.??. _....... .
c: o
(])

C/l

AUTOMATIC REGISTRATION AND POSITIONING

'u
"C
Cl '.;:i

'Cij

c:

IMAGING SYSTEM CAMERA/LENS SUB-SYSTEM LIGHTING SUB-SYSTEM

:;
Automated visual inspection system

ro

Q;
+"'

u.

o

E

:l 0.

E o (J

~

"
u
c
0
,~

c: ,2
fII
Q) '(ij O)~

§
'.;:i

'w E
g,,~

..... ro

>co
ro ro

,~
c:

_ C E co

IV ?I

--; 0
~ ~
CIJ"C

~

????????????????????????????????????????????????????? ,§
INSPECTION AND PART PICKUP
C)

V5
+"'

~

cu

C '(ij
fII

Figure 1.3

Components of a machine vision system in the context of a manufacturing environment.

(/J

:l -C

'c 0

0 +"'
E
0
C1.l

0

:E B 0
::l ...
(.)
I

'0,

co .....
OJ

c:

CD
C1.l

E
c: ro ..c c:
C1.l C1.l
(J

:0
:l

C1.l

:>

"C

co
Q)

the processed image, is concerned with the extraction of explicit information regarding the contents of the image (e.g. object position, size, orientation). Thus, there is a fundamental difference between image processing and image analysis: the former facilitates transformations of images to (hopefully, more useful) images, while image analysis facilitates the transformation from an image to an explicit (symbolic) piece of information, The final phase, image interpretation, is concerned with making some decision (e,g, accept or reject) based upon the description of what is in the image; it provides the link back to the environment to control the inspection process or the robot manipulator. Each sub-system requires its own specialized type of hardware: image formation uses camera technology, optics, and various light sources. Image acquisition requires an analogue-to-digital converter or 'frame-grabber' to capture a frame of video information; image processing is a computationally intensive task (due to the amount of data comprising an image) and the predominant trend is to accomplish as much of this in hardware as possible. Image analysis typically takes place on a conventional micro/ mini computer, though, again, there is a trend to accomplish as much of the analysis as possible in dedicated hardware. Image
10

C0)-

C"..E:f
?I

..... c:
-ctl > c:

"
o
c:

0) ....

'iii
Q)

0

:>

"C

.E

co,.,

Cl

c

?
c: o c:

0

(J

.E

ro

o
Q)

C

'.j:i

'.j:l ctl

0)?1

E -0
c: o
ro

co

-

E

'E @

a..

J.-I;~i!1 .Ctl

c:

',p
+"'

c:

0

I< ..

~,

1:

C1.l

(J

0..0.

ro

e
(/J

11

An introduction to computer vision

References and further reading
Corby, N.R. 1983 'Machine vision for robotics', IEEE Transactions on Industrial Electronics, Vol. IE-30, No.3, pp. 282-91. Caudrado, J. L. and Caudrado, C. Y. 1986 'A.1. in computer vision', Byte, Vol. 11, No.1, pp.237-58. Duda, R.O. and Hart, P.E. 1973 Pattern Classification and Scene Analysis, Wiley, New York. Fairhurst, M.C. 1988 Computer Vision Jor Robotic Systems, Prentice Hall International (UK), Hertfordshire. Fischler, M.A. 1981 Computational Structures oj Machine Perception, SRI International, Technical Note No. 233. Gonzalez, R.C. and Safabakhsh, R. 1982 'Computer vision techniques for industrial applications and robot control', Computer Vol. 15, No. 12, pp. 17-32. Gonzalez, R.C. and Wintz, P. 1977 Digital Image Processing, Addison-Wesley, Reading, Mass. Hall, E.L. 1979 Computer Image Processing and Recognition, Academic Press, New York. Hanson, A.R. and Riseman, E.M. 1978 'Segmentation of natural scenes', in Computer Vision Systems, Hanson, A.R. and Riseman, E.M. (eds), Academic Press, New York. Hara, Y., Akiyama, N. and Karasaki, K. 1983 'Automatic inspection system for printed circuit boards', IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-5, No.6, pp. 623-30. Horn, B.K.P. 1986 Robot Vision, The MIT Press, Cambridge, Massachusetts. Jarvis, J. 1982 Computer Vision Experiments on Images oj Wire-Wrap Circuit Boards, Conference Record on 1982 Workshop on Industrial Applications of Machine Vision, pp. 144-50. Kelley, R.B., Birk, J. R., Martins, H.A.S. and Tella, R. 1982 'A robot system which acquires cylindrical workpieces from bins', IEEE Transactions on Systems, Man, and Cybernetics, Vol. SMC-12, No.2, pp. 204-13. Kelley, R.B., Martins, H.A.S., Birk, J.R. and Dessimoz, J-D. 1983 'Three vision for acquiring workpieces from bins', Proceedings oj the IEEE, Vol. 71, No.7, pp.803-21. Mahon, J., Harris, N. and Vernon, D. 1989 'Automated visual inspection of solder paste deposition on surface mount technology printed circuit boards, accepted for publication in Computers in Industry. Moore, F.W. 1987 'Remote visual inspection of nuclear fuel pellets with fibre optics and video image processing', Optical Engineering, Vol. 26, No.2, pp. 152-5. Nevatia, R. 1982 Machine Perception, Prentice Hall, New Jersey. Pau, L.F. 1983 'Integrated testing and algorithms for visual inspection of integrated circuits', IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-5, No.6, pp. 602-8. Pau, L.F. 1984 'Approaches to industrial image processing and their limitations', Electronics and Power, February, pp. 135-40. Pinker, S. (ed.) 1984 'Cognition', The International Journal oj Cognitive Psychology, Vol. 18, Nos. 1-3. Pratt, W.K. 1978 Digital Image Processing, Wiley, New York. Pugh, A. (ed.) 1983 Robot Vision, IFS (Publications) Ltd, United Kingdom. Rosenfeld, A. 1984 Why Computers Can't See (Yet), Abacus, Vol. 1, No.1, pp. 17-26, Springer Verlag, New York. Rosenfeld, A. and Kak, A. 1976 Digital Picture Processing, Academic Press, New York.

interpretation is normally effected in software on a conventional computer, usually the same one which implements the image analysis.

Exercises
1. Comment on the validity of the statement that 'Industrial machine vision and image understanding have nothing in common'. 2. Given that an inspection task typically comprises three main stages of sensing, image processing, and image analysis, identify the component processes within each stage by describing the functional structure of a task involving the automated visual inspection of O-ring seals. The objective of the inspections task is to compute the outside and inside diameters and the diameter ratios and to check that each of the three values is within a certain tolerance. Pay particular attention to the components related to image analysis. 3. Describe briefly the essential difference between information contained in images and the information required to make a decision in a manufacturing environment.

References and further reading
Ballard, D.H. and Brown, C.M. 1982 Computer Vision, Prentice Hall, New Jersey. Batchelor, B.G., Hill, D.A. and Hodgson, D.C. (eds) 1985 Automated Visual Inspection, IFS (Publications), Ltd, United Kingdom. Besl, P.J., Delp, E.J. and Jain, R. 1985 'Automatic visual solder joint inspection', IEEE Journal oj Robotics and Automation, Vol. RA-l, No.1, pp. 42-56. Binford, T.O. 1982 'Survey of model-based image analysis systems', The International Journal oj Robotics Research, Vol. 1, No.1, pp. 18-64. Brady, J.M. (ed.) 1981 Computer Vision, Elsevier Science Publishers, The Netherlands. Brady, M. 1982 'Computational approaches to image understanding', ACM Computing Surveys, Vol. 14, No.1, pp. 3-71. Bolles, R.C. 1981 An Overview oj Applications oj Image Understanding to Industrial Automation, SRI International, Technical Note No. 242. Boyle, R.D. and Thomas, R.C. 1988 Computer Vision - A First Course, Bla::kwell Scientific Publications, Oxford. Chin, R.T. and Harlow, C.A. 1982 'Automated Visual Inspection: A survey', IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-4, No.6, pp.557-73. Christiansen, D. 1984 'The automatic factory', IEEE Spectrum, Vol. 20, No.5, p. 33. Connors, R.W., McMillin, C.W., Lin, K. and Vasquez-Espinosa, R.E. 1983 'Identifying and locating surface defects in wood: part of an automated lumber processing system', IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-5, No.6, pp. 573-83.

12

13

An introduction to computer vision
Rosenfeld, A. and Kak, A. 1982 Digital Picture Processing, Academic Press, New York. Sanderson, R.J. 1983 Machine Vision Systems: A Summary and Forecast, Tech Tran Corp. Illinois. . Spoehr, K.T. and Lehmkuhle, S.W. 1982 Visual Information Processmg, Freeman and Co., San Francisco. . Suresh, B., Fundakowski, R.A., Levitt T.S. and Overland, J.E. 198~ 'A real-time automate~ visual inspection system for hot steel slabs', IEEE TransactIOns on Pattern AnalysIs and Machine Intelligence, Vol. PAMI-5, No.6, pp. 563-72. . . Vernon, D. (Rapporteur) 1987 Sensoric Feedback jor Control and DecIsIOn Support Systems, Report of a Workshop in the Framework of ESPRIT Conference 1987, Directorate General XIII (eds.). Vernon, D. 1989 'Computers See the Light', Technology Ireland, Vol. 21, No.1, pp. 21-3.

2
Illumination and sensors

2.1

Illumination

Scene and object illumination playa key role in the machine vision process. The central purpose of imposing controlled constant illumination is to enhance visually the parts to be imaged so that their flaws, defects, and features are highlighted and so that their identification and classification by the vision system becomes somewhat easier. Although the choice of lighting will typically be application-dependent, some general points may be pertinent. The common incandescent bulb is probably the simplest source of light. It is cost-effective and it is easily adjusted for light intensity; however, it generally provides directional illumination since it is, to an approximation, a point source of light. Hence, incandescent bulbs cast strong shadows which invariably cause problems for machine vision software. Special bulbs are normally required as degradation in emitted light intensity is common with age. Furthermore, incandescent bulbs emit considerable infra-red radiation; this does not cause problems for humans as we are not sensitive to such light but some camera sensors, particularly so-called CCD cameras, are sensitive and visual data can be washed out by the reflected infra-red rays. For most machine vision applications, a diffuse source of light is the most suitable. Diffuse lighting is non-directional and produces a minimum amount of shadow. Fluorescent lighting is the simplest and most common method of obtaining diffuse illumination and is especially good for providing illumination of large areas. In situations where the only features that need to be inspected are evident from the silhouette of the object, back-lighting is the most appropriate. Backlighting, e.g. in the form of a light table, provides high contrast between the object and the background upon which the object rests. Its advantage is that it facilitates very simple object isolation or segmentation, a topic to which we will return in Chapter 5.

15 14

Illumination and sensors
In some manufacturing environments, it is necessary to inspect moving objects. Depending on the characteristics of image sensor, it may be necessary to 'freeze' the motion of the object for an instant by the use of a strobe light or electronic flash. The lamp emits a short (1 ms) burst of light, thus the moving object is illuminated for a very short period and appears s!ationary. The activation of the strobe must be synchronized with the acquisition of the image. Alternatively, you can exploit cameras with a very fast 'shutter speed' or, rather, the electronic equivalent, a very short exposure time. The exposure is usually referred to as the integration time since it is the period over which the sensor integrates or averages the incident light. One would normally choose the latter option of a short integration time, since it is more ergonomic and less disruptive for humans.

Sensors

2.2 Sensors
The task of a camera system is to convert an optical picture, typically representing a two-dimensional or three-dimensional scene, into some form suitable for use with electrical systems. Since the camera represents the direct link between the environment being examined and the information-processing system, it plays a particularly significant role and merits a good deal of attention.

2.2.1 Image formation: elementary optics
Before commencing a discussion of cameras, image sensors, and image acquisition
co~~onents proper, we will first address a few fundamental issues on the optics of

Control of illumination and light levels As many inspection systems base much of their analysis on the absolute intensity of the incident light, the control of the object illumination can be important. In particular, if image processing and analysis decisions are being made on the basis of a fixed intensity datum (threshold), then some problems will occur if the illumination, and hence the reflected light, changes. If possible, the vision system should be able to adapt to such changes, although this does not necessarily mean that it should be capable of dealing with dynamic changes. Most illumination systems degrade quite slowly over time and it would be quite satisfactory if the system were capable of self-calibration at the beginning of each day. Other alternatives exist, however, to this adaptive approach. One solution is to ensure that the illumination does, in fact, remain constant by monitoring it using light meters and adjusting the illumination system appropriately. Alternatively, the aperture of the camera lens might be altered. Note, however, that electronically controlled aperture lenses (so-called auto-iris lenses) should not be employed directly; their function is to alter aperture so that the average amount of light passing through the lens remains constant. This is not appropriate for machine vision systems as the grey-tone shade of a particular feature would vary, depending on the intensity of the ambient lighting. It was mentioned above that incandescent lighting is not suitable for some cameras and, in general, one should ensure that the lighting system is compatible with the image sensor. For example, mains-powered lighting is inherently 'flickery' due to the a.c. characteristics of the mains electricity supply. Humans do not notice this, in general, because they effectively 'integrate' (or average) the incident illumination over a short period of time. This process also accounts for our ability to view moving objects in cinema films and perceive the motion to be continuous. Machine vision sensors, as we shall see in the next section, do not integrate in quite the same way and, when they acquire the image, the flicker can become apparent. The use of an appropriate (d.c., say) power supply can alleviate this problem, when it does occur.
16

D

a VISIOn system and of lenses in particular. Lenses are required to focus part of the visual environment onto the image sensor. It is possible to construct an imaging system without lenses, using, for example, a collimated light source to project a shadow of a (small) object onto the sensor but such a configuration is not typical of the requirements of a vision system. Lenses are defined by their foeallength (quoted in millimetres: mm) and their aperture (the f number). These parameters determine the performance of the lens in terms of light-gathering power and magnification, and it often has a bearing on its physical size. The focal length of a lens is a guide to the magnification it effects and its field of view. Selecting the focal length which is appropriate to a particular application is simply a matter of applying the basic lens equation:

where

-+-=U v f
v is the distance from the lens to the image, u is the distance from the lens to the object, f is the focal length.

111

Noting the magnification factor Mis:

M = image size object size
and, equivalently:

M = image distance object distance
Thus:

uM
f= M+l

17

Illumination and sensors
Hence, if we know the required magnification factor and the distance from the object to the lens, we can compute. the re~uire~ focallen~th. For example if a 10 cm wIde object IS to be Imaged on a common 8.8 X 6.6 mm sensdr from a distance of 0.5 m this implies a magnification factor of:

Sensors
typically less than 10 mm in width and hence the optical surface can be much smaller. Many of the visual problems which cause difficulty in interpreting a scene can often be solved by the use of simple filters on the camera lens. Filters are frequently used by the television and video industries to produce special effects, for example, the star-light filter used to make candlelight seem soft and star-like. Filters are just as often used to reduce, or remove, such effects; polaroid sunglasses are probably the most widely used 'filters' and are used to reduce glare. In machine vision, one of the most annoying and potentially disruptive problems is that due to specular reflection (mirror-like reflection on shiny objects). The use of a simple polarizing filter on the camera lens can often reduce the effect of these reflections. Some sensors are sensitive to segments of the electromagnetic spectrum which do not convey visual data, e.g .. infra-red radiation. The use of a simple infra-red blocking filter can solve this problem in a simple and effective manner.

M=~=0.088
100 So:

f

= 500 X 0.088 = 40.44 1.088

Thus we require a 40.44 mm focal length lens. Typically, one would use a slightly short~r focal length (e.g. 35 mm) and accept the slight loss in resolution due to the larger field of view. The minimum focal distance is the minimum distance between the front of the lens and the object at which the object can still be in focus. Generally speaking, lenses with short focal lengths have smaller minimum focal distances. If the lens is required to focus on a relatively small object at short distances, the minimum focal distance (typically 300 mm) may be too large. In that case, an extension tube can be used to increase the distance, v, of the lens from the sensor and hence decrease the distance, u, from the lens to the object. For a given focal length, f, the lens equation stipulates that u decreases as v increases, if the image is to remain in focus. The f-number is a measure of the amount of light allowed to pass through the lens and is a normalized measure of lens aperture. It is defined by the focal length divided by the diameter of the aperture. The standard scale is 1.4, 2, 2.8, 4, 5.6, 8, 11, 16; each increase reduces the amount of light passing through the lens by one-half. The depth of field is the distance between the nearest and the furthest points of a scene which remains in acceptable focus at anyone aperture setting. In general, the depth of field gets larger as the focused point moves away from the lens (given constant aperture and focal length). Also, the depth of field increases significantly as the aperture closes (i.e. as the f number increases) for any given focusing distance. Lenses have many types of standard mounts, for example the Pentax, Olympus, Nikon bayonet mounts, but the standard on television, and CCTV (Closed Circuit TV), cameras is a screw mount called the C mount. Since there is a vast choice of 35 mm photographic lenses, it makes sense to be able to use these photographic lenses with C mount cameras and there are several types of bayonet adaptors available for C mount cameras. However, it should be remembered that these lenses are usually much more bulky than the miniature lenses which are specifically designed for CCD cameras. The reason for this is that a photographic lens is designed to image a 35 mm format and, for a given focal length, the optical surface must be much larger. CCD sensors (as we will see in the next section) are

2.2.2 Camera sensors
There are essentially two types of video camera available: one, the vidicon, is based on vacuum-tube technology and the other is based on semi-conductor technology. Briefly, a vidicon tube is a photoconductive device which employs a photosensitive sensor layer consisting of several million mosaic cells insulated from one another on a transparent metal film (refer to the Figure 2.1). Each cell represents a small capacitor whose charge is a function of incident light. The sensor layer is scanned in a raster format with an electron beam over 625 lines in accordance with the television standard (discussed in the next section). This beam is deflected magnetically by a set of coils outside the tube bulb. The electron beam makes up charge lost through the incidence of light in individual mosaic cells and
Transparent front plate

~~~_/Lens

Focusing coil

--

-- Cathode
~

Deflected electron beam

Deflection C~O~il~;::;;::;;:_ __

ensor layer

o

Video signal

Figure 2.1

Schematic of the pickup tube of the Vidicon television camera.

18

19

Illumination and sensors

Sensors
Lig ht-sensitive photosites

so generates the video signal at the sensor element. This video signal is simply a continuous analogue signal proportional to the light intensity of the focused image. The camera electronics insert synchronization pulses (syncs.) to indicate scan lines, fields and frame ends (see Section 2.2.3). A distinction is made between the following camera types depending on the sensor element: the standard vidicon has a sensor element comprised of antimony sulphide (Sb2S3), the silicon diode vidicon has a sensor element made from silicon (Si), while the plumbicon has a sensor element made of lead oxide (P?O). Most solid-state cameras are based on charge-coupled devIce (CCD) technology, though there are several variations on the theme. In order for the reader to be at least acquainted with the names of these devices, they are listed here:
CIt

Charges are shifted to the shielded area (2) The charges in the column are shifted down one cell ? A row of charge is then shifted out

CD

I

Shielded CCD shift registers

I

CD

LL:..L~-4J~.L..l.....L.u...Lu..J..JLJo--9P Output signa I

+ +

.. .. .. .. .. ..

charge transfer devices single transfer devices bucket brigade devices charge coupled devices charge injection devices surface charge coupled devices bulk charge coupled devices

CTD STD BBD CCD CID SCCD BCCD

(to amplifier)

Figure 2.2

Interline transfer of charge in CCD sensors.
Light-sensitive photosites

CCD technology is the most widespread, the basic structure of which is that of an analogue shift register consisting of a series of closely spaced capacitors. Charge integration (accumulation) by the capacitors, photosites, caused by the photons comprising the incident light, provides the analogue representation of light intensity. At the end of the integration period (exposure time) these charges are read out of the sensor. CCD sensors most commonly use one of three addressing strategies: interline transfer, frame transfer, and column-row transfer. The interline transfer CCD is organized into column pairs of devices. An imaging column of photosensors is adjacent to an opaque vertical shift register (see Figure 2.2). Charge accumulates in the imaging column until the end of the integration period, when it is transferred to the opaque column. The signal then shifts vertically into a horizontal shift register that represents the picture sequentially, line by line. The advantage of the interline transfer is that the transfer time (to opaque storage) is short compared to the integration period. This is desirable because when transfer time approaches the integration time, solid-state sensors tend to exhibit a locally contained spreading of the image response, called smear. Thus, the interline transfer minimizes smear. In the frame transfer organization (refer to Figure 2.3) the sensor consists of vertical columns of CCD shift registers divided into two zones. One zone, where charge accumulates during integration time, is photosensitive. When integration is complete, the whole array is transferred in parallel to the opaque storage area of the second zone. A third type of solid-state sensor employs x-y addressing to transfer charge from the photo site to the output signal amplifier. The sensor elements are addressed 20

All charges are shifted down to the shielded area (2) Each row of charge is then shifted out

CD

Shielded CCD shift registers - -

&....L....&...JL-L...L-..I......I-J...J....L-&....L.....I...JL-L.L..JI--l"'-........

--. --. --.

Output signal (to amplifier)

Figure 2.3

Frame transfer of charge in CCD sensors.

by selecting individual column and row electrodes. Charge collected under the column electrode is transferred to the row electrode and amplified for output. All the devices discussed so far have been so-called 'area sensors', i.e. the sensor has been a two-dimensional array of photosites. However, there is another important class of solid-state sensor (and camera): this is the linear array sensor or 21

Illumination and sensors

Sensors

line-scan sensor. In effect, these sensors are simply a one-dimensional array (row) of photosites, and use exactly the same technology as the two-dimensional array sensors. They differ in two important characteristics, however. Firstly, these sensors can have between 256 and 4096 photo sites in a rowand, hence, can achieve much greater resolution than state-of-the-art array cameras. Since they are inherently one-dimensional devices they can only take pictures of slices of a twodimensional scene, and if a two-dimensional image is required, several such slices must be acquired. Thus, these sensors are best suited to the inspection applications in which the scene to be scanned is in continuous linear motion (or, indeed, where the camera itself can be translated). The second point to notice about these sensors is that the video signal that they generate does not correspond to any particular video standard and what is produced is essentially a time-varying analogue voltage which represents the incident light along the line of photosites. The repercussion of this characteristic is that most systems which use a line-scan camera tend to have custom-designed computer interfaces, although it is worth noting that matched linescan cameras and digitizers and, indeed, line-scan digitizers are now commercially available (e.g. from Datacube Inc. and Imaging Technology Inc.).

time available for one line is:

T= l/f= 1/15625 s = 64 X 10- 6 s = 64 /LS
11.5/Ls of this are used for the blanking and synchronizing signal. This consists of a 6.5/Ls porch blanking signal and a 5/Ls sync. The sync. pulse signals the beginning of one line and the end of another and the porch represents a quiescent voltage level which prevents the beam appearing as a bright line during the line flyback (when being displayed on a video monitor). The end of the picture consisting of 625 lines, i.e. the frame, is characterized by several picture pulses which are significantly different from the line pulses. During these pulses, the picture is blanked and the beam returns to the top of the picture for the next frame. A (simplified) diagrammatic representation of a video signal is depicted in Figure 2.4. Unfortunately for machine vision, the CCIR standard specifies a 4:3 horizontal-to-vertical aspect ratio for video signals. The television monitor has a similar aspect ratio and thus an image of a square will appear square. However, as we shall see shortly, image acquisition devices often have a 1: 1 aspect ratio, meaning that the rectangular video picture is effectively squeezed into a square with a consequent geometric distortion. This has very severe repercussions for vision-based gauging applications which we will address in Chapter 3.

2.2.3 Camera interfaces and video standards
The monochrome video signal standard used in the United States and Japan is RS-170, a subset of the NTSC (National Television Systems Committee) standard. Europe uses the international CCIR (International Radio Consultative Committee) standard, which is similar to, but incompatible with, RS-170. Television video signal standards are, unfortunately, inappropriate for the requirements of machine vision and both standards present essentially the same problem to machine vision applications. However, since the entertainment industry is still a far more lucrative market for camera manufacturers than machine vision is, few image sensors and cameras deviate from television standards. We will have a brief look at the CCIR standards to better understand the limitations it imposes on the imaging system. Approximately fifty pictures per second are necessary for a flicker-free television picture. As the generation of a complete new picture every toth of a second (20 ms) would place a severe load on the capabilities of the video amplifier so, in an attempt to obtain a flicker-free image, the odd lines of a picture are scanned first and then, in the following toth of a second, the even lines are scanned. This means that a complete picture is established in tsth of a second. This procedure whereby every alternate line is skipped in each field is called interlaced scanning. Thus, the number of complete pictures per second, i.e. the picture frame frequency is 25 Hz; the raster field frequency is 50 Hz. Each frame comprises 625 lines in the CCIR standard (525 in the NTSC RS-170 standard) thus, with twentyfive pictures of 625 lines, 625 x 25 = 15625 lines will be scanned in a second. The so-called line frequency, f, is, hence, 15625 Hz. The signal from a line comprises the picture information, a synchronizing pulse marking the end of a line (horizontal sync. pulse), and a blanking period. The

2.2.4 Characteristics of camera sensors
There are several measures of the usefulness and reliability of a camera sensor; these characteristics will be briefly considered in turn.

Resolution Given that the purpose of the camera sensor is to convert an incident analogue optical image to some electrical signal, the resolution of a sensor can be defined as the number of optical image elements which can be discriminated by that sensor

o

Reference white _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ __ voltage level

Reference black voltage level Blanking voltage level

Sync. pulse

..........................................................................u......................................
Sync. voltage level

.· . ··. .· .·· .· D
................... .

figure 2.4 Composite video signal.

22

23

Illumination and sensors

Sensors

and represented by the resulting signal. The resolution is limited by the number of photo sites in the sensor since this defines the frequency with which the optical image is sampled. The concept of image sampling is an important one and we will defer detailed discussion of this issue until Chapter 3. For the moment, note that in solidstate cameras the effective resolution of the sensor is approximately half that of the number of photo sites in any given direction. Resolution is also limited by the array geometry and by how much opaque material separates the photosites: interline transfer sensors have less effective resolution than frame transfer sensors due to the presence of shielded buffers between each photosensitive line. As we will see in Chapter 3, the resolution of the camera system is also constrained by the limitations imposed by the CCIR video standard and by the sampling frequency of the analogue-to-digital converter which converts the video signal to a digital image. Tube camera resolution is a function of the electron-beam diameter relative to the area of the photoconductive layer. Tube camera resolution is generally higher than that of solid-state cameras and easily outstrips the limitations imposed by the CCIR standard.

linearity for common types of sensor:

Image sensor Sb 2S3 vidicon PbO plumbicon CCD

Gamma 0.6 0.95
1

o

Lag Lag is often defined as the percentage of the signal current at a certain point of the target some period after the illumination has been switched off. Typical values for an elapsed time of 60 ms are shown below: Image sensor Sb2 S3 PbO plumbicon CCD Lag 20070 2070-5070 1070

The lag introduces a limitation in the practical industrial application of TV cameras in terms of the permissible speed of movement of an object under consideration.

o

Geometric faults For television cameras with electron-beam scanning, deviations in constancy of the vertical and horizontal deflection show up as faults in the geometrical placement of the picture content. Standard industrial cameras are not designed as measuring cameras but to generate a picture for subjective human examination, and they exhibit relatively large geometrical faults. For the standard industrial television camera it is usually ± 1 per cent or ± 2 per cent of the picture frame and it is usually much larger with cheaper cameras. With CCD cameras there are no geometrical faults due to electron beam scanning; any geometric distortion is due to the lens.

o

Spectral sensitivity The spectral sensitivity of an image sensor system is defined as the variation of the output as a function of the wavelength of the incident light. Referring to Figure 2.5, we see that solid-state image sensors are sensitive to light in the near infra-red region of the spectrum, and since such radiation does not carry any useful visual information, an infra-red cut filter should be used with such cameras, particularly if they are to be used with incandescent lighting. Blooming

o

If the sensor of a television camera is subjected to intensive brightness, then the

o

Sensitivity and transfer linearity The input signal of an image sensor is a distribution of light or brightness. The output signal is a current or voltage which is related to the brightness. The sensitivity of the sensor is defined as the ratio of the output magnitUde to the input magnitude. In general, the output will be some power function of the input magnitude:
output magnitude = (input magnitude)'Y

1.0
Cll

Cl. UI

o

c:

UI

~

Cll

> '.j:;
(1J

0.5
Ultraviolet //

where 'Y (gamma) is the exponent of the transfer function which, rearranging the above equation, is given by: _log(output magnitude) 'Y - log (input magnitude) A linear sensor would have a 'Y of 1. The following are typical values of transfer
24

?

500

600

700

800, 900

1000

1100·1200

Wavelength (nm)

F,igure 2.5

Spectral sensitivity of Vidicon and

ceo

cameras.

25

Jl/umination and sensors

References and further reading

excess charge carriers spread into neighbouring zones and light is registered there also. Thus a thin beam of very bright light will be sensed over an area considerably larger tha~ the cross-sectional area of the beam. This effect is called 'blooming' and is especially noticeable with specular (mirror-like) reflections. Whil.e tube .c~mer~s are most susceptible to blooming, solid-state sensors, too, WIll exhIbIt thIS characteristic under extreme conditions.

expresses the ratio of electrical variables of the same unit in logarithmic terms, one can compute the level, defined as twenty times the decimal log of the ratio of the linear variables and expressed in decibels (dB). Thus, a signal-to-noise ratio of 20 dB means that the picture quality is very bad (20 dB implies that 10glO(S/N) equals 1 and, thus, the signal-to-noise ratio is 10: 1). For satisfactory quality, a signal-to-noise ratio of 40 dB or more is required.

D

Noise

Noise, unwanted interference on the video signal, is defined quantitatively by the signal-to-noise ratio (SIN), i.e. the ratio of the amplitude of the wanted signal to the average amplitude of the interference. For television cameras, the signal-tonoise ratio is defined as the peak-to-peak video signal divided by the effective amplitude of the noise. If one follows general practice in telecommunications and

2.2.5 Commercially available cameras
Table 2.1 summarizes the characteristics of a number of popular CCD cameras which are available at time of writing. We have listed only CCD cameras since these are the most popular for industrial machine vision.

Exercises
Table 2.1 CCD Camera systems

Vendor
Sony Sony Panasonic Pulnix Fairchild Hitachi Videk

Model
XC-77CE XC-57CE WV50 TM-540 CCD3000 KP-120 Megaplus

Camera type
CCD Area CCD Area CCD Area CCD Area CCD Area CCD Area CCD Area

Sensor resolution
756 x 581 500 x 582 500 x 582 500 x 582 488 x 380 320 x 240 1320 X 1035

Output signal
CCIR CCIR CCIR CCIR CCIR CCIR Noninterlaced analogue; digital Analogue RS232; RS422 Analogue; Digital flags Analogue; Digital

Inteljace required
Standard framestore Standard framestore Standard framestore Standard framestore Standard framestore Standard framestore Special framestore Requires camera controller Conventional serial port Can be directly interfaced to PLca Slow-scan framestore; can be directly interfaced to PLC

1.

2.

3.

Identify the basic operational differences between vidicon-tube cameras and solid-state cameras. Which type of camera is more appropriate for use in FMS (flexible manufacturing systems) requiring automated vision for either inspection or robot gUidance? Why is this so? What limitations does the European CCIR television standard impose on the effective resolving power of imaging equipment? Is it an appropriate standard for machine vision? Explai n. It is required to image a 5 cm wide object using a ceo camera, whose sensor measures 8.8 mm x 6.6 mm, at a distance of 0.5 m. What focal length lens should you choose? What options do you have if you wish to image an object which measures only 1 mm x 1 mm.

References and further reading
Batchelor, B.G., Hill, D.A. and Hodgson, D.C. (eds) 1985 Automated Visual Inspection, IFS (Publications) Ltd, U.K. Blouke, M.M., Corrie, B., Heidtmann, D.L., Yang, F.H., Winzenread, M., Lust, M.L. and Marsh, H.H. 1987 'Large format, high resolution image sensors', Optical Engineering, Vol. 26, No.9, pp. 837-43. CCIR, International Radio Consultative Committee 1982 Recommendations and Reports of the CCIR, 1982, XVth Plenary Assembly, Geneva, Vol. XI, Part 1. Dunbar, P. 1986 'Machine vision', Byte, Vol. 11, No.1, pp. 161-73. Herrick, C.N. 1972 Television Theory and Servicing, Reston Publishing Company, Virginia. Market Intelligence Research Co. 1986 Solid-state Cameras for Machine Vision, Market Intelligence Research Company, 4000 Middle Field Road, Palo Alto, California, 94303, USA. (Vision, SME, Vol. 4, No.1, p. 12.) Tassone, J. 1987 'An illumination system for inspecting printed circuit boards with surface mounted components', Vision, SME, Vol. 4, No.4, pp. 1-5.

Fairchild Honeywell Fuji Analytic Vision Systems

CCD1600R HVS 256 PJ1 IMS-90

CCD Line-scan CCD Line-scan CCD Line-scan CCD Line-scan

3456 X 1 256 xI 2048 x 1 1024 x 1 2048 x 4 4096

apLC: Programmable Logic Controller.

26

27

Sampling and quantization

3
Image acquisition and
representation

Rectangular sampling pattern

Grey-scale

3. 1 Sampling and quantization
Any visual scene can be represented by a continuous function (in two dimensions) of some analogue quantity. This is typically the reflectance function of the scene: the light reflected at each visible point in the scene. Such a representation is referred to as an image and the value at any point in the image corresponds to the intensity of the reflectance function at that point. A continuous analogue representation cannot be conveniently interpreted by a computer and an alternative representation, the digital image, must be used. Digital images also represent the reflectance function of a scene but they do so in a sampled and quantized form. They are typically generated with some form of optical imaging device (e.g. a camera) which produces the analogue image (e.g. the analogue video signal discussed in the preceding chapter), and an analogueto-digital converter: this is often referred to as a 'digitizer', a 'frame-store', or 'frame-grabber' . The frame-grabber samples the video signal in some predetermined fashion (usually in an equally spaced square grid pattern) and quantizes the reflectance function at those points into integer values called grey-levels (see Figure 3.1). Each integer grey-level value is known as a pixel and is the smallest discrete accessible sub-section of a digital image. The number of grey-levels in the (equally spaced) grey-scale is called the quantization or grey-scale resolution of the system. In all cases, the grey-scale is bounded by two grey-levels, black and white, corresponding to the minimum and maximum measurable intensity respectively. Most current acquisition equipment quantizes the video signal into 256 discrete grey-levels, each of which are conveniently represented by a single byte. In certain cases, a grey-scale of -128 to + 127 is more convenient; processed images need not necessarily represent the reflectance function and pixels may assume negative values but the grey-level can still be represented by a signed-byte integer. The sampling density, the number of sampling points per unit measure, is
Figure 3.1 Sam pi i ng and quantization.

usually referred to as the (spatial) resolution and, since the sampling device is usually arranged as a square grid, it is measured in terms of the number of sampling elements along each orthogonal axis. This normally corresponds to the extent of the number of pixels in both the horizontal and vertical directions. Most current commercial frame-grabbers have spatial resolutions of 512 x 512 pixels. In summary, digital image acquisition equipment is essentially concerned with the generation of a two-dimensional array of integer values representing the reflectance function of the actual scene at discrete spatial intervals, and this is accomplished by the processes of sampling and quantization. Since these are fundamentally important concepts, we will look at them in a little more depth in the remainder of this section.

3.1.1 Spatial frequency and the effects of sampling
Recall that the objective of the image acquisition system (which includes the sensor sub-system) is to convert an analogue optical image to a digital image in as faithful a manner as possible. As we have seen, this is achieved by first using the sensor to sample the analogue optical image, generating an analogue video signal, and then by subsequently re-sampling the video signal and generating the digital signal. Thus, there are three factors which can limit the effective resolution and fidelity of the final digital image: (a) (b) (c) the sensor sampling frequency; the bandwidth of the video signal; the sampling frequency of the analogue-to-digital converter, i.e. the frame-grab ber .

28

29

Image acquisition and representation

Sampling and quantization

We shall consider each of these in turn. First, however, we need to look a little closer at the idea of sampling an analogue image and to develop the concept of spatial frequency. We will do this intuitively at first and then we will formalize the ideas somewhat. Readers who find this area interesting should consult Duda and Hart (1973) which contains a particularly lucid account of this topic. High-frequency signals, such as high-pitch soundwaves, periodically change their value over a very short period of time. Similarly, a high-frequency spatial signal, such as an image, periodically changes its value (e.g. its intensity or greylevel) over a very short distance, i.e. it changes abruptly from one grey-level to another. Conversely, low spatial frequencies correspond to 'slower' changes in intensity where the change occurs gradually from one position in the image to another. To make this idea more formal, consider a spatially unbounded analogue optical image g(x, y). The Fourier transform G(fx,h) of the image g(x, y) is defined by:
G(fx,h) = §(g(x, y))
=

1:00 i:oo g(x,Y) exp [-27ri(fxx +hY)] dxdy
[y-l (G(fx,

(3.1)
(a)

The inverse Fourier transform of G(fx, h) is defined by:
g(x, y) = h)) (3.2)

= 1:00 ~:oo

G(fx,h) exp [27ri(fxx +fyy)] dfx dh

The variables fx and h which identify the domain over which G (fx, h) is defined are known as the spatial frequencies in the x- and y-directions, respectively, and the domain is known as the spatial frequency domain. What do these variables and this domain represent? The integral in equation (3.2) can be viewed as a 'generalized sum' of complex exponentials, defined in terms of the spatial frequencies Ix and h, and in terms of the spatial coordinates x and y. Each exponential is weighted by a value given by G(lx,h) and, thus, equation (3.2) is an expansion (or expression) of the analogue optical function g(x, y) in terms of this weighted generalized sum of exponentials. These weights are, in fact, given by equation (3.1), i.e. the Fourier transform of the image function, and will, of course, vary with image content. Since these complex exponentials can also be expressed in terms of sine functions, * a spatial frequency domain which has, for example, just two non-zero values at, say (Ix!! hJ and (- fXl' - hJ corresponds to a sinusoidal variation in intensity in the spatial domain, i.e. to an image g(x, y) which comprises sinusoidally undulating 'stripes' of alternating light and dark intensity. The period and orientation of these stripes depends on the exact values of fXI and

(b)

Figure 3.2 (a) An image comprising a sinusoidal variation in intenSity along the x axis; and (b) its Fourier transform, comprising two spatial frequency components (fXl , fYI) and (- fXl' - f y1 ), both of which are spatial frequenCies in the x direction.

30

31

Image acquisition and repre::;entation

Sampling and quantization

hi'

Conversely, an image g(.x, y) which comprises a sinu::midal vanatlOn in intensity can be expressed in terms of the spatial frequencies (.f~t' ,h· I ) and .(' ( _ jXll _.('). see Figure 3?2 The 'quicker' these sinusoidal variations, i.e. the jYt, ? greater the frequency of variation in the spatial domain, the further (fxl! fYI) and .(' ( _ JXI' _.(') are from the origin in the G(j~, Iv) domain. Of course, a sinusoidal JYI . . variety in intensity is not a particularly common image. However, Fourier analysis tells us that more complex functions can be constructed by including more terms of varying weight in the 'generalized sum' of exponentials, i.e. by including further spatial frequency components. The exact weight is, again, determined by equation (3.1), i.e. the Fourier transform. An abrupt change in intensity will require the presence of a large number of terms which will correspond to high spatial frequencies, many of which are far removed from the origin of the Ceft', };.) domain; see Figure 3.3. Thus, we now have arrived at the interpretation we required. That is: high spatial frequencies correspond to the presence of abrupt, or sharp, changes in the intensity of the image. The next issue to which we turn is that of sampling. In particular, we would like to know what sampling frequency is required, i.e. how often one needs to sample in the spatial domain in order to represent an image faithfully. Shannon's sampling theorem tells us that a band-limited image (i.e. an image which does not comprise infinitely high spatial frequencies) can be faithfully represented (i.e. reconstructed) if the image is sampled at a frequency twice that of the highest spatial frequency present in the image. This sampling frequency is often referred to as the Nyquist frequency. We are now in a position to return to address the three issues we raised at the beginning of this section: the effect of the sensor sampling frequency (resolution), the video signal bandwidth, and the analogue-to-digital (frame-grabber) sampling frequency on the effective resolving power of the image acquisition syst.em. First, we now see that a sensor which has, for example, 756 photosites in the horizontal direction, i.e. along a line, will only be capable of representing a sinusoidal variation in intensity which has a spatial frequency of 378 cycles per unit distance. A pattern of 378 alternately light and dark bars with abrupt edges, i.e. discontinuities in intensity, would obviously require a much higher sensor sampling frequency to represent the image faithfully. Second, the resolution of a television picture is also limited by the number of lines and the frequency bandwidth of the video signal. We saw in Chapter 2 that the line frequency for the CCIR standard video signal is 15625 Hz. In addition, the nominal bandwidth of the CCIR standard is 5.0 MHz, meaning that a signal can transmit a video image with five million periodic variations in the signal (brightness) levels. This results in an absolute maximum of 5 x 10 6 + 15625 = 320 periodic (or sinusoidal) variations per line, that is, the maximum spatial frequency which can be faithfully represented by a video signal is 320 cycles per line. Third, a frame-grabber which has, for example, a sampling frequency of 512 pixels in the horizontal direction, will only be capable of faithfully representing a sinusoidal variation in intensity which has a spatial frequency of 256 cycles per unit

(a)

(b)

Figure 3.3 (a) An image comprising a step discontinuity in intensity along the x axis; and (b) its Fourier transform, exclusively comprising spatial frequency components {x, i.e. spatial frequencies in the x direction.

32

33

Image acquisition and representation
distance. Again, a pattern of 256 alternately light and dark bars with abrupt edges would require a much higher sampling frequency to represent the image faithfully.

Adjacency conventions
square image and, hence, the effective distance between horizontal pixels is 1times greater than that between vertical neighbours (see Figure 3.4). The situation becomes even more unsatisfactory when one considers the distance between a pixel and its diagonal neighbour. While most discussions of inter-pixel distances usually assume the diagonal inter-pixel interval to be J2 (i.e. the length of the hypotenuse completing the right-angle triangle formed by the horizontal inter-pixel interval of length 1 and the vertical inter-pixel interval of length 1). However, if we are working with a framestore which has a square aspect ratio and has a video signal which as a 4:3 aspect ratio, then the diagonal inter-pixel interval is, in fact, J~, i.e. the length of the hypotenuse completing the right-angle triangle formed by the horizontal inter-pixel interval of length ~ and the vertical inter-pixel interval of length 1 (see Figure 3.5). Unfortunately, this is not the complete story. The CCIR standard stipulates that a picture comprises 625 lines. However, only 576 of these carry visual information while the remainder are used for other purposes. Framestores with a vertical resolution of 512 pixels (Le. 512 lines) do not capture all 576 of these lines of video; they only capture the first 512 of them. This introduces a further distortion, resulting in an effective CCIR aspect ratio of 4:(3 x ;~i) = 4:2.66 or 3:2. The effective vertical, horizontal and diagonal inter-pixel distances are thus 1, ~,
~ 4'

3.2 Inter-pixel distances
As mentioned in the preceding chapter, video signals assume an aspect ratio of 4:3 (horizontal-to-vertical) and we noted in Section 3.1 that, although framestores with 4:3 aspect ratios are becoming available, they normally use a square aspect ratio of 1: 1. The aspect ratio mis-match has serious repercussions for gauging or measuring functions of an inspection system: the rectangular picture is being squeezed into a
100

100

4:3 aspect ratio (video camera signals)

/11

Image acquisition

3.3 Adjacency conventions
75

100

1: 1 aspect ratio (framestores)

There is yet another problem which is associated with the representation of digital images: the exact spatial relationship of one pixel with its neighbours. In effect, this problem is one of defining exactly which are the neighbours of a given pixel. Consider the 3 x 3 neighbourhood in an image (shown in Figure 3.6) where the

Image display 5/3

100

/
4: 3 aspect ratio (video monitor signals)

/
1
4/3

1100

Figure 3.4 Distortion arising when acquiring a video signal with a conventional (square) framestore.

Figure 3.5

I nter-pixel distances.

34

35

Image acquisition and representation

Image acquisition hardware

3

2

___ ._________ - _.._--f--

4

8

o
7

5

6

--~----~.-----.-+------

figure 3.6 A 3 >( 3 pixel neighbourhood.

within other regions as one goes from level to level in the nesting. Thus, one would apply 4-adjacency to the background region, 8-adjacency to the objects resting on the background, 4-adjacency to holes or regions contained within the objects, 8-adjacency to regions contained within these, and so on. This convention means that one never encounters topological anomalies such as the one described in the above example. Because the 8-adjacency convention allows diagonally connected neighbours, measurements of object features (e.g. its perimeter) will be more faithful with 8-adjacency. Consequently, if it is possible to stipulate which convention is to be applied to a particular object of interest, one should opt for the 8-adjacency convention. The preceding discussion has been developed in the context of the acquisition of digital images from area cameras. As we have seen, however, line-scan cameras are extremely important imaging devices, particularly for high-resolution gauging applications. The same sampling and quantization processes must also be used with these linear array cameras, the only difference in this case is that only one line of video information needs to be digitized. Similarly, the discussion of adjacency conventions and inter-pixel distances applies equally. Image acquisition equipment for line-scan sensors will often be custom-designed for a particular application and configuration, although it is worth reiterating the point made earlier that matched line-scan cameras and digitizers and, indeed, line-scan digitizers are now appearing on the market. These line-scan digitizers are, in fact, general purpose devices which can deal with many different scan rates. They are often referred to as slow-scan digitizers. This is an unfortunate misnomer: they are really variable-scan digitizers and can deal with extremely high data rates.

Figure 3.7 Adjacency conventions: a dark doughnut on a white table.

pixels are labelled through 8; which pixels does pixel 8 touch? One convention, called 4-adjacency, stipulates that pixel 8 touches (i.e. is connected to) pixels 0, 2, 4, and 6, but does not touch (i.e. is not connected to pixels 1, 3, 5, and 7). An alternati ve convention, called 8-adjacency, stipulates that pixel 8 is connected to all eight neighbours. Adopting either convention universally can, however, lead to diffIculties. Figure 3.7 shows parts of a digital image depicting, in extremely simplifIed form, a dark doughnut on a white table. If we apply the 4-adjacency convention, then we have an obvious problem: there are four 'doughnut segments' (two vertical and two horizontal) but none of the segments is touching: the segments are not connected. Applying the 8-adjacency convention, the segments are now connected in a ring (as we would expect) but now so too is the inside of the doughnut connected to the outside: a topological anomaly, In themselves, neither convention is sufficient and it is normal practice to use both conventions: one for an object and one for the background on which it rests. In fact, this can be extended quite generally so that the adjacency conventions are applied alternately to image regions which are recursively nested (or embedded)
36

°

3.4 Image acquisition hardware
A typical machine vision system can be configured in two ways: by building it yourself using off-the-shelf equipment or by buying a complete turnkey system. The former alternative is very cost-effective but a significant amount of work needs to be done to integrate the system, ensuring that the image acquisition and processing devices are correctly matched with the host computer and can be controlled by it. The latter alternative is more expensive but, depending on the application, turnaround time on development should be much reduced. There are two mutually dependent components to be chosen when configuring a machine vision system with off-the-shelf equipment: the CPU and the framestore. Most of the commercial framestores are dedicated to just a few of the more popular computer buses, in particular, the VME bus, the Multibus, the Q-Bus, the IBM PC bus, the Nu-bus (Apple Macintosh II), and to some extent the IBM MicroChannel Architecture used in the newer PS-2 models. Each bus enforces certain restrictions on the way the framestore is used and most of the functional differences are related to the bus structure and the available support equipment (Le. the framestore sister boards).
37

Table 3.1

Overview of commercial frame-grabber hardware

DT 2603 DT 2861
16

Feature

ITI ITr PCVision Series 100 Plus
2 up to 4

ITI 150/151 Family

ITI 200 Family

Datacube DTb Matrox MVP-AT Max Video 2851/ 2858 MVP-NB Family
2

DT 225

O'Q
("!) ()

:t:u 3

SlJ

.Q

c:

Frame buffers

w

~

Spatial resolution Bus compatibility

RS-170 CCIR Grey levels Dedicated Graphics plane I/O mapped Memory mapped Input LUTS Pseudo-colour Pan/scroll/zoom Dedicated video bus Video inputs 4:3 aspect ratio compensation

2 per board C 512 x 512 512 x 512 512 x 512 VME PC-AT PC-AT 151 : VME Multibus VME-AT adaptor Q-bus Yes Yes Yes Yes Yes Yes 256 256 256 4 0 No Yes Yes Yes Yes No 2 Yes Yes Yes Yes Yes Yes No 3 Yes Yes Yes Yes Yes c Yes 4 No

Manyc 4 up to 4 per board C 512 x 512 512 x 512 512 x 512 VME PC-AT Q-bus

~:
0 :J
t:u

:to

512 x 512 256 x 256 512 x 512 768 x 512 Nu-bus PC-AT PC-AT PC-AT Q-bus VME Yes Yes 256 0 Yes Yes Yes Yes Yes 1 No Yes Yes 64 2 Yes Yes Yes No No 1 No Yes Yes 256 0 Yes Yes Yes Yes 1 No Yes Yes 256 0 Yes Yes No No No 4 Yes

:J
Q..

(ti
""0

Yes Yes 256 Yes No Yes Yes Yes c Yes 4 No

Yes Yes 256 4 Yes Yes Yes Yes Yes 1 No

Yes Yes 256 Several Yes Yes Yes Yes c Yes 8 No

(ti
("!)

en

:J

@"
::!". 0 :J

Feature

ITId ITI PCVision Series Plus 100
No No No No No No No No No No Binary images Binary images Binary images No No No No No No No

ITI 150/151 Family
Yes c Yes c Yes Yes Yes c Yes c Yes c Yes c Yes Yes

ITI 200 Family
Yes c Yes c Yes c Yes Yes No No Yes c No No

Datacube DTe Matrox MVP-AT MJP? Video 2851/ MVP-NB Family 2858
Yes 3x3 400ms Yes Yes Yes Yes Yes No Yes No No Yes c up to 8 x 8 in realtime Yes c Yes c Yes Yes c Yes c Yes c Yes c Yes c Yes c Yes not in realtime Yes Yes Yes Yes Yes No No No No

DT 2603
No No No No No No No No No No

DT 2861
Yes not in realtime Yes Yes Yes Yes No No No Yes

DT 225
No No No No No No No No No No

Convolution Erosion Dilation Sister boards Real-time ALU operations Histogram Feature f extraction Area of interest Proto typing board Variable Scan interface
a

O'Q
("!) ()

3" t:u
t:u

.Q

\.0

w

:to 0 :J
t:u ..... 0.. t:u

en' -.

c:

:J~

Imaging Technology Corporation. b Data Translation Corporation. Functionality provided by sister board. d Imaging Technology Corporation. e Data Translation Corporation. f A feature in this context has an extremely restricted meaning: it is a region having a specified grey-level.
C

(ti

Image acquisition and representation
It would not be appropriate to enter into a discussion of different CPUs here; suffice it to say that most microprocessors are supported by VME bus and M ultibus boards (e.g. 68030, 80386, 32032), while Q-bus computers are generally MicroVax based. The IBM PC AT bus is supported by IBM's PC-AT and the multitude of PC clones (most of which are now significantly faster than the AT). Once an image has been digitized, it is stored on the frame-grabber in memory as a square two-dimensional array. It may now be accessed, processed, and analysed by the host computer. However, it is frequently desirable to process the digital image somewhat before the computer ever uses the information, mainly because the resultant image may be more amenable to subsequent analysis and it may take the computer an inordinately long time to perform such processing. This pre-processing can frequently be accomplished in one frame-time (40 ms) by hardware resident on the frame-grabber or by sister boards for the frame-grabber. This rate of processing is often called real-time image processing, as the image data can be processed at the same rate at which the camera system generates it. Recall, however, the comment in Chapter 1 that a more pragmatic definition of real-time in machine vision is simply that the vision process should keep up with production rates, i.e. if the vision system produces the information as quickly as it is required, it runs in real-time. To provide the reader with some idea of the capabilities that are provided by some commercial frame-grabbers and sister boards, a brief summary of the main characteristics is given in Table 3.1. Unfortunately, many of the capabilities that are cited refer specifically to image processing techniques which are not covered until the next chapter. To describe them here, however briefly, would be pre-emptive and you should refer again to this section after having read Chapter 4. Custom commercial systems, which can be bought complete with processing and analysis software, and subsequently programmed to accomplish a required function, tend to be significantly more expensive than a system configured from of1'the-shelf equipment. Since such systems are by their very nature unique, a direct comparison is very difficult. However, a brief synopsis of some systems that are available is included so that the reader can form some idea of their capabilities. Universal Instruments Corp. market several printed circuit board (PCB) inspection systems. Model 5511 exploits four high-resolution CCD cameras, each of which views a small portion of the board from a different angle in order to facilitate thorough inspection. The complete board is inspected by moving the PCB on an X- Y table. It uses high-speed strobed light emitting diode (LED) illumination to allow it to take images while the PCB is being moved on the X-Y table. This increases the overall speed of the system quite significantly since it removes the acceleration and deceleration time between snapshots. The quoted inspection performance is an impressive twelve square inches per second. The 5515 is based on the Motorola 68020 (nominally running at 20 MHz) and can use CAD data to drive the inspection strategy. IRI (International Robomation Intelligen<;e) Ltd offer a wide variety of systems, from development vision systems with imaging resolutions of 256 X 256

Speed considerations
and 512 x 512, to fully fledged turnkey PCB inspection systems. The development systems are typically based on Motorola 680XX microprocessors and incorporate a real-time operating system and an extensive library of image processing functions. It should be emphasized that this library of functions is a collection of image processing routines, rather than image analysis routines. The distinction will be discussed in detail in the next chapter but suffice it for the present to note that a significant amount of software development will be required in most cases before a fully operational target system can be configured and incorporated in a manufacturing environment. A typical IRI development system, the SD512 Vision System, features 512 x 512 x 8 bit resolution, and can store eight images simultaneously. It can interface to normal video cameras, to line-scan cameras, and to special-purpose video cameras such as the Videk MEGAPLUS 1320 x 1035 sensor. Optional array processors are available to implement convolution and algebraic operations in near real time (250 million operations per second). A development system such as this would normally incorporate a MC68020 25 MHz host processor, a MC68851 memory management unit, 8 Mbytes of memory, a convolution processor, a correlation processor, CCD camera, colour monitor, terminal, 56 Mbyte hard disk, eight camera ports, and 2 Mbytes of image memory. Complete turnkey systems, e.g. the IRI PCB INSPECTOR would typically cost between two and three times the cost of a development system. In a similar vein to IRI Ltd, Computer Recognition Systems Ltd, in the United Kingdom, offer a mix of configured development systems and application engineering expertise and have developed machine vision systems for several industries. A typical CRS development system comprises at least a VMEbus workstation (MC68020, 1 Mbyte program RAM, 40 Mbyte hard disk, 5.25" floppy disk drive, Idris operating system, Pascal, and C), two 512 x 512 x 8 bit image memory, 8 bit resolution frame-grabber, extensive image processing algorithm library and development software, edge detector (simple 3 x 3 masks), and four video inputs. Note that these summaries, and indeed Table 3.1, are providedfor illustration only; while every effort has been made to ensure that this information is correct, manufacturers are continually upgrading their products and specific models often re-appear with much enhanced capabilities. As such, no responsibility can be accepted for errors in the functional specifications or prices. You should contact the vendor to ensure that the information is accurate and up-to-date.

3.5 Speed considerations
There are several issues which must be addressed when evaluating vision systems and their potential processing speed. The first of these is obviously the processing power of the host computer. An Intel 80286-based PC-AT will operate about three times faster than an 8086-based PC, while an 80386-based system can deliver nearly ten times the power of a PC. A DEC MicroVAX will outperform an LSI-ll or

40

41

Image acquisition and representation

References and further reading

PDP-ll. Many of the newer, more powerful VME-bus systems are based on Motorola's 68030; but you should be sure to compare the processor clock frequencies and memory access times ~f .otherwise simila~ CPU bo~rds. . No matter how fast the CPU is, If It can't commumcate effectively wIth the framestore then the effective image acquisition time will become significant, i.e. the image may be grabbed in tsth second but it may take nearly one second t~ transfer it to the host. In general, memory-mapped framestores are the most efficIent since transfer may not be necessary; this is the distinct advantage of VME systems, such as some of those marketed by Datacube and Imaging Technology. Many boards are input/ output (I/O) mapped, however, and image transfer must take place pixel by pixel; some boards attempt to alleviate this bottleneck by offering DMA (direct memory access) transfer capabilities. Most microprocessor systems do not support floating point arithmetic directly (the Inmos Transputer is one exception) and, if your application requires a significant number of floating point operations, this may be another computational bottleneck. Even if you can get a floating point co-processor, it is best to adhere to integer arithmetic wherever possible when writing image analysis software. As is evident from the survey of imaging systems in the preceding section, most framestores facilitate simple image manipulation (such as contrast stretching, thresholding, trivial segmentation: see next chapter for detailed explanation) through the use of look-up tables (LUT). Each LUT will have an entry for every grey-level that the system can deal with, typically 256 of them. This entry corresponds to the value to which this grey-level is to be changed. When digitizing the video signal each incoming pixel is checked against the LUT, and the table value, not the digitized value, is stored in the frame buffer. Most vendors now offer ALU (arithmetic logic unit) boards which contain high-speed pipe-line processors. Basic operation includes 8 bit 2's complement, logical AND, OR, XOR, multiplication, and 16 bit addition. A full frame can usually be processed in one frametime (40 ms). These boards are intended to provide real-time frame summation, thresholding, contrast enhancement, and relatively fast edge enhancement and filtering. If real-time filtering, such as edge detection, is required then sister boards can often be deployed. It is important to note that these systems generally utilize their own high-speed video bus to effect communication between the frame-grabber and the sister processing boards. As a concluding note, remember that the application requirements will dictate the effective pixel throughout the inspection or robot vision system, which, in turn, will dictate the required architecture and whether or not the application is feasible. If real-time response is required, as it normally will be, this does not necessarily mean that it must process the part in, say, 10 ms; it means that the machine vision system must not delay the overall production system. If extremely fast on-the-fly processing is required then one may have to restrict the functionality of the system somewhat to ensure a feasible solution or, alternatively, one may have to deploy dedicated hardware to implement either the initial processing or subsequent analysis, or both.
42

Exercises
1. What compatibility problems would you encounter in measuring the perimeter of objects when the alternate B-adjacency/4-adjacency convention is used? What do you understand by the terms 'quantization resolution' and 'sampling density'? Identify two adjacency conventions and discuss the merits of each. What adjacency convention would you choose in an application to measure visually the perimeter of coastlines in ordnance survey maps using a conventional 4:3 aspect ratio CCIR camera and a 1:1 aspect ratio framestore? How would you ensure that your measurement is as accurate as possible, given a fixed field of view? What problems do you encounter when attempting to use line-scan cameras? Dedicated video buses are used to alleviate the communication bottleneck between frame-grabbers and their sister boards. Why?

2.

3.

4. 5.

References and further reading
Agin, G. 1975 An Experimental Vision System for Industrial Application, SRI International, Technical Report No. 103. Agin, G. 1977 Servoing with Visual Feedback, SRI International, Technical Report No. 149. Agin, G. 1980 'Computer vision systems for industrial inspection and assembly', Computer, Vol. 13, No.5. Beedie, M. 1986 'Image IC detects edges in real time', Electronic Design, May, pp. 50-8. Duda, R.O. and Hart, P.E. 1973 Pattern Classification and Scene Analysis, Wiley, New York. Dunbar, P. 1986 'Machine vision', Byte, Vol. 11, No.1, pp. 161-73. Giordano, A., Maresca, M., Sandini, G., Vernazza, T. and Ferrari, D. 1985 A Systolic Convolver for Parallel Multiresolution Edge Detection, Internal Report, DIST, University of Genoa. Giordano, A., Maresca, M., Sandini, G., Vernazza, T. and Ferrari, D. 1987 'VLSI-based systolic architecture for fast Gaussian convolution', Optical Engineering, Vol. 26, No.1, pp. 63-8. Healy, P. and Vernon D. 1988 Very Coarse Granularity Parallelism: Implementing 3-D Vision with Transputers, Proc. Image Processing '88, Blenheim Online Ltd., London, pp.229-45. Li, H.F., Tsang, C.M. and Cheung, Y.S. 1983 'A low-cost real-time imaging and processing system, Software and Microsystems, Vol. 2, No.5. pp. 121-9. Plessey Semiconductors Ltd 1986 PDSP16401 2-Dimensional Edge Detector, Preliminary product information. Selfridge, P .G. and Mahakian, S. 1985 'Distributed computing for vision: architecture and a benchmark test, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI/7, No.5, pp. 623-6.

43

Point operations

Fundamentals of digital ? ? Image processing

digital image processing from the second viewpoint, addressing some simple but useful techniques, and drawing when appropriate from the signal processing approach. There are, broadly speaking, three distinct classes of operations: point operations, neighbourhood operations, and geometric operations. We mentioned in the preceding chapter that there is a trend to perform as much image processing as possible in hardware, particularly in advanced frame-grabbers or in their sister boards which have access to (he image data via a dedicated video bus. Point and neighbourhood operations are typically the type of operations that are performed by these frame-grabber sister boards, while very few systems offer any extensive operations for geometric image processing.

4. 1 Point operations
Although the distinction between digital image processing and digital image analysis is not immediately obvious, it is an extremely important one. Image processing can be thought of as a transformation which takes an image into an image, i.e. it starts with an image and produces a modified (enhanced) image. On the other hand, digital image analysis is a transformation of an image into something other than an image, i.e. it produces some information representing a description or a decision. However, digital image analysis techniques are usually applied to previously processed (enhanced) images. Since the analysis of enhanced images is accomplished quickly and easily by human observers, it is often erroneously assumed that the analysis phase is easy, if not trivial. Although image processing techniques often provide more pleasing or more visually accessible images when viewed by a human observer, and the human is able to detect salient features without dif11culty, it is essential to realize that the interpretation of such emergent features is simple only in the context of the powerful perceptual abilities of the human visual system. For machine vision systems, the sole purpose of image processing is to produce images which make the subsequent analysis more simple and more reliable. We are not at all interested in how 'well' the image looks. In particular, the image processing phase should facilitate the extraction of information, it should compensate for non-uniform illumination, and, possibly, re-adjust the image to compensate for distortions introduced by the imaging system. Historically, digital image processing had two main thrusts to its development. One was the natural extension of one-dimensional (temporal) digital signal processing to two (spatial) dimensions. Consequently, two-dimensional signal processing was approached from a mathematical basis, allowing a great deal of rigorous manipulation to be performed, using classical linear system theory. The second, more heuristic, thrust considers digital images as a set of discrete sample points, performing arithmetic operations on the individual points. This contrasts with the signal processing approach which treats images as a discrete representation of a continuous two-dimensional function. We will continue our discussion of A point operation is an operation in which each pixel in the output image is a function of the grey-level of the pixel at the corresponding position in the input image, and only of that pixel. Point operations are also referred to as grey-scale manipulation operations. They cannot alter the spatial relationships of the image. Typical uses of point operations include photometric decalibration, to remove the effects of spatial variations in the sensitivity of a camera system; contrast stretching

101 110 134 132 134 130 126 130

100 140 134 132 140 138 133 140

103 120 135 132 140 139 138 150

105 122 131 133 135 150 149 169

107 130 137 133 140 169 163 178

105 130 138 150 156 175 169 185

103 121 120 160 160 170 180 190

110 120 121 155 174 165 185 200

figure 4.1

Digital image.

44

45

Fundamentals of digital image processing

Point operations
Number of pixels

(e.g. if a feature or object occupies a relatively small section of the total grey-scale image, these point operations can manipulate the image so that it occupies the entire range); and thresholding, in which all pixels having grey-levels in specified ranges in the input image are assigned a single specific grey-level in the output image. As we shall see, these operations are most effectively accomplished using the hardware input look-up tables (LUTs) which are provided by most frame-grabbers.

4. 1. 1 Contrast stretching
Consider the contrived digital image shown in Figure 4.1. If we look at the distribution of the grey-levels in this image, we find that there are grey-levels in the ranges from 100 to 200. Obviously the complete range is not being used and the contrast of this image would be quite poor. The contrived grey-level histogram shown in Figure 4.2 illustrates graphically this poor use of the available grey-scale. We wish to enhance the contrast so that all grey-levels of the grey-scale are utilized. This contrast stretching is achieved by first shifting all the values so that the actual pixel grey-level range begins at 0, i.e. add to every pixel the difference between the final low value (0) and the initial low value (100): 0 - 100 = - 100. The effect of this is shown in Figure 4.3. Next, we scale everything, reassigning values in the range 0-100 to the range 0-255, i.e. we scale by a factor = (255 - 0)/ (100 - 0) = 2.55; see Figure 4.4. Thus, in this instance, to stretch the contrast so that all grey-levels in the greyscale are utilized, one must simply apply the following operation:
120 140 160 180 200 220 240 255
Grey-level

figure 4.3

Shifted grey-level histogram.

Number of pixels

new pixel value:=(old pixel value-100)*2.55
40 60 80 100 120 140 160 180 200 220 240 255
Grey-level

figure 4.4 Stretched grey-level histogram.
Number of pixels

By way of example, let us consider the application of this stretching to any pixel having either the lowest grey-level (100) in the original image or the highest grey-level (200) in the original image:

100

120

140

160

180,200

220

240255

If old pixeL vaLue=100 new pixeL vaLue:=(100-100)*2.55 =0 If oLd pixeL vaLue=200 new pi xe L va Lue:= (200-1 00) *2.55 =255

Grey-level

figure 4.2 Grey-level histogram.

46

47

Fundamentals of digital image processing

Point operations

We can express the algorithm a little more formally in pseudo-code as follows:

1* contrast stretching in a 512x512 pixel image */
/* HIGH and LOW are assumed to be the highest and lowest grey-leveLs, respectively, in the unstretched image */ scale_factor:=255 I (HIGH-LOW) FOR i :=1 TO 512 DO FOR j :=1 to 512 DO IF image[i, j] < LOW THEN image[;, j] :=0 ELSE IF image[i I j] > HIGH THEN image[;, j] :=255 E LS E
figure 4.5 Contrast stretching. The grey-level histogram of the original

image[i, j] :=(;mage[i *sca le_factor

I

j] - LOW)

image (top-left) is shown at the top-right; the stretched image with its histogram is shown below.

while the LUT formulation might be written as:

1* contrast stretching using LUT */
scale factor:=255 I (HIGH-LOW)

1* initiaLise LUT */
FOR i :=0 TO LOW-1 DO LUT[i] :=0 FOR I :=LOW TO HIGH DO LUT[i] :=(i-LOW)*scaLe-factor FOR i :=HIGH+1 TO 255 DO LUT[iJ:=255

As an example, Figure 4.5 shows the results of applying this contrast stretching algorithm to a real digital image. The associated grey-level histograms are displayed to the right of the image. There are many other approaches which can be used for contrast enhancement, e.g. histogram equalization is a technique which computes the histogram of the grey-level distribution in the image and reassigns grey-levels to pixels in an effort to generate a histogram where there are equally many pixels at every grey-level, thereby producing an image with a flat or uniform grey-level histogram.

4.1.2 Thresholding
Some scenes, e.g. those of bare printed circuit boards (PCBs), can be very simple, from a visual point of view, in that they comprise just one or two types of objects: in this case the solder tracks and the circuit board. Suppose we wish to process an image of such a scene so that it exhibits extremely high contrast and so that the solder tracks and circuit board are very easily distinguishable. If we stipulate that there are just two allowable grey-levels, black and white, then this would certainly result in the requisite contrast. The problem is to convert a grey-scale image (typically with 256 grey-levels) into an image with just two levels of grey. Consider again a contrived grey-scale histogram. In Figure 4.6 we can see that all the image

1* stretch using LUT *1
FOR i :=1 TO 512 DO FOR j := 1 to 51 2 DO image[i, j] :=LUT[ image[i, j]
48

49

Fundamentals of digital image processing

Point operations
Number of pixels

pixels representing the circuit board probably have grey-levels in the range 0-160 while the solder tracks probably.,..have grey-levels in the range 160-255. If we assign all the pixels in the range 0-160 to be black and all pixels in the range 160-255 to be white we will give effect to this extreme contrast stretching. Happily, we also now have an explicit labelling in our image: the circuit board pixels are labelled (identified) with a grey-level of 0 and the solder is labelled with a grey-level of 255. In effect we have nominated the grey-level 160 as the threshold and the reassignment of pixel values is known as thresholding. The effect of thresholding on the grey-scale histogram can be seen in Figure 4.7; the pseudo-code that follows summarizes the thresholding algorithm:
/* ThreshoLd an image in pLace (; .e. without an */ /* expLicit destination_image) */

Solder tracks

20

40· 60

80

100

120

140

160

180

200

220

240 255

Grey-level

Figure 4.6 Grey-level histogram of a PCB.

FOR i :=1 TO 512 DO FOR j :=1 to 512 DO IF image[;, jJ > threshoLd
THEN

Number of pixels

image[i, jJ:=255

ELSE
image[i,jJ:=O
Circuit board Solder tracks

The algorithm may also be formulated using a LUT:
/* ThreshoLd an image in pLace (i .e. without an */ /* expLicit destination_image) using a LUT */ /* initiaLise LUT */

20

40

60

80

100

120

140

160

180

200

220

240 255

Grey-level

Figure 4.7 Grey-level histogram of a PCB after thresholding.

FOR i :=0 TO threshoLd DO LUT[iJ:=O FOR i :=threshoLd+1 TO 255 DO LUT[iJ:=255
/* threshoLd using LUT */

selection of the threshold point and to the incorporation of contextual information when applying the threshold. An example of digital image thresholding is shown in Figure 4.8.

4.1.3 Noise suppression by image addition
FOR i :=1 TO 512 DO FOR j :=1 to 512 DO image[;, j J :=LUT[ image[;, j J ]
If it is possible to obtain multiple images of a scene, each taken at a different time, and if the scene contains no moving objects, then the noise in the image can be reduced by averaging these images. The rationale is quite simple: in the averaging process the constant part of the image (that which is due to light reflected from stationary objects) is unchanged while the noise, which will in general change from image to image, will accumulate more slowly. The assumptions inherent in this

Note that the subject of thresholding is an important one to which we will return and will discuss in greater detail in Chapter 5, specifically with regard to the

50

51

Fundamentals of digital image processing

Neighbourhood operations

highlighted when two images of the same dynamic scene, which have been taken at slightly different times, are subtracted. This process of subtraction of an uninteresting background image from a foreground image containing information of interest is referred to, not surprisingly, as 'background subtraction'. Photometric decalibration is one of the most important applications of background subtraction. In some circumstances, camera systems exhibit nonuniform response to light across the field of view. Quite often, this is caused by the lens and is manifested as an image centre which is somewhat brighter than the periphery. This can cause severe problems if one is using thresholding techniques to isolate objects. Since the grey-level which is assumed to correspond to the object may change (from point to point) depending on the position in the image, one solution (apart from replacing the lens) is to model this non-uniform response by, e.g. taking an image of a uniformly shaded surface, identifying the minimum grey-level of this image and subtracting this value from each pixel to generate an image which represents the effective response of the camera. Images which are subsequently acquired can then be processed by subtracting this calibration image from them.

4.2 Neighbourhood operations
Figure 4.8 Thresholding. A grey-scale image (top-left) with its histogram (top-right) is thresholded at a grey-scale value of 128 (bottom-left); the resultant histogram is shown in the bottom-left quadrant.

approach are as follows: 1. 2. The noise in each image is indeed uncorrelated. The noise has a zero mean value, i.e. it averages out to an image with a greylevel of zero which contributes nothing to the real image.

With these assumptions, it is possible to show that averaging N images increases the signal-to-noise ratio by Many commercial framestores incorporate facilities to accomplish this averaging in real time (Le. as the image is acquired) and, as such, it is worth bearing this technique in mind as it involves very little computational overhead. However, you should also bear in mind the assumptions upon which this technique is based; not all noise has these desirable properties.

IN.

4.1.4 Background subtraction Digital image subtraction refers to an operation in which the pixel values of two images are subtracted on a point by point basis. It can be useful for the subtraction of a known pattern (or image) of superimposed noise or, indeed, for motion detection: stationary objects cancel each other out while moving objects are

A neighbourhood operation generates an 'output' pixel on the basis of the pixel at the corresponding position in the input image and on the basis of its neighbouring pixels. The size of the neighbourhood may vary: several techniques use 3 X 3 or 5 x 5 neighbourhoods centred at the input pixel, but many of the more advanced and useful techniques now use neighbourhoods which may be as large as 63 x 63 pixels. The neighbourhood operations are often referred to as 'filtering operations'. This is particularly true if they involve the convolution of an image with a filter kernel or mask. Such filtering often addresses the removal (or suppression) of noise or the enhancement of edges, and is most effectively accomplished using convolver (or filtering) hardware, available as sister boards for most frame-grabbers. Other neighbourhood operations are concerned with modifying the image, not by filtering it in the strict sense, but by applying some logical test based on, e.g. the presence or absence of object pixels in the local neighbourhood surrounding the pixel in question. Object thinning, or skeletonizing, is a typical example of this type of operation, as are the related operations of erosion and dilation, which, respectively, seek to contract and enlarge an object in an orderly manner. Since convolution is such an important part of digital image processing, we will discuss it in detail before proceeding.

4.2. 1 Convolution
The convolution operation is much used in digital image processing and, though it can appear very inaccessible when presented in a formal manner for real continuous

52

53

Fundamentals of digital image processing

Neighbourhood operations

functions (signals and images), it is quite a simple operation when considered in a discrete domain. Before discussing discrete convolution, let us consider why one is interested in the operation in the first place. The two-dimensional convolution integral, which is given by the equation:
g(i,j)

others. A particular filter h is often referred to as a 'filter kernel'. Attempts to form a conceptual model of this convolution operation in one's mind are often fruitless. However, it becomes a little easier if we transfer now to the discrete domain of digital images and replace the integral with the sigma (summation) operator. Now the convolution operation is given by:
g(i,j)

= f*h = ):00

):00 f(i - m,j -

n) hem, n) dm dn

= f*h

= ~ ~ f(i - m,j - n) hem, n)
m
11

embodies the fact that the output g of a shift-invariant linear system (i.e. most optical electronic and optoelectronic systems, to a good approximation, and most filtering techniques) is given by the 'convolution' or application of the input signal f with a function h which is characteristic of the system. Thus, the form o~ g depends on the input f (obviously) and on the form of the system h through WhICh it is being passed. The relationship is given by the convolution integral. The function h is normally referred to as the filter, since it dictates what elements of the input image are allowed to pass through to the output image. By choosing an appropriate filter, we can enhance certain aspects of the output and attenuate

Filter h(m, n)
loput Im'g'

IV, ji '\...

\

The summation is taken only over the area where f and h overlap. This multiplication and summation is illustrated graphically in Figure 4.9. Here, a filter kernel h is a 3 x 3 pixel image, with the origin h (0, 0) at the centre, representing a mask of nine distinct weights: h( -1, -1) ... h( + 1, + 1); see Figure 4.10. The kernel or mask is superimposed on the input image, the input image pixel values are multiplied by the corresponding weight, the nine results are summed, and the final value of the summation is the value of the output image pixel at a position corresponding to the position of the centre element of the kernel. Note that the convolution formula requires that the mask h be first rotated by 180 0 since, e.g., f(i - 1, j - 1) must be multiplied by h (1, 1), f(i - 1, j) must be multiplied by h(1,O), ... , and 0f(i+ I,j+ 1) must be multiplied by he-I, -1). Quite often, the rotation by 180 is omitted if the mask is symmetric. The algorithm to generate g is given by the following pseudo-code:
FOR i :=(Low Limit of i + 1) TO (high Limit of i -1) DO FOR j :=(Low_Limit_of_j + 1) TO (h;gh_Lim;t_of_j - 1) DO

101 110

100 140

103 120

105
11111 .? 11

107
IJlII.OI

105
11111.

103
II

110 120

122
Ilil.ill 1110. ,II

130
/1,··1.11

130
Iii l.i l ll 1110. 11

121

1* for each feasible point in the image ??. form the convoLution *1
temp :=0 FOR m:= -1 TO +1 DO FOR n :=-1 TO +1 DO temp:=temp+FEi-m, j-nJ G[i,jJ:=temp
I

IHO.OI

134

134

135

131
Iii. I II' 111··1.111

137
Ili./1 111 .. 1.01

138
~.illl
1111. 11

120
--

121
f-I.."~...,,..f--~-.---i---

~1_32-t._l_3_2-+-_1_ ~~}::=II~~?=}::='il~I}~'T.: =?=jl ~16_0-t-~_~1_5-l5~~;[_ 32-J!!='}~;
134 130 140 138 140 139 135 150 149 140 169 156 175 160 170 174 165
g(/,/)

??... __.?.?,?.. _...

*

HEm, nJ

I

--f...-._{(im, jn) h(m, n)
1

~~
111,11

126

133

138

163

169

180

185
~---+----r---+---"

130

140

150

169

178

185

190
......u ...u _

200

f-.-.._._..~._u

u .......................... _ ..................................................... ..

figure 4.9. Convolution.

Before returning to the discussion of neighbourhood operators, a short note about the relationship of the convolution operator to classical two-dimensional digital signal processing is appropriate. This can be safely skipped without any loss of continuity. All image systems introduce some amount of distortion into an image, for example image blur due to camera shake. Since optical and electrical systems are (to an approximation) linear, the imaging system may also be assumed to be linear. This implies that, in the language of linear system theory, it has some transfer function H(wx, Wy). This is the frequency domain equivalent of the system impulse response hex, y) in the spatial domain; hex, y) typifies how the system would

54

55

Fundamentals of digital image processing

Neighbourhood operations

h(-1,-1)

h(-1,0)

h(-1,+1)

Figure 4.11 shows this local neighbourhood average mask and Figure 4.12 illustrates the application of the mask to part of an image. Referring to Figure 4.12, we can see that the result of this filtering, i.e. the value of the output pixel which would be placed in the output image at the same position as the input pixel corresponding to the centre position of the mask, is: 10hl/9 + 100*1/9 + 103*1/9 + 110*1/9 + 140*1/9 + 120*1/9 + 134*1/9 + 134*1/9 + 135*1/9

h(O, -1)

h(O,O)

h(O,

+ 1)

h(+

1, -1)

h(+

1,0)

h(+1,+1l

1/9

1/9

1/9

1/9
Figure 4.10 3 x 32 convolution filter h.

1/9

1/9

respond if a single unit spike were input to it. Thus, the transfer function describes how some input image is transformed into some output image. Since the output of this system is a distortion of the correct image we would expect that if we could model the system, i.e. define its transfer function, compute the inverse, and apply it to the distorted image, we would arrive at a much better approximation to the original. This is accomplished by convolving the inverse impulse response with the image: this operation is referred to as a 'deconvolution'. Alternatively, the image may be transformed to the frequency domain using the digital Fourier transform, multiplied by the inverse of the transfer function, and transformed back to the spatial domain by use of the inverse Fourier transform.

1/9
Figure 4.11

1/9

1/9

Local average mask.

101 * 1/9 110
.*

100 * 1/9 140 * 1/9 134
~.

103 * 1/9 120 * 1/9 135 * 1/9 132

105

107

105

103

110

122

1/9 134 * 1/9 132

130

130

121

120

4.2.2 Noise suppression
If we define noise as any unwanted contamination of an image then the mechanism by which noise is removed depends on the assumption we make regarding the form of this unwanted contamination. One of the most common (and realistic) assumptions made about noise is that it has a high spatial frequency (refer back to Section 3.1.1). In this case, it is often adequate to apply a low-pass spatial fIlter which will attenuate the higher spatial frequencies and allow the low spatial frequency component to pass through to the resultant destination image. Of course if the image itself exhibits high spatial frequencies then it will be somewhat degraded after filtering. These low-pass filters can be implemented by convolving the image with some simple mask; the mask values constitute the weighting factors which will be applied to the corresponding image point when the convolution is being performed. For example, each of the mask values might be equally weighted, in which case the operation we are performing is simply the evaluation of the local mean of the image in the vicinity of the mask.

1/9 132

131

137

138

120

121

133

133

150

160

155

134

140

140

135

140

156

160

174

130

138

139

150

169

175

170

165

126

133

138

149

163

169

180

185

130

140

150

169

178

185

190

200

Figure 4.12

Image smoothing using local average mask.

56

57

Fundamentals of digital image processing

Neighbourhood operations

which is equivalent to: 1/9* [101 + 100 + 103 + 110 + 140 + 120 + 134 + 134 + 135] . . which is equal to 121. Thus the central point becomes 121 instead of 140 and the Image WIll appear much smo~ther. This averaging or smoothing is, of course, applied at all points of the image. Occasionally, it may be more useful to apply this smoothing subject to some condition, e.g. the centre pixel is only assigned if the difference between the average value and the original pixel value is greater than a previously set threshold. This goes some way towards removing noise without smoothing out too much of the detail in the original image. The algorithm may be expressed in pseudo-code as follows:

complex and is not easily effected in hardware and thus tends not to be used much in machine vision. Gaussian smoothing, whereby the image is convolved with a Gaussian function, is perhaps one of the most commonly used smoothing techniques in advanced computer vision since it possesses several useful properties. Unfortunately, it is not yet widely used in industrial machine vision because the sizes of the masks are very large and the processing is consequently computationally intensive. It should be noted, however, that some vendors do offer dedicated image processing boards for Gaussian filtering and, if such hardware is available, this type of smoothing should be considered. The Gaussian function G(x, y) is defined by:
G(x, y)

= - 2 exp [ 2
7r(J

1

(x 2 + y2)j2(J2]

1* For a 512x512 image, indexed from 0 to 511 *1
FOR i :=1 TO 510 DO FOR j :=1 TO 510 DO

1* for each feasibLe point in the image: compute average *1
average:=(source_image[i-1, j-1] source_image[i-1, j] source image[i-1, j+1] source-image[i, j-1] source-image[i, j] source-image[; I j+1] source-image[;+1, j-1] source-image[i+1, j] source=image[i+1, j+1]

+ +
+ +

where (J defines the effective spread of the function: Gaussian functions with a small value for (J are narrow, whereas those with a large value for (J are broad. Figure 4.13 illustrates the shape of a Gaussian function with (J = 1.5, 3.0, and 6.0 pixels respectively. The mask weights are included for reference. Note that, since the Gaussian function is defined over an infinite support, i.e. it has non-zero (but very small) values at x, y = ± 00, we must decide at what point to truncate the function. Normally, one chooses the quantization resolution of the function by deciding on the integer number which will represent the maximum amplitude at the centre point (e.g. 1000), and then one chooses the mask size which includes all non-zero values. Typically, a mask size of 23 X 23 pixels would be required to represent a twodimensional Gaussian function with a value of 3.0 pixels for (J. Fortunately, there is no need to use two-dimensional Gaussian functions since the convolution of a two-dimensional Gaussian can be effected by two convolutions with one-dimensional Gaussian functions. Specifically:
G(x, y) * lex, y) = G(x) * {G(y)

+ +
+

* lex, y)}

+

) I 9

1* destination image assumes the average vaLue *1
dest; nat; on_i mage [;
I

j] :=ave rage

There are other noise-suppression techniques, of course. For example, median filtering is a noise-reducing technique whereby a pixel is assigned the value of the median of pixel values in some local neighbourhood. The size of the neighbourhood is arbitrary, but neighbourhoods in excess of 3 X 3 or 5 X 5 may be impractical from a computational point of view since the evaluation of the median requires that the image pixel values be first sorted. In general, the median filter is superior to the mean filter in that image blurring is minimized. Unfortunately, it is computationally
58

Thus, the image is first convolved with a 'vertical' Gaussian and then the resulting image is convolved with a 'horizontal' Gaussian. Why is the Gaussian such a popular smoothing function? The reason is quite straightforward. While the primary purpose of a smoothing function is to reduce noise, it is often desirable to be able to choose the resolution at which intensity changes are manifested in the image, i.e. to choose the level of detail which is retained in the image. For example, an image which has been smoothed just a little (small (J) will retain a significant amount of detail, while one which has been smoothed a great deal (large (J) will retain only the gross structure (you can accomplish the same thing yourself by squinting). In more formal terms, we wish to sharply delimit the spatial frequencies (see Section 3.1.1) which are present in the image, i.e. to localize the spatial frequency bandwidth of the image. On the other hand, we also need to ensure that the smoothing function does not distort the image excessively by smearing the features. To allow for this, we need to ensure that
59

Fundamentals of digital image processing

Neighbourhood operations

the function has a limited support in space. The Gaussian function optimizes the trade-off between these two conflicting requirements.
1 2 3 6 11 18 28 43 65 95 135 186 0.2 249 324 411 506 606 706 800 882 945 986 1000 986 945 882 800 706 606 506 0 411 324 -20 249 186 135 95 65 43 28 18 11 6 3 2 1

4.2.3 Thinning, erosion, and dilation
Thinning is an iterative neighbourhood operation which generates a skeletal representation of an object. It assumes, of course, that you know exactly what constitutes the object in the image and what constitutes the background (i.e. everything which is not part of the object). Such an image is said to have been segmented into its component parts: the topic of segmentation is extremely important and will be discussed in detail in the next chapter. The skeleton of an object may be thought of as a generalized axis of symmetry of the object and hence it is a suitable representation for objects which display obvious axial symmetry. The medial axis transform (MAT) proposed by Blum is one of the earliest and most widely studied techniques for generating the skeleton. More recently, Brady has introduced the related (but extended) concept of smoothed local symmetries (SLS) which we will discuss in the last section of Chapter 7. The skeleton is frequently used as a shape descriptor which exhibits three topological properties: connectedness (one object generates one skeleton); invariance to scaling and rotation; and information preservation in the sense that the object can be reconstructed from the medial axis. The concept of thinning a binary image - an image comprising just two grey-levels: black and white - of an object is related to such medial axis transformations in that it generates a representation of an approximate axis of symmetry of a shape by successive deletion of pixels from the boundary of the object. In general, this thinned representation is not formally related to the original object shape and it is not possible to reconstruct the original boundary from the object. Thinning can be viewed as a logical neighbourhood operation where object pixels are removed from an image. Obviously, the removal must be constrained somewhat so that we have a set of conditions for pixel removal. The first restriction is that the pixel must lie on the border oj the object. This implies that it has at least one 4-connected neighbouring pixel which is a background pixel. The removal of pixels from all borders simultaneously would cause difficulties: for example, an object two pixels thick will vanish if all border pixels are removed simultaneously. A solution to this is to remove pixels of one border orientation only on each pass of the image by the thinning operator. Opposite border orientations are used alternately to ensure that the resultant skeleton is as close to the medial axis as possible. The second restriction is that the deletion of a pixel should not destroy the object's connectedness, i.e. the number of skeletons after thinning should be the same as the number of objects in the image before thinning. This problem depends on the manner in which each pixel in the object is connected to every other pixel. A pixel can be considered to be connected to, and a component of, an object if it has 61

-----

3 28 135 411 800 1000 800 411 135 28 3

1 3 11 28 65 135 249 411 606 800 945 1000 945 800 606 411 249 135 65 28 11 3 1

X

20

Figure 4.13 The Gaussian function for three values of a (1.5, 3.0, and 6.0) together with their corresponding discrete one-dimensional masks; note that the result of convolution with these masks should be normalized by dividing by the sum of the mask weights.

60

Fundamentals of digital image processing

Neighbourhood operations changes to the image are made, at which stage thinning ceases. Figure 4.16 illustrates the result of thinning an object in a binary image. The concepts of erosion and dilation are related to thinning in the sense that erosion can be considered as a single pass of a thinning operator (stripping object pixels of all four border orientations). However, there is one significant difference in that, for most applications, one does not mind if the object breaks in two and, hence, the complex check for critical connectivity is no longer required. In fact, it

A

E

B

C

0

figure 4.14 A critically connected object.

a grey-level of 255 and at least one adjacent object pixel (note that we are assuming object pixels have a grey-level of 255 - white - and background pixels have a greylevel of 0 - black). Consider now the 5 pixel object shown in Figure 4.14. The pixel C 'connects' the two object segments AB and ED, that is, if C were removed then this would break the object in two; this pixel is 'critically connected'. Obviously, this property may occur in many more cases than this, and critical-connectivity may be characterized as follows: Given a pixel, labelled 9 and its eight adj acent neighbours, labelled 0-7 (see Figure 3.6), and assume that writing the pixel number (e.g. 7) indicat:.s presence, i.e. it is an object pixel, whereas writing it with an overbar (e.g. 7) indicates absence, i.e. it is a background pixel. Assume, also, normal Boolean logic sign conventions (+ indicates logical OR, and. indicates logical AND). Then pixel 8 is critically connected if the following expression is true. 8. {[ (1 + 2 + 3).(5 + 6 + 7).4.0] + [(1 + 0 + 7).(3 + 4 + 5).2.6] + [3.(5 + 6 + 7 + 0 + 1).2.4] + [1.(3 + 4 + 5 + 6 + 7).2.0] + [7. (1 + 2 + 3 + 4 + 5).0.6] + [5.(7 + 0 + 1 + 2 + 3).4.6]) Figure 4.15 depicts these six neighbourhood conditions which correspond to the presence of critical connectivity. Hence, the second restriction implies that if a pixel
[3. (5 + 6 + 7 + 0 + 1).2.4] [1.(3+4+5+6+7).2.0] [(1 +2+3).(5+6+7).4.0) [(1 +0+7).(3+4+5).2.6

is critically connected then it should not be deleted.
A thinning algorithm should also preserve an object's length. To facilitate this, a third restriction must be imposed such that arc-ends, i.e. object pixels which are adjacent to just one other pixel, must not be deleted. Note that a thinned image should be invariant under the thinning operator, i.e. the application of the thinning algorithm to a fully thinned image should produce no changes. This is also important as it provides us with a condition for stopping the thinning algorithm. Since the pixels of a fully thinned image are either critically connected or are arc-ends, imposing the second and third restrictions allows this property to be fulfilled. The final thinning algorithm, then, is to scan the image in a raster fashion, removing all object pixels according to these three restrictions, varying border from pass to pass. The image is thinned until four successive passes (corresponding to the four border orientations) producing no

[7.(1 +2+3+4+5).0.6]

[5.(7+0+1 +2+3).4.6]

Figure 4.15

Neighbourhoods exhibiting critical connectivity.

62

63

Fundamentals of digital image processing

Neighbourhood operations

figure 4.17 PCB track with extraneou s copper.

figure 4.16 A grey~scale image (top~left) is thresholded to produce a binary image (top-right) which is then thinned (bottom~right). The image at the bottom~left is an intermediate partially thinned version.

is this property of object splitting that makes the erosion operation particularly useful. The dilation operation effects the reverse of erosion, i.e. an expansion of the object into all those background pixel cells which border the object. Thus, erosion 'shrinks' an object while dilation 'enlarges' it. This interpretation of erosion and dilation, although common, is quite a loose one. The operations of erosion and dilation do have a much more formal meaning in the context of a branch of image processing referred to as mathematical morphology. We will return to mathematical morphology in Section 4.4 but, for the present, we will continue with the informal treatment. To see the usefulness of erosion and dilation operations, consider the following common printed circuit board inspection problem. Typically, in PCB manufacturing, a film negative depicting the appropriate pattern of electrical contacts (pads) and conductors (tracks) is contact-printed on a copper-clad board which has been covered with a photo resistive solution. The board is then etched, leaving just the copper circuit patterns. Such a process can lead to many problems with the final circuit pattern, e.g. the track may be broken, it may be too wide or too thin, or there may be spurious copper remaining on the PCB. These potential faults (shorts and breaks) are impossible for conventional functional (electrical) testing to detect and it is for this reason that these visual techniques are so useful. 64

figure 4.18 Dilated PCB track.

figure 4.19 PCB track with a neck.

65

Fundamentals of digital image processing

Geometric operations

Figure 4.20

Eroded PCB track.

If the tracks (and/or pads) are too large or have extraneous copper attached (see Figure 4.17), then dilating the image a number of times will cause the two tracks to merge (see Figure 4.18) and a subsequent analysis of the track connectivity will identify this potential fault. Conversely, a track which is too thin or has a neck (see Figure 4.19) will break when eroded (Figure 4.20). Similar connectivity analysis will identify this potential circuit break. A pseudo-coded algorithm for erosion can be formulated as follows:

figure 4.21 A grey-scale image (top-left) is thresholded to produce a binary image (top-right) which is then eroded twice (bottom-left). The application of two passes of the dilation algorithm is shown at bottom-right.

FOR aLL pixels in the image IF the pixeL is an object pixeL AND aLL its neighbours are obj ect pi xe ls copy it to the destination image

4.3

Geometric operations

while a dilation algorithm can be formulated as:

FOR alL pixels in the image IF the pixel is an object pixeL make it and its eight neighbours object pixels in the destination image

Geometric operations change the spatial relationships between objects in an image, i.e. the relative distances between points a, band c will typically be different after a geometric operation or 'warping'. The applications of such warping include geometric decalibration, i.e. the correction of geometric distortion introduced by the imaging system (most people are familiar with the barrel distortion that arises in photography when using a very short focal length 'fish-eye' lens), and image registration, i.e. the intentional distortion of one image with respect to another so that the objects in each image superimpose on one another. The techniques used in both these applications are identical and will be discussed in detail before describing actual usage.

4.3.1 Spatial warping
The approach to geometric image manipulation described here is called spatial warping and involves the computation of a mathematical model for the required

Figure 4.21 illustrates the effect of several applications of these erosion and dilation algorithms to an object in a binary image.

66

67

Fundamentals of digital image processing

Geometric operations

distortion, its application to the image, and the creation of a new corrected (decalibrated or registered) image. The distortion may be specified by locating control points (also called fiducial points) in the input image (the image to be warped) and identifying their corresponding control points in an ideal (undistorted or registered) image. The distortion model is then computed in terms of the transformation between these control points generating a spatial warping function which will allow one to build the output image pixel by pixel, by identifying the corresponding point in the input image. Since, in general, the estimates of the coordinates of input pixels yielded by the warping function will not correspond to exact (integer) pixel locations, we also require some method of estimating the grey-level of the output pixel when the 'corresponding' pixel falls 'between the integer coordinates' (see Figure 4.22). The question is: how do the four pixels surrounding the computed point contribute to our estimate of its grey-level, i.e. how do we interpolate between the four? We will return to this question later; suffice it at present to summarize the two requirements of geometric operations:

(a)

(b)

a spatial transformation which allows one to derive the position of a pixel in the input which corresponds to the pixel being 'filled' or generated in the output; an interpolation scheme to estimate the grey-level of this input pixel.

Note that the grey-level interpolation algorithm may be permanently established in the software whereas the spatial transformation will change as the environment changes (e.g. camera, lens, registration requirements),

4.3.1.1

The spatial transformation

The spatial transformation is expressed in general form as a mapping from a point (x, y) in the output image to its corresponding (warped) position (i, j) in the input image:
(i, j) = (Wx(x, y), Wy(x, y?

Input image

Output image

That is, the first coordinate, i, of the warped point is a function of the current position in the output; likewise for the second coordinate, j. Thus, given any point (x, y) in the output image, the coordinates of the corresponding point in the input image may be generated using the warping functions Wx and Wy respectively. It would, of course, be ideal if we had some analytic expression for Wx and Wy but this is rarely the case. Instead, we normally model each spatial warping function by a polynomial function. So, we assume that, for example, the warping functions are given by the following equations:
II 11

Wx(x, y) = ~ ~ apqxPyq
p q

Wy(x, y)
Position of estimated input pixel

=~
P

n

~ bpqxPyq
q

11

For example, if n = 2 (which is adequate to

corre~t

for most distortions):

Wx(x, y) = aooxOyO + alOxl yO + a20x2 y O + aOlxOyl + allxlyl + a2lx 2y l + ao2xOy2 + a12xly2 + a22x2y2 Wy(x, y) = booxOyO + blOXlyO + b20X2 y O + bOIxo yl + bllx 1yl + b2l X2y l + b02X O + bl2X1 y2 + b22X2 y2 y2

Grey-level of estimated input pixel

Figure 4.22 Spatial transformation and grey-level interpolation.

Now, the only thing that remains to complete the specification of the spatial warping function is to determine the values of these coefficients, i.e. to compute aOO-a22 and boo-b22. To do this, we have to assume that we know the transformation exactly for a number of points (at least as many as the number of coefficients; nine in this case), that is, to assume that we know the values of x and y and their corresponding i

68

69

Fundamentals of digital image processing

Geometric operations

and j values. We then write the relationships explicitly in the form of the two equations above. We then solve these equations simultaneously to d~termine the value of the coefficients. Remember that we have two sets of sImultaneous equations to set up: one for the' a' coeffic~ents and one. for the' b' coe.ffi~ients. However the same values relating x, Y to i, J can be used m each case. ThIS IS now where th~ control points come in as we are going to use these to provide us with the (known) relationships between (x, y) and (i, j). If we have nine unknown coefficients, as in the example above, then in order to obtain a solution we require at least nine such observations, (Xl, yt), (it, jt}l ... {(X9, Y9), (i9, h)}, say. Such a system. is said .to be ~~actly determined. However, the solution of these exact systems IS often Ill-condItIOned (numerically unstable) and it is usually good practice to overdetermine the system by specifying more control points than you need (and hence generate more simultaneous equations). These equations can then be solved, yielding the coefficients and hence the warping functions, using standard personal computer based maths packages such as MATLAB ? ? For the sake of completeness, we include here details of how to solve such an overdetermined system. The reader can safely skip this section if (s)he so wishes. The first point to note is that an overdetermined system (where the number of equations is greater than the number of unknown values) does not have an exact solution and there are going to be some errors for some points. The idea, then, is to minimize these errors. We will use the common approach of minimizing the sum of the square of each error (i.e. to generate the so-called least-square-error solution). Consider, again, a single control point and assume we are attempting to compute the apq coefficients:
11
.

Similar
j=Xb+e

We req appro pI greater " Howeve a in ten

!

!

a, so we might think of multiplying across by X-I to obtain an expression. Unfortunately, X is non-square (number of equations is 1 the number of coefficients) and one cannot invert a non-square matrix. fe can indulge in a little algebra and calculus to derive an expression for of X and i: i =Xa+e e= i-Xa

We forr

le sum of the square of each error by computing eTe:
eTe = (i - Xa)T (i - Xa)

Differen coefficie

ing e Te with respect to a, to find out how the errors change as the change:
d(eTe) d(a)

= (0 _ XI) T (i _ Xa) + (i _ Xa) T (0 -

XI)

= (- XI) T(i - Xa) + (iT - (Xa) T)( - XI) = -IXT(i - Xa) + (iT - aTXT)( -Xl) = -IXTi + IXTXa - iTXI + aTXTXI
But noting that iTXI and aTXTXI are 1 x 1 matrices and that the transpose of a 1 X 1 matrix is equal to itself, we transpose these two sub-expressions:
= -IXTi + IXTXa - IXTi + IXTXa = 2(I)(XTXa - XTi)

= aOOXl 0 0 + alOXI 1 0 + a20X 2 IYI ° Yl Yl
+
aOIXIOYl 1

+ allXI 1Yl l + a21X1 2yJ l
1

+ a02XloYl + al2XI

l

Y1

2

+

2 2 a22 X 1 Y1

The sum of the square of each error is minimized when d(eTe)jd(a) is equal to zero, thus:
0= 2(l)(XTXa - XTi) (XTX)-lXTXa = (XTX)-IXTi a = (XTX)-IX Ti (XTX) -IXT is commonly referred to as the 'pseudo-inverse' of X and is written xt.

If we use m control points in total we will have m such equations which (noting that XO and yO are both equal to 1) we may write in matrix form as:

When this has been computed, the coefficient matrix a may be computed by simply multiplying xt by i; b is obtained in a like manner. 4.3.1.2 We have to include the errors since there will not be a set of aoo- a22 which will simultaneously provide us with exactly it-im, in the overdetermined case. Let us abbreviate this matrix equation to:
i= Xa
Grey~level

interpolation

+e

Once the spatial mapping function has been found, the output image can be built, pixel by pixel and line by line. The coordinates given by the warping function, denoting the corresponding points in the input image, will not in general be integer values and the grey-level must be interpolated from the grey-levels of the surrounding pixels. 71

70

Fundamentals of digital image processing
The simplest interpolation function is nearest-neighbour interpolation (zeroorder interpolation) whereby the grey-level of the output pixel (which is what we are trying to estimate) is given by the grey-level of the input pixel which is nearest to the calculated point in the input image (see Figure 4.23). The computation involved in this interpolation function is quite trivial but the function generally yields quite adequate results. If the image exhibits very fine detail in which adjacent pixel grey-level varies significantly, i.e. the image exhibits high spatial frequencies, some of this detail may be lost. In such cases, bi-linear (Le. first-order) interpolation should be considered since the estimate is made on the basis of four neighbouring input pixels. Consider the case shown in Figure 4.24 where we need to estimate the grey-level of a point somewhere between image pixels (i, j), (i, j + 1), (i+ l,j), and (i+ l,j+ 1). Let the position of this point relative to pixel (i,j) be given by coordinates (p, q); ~ p, q ~ 1. The grey-level at point (p, q) is constrained by the grey-level at the four neighbouring pixels and is a function of its position between these neighbours. To estimate the grey-level, J(p, q), we fit a surface through the four neighbours' grey-levels, the equation of which will in general identify the grey-level at any point between the neighbours. The surface we fit is a hyperbolic paraboloid and is defined by the bilinear equation:

Geometric operations

-

q

p

I

II

II

f(i, j)
0

f(i, j

+ 1)

f(i+p,j+q)- l -

e
t(i + 1, j)

II

f(i+1,j+1)

°

Figure 4.24 Bilinear interpolation.

J(p, q) = ap + bq + cpq + d
There are four coefficients, a, b, c, and d, which we must determine to identify this

function for any given 2 x 2 neighbourhood in which we wish to interpolate. Thus we require four simultaneous equations in a, b, c, and d; these are supplied from our knowledge of the grey-level at the four neighbours given by relative coordinates (0,0), (0, 1), (l, 0), (1, 1). Specifically, we know that:

a X + b X 0 + c X 0 X 0 + d = J(i, j) a X 0 + b X 1 + c X 0 X 1 + d =J(i, j + 1)
aX 1+ b X0+ cX 1X0+ d a X 1+ b X 1+ c X 1 X 1+ d

°

(4.1) (4.2) (4.3) (4.4)

101 110 134 132 134 130 126 130

100 140 134 132 140 138 133 140

103 120 135 132 140 139 138 150

105 122 131 133 135 150 14.9 169

107 130 137 133 140 169 163 178

105 130 138 150 156 175 169 185

103 121 120 160 160 170

110 120 121 155 174
Nearest neighbour

=J(i + 1, j)

=J(i + 1, j + 1)

Directly from (4.1), we have:

d=J(i,j)
Rearranging (4.2) and substituting for d, we have:

(4.5)

b=J(i,j+ 1)-J(i,j)
Rearranging (4.3) and substituting for d, we have:

(4.6)

A
180
v

/
185
~

V
-

a = J(i + 1, j) - J(i, j)
Rearranging (4.4) and substituting for a, b, and d, we have: c = J(i + 1, j
Computed point

(4.7)

+ 1) + J(i, j) - J(i + 1, j) - J(i, j + 1)

(4.8)

190

200

Figure 4.23 Nearest-neighbour interpolation.

Equations (4.5)-(4.8) allow us to compute the coefficients a, b, c, and d, which define the bilinear interpolation for a given 2 X 2 neighbourhood with known pixel grey-levels. For example, if the coordinates of the point at which we wish to estimate the grey-level are (60.4, 128.1) and the grey-level at pixels (60, 128), (60, 129), (61, 128),

72

73

Fundamentals of digital image processing

Mathematical morphology

and (61, 129) are 10, 12, 14, and 15, respectively, then the grey-level at this point, in relative coordinates, is given by:
f(O.4, 0.1) = (14 - 10) x 0.4 + (12 - 10) x 0.1 + (15 + 10 - 14 - 12) + 10 = 11.76

?

Intersection. The intersection of two sets X and Y, written X
X

n Y,

is defined:

nY=

(XC

uy

C

)

c

x 0.4 x 0.1

In effect, the intersection of two sets is the complement of the union of their respective complements. Thus, the intersection of X and Yis the complement of the set of elements which are not in X or not in Y, i.e. the set of elements which is common to both sets X and Y. Let us now consider two further concepts. The first is translation. The translation of a set X by h is denoted XI!. This is a set where each element (point) is translated by a vector h. Second, we need to introduce a general symbolism for set transformation. A transformation of a set X is denoted by '!reX). '!r( ) is the set transformation and in an expression such as Y = '!r (X), Y is the transformed set. Finally, we require the concept of duality of set transformations. Given some transformation '!r, we define the dual of the transformation '!r*:
'!r*(X) ~ ('!r (XC) C

4.3.2 Registration and geometric decalibration
This technique for spatial warping can be used directly to effect either registration of two images or to correct the geometric distortion which may have been introduced by the imaging system. In the former case, one merely needs to identify several corresponding points in the two images and use these as the co~trol points when generating the polynomial coefficients. In the latter case, one mIght .use the imaging system to generate an image of a test card (e.g. a square gnd) and superimpose an undistorted copy of this pattern on the distorted image. The corresponding control points can then be explicitly identified in each ~ersio~ of the pattern. For example, in the case of the grid pattern, the control pomts mIght be the points of intersection of the grid lines.

For example, intersection is the dual of union since X

nY=

(XC

uY

C )

c.

4.4.2 Structuring elements and hit or miss transformations

4.4

Mathematical morphology
4.4.1 Basic set theory

Mathematical morphology is a methodology for image processing and image analysis which is based on set theory and topology. As such, it is a formal and rigorous mathematical technique and we need to establish a basic 'language' before proceeding. For the most part, this is the language of set theory. In the following, points and vectors will be denoted by latin lowercase letters: x, y, z; sets will be denoted by latin uppercase letters: X, Y, Z; and the symbol cp denotes the empty set. The more common set operations we will encounter in this brief treatment of morphology include the following: .. Set inclusion. This is written: Y C X, i.e. Y is a subset of (is included in) the set X. This is defined as y E Y ~ Y E X: if y is an element of Y, then y is also an element of X. Complement. For any set X, the complement of X is written Xc. This is the set of all elements which are not elements of X. Union. The union of two sets X and Y, written XU Y, is defmed:
XU Y = {x I x E X or x E Y}

We are now in a position to continue with the discussion of mathematical morphology. Let us begin with the concept of a structuring element. A structuring element Bx , centred at x, is a set of points which is used to 'extract' structure in a set, X, say. For example, the structuring element might be a square or a disk, as shown in Figure 4.25, or any other appropriate shape. Now we define a hit or miss transformation as the 'point by point' transformation of a set X, working as follows. We choose, and fix, a structuring 1 element B. DefineBx to be that subset of Bx (recall Bx is the translate of B to a position x) whose elements belong to the 'foreground' and B/ to be the subset of Bx whose elements belong to the 'background' (Le. BIn B2 = cp).

.. ..

This should be read: the set X union Y is the set of all x such that x is an element of X or x is an element of Y.
74

??? ??? ??? ??? ? ???? ????? ????? ???
figure 4.25 Square and circular structuring elements.

75

Fundamentals of digital image processing

Mathematical morphology

A point x belongs to the hit or miss transform, denoted X ? B, if and only if Bx 1 is included in X and Bx 2 is included in Xc the complement of X:
X (8) B

symmetrical set of B with respect to its origin. Then, the erosion operation is denoted and the erosion of a set X with B is defined:

e

= [xl Bx 1 C

X;B/ C XC}

XeB= {xIBxcX}
Note that this is equivalent to a hit or miss transformation of X with B, where B2 = cp, i.e. where there are no background points. Note also that X e B is not an erosion - it is the Minkowski subtraction of B from X. Intuitively, the erosion of a set by a structuring element amounts to the " generation of a new (transformed/eroded) set where each element in the transformed set is a point where the structuring element is included in the original set X. For example, let B and X be the structuring element and set shown in Figure B is depicted in Figure 4.27(c). 4.27(a) and (b), respectively; then X In a more complex case, if B is a circular structuring element, Figure 4.28 schematically illustrates the effect of erosion .

Thus, X (8) B defines the points where the structuring element B exactly matches (hits) the set X, i.e. the image. . . For example, let B be the structuring element sho,:n III FIgure 4.26(~) a~d let X be the set shown in Figure 4.26(b). Then X ? B IS the set shown III FIgure 4.26(c).

4.4.3 Erosion and dilation
In Section 4.2.3, we informally introduce~ the concepts of erosion and dilat~on. whe will now define them formally. Let B be the transposed set of B, I.e. t e

e

? ?
(a)
(a)

000000 000000 00 ? ? 00 0 ? ? ? 00 00.000 000000
(b)

000000 000000 00 ? ? 00 0 ? ? ? 00 00.000 000000
(b)

000000 000000 000000 000000 00.000 000000
(c)

000000 000000 008.00 008000 000000 000000
(c)

Figure 4.26 (a) Structuring element Bi (b) image set Xi (c) B 'hit or miss'

Figure 4.27

X:X?B.

(a) Structuring element Bi (b) image set Xi (c) the erosion of X with B: XeS.

76

77

Fundamentals of digital image processing

Mathematical morphology

x

The opening is the domain swept out by all the translates of B which are included in X. This effec:s a smoothing of the contours of X, cuts narrow isthmuses, suppresses small Islands and sharp edges in X.

4.4.5 Thinning and the extraction of endpoints
As we have seen in Section 4.2.3, an approximation to the skeleton can be achieved by an ite.rative transformation known as thinning. From the perspective of mathematIcal morphology, the thinning of a set X by a sequence of structuring elements L, is denoted
Erosion of X by a circular structuring element B.

Figure 4.28

that is

Dilation is a closely related operation. In fact, dilation is the dual of erosion. The dilation operation is denoted by the symbol (±). Thus:

X 0 L is defined:

X (±) B = Xc

e iJ

XOL=X/X(8)L
That is, the set X less the set of points in X which hit L. Thus, if X (8) L identified border points, and ~ is, appropriately structured to maintain connectivity of a set, then repeated applIcatIOn of the thinning process successively removes border p~ints fr~m ~ set until the s~eleton is achieved. At this point, further application of the thlllnlllg transform YIelds no change in the skeletal set. This, of course, should be reminiscent of the thinning algorithm discussed in Section 4.2.3. Recalling the definition of a hit and miss transformation:
X (8) L = {x I 1 Lx C X; 2 Lx C XC}

That is, the dilation of X is the erosion of Xc. This amounts to saying that the erosion of a set (e.g. an object) with a given structuring element is equivalent to the dilation of its complement (i.e. its background) with the same structuring element, and vice versa.

4.4.4 Opening and dosing
After having eroded X by B, it is not possible in general to recover the initial set by dilating the eroded set X e B by the same B. This dilate reconstitutes only a part of X, which is simpler and has fewer details, but may be considered as that part which is most essential to the structure of X. This new set (i.e. the results of erosion followed by dilation) filters out (i.e. generates) a new subset of X which is extremely rich in morphological and size distribution properties. This transformation is caIIed an opening. The opening of a set X with a structuring element B, denoted Xl]' is defmed:
X B = (X

lLxn 2L x = ?
We can now proceed to develop a thinning algorithm by defining L. The sequence {L} which is used for thinning is based on a single structuring element and is generated by rotating the structuring element (through 360 0 in increments of 45 0 for a. square lattice). This sequence (L} is shown in Figure 4.29. The thinning algonthm then amounts to the repeated transformation of a set Xi -l- Xi+ 1 defined:

e B) (±) B
in BB
Ll L2

Xi+l

= ?( ... (Xi 0 Lt) 0 L2) 0 L3) ... 0
L3 L4 L5 L6

Lg)

The closing of X with respect to B, denoted X B , is defllled:
XB = (X (±)

Opening is the dual of closing, i.e.:
(XC)B = (XB)c

0 ?? 00

?

000

?? ??

?? 0 00

?

0 ?? 0

and

Figure 4.29

?

?

e
00
0 ??

L8

00
?? 0

0

?

? ? ??

000

0
0

?

0 ??

Sequence of structu ring elements used in the thinning operation.

? ?

78

79

Fundamentals of digital image processing
?' ?2 ?3 ?4 ?5 ?6 ?7 ?8
Intensity

Mathematical morphology

000

oee
000

ooe oeo
000

oeo oeo
000

eoo oeo
000

000 880 000

000

oeo

.00

000 080 080

000 080 00.

figure 4.30 Structuring elements used to identify end-points.

The skeleton is achieved when Xi = Xi+ 1. Initially Xo = X, i.e. the original (unthinned) image. Given a skeleton set X, we can identify the endpoints, i.e. points which are connected to just one other point, using the hit or miss transform and an appropriate set of structuring elements {E}, shown in Figure 4.30. Thus, the endpoints of the skeleton are given by:
y=

X position
(a)

U
i=O

8

X?Ei

That is, the union of all those points which hit with one of these endpoint structuring elements.

T

4.4.6 Application: identification of endpoints of electrical wires
In Chapter 8, we will be considering a complete vision and robotics case study: automated crimping of electrical wires. Part of this application is concerned with the identification of the positions of the ends of these electrical wires lying on a flat surface. In Chapter 8, we will be employing conventional techniques, but for the sake of illustration we will briefly consider a morphological approach here. Assuming that we are dealing with a grey-scale image of long thin wires (see Figures 8.18 and 8.19), then there are just three simple steps in achieving our objective: Step 1. Threshold the grey-level image to obtain a binary image. Call this set Xl. Step 2. Generate the skeleton X2:
Intensity
(b)

X2=XIO (L}
Step 3. Identify endpoints of skeleton X3:
8

Intensity t"..,.,,-------'~

-

I

\ \

.....

I

A

X3=

U
i=O

X2 ?Ei
X position
(e)

X3 is the set of all endpoints of these electrical wires.
4.4.7 A brief introduction to grey-scale mathematical morphology
So far, all of the mathematical morphology has assumed that we are dealing with 80

figur.e 4.31 (a) A one-dimensional slice of a two-dimensional image function. (b) Thresholding this one-dimensional function at a value T generates a set x}, = (x I x ~ T) depicted by the bold line. (c) Representation of an image by a sequence of such sets.

81

Fundamentals of digital image processing

References and further reading

a set X and its complement XC, i.e. that we have been dealing with binary images comprising a foreground (the set of object points X) and a background (the set of all other points XC). Unfortunately, grey-scale mathematical morphology is considerably more difficult to understand than binary morphology and this section is just intended to serve as an introduction to the manner in which we approach grey-level images from a morphological point of view. For the following, we will consider one-dimensional slices of a twodimensional image function. This makes it easier to represent in diagrams and it is somewhat easier to follow. Thus, a slice along the X-axis (i.e. a line of) a grey-scale image can be viewed as shown in Figure 4.31(a). If we threshold this function choosing values ;;::= Xh, that is, generate a set Xh:
Xh= [xl x;;::= T}

X position

Intensity

where T is the threshold value, we generate the set of points shown in a bold horizontal line in Figure 4.31(b). A grey-scale image, then, is considered to be a function f and is a sequence of sets:
f
~

figure 4.33 Dilation of a one-dimensional function with a flat structuring element; the dilated function is depicted by the thin curve.

Similarly, grey-level dilation is defined:
f~ f (f) B) [x}..(f)} ~ {x}..(f) (f) B)}

[xh(f)}

and the grey-scale image is effectively the entire sequence of sets generated by successively decreasing the threshold level, as shown in Figure 4.31 (c). Grey-level erosion is defined:

and Figure 4.33 illustrates the dilation of a (one-dimensional) function with a flat structuring element.

and equivalently:
[xh(f)} ~ (Xh(f)

Exercises

e B)}

1.

This grey-level erosion is the function, or sequence of sets, which are individually the erosion of sets generated at successive thresholds. Figure 4.32 illustrates the erosion of a (one-dimensional) function with a flat structuring element.

2.

3. 4. 5.
X position

Why is convolution a useful operation in image processing? Be specific in your answer by identifying the relationship between convolution and filtering. Identify and annotate two simple techniques for noise removal in digital images; detail any assumptions upon which the techniques are based. What is the essential difference between erosion and thinning? What is the relationship between thinning and the medial axis transform? Why are look-up table (LUT) formulations of algorithms computationally efficient? If two images can be registered by translation and rotation operations, is it necessary to use spatial warping techniques? Will grey-level interpolation be an issue?

Intensity

References and further reading
Arcelli, C. 1979 'A condition for digital points removal', Signal Processing, Vol. 1, pp.283-5.

figure 4.32 Erosion of a one-dimensional function with a flat structuring element; the eroded function is depicted by the thin curve. 82

83

Fundamentals of digital image processing
Blum, H. 1967 'A transformation for extracting new descriptors of shape', in Models for the Perception of Speech and Visual Form, W. Wathen-Dunn (ed.), MIT Press Cambridge, Massachusetts, pp. 153-71. Brady, M. and Asada, H. 1984 'Smoothed local symmetries and their implementation', The International Journal of Robotics Research, Vol. 3, No.3, pp. 36-61. Castleman, K.R. 1979 Digital Image Processing, Prentice Hall, New York. Gonzalez, R.C. and Wintz, P. 1977 Digital Image Processing, Addison-Wesley, Reading, Massachusetts. Hall, E.L. 1979 Computer Image Processing and Recognition, Academic Press, New York. Hilditch, C.l. 1983 'Comparison of thinning algorithms on a parallel processor', Image and Vision Computing, Vol. 1, No.3, pp. 115-32. Kenny, P.A., Dowsett, D.l., Vernon, D. and Ennis l.T. 1990 'The application of spatial warping to produce aerosol ventilation images of the lung immediately after perfusion with the same labelled isotope', Physics in Medicine and Biology, Vol. 35, No.5, 679-85. Motzkin, Th. 1935 'Sur Quelques Proprietes Caracteristiques des Ensembles Bornes Non Convexes', Atti. A cad. Naz. Lincei, 21, pp. 773-9. Nackman, L.R. and Pizer, S.M. 1985 'Three dimensional shape description using the symmetric axis transform 1: Theory', IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-7, No.2, pp. 187-202. Pratt, W.K. 1978 Digital Image Processing, Wiley, New York. Rosenfeld, A. and Kak, A. 1982 Digital Picture Processing, Academic Press, New York. Rosenfeld, A. 1975 'A characterization of parallel thinning algorithms', Information and Control, Vol. 29, pp. 286-91. Serra, l. 1982 Image Analysis and Mathematical Morphology, Academic Press, London. Tamura, H. 1978 'A comparison of line-thinning algorithms from a digital geometry viewpoint', Proceedings 4th International Joint Conference on Pattern Recognition, pp.715-19. Zhang, T. Y. and Suen, C. Y. 1984 'A fast parallel algorithm for thinning digital patterns', Communications of the ACM, Vol. 27, No.3, pp. 236-9.

5
The segmentation problem

5.1

Introduction: region- and boundary-based approaches

Segmentation is a word used to describe a grouping process in which the components of a group are similar with respect to some feature or set of features. The inference is that this grouping will identify regions in the image which correspond to unique and distinct objects in the visual environment. There are two complementary approaches to the problem of segmenting images and isolating objects: boundary detection and region growing. Region growing effects the segmentation process by grouping elemental areas (in simple cases, individual image pixels) sharing a common feature into connected twodimensional areas called regions. Such features might be pixel grey-level or some elementary textural pattern, e.g. the short thin bars present in a herringbone texture. Boundary-based segmentation is concerned with detecting or enhancing the boundary pixels of objects within the image and subsequently isolating them from the rest of the image. The boundary of the object, once extracted, may easily be used to define the location and shape of the object, effectively completing the isolation. An image comprising boundaries alone is a much higher level representation of the scene than is the original grey-scale image, in that it represents important information explicitly. In cases where the boundary shape is complicated, the gap between these two representations is wide and it may be necessary to introduce an intermediate representation which is independent of the shape. Since boundaries of objects are often manifested as intensity discontinuities, a natural intermediate representation is composed of local object-independent discontinuities in image intensity, normally referred to as 'edges'. The many definitions of the term edge can be summarized by the observation (or premise) that an edge occurs in an image

85
84

The segmentation problem

Thresholding

if some image attribute (normally image intensity) changes its value discontinuously. In particular, edges are seen as local intensity discontinuities while boundaries are global ones. The usual approach to segmentation by boundary detection is to first construct an edge image from the original grey-scale image, and then to use this edge to construct the boundary image without reference to the original grey-scale data by edge linking to generate short-curve segments, edgethinning, gap-filling, and curve segment linking, frequently with the use of domaindependent knowledge. This association of local edges is normally referred to as boundary detection and the generation of the local edge image is referred to as edge detection. Boundary detection algorithms vary in the amount of domain-dependent information or knowledge which they incorporate in associating or linking the edges, and their effectiveness is obviously dependent on the quality of the edge image. The more reliable the edge elements in terms of their position, orientation and, indeed, authenticity, the more effective the boundary detector will be. However, for relatively simple, well-defined shapes, boundary detection may become redundant, or at least trivial, as edge detection performance improves. Since computational complexity for the segmentation process as a whole is a function of the complexity of both the edge detection and boundary detection then minimal segmentation complexity may be achieved by a trade-off between the sophistication of the edge detector and the boundary detector. It is worth noting, however, that since edge detection is essentially a filtering process and can often be effected in hardware, while boundary detection will require more sophisticated software, the current (and, probably, correct) trend is to deploy the most effective and sophisticated edge detector (e.g. the Canny operator or the Marr-Hildreth operator) and to simplify the boundary detection process. The remainder of this chapter is devoted to a discussion of a region-based segmentation technique (thresholding), edge detection, region growing and, finally, boundary detection.

5.2.1 Global, local, and dynamic approaches
In a more general sense, a threshold operation may be viewed as a test involving some function of the grey-level at a point, some local property of the point, e.g. the average grey-level over some neighbourhood, and the position of the point in the image. Thus, a threshold operation may be viewed as a test involving a function T of the form:
T(x, y, N(x, y), g(x, y))

where g(x, y) is the grey-level at the point (x, y) and N(x, y) denotes some local property of the point (x, y). If g(x, y) > T(x, y, N(x, y), g(x, y)) then (x, y) is labelled an object point, otherwise it is labelled a background point, or conversely. This is the most general form of the function T, however, and three classes of thresholding (global, local, and dynamic) may be distinguished on the basis of restrictions placed on this function (see Weszka, 1978). These are: Global thresholding: the test is dependent only on the grey-level of the point. T= T(N(x, y), g(x, y)) Local thresholding: the test is dependent on a neighbourhood property of the point and on the grey-level of the point. T= T(x, y, N(x, y), g(x, y)) Dynamic thresholding: the test is dependent on the point coordinates, a neighbourhood property of the point and on the grey-level of the point.
T= T(g(x, y))

It is pertinent to note, however, that most systems utilize the simplest of these three

approaches, global thresholding: the threshold test is based exclusively on the global threshold value and on the grey-level of a test-point, irrespective of its position in the image or of any local context. The approach is facilitated either by constraining the scene to ensure that there is no uneven illumination or by photometrically decalibrating the image before thresholding. The advantage of this approach is that the thresholding can be accomplished by commonly available hardware using a look-up table, as described in Chapter 4.

5.2 Thresholding
Grey-level thresholding, which we covered briefly in Chapter 4, is a simple regionbased technique. However, we include it in a section on its own here because it is a very commonly used and popular technique. As we saw in Chapter 4, in situations where an object exhibits a uniform grey-level and rests against a background of a different grey-level, thresholding will assign a value of 0 to all pixels with a greylevel less than the threshold level and a value of 255 (say) to all pixels with a grey-level greater than the threshold level. Thus, the image is segmented into two disjoint regions, one corresponding to the background, and the other to the object.

5.2.2 Threshold selection
The selection of an appropriate threshold is the single major problem for reliable segmentation. Of the several techniques which have been proposed, most are based on the analysis of the grey-level histogram, selecting thresholds which lie in the region between the two modes of the (bi-modal) histogram. The assumption that the histogram is indeed bi-modal, with one mode corresponding to the grey-level representing the object and the other to the grey-level representing the background, is often not valid; histograms are frequently noisy and the two modes may be

86

87

The segmentation problem

Thresholding

difficult to detect (see Figure 5.1). Just as often, the object will generate a single mode while the background will comprise a wide range of grey-level, giving rise to a uni-modal histogram (see Figure 5.2). While several applications of a simple smoothing (or local averaging) operator to a noisy histogram will help with noisy bi-modal histograms, it will be of little use with uni-modal histograms. Figure 5.3

Frequency

o
Figure 5.1

Grey-scale

255

Figure 5.3 Grey-scale histogram smoothing: top-left: no smoothing; topright: one application of a 3 x 1 neighbourhood average operator; bottom-left: two applications; bottom-right: three applications.

Noisy bi-modal grey-scale histogram.

Frequency

o
Figure 5.2

255
Grey-scale

Un i-modal grey-scale histogram.

illustrates the effect of smoothing a grey-scale histogram by iterative application of a local 3 x 1 neighbourhood averaging operator. Another useful approach to thresholding selection is to use the average greylevel of those pixels which are on the boundary between the object and the background as an estimatL: of the threshold value. As the grey-level of this boundary pixel will typically lie between those of the object and the background, it provides a good indication of the threshold value (see Figure 5.4). The difficulty, of course, lies in deciding which pixels are on the boundary. One of the best approaches is to use a reliable edge detector, such as the Marr-Hildreth operator described in the next section, to identify these boundary points. The threshold selection procedure first uses a Marr-Hildreth operator to locate edges in the image and the mean grey-level of the image pixels at these edge locations is computed. This mean represents the global threshold value. To illustrate this approach, Figure 5.5 shows the binary image generated by thresholding the original grey-scale image at a threshold equal to the mean grey-level of the boundary points generated using the Marr-Hildreth operator. It should be noted that, although this threshold selection technique is computationally complex and may take a significant amount of time to compute, it is only a calibration exercise and need not be performed before every threshold operation.

88

89

The segmentation problem

An overview of edge detection techniques

(a)

Edge pixels

Edg

pixels
(a)

255
Grey-level

Mean grey-level , . . . . - - of edge pixels: threshold value

oL-______-l~=======i~
(b)

________

Figure 5.4 Using edge pixels to select a threshold: (a) image of dark, round object on a light background with section X - X shown; (b) profile of image intensity along section X-X.

5.3 An overview of edge detection techniques
As might be expected when dealing with a process which is fundamental to image segmentation, the literature concerning edge detection is large. This section will not attempt to review all detectors in detail; rather it will survey and describe the different approaches to edge detection and illustrate the approach with specific algorithms. There are four distinct approaches to the problem of edge detection: (a) (b) gradient- and difference-based operators; template matching;
90

(b)

Figure 5.5

Automatic threshold selection using the Marr-Hildreth theory of edge detection: (a) original grey-scale image; (b) automatically thresholded binary image.

91

The segmentation problem
(c) (d) edge fitting; statistical edge detection.

An overview of edge detection techniques
An operator due to Roberts estimates the derivatives diagonally over a 2 x 2 neighbourhood. The magnitude of the gradient g(x, y), at an image point (x, y), is approximated by taking the RMS of the directional differences:
g(x,y) ~ R(x,y) = J[{f(x,y) - f(x+ I,y + I)} 2 + {f(x,y

Each of these four approaches will be considered in turn.

+ 1) - f(x+ I,y)} 2]

5.3.1 Gradient- and difference-based operators
If we define a local edge in an image to be a transition between two regions of significantly different intensities, then the gradient !unction of t?~ image, which measures the rate of change, will have large values III these transItIOnal boundary areas. Thus gradient-based, or first-derivative-based, edge detectors enhance the image by estimating its gradient function and then signal that an edge is present if the gradient value is greater than some defined threshold. In more detail, if a/ax and a/ay represent the rates of change of a twodimensional function f(x, y) in the x- and y-directions, respectively, then the rate of change in a direction {) (measured in the positive sense from the X-axis) is given by:

R(x, y) is usually referred to as the Roberts cross operator. The differences may be combined in a way other than the RMS to provide a computationally simpler version, the Roberts absolute value estimate of the gradient function, given by:
g(x,y)~R(x,y)= If(x,y)-f(x+ I,y+ 1)1

+ If(x,y+ l)-f(x+ I,y)1

Rosenfeld and Kak have argued that a third version, the Roberts max operator, given by:
g(x,y)~R(x,y)=Max (If(x,y)-f(x+ I,y+ I)I,lf(x,y+ I)-f(x+ I,y)!)

a cos {) + aaf sin f ax y

{)

The direction {), at which this rate of change has the greatest magnitude is given by:

arctan[:~ I ;~]
with magnitude:

affords better invariance to edge orientation. Applying the Roberts max operator to edges of equal strength, but of different orientation, produces less variation in the resultant magnitude value than if the cross operator were used. To illustrate these edge detectors, a test image with fairly fine structure (a tray of electrical wires) was acquired: see Figure 5.6. The Roberts RMS, absolute value, and Roberts max operators are shown in Figures 5.7, 5.8, and 5.9 respectively. One of the main problems with the Roberts operator is its susceptibility to noise because of the manner in which it estimates the directional derivatives, i.e. the first differences, of the image function f(x, y). This has prompted an alternative estimation of the gradient by combining the differencing process with local averaging. For example, the Sobel operator estimates the partial derivative in the

The gradient of f(x, y) is a vector at (x, y) with this magnitude and direction. Thus the gradient may be estimated if the directional derivatives of the function are known along (any) two orthogonal directions. The essential differences between all gradient-based edge detectors are the directions which the operators use, the manner in which they approximate the one-dimensional derivatives of the image function in these directions, and the manner in which they combine these approximations to form the gradient magnitude. These gradient functions are intuitively easy to understand when we confine ourselves to the discrete domain of digital images where partial derivatives become simple first differences. For example, the first difference of a two-dimensional function in the x-direction is simply:
f(x+ 1, y) - f(x, y)

Similarly, the first difference of a two-dimensional function in the y-direction is simply:
f(x, y

+ 1) 92

f(x, y)

Figure 5.6 A tray of wires.

93

The segmentation problem

An overview of edge detection techniques

figure 5.7 Roberts RMS edge detection operator.

figure 5.9 Roberts max edge detection operator.

r
"""
I ..

'
:,

",
? I"

_

tr.
.. ,.

, ' ..~ ,

figure 5.8 Roberts absolute value edge detection operator.

figure 5.10 Sobel RMS edge detection operator.

x-direction over a 3 x 3 region centred at !(x, y) by: Sx= {!(x+ 1,y-l)+2!(x+ I,y)+!(x+ 1,y+ 1?)
- {!(x-I, y -1) + 2!(x-I, y) + !(x-I, y + 1?)

The gradient may then be estimated as before by either calculating the RMS (see Figure 5.10):
g(x, y)
=:::

S = J(S; + S;)
S = I Sx I + I Sy I

This essentially takes the difference of a weighted average of the image intensity on either side of !(x, y). Similarly: Sy= {!(x-I,y+ 1)+2!(x,y+ l)+!(x+ 1,y+ I)}
- {!(x-I,y-1)+2!(x,y-1)+!(x+ 1,y-I)}
94

or by taking the absolute values (see Figure 5.11):

g (x, y)

=:::

In an analogous manner Prewitt suggests an approximation of the partial 95

The segmentation problem

An overview of edge detection techniques

derivatives by:
Px = {!(x+ l,y-I)+!(x+ I,y)+!(x+ l,y+ I)} - {!(x-I,y-l)+!(x-l,y)+!(x-l,y+ I)} Py = {!(x-l,y+ 1)+!(x,y+ I)+!(x+ l,y+ I)} - {!(x-I,y-l) +!(x,y-l)+!(x+ l,y-I)}

and the gradient may be estimated as before (see Figures 5.12 and 5.13). Quite often, the directional differences are estimated using simple convolution

figure 5.13 Prewitt absolute value edge detection operator.

figure 5.11 Sobel absolute value edge detection operator.

kernels, one kernel for each different operator. These yield two partial derivative images, which are then combined on a point by point basis, either as the RMS or as the sum of absolute values, to produce the final gradient estimate. Figure 5.14 illustrates the convolution kernels for the Roberts, Sobel, and Prewitt operators. 0 Strictly speaking, the kernel should first be rotated by 180 before the convolution is performed (see Section 4.2.1). However, this is normally omitted since the 0 resultant error of 180 in the gradient direction can be ignored. Once the gradient magnitude has been estimated, a decision as to whether or not an edge exists is made by comparing it to some predefined value; an edge is deemed to be present if the magnitude is greater than this threshold. Obviously the choice of threshold is important and in noisy images threshold selection involves a trade-off between missing valid edges and including noise-induced false edges. So far, edge detection has been discussed on the basis of first-derivative directional operators. However, an alternative method uses an approximation to the Laplacian:
V2= ax2 + ay2

a2

a2

i.e. the sum of second-order, unmixed, partial derivatives. The standard approximation is given by:
L(x, y)

= lex, y) -

1/4 {!(x, y + 1) + lex, y - 1) + !(x + I,y) + !(x - 1, y)}

figure 5.12 Prewitt RMS edge detection operator.

The digital Laplacian has zero response to linear ramps (and thus gradual changes in intensity) but it does respond on either side of the edge, once with a positive sign and once with a negative sign. Thus in order to detect edges, the image is enhanced by evaluating the digital Laplacian and isolating the points at which the resultant

96

97

The segmentation problem

An overview of edge detection techniques
Despite some criticism of this technique, it is very widely used. The operator possesses a number of useful properties: for example: the evaluation of the Laplacian and the convolution commute so that, for a Gaussian with a given standard deviation, we can derive a single filter: the Laplacian of Gaussian:

rn rn
(a)

D
-1

o
0 0 0
1

v2 (I(x, y) * G(x, y)}

= V2 G(X, y)

* I(x, y)

Furthermore, this two-dimensional convolution is separable into four onedimensional convolutions (see Appendix I for a derivation):
-1

-2
0

-1
0
1

v (J(x, y). G(x, y)) = G(x)' (J(X, y).
2
1

2

:;2

G(Y)]

0
1

-2
-1

2

+ G(y). (J(X, y).

aa;2

G(X)]

(b)

-1

- 1

-1

-1

0 0 0

1
1

0

0
1

0

-1
-1

1

1

1

(c)

Figure 5.14

Convolution kernels for estimation of the partial derivatives with (a) Roberts; (b) Sobel; and (c) Prewitt edge detection operators.

image goes from positive to negative, i.e. at which it crosses zero. The Laplacian has one significant disadvantage: it responds very strongly to noise. A different and much more successful application of the Laplacian to edge detection was proposed by Marr and Hildreth in 1980. This approach firs

相关文章:
计算机视觉领域牛科研团队(machine vision group)
计算机视觉领域牛科研团队(machine vision group)_IT/计算机_专业资料。...San Diego. computer vision --http://vision.ucsd.edu/content/home 9).CV...
三维扫描仪和机器视觉
70 年代中期,麻省理工学院(MIT)人工智能(AI)实验室正式开设“机器视觉”(Machine Vision)课程,由国际著名学者 B.K.P.Horn 教授讲授.同时,MIT AI 实验室吸引了...
机器视觉在农业中的应用
Keywords: Machine vision technology; Agricultural production; Application 1 机器视觉技术在农作物生长检测中的研究 精准农业中一个重要内容是农作物生长信息的自动...
机器视觉与图像采集的研究
关键词: 机器视觉,图像采集,摄像机,图像采集卡,光源。 I 河北工程大学毕业设计 Abstract Machine vision as the name suggesting is to allow the machine ...
A Typical applications for machine vision
A Typical applications for machine vision_英语考试_外语学习_教育专区 暂无评价|0人阅读|0次下载|举报文档 A Typical applications for machine vision_英语考试_...
Control based on Machine Vision
Control based on Machine Vision_IT/计算机_专业资料。对基于机器视觉的控制的简单介绍Control based on Machine Vision Machine Vision is a funny and attractive...
基于嵌入式机器视觉的信息釆集与处理技术研究
作 者以 EV(Engineering Village)数据库为数据资源,以 machine vision and agricultural or fanning 为关键字,在 Subject/Title/Abstract 中检索从 1987-2011 年的...
机器视觉光源对比machine vision
machine vision 光源:直接照明光源、散射照明光源、背光照明光源、同轴照明光源和特殊照明光源 1.沐光方式(直照): 优点是亮度大、灵活、容易适应包装要求;缺点是:...
Machine Vision Technology and Its Applicatio
Machine Vision Technology and Its Applicatio_英语学习_外语学习_教育专区。摘要写作Machine Vision Technology and Its Application to Manufacturing Automation ...
机器视觉模块说明文档A
(如图 1 所示) 1. IMAQ Image.ctl 2.Image Display control 3.IMAQ Vision controls 4.Machine Vision controls 图1 1 IMAQ Vision controls 对图像进行分析和...
更多相关标签:
machine | vision | 机器视觉 | machine vision pdf | 机器视觉 pdf | 今目标 | machine vision ppt | 机器视觉算法与应用 |