当前位置:首页 >> >>

User Interfaces for Topic Management of Web Sites


User Interfaces for Topic Management of Web Sites
Brian Amento

Dissertation submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science and Applications

Deborah Hix, Chair Roger Ehrich Rex Hartson Will Hill Robert Schulman Loren Terveen

September 26, 2001 Blacksburg, Virginia Keywords: Information Access, Information Retrieval, Information Visualization, Human Computer Interaction, Social Filtering, Collaborative Filtering, Copyright 2001, Brian Amento

User Interfaces for Topic Management of Web Sites
Brian Amento

(ABSTRACT)
Topic management is the task of gathering, evaluating, organizing, and sharing a set of web sites for a specific topic. Current web tools do not provide adequate support for this task. We created and continue to develop the TopicShop system to address this need. TopicShop includes (1) a web crawler/analyzer that discovers relevant web sites and builds site profiles, and (2) user interfaces for information workspaces. We conducted an empirical pilot study comparing user performance with TopicShop vs. Yahoo?. Results from this study were used to improve the design of TopicShop. A number of key design changes were incorporated into a second version of TopicShop based on results and user comments of the pilot study including (1) the tasks of evaluation and organization are treated as integral instead of separable, (2) spatial organization is important to users and must be well supported in the interface, and (3) distinct user and global datasets help users deal with the large quantity of information available on the web. A full empirical study using the second iteration of TopicShop covered more areas of the World Wide Web and validated results from the pilot study. Across the two studies, TopicShop subjects found over 80% more high-quality sites (where quality was determined by independent expert judgements) while browsing only 81% as many sites and completing their task in 89% of the time. The site profile data that TopicShop provide – in particular, the number of pages on a site and the number of other sites that link to it – were the key to these results, as users exploited them to identify the most promising sites quickly and easily. We also evaluated a number of link- and content-based algorithms using a dataset of web documents rated for quality by human topic experts. Link-based metrics did a good job of picking out high-quality items. Precision at 5 (the common information retrieval metric indicating the percentage of high quality items selected that are actually high quality) is about 0.75, and precision at 10 is about 0.55; this is in a dataset where 32% of all documents were of high quality. Surprisingly, a simple content-based metric, which ranked documents by the total number of pages on their containing site, performed nearly as well. These studies give insight into users’ needs for the task of topic management, and provide empirical evidence of the effectiveness of task-specific interfaces (such as TopicShop) for managing topical collections.

User Interfaces for Topic Management of Web Sites
Table of Contents

CHAPTER 1: 1.1 1.2 1.3 1.4 1.5

INTRODUCTION

1 1 1 3 3 4 6 6 6 10 10 12 15 16 18 20 20 21 22 23 25 25 25 26 26 28 28 31 31 36 37 38 40 41 41 41

INTRODUCTION MOTIVATION OF RESEARCH OBJECTIVES OF RESEARCH APPROACH TO RESEARCH CONTRIBUTIONS OF RESEARCH RELATED WORK

CHAPTER 2: 2.1 2.1.1 2.2 2.2.1 2.2.2 2.2.3 2.3 2.4

FILTERING COLLABORATIVE/SOCIAL FILTERING STRUCTURE IN THE WEB HYPERTEXT STRUCTURE USING STRUCTURE IN TOOLS WEB CRAWLING WEB PAGE ARCHIVING INFORMATION WORKSPACES PHOAKS SYSTEMS

CHAPTER 3:

3.1 INTRODUCTION 3.1.1 USENET NEWS 3.1.2 FREQUENCY OF MENTION IN PUBLIC CONVERSATION 3.1.3 CLASSIFICATION RULES: DEVELOPMENT & ITERATIVE REFINEMENT 3.2 PHOAKS ARCHITECTURE 3.2.1 PHOAKS NEWS AGENT 3.2.1.1 Filtering 3.2.1.2 Categorization 3.2.1.3 Disposition 3.2.2 WEB INTERFACE 3.3 LESSONS LEARNED CHAPTER 4: 4.1 4.2 4.2.1 4.3 4.4 4.4.1 4.4.2 4.4.3 TOPICSHOP SYSTEMS

WEB CRAWLING WEBCITE LESSONS LEARNED TOPICSHOP CURRENT INTERNET RESOURCE DISCOVERY TECHNIQUES COMPREHENSIVE INDICES (WEB DIRECTORIES) KEYWORD SEARCHES HYBRID DIRECTORY/KEYWORD SEARCHES

iii

4.4.4 4.4.5 4.4.6

SPECIALIZED INDICES SOCIALLY FILTERED TOPICSHOP OVERVIEW OF USER STUDIES

42 42 42 45 45 45 46 47 48 49 49 50 52 52 53 53 59 61 64 65 67 69 70 71 73 74 74 74 74 75 78 80 81 85 87 88 89 91 93 93 95 95 95

CHAPTER 5:

5.1 HYPOTHESIS 5.2 EXPERIMENTS 5.2.1 SELECTING A DOMAIN 5.2.2 INTRODUCTION TO PILOT STUDY 5.2.3 INTRODUCTION TO INTERFACE EVALUATION CHAPTER 6: 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 PILOT STUDY

INTRODUCTION EXPERIMENTAL DESIGN PARTICIPANTS METHODOLOGY DATA COLLECTION AND ANALYSIS QUANTITATIVE RESULTS USER EXPLORATION STRATEGIES DESIGN IMPLICATIONS USER INTERFACE EVALUATION

CHAPTER 7:

7.1 LESSONS LEARNED 7.2 TOPICSHOP DESIGN ITERATION 7.3 EXPERIMENTAL DESIGN 7.4 PARTICIPANTS 7.5 METHODOLOGY 7.6 DATA COLLECTION AND ANALYSIS 7.6.1 PHASE ONE: USER STUDY 7.6.2 PHASE TWO: EXPERT RATINGS 7.7 QUANTITATIVE RESULTS 7.7.1 EXPERT METRICS 7.7.2 FINDING QUALITY SITES 7.7.3 USER SEARCH EFFICIENCY 7.7.4 REQUIRED EFFORT 7.7.5 USER CATEGORIZATION 7.7.6 RELATIONSHIP BETWEEN EVALUATION AND ORGANIZATION SUB-TASKS 7.7.7 EXPERT RATINGS FOR SITE BREAKDOWNS 7.7.8 COMPARING HUMAN PERFORMANCE TO AUTOMATIC METRICS 7.7.9 QUESTIONNAIRE RESULTS 7.7.10 QUALITATIVE OBSERVATIONS 7.8 DESIGN SUMMARY 7.8.1 SPATIAL ORGANIZATION IN WORK AREA CHAPTER 8: COMPARISON OF STUDIES

8.1 RESULTS 8.1.1 FINDING QUALITY SITES

iv

8.1.2 8.1.3

USER EFFORT QUESTIONNAIRE PREDICTING QUALITY SITES

96 97 100 101 102 103 103 105 107 112 113

CHAPTER 9:

9.1 EXPERIMENT 9.1.1 DATA 9.2 RESULTS 9.2.1 EXPERT AGREEMENT 9.2.2 LINK-BASED METRIC COMPARISON 9.2.3 PREDICTING QUALITY 9.2.4 DISCUSSION CHAPTER 10: SUMMARY AND CONCLUSIONS

CHAPTER 11:

REFERENCES

115

CHAPTER 12:

CURRICULUM VITAE

122

v

Table of Figures
FIGURE 1.1: RESEARCH ROAD MAP ................................................................................................................. 4 FIGURE 3.1: PHOAKS WEB INTERFACE ....................................................................................................... 27 FIGURE 4.1: WEBCITE USER INTERFACE ....................................................................................................... 36 FIGURE 4.2: FIRST VERSION OF TOPICSHOP (DETAILS VIEW) ........................................................................ 38 FIGURE 4.3: FIRST VERSION OF TOPICSHOP (ICONS VIEW) ............................................................................ 40 FIGURE 6.1: SEARCH ENGINE USAGE ............................................................................................................ 50 FIGURE 6.2: WEB BROWSE HISTORY FROM USER PILOT STUDY ...................................................................... 58 FIGURE 7.1: REVISED VERSION OF TOPICSHOP, BASED ON RESULTS OF PILOT STUDY .................................... 65 S FIGURE 7.2: A SAMPLE SUBJECT' CATEGORIZATION OF TORI AMOS SITES. (SUBJECT 3) .............................. 84 S FIGURE 7.3: A SECOND SUBJECT' CATEGORIZATION OF TORI AMOS SITES. (SUBJECT 4) .............................. 84 FIGURE 7.4: GROUPS FOR TORI AMOS AS CREATED BY SUBJECTS 3&4.......................................................... 85 FIGURE 7.5: TIMELINES OF USER ACTIVITY .................................................................................................... 87 FIGURE 7.6: AUTOMATED METRICS COMPARED TO SUBJECTS’ JUDGMENTS................................................... 89 FIGURE 9.1: DATA FOR QUALITY EXPERIMENTS........................................................................................... 103

vi

Table of Tables
TABLE 4.1: COMPARISON OF SEARCH INTERFACES ........................................................................................ 43 TABLE 6.1: PILOT STUDY EXPERIMENTAL DESIGN.......................................................................................... 51 TABLE 6.2: EXPERT INTERSECTION ANALYSIS ............................................................................................... 54 TABLE 6.3: EXPERT WEIGHTED UNION ANALYSIS .......................................................................................... 55 TABLE 6.4: AMOUNT OF WORK ...................................................................................................................... 59 TABLE 7.1: MAIN STUDY EXPERIMENTAL DESIGN.......................................................................................... 70 TABLE 7.2: NUMBER OF SITES IN EXPERT SETS. ........................................................................................... 73 TABLE 7.3: AVERAGE EXPERT MAJORITY SCORES FOR TOPICSHOP AND YAHOO USERS ................................ 76 TABLE 7.4: MAJORITY SCORE FOR TOP 5/TOP 10 USER SITES ........................................................................ 77 TABLE 7.5: INTERSECTION BETWEEN USERS SELECTIONS AND TOP 15 EXPERT-RATED SITES ......................... 77 TABLE 7.6: TASK TIME (IN MINUTES) ............................................................................................................ 78 TABLE 7.7: TIME TO VISIT TOP 5 SITES .......................................................................................................... 79 TABLE 7.8: PERCENTAGE OF TIME SPENT BROWSING/ORGANIZING ............................................................... 79 TABLE 7.9: AVERAGE NUMBER OF SITES BROWSED ....................................................................................... 80 TABLE 7.10: AVERAGE SITE INTERSECTION AMONG USERS ........................................................................... 81 TABLE 7.11: PAIRWISE CATEGORY AGREEMENT BETWEEN USERS (1-4) ........................................................ 83 TABLE 7.12: DISTRIBUTION OF ORGANIZATIONAL ACTIONS ACROSS TIME QUARTILES .................................. 86 TABLE 7.13: EXPERT SCORES OF SITE CATEGORIES........................................................................................ 88 TABLE 8.1: EXPERT INTERSECTION COMPARISON ACROSS STUDIES ............................................................... 96 TABLE 8.2: TASK TIME COMPARISON ACROSS STUDIES .................................................................................. 97 TABLE 8.3: COMPARISON OF NUMBER OF SITES BROWSED ............................................................................. 97 TABLE 8.4: USER CONFIDENCE FROM QUESTIONNAIRE .................................................................................. 98 TABLE 8.5: SITE PARAMETER RANKINGS ....................................................................................................... 98 TABLE 9.1: EXPERT AGREEMENT USING CORRELATIONS ............................................................................. 104 TABLE 9.2: EXPERT AGREEMENT, USING CATEGORIES................................................................................. 104 TABLE 9.3: METRIC SIMILARITY .................................................................................................................. 105 TABLE 9.4: METRIC SIMILARITY, INTERSECTION OF TOP 5 AND 10 .............................................................. 106 TABLE 9.5: LINEAR MODEL FOR PREDICTING EXPERT AVERAGE ................................................................ 107 TABLE 9.6: NUMBER AND PROPORTION OF GOOD ........................................................................................ 108 TABLE 9.7: PRECISION AT 5 AND 10 ............................................................................................................ 109 TABLE 9.8: MAJORITY SCORE AT 5 AND 10 ................................................................................................. 110 TABLE 9.9: AVERAGE EXPERT SCORES OF TOP 10 SITES ............................................................................. 111

vii

CHAPTER 1: INTRODUCTION

1.1

INTRODUCTION Web search and navigation are difficult problems that have received much attention, with search

engines like AltaVista and directories like Yahoo being the most widespread solution attempts. However, users have information needs and interests that are larger in scope and longer in duration than can be satisfied by AltaVista and Yahoo. In particular users want to manage their persistent interests in broad topics and to comprehend collections of web documents relating to topics. 1.2 MOTIVATION OF RESEARCH Typical search solutions are content-based, where a user query is filled by matching keywords to the text of web pages. While this approach works in many situations, it fails when users want to find quality information on a topic and manage the resulting information over a period of time. By utilizing the inherent structure found on the World Wide Web, we may gain more insight into the perceived quality of a web site. By viewing links to web pages as endorsements (a site linking to a page might validate that it contains quality content), we can use the concepts of social filtering (utilizing user preference for prediction) to create better collections of topically coherent web sites. Social filtering is a method of filtering objects (documents, videos, web pages, etc.) that concentrates on the characteristics of people and their preferences in addition to the objects’ content. The focus of social filtering is shifted from strictly

1

assessing the content of objects to evaluating the personal and organizational relationships of the community of users accessing those objects. An important task that many web users perform is gathering, evaluating, and organizing relevant information resources for a given topic; we call this topic management. Sometimes users investigate topics of professional interest, at other times topics of personal interest. Users may create collections of web information resources for their own use or for sharing with coworkers or friends. For example, one might gather a collection of web sites on wireless telephony as part of a report for work or a collection on the XFiles as a service for fellow fans. Librarians might prepare topical collections for their clients, and teachers for their students. Topic management is a difficult task that is not supported well by current web tools. A common way to find an initial set of (potentially) relevant resources is to use a search engine like AltaVista or an index like Yahoo. At this point, however, a user’s work has just begun: the initial set usually is quite large, consisting of dozens to hundreds of sites of varying quality and relevance, covering assorted aspects of the topic. Users typically want to select a manageable number – say 10 to 20 – of high-quality sites that cover the topic. With existing tools, users simply have to browse and view resources one after another until they are satisfied they have a good set, or, more likely, they get tired and give up. Browsing a web site is an expensive operation, both in time and cognitive effort. And bookmarks, probably the most common form of keeping track of web sites, are a fairly primitive organizational technique. While many web search utilities provide answers to specific queries, they do not provide convenient, efficient methods for exploring the body of knowledge available about a topic. Some search resources allow users to find a category that closely matches the topic they are interested in, but the end result is simply an alphabetical list of web sites that contain information on the given topic. New

techniques that provide additional functionality need to be available on the web to support broader types of information gathering. Most research done on search engines (See Related Work, Section 2), has concentrated on tweaking search algorithms to give very small gains in relevance ranking of results with respect to the user’s query. While improving the result relevancy is still important, the small gains attained are out of proportion with the amount of work that must be done by the user. Even after these gains are realized,

2

there still remains the problem of what to do with the ranked list of information. With better user interfaces and visualization methods for presenting results, we may help users find information more efficiently and effectively. We created and continue to develop the TopicShop system (discussed in Section 3.2.2) to address this need. TopicShop includes (1) a web crawler that discovers relevant web sites and builds site profiles, and (2) information workspaces for exploring and organizing sites.

1.3

OBJECTIVES OF RESEARCH This research consisted of multiple initial goals. First, we wanted to gain a better understanding of

the task of topic management and the methods that people use to complete this task, while showing that the task has limited existing support on the web. Also, we wanted to evolve the interface designs in TopicShop to be more efficient and provide users better access to the data necessary for topic management. Finally, using two controlled empirical studies, we have validated that these interfaces enable users to perform the management task effectively and demonstrate their usefulness for people maintaining persistent collections of web sites (such as links page maintainers) by enabling them to easily use the TopicShop system for their own web sites.

1.4

APPROACH TO RESEARCH We conducted two between-subjects empirical comparisons of TopicShop and Yahoo with users

performing the task of topic management. We have also investigated the effectiveness of these two user interfaces in helping to support users’ needs of managing persistent topic collections of web sites. Yahoo is a popular Internet tool that is currently used for the task of topic management and combined with bookmarks serves as a good comparison for TopicShop. The users in both of our studies were presented data on a topic in either the Yahoo interface in its original form or a topic crawl using Yahoo sites as seed sites in our TopicShop interface. They were asked to utilize the interface provided to them and evaluate sites in the collection from the topic crawl, selecting sites they thought were the highest quality (they provide a good overview of the topic). By measuring their performance and soliciting subjective feedback, we have been able to gain an insight into the benefits of each interface concept and incorporate changes into later iterations of TopicShop that improve its

3

usefulness. In addition, topic experts have evaluated these same sites and we have used these expert quality judgements of sites to compare to each user’s collection of sites in order to rate the quality of their collections. We have also performed numerous analyses of the notion of quality and how it can be predicted through automated measures. These studies and analyses are detailed in Chapters 4 through 10.

Research M ap
PH O A KS
Producti System on
Lessons Learned

Topi cShop V1
Piot Study l
But,what i s qualty? i

U ser Feedback

Topi cShop V2

Goal com pare qualty of : i TS and Yahoo subj ects colecti l ons

Al gori thm Eval on uati

M ai Study n

Em pi cal ri Eval on uati
Figure 1.1: Research Road Map When designing TopicShop, we kept a number of goals in mind, including: making relevant but invisible information visible, including a rich representation of the desired web sites, making it simple for users to explore and organize resources, and integrating topic management into a user’s normal computing and communications environment. By following these guidelines and iterating the design based on user feedback from empirical studies, we have improved the TopicShop user interface to meet our goals and the needs of users.

1.5

CONTRIBUTIONS OF RESEARCH

Contributions of this research include: ? Java applet-based web crawler to efficiently gather relevant web sites about a topic using the hypertext structure of the web and return results with detailed site profiles in response to a user’s topical query.

4

?

TopicShop visualization and management user interface, which this work showed to be more effective for displaying results and managing topic collections, developed by thoroughly analyzing how users perform the task of topic management.

?

Empirical evidence regarding the effectiveness of a task-specific user interface for topic management over current search engine technology through a controlled empirical study comparing topic management interfaces. We have shown that TopicShop subjects find more high quality sites, while doing less work, in less time. In addition we have shown that simple features of web sites can be used to predict their quality.

5

CHAPTER 2: RELATED WORK

2.1

FILTERING Information filtering is a technique that uses past user data to make recommendations about

something a user will want in the future. A text retrieval system can log what a user has searched for in the past and make suggestions of other documents they might be interested in, based on their past search queries. Most information filtering to date has been content-based, but there may be better methods of filtering. 2.1.1 Collaborative/Social Filtering Social filtering is a type of information filtering where, instead of filtering on document content, systems filter on similarity of user preferences. By matching a user to other similar users, a system can suggest potential documents or items that similar users have commented on in the past. One of the earliest investigations into personal ratings for HCI-type user-modeling by Allen [4] had unencouraging results. But later attempts have built on this early work and show some very successful results. Malone et al [61] describe three types of information filtering: cognitive filtering, economic filtering, and social filtering. In an email filtering system, cognitive filtering, often referred to as content-based filtering, involves matching messages to receivers based on the actual content of the message, and economic filtering considers the estimated search cost and benefit of use to the user before suggesting a document.

6

An example of a content-based filtering system is Rhodes and Starner’s Remembrance Agent [83], an automated information retrieval system that watches what a user is currently typing and then scans old email messages, notes, and online documents for something relevant to the user’s current interest. This way, the user is not required to do anything within the system; it simply watches in the background and interrupts when relevant information is found. There are of course settings that allow users to set the frequency with which document suggestions are made. The suggestions were limited to one “nugget” of information so that they would not interfere too much with the user’s current work. Because this system relied solely on text matching, it created many false positives but since they could be easily ignored, this was not a major problem. The Information Lens [60] is an information sharing system that uses social filtering to filter email. It automatically filters by matching messages using user defined rules and performing a specified action on the messages. In Information Lens, users fill out a template of specified fields like time, topic, and meeting place for each message they send and then can write rules to automatically filter their incoming messages based on these fields. INFOSCOPE [30][91] is a system that filters Usenet (explained in Section 3) news articles and can be thought of as an extension to Information Lens. This system works at recategorizing newsgroup messages into virtual newsgroups by matching user profiles or following user-defined rules. A virtual newsgroup is a logical entity containing articles from multiple newsgroups that match some series of patterns specified by the user. Agents monitor user behavior behind the scenes and automatically suggest new virtual newsgroups that might be useful. This system is strictly based on individual users and the information they are dealing with. There is no collaboration between the users of the system. Collaborative filtering recommendation systems match user profiles and suggest items that similar users recommend. The first of these systems was the Video Recommender System by Hill, Stead,

Rosenstein, and Furnas [46]. The interface to this system was through email and allowed users to send in their ratings of movies they have seen and then receive back a list of potential additional movies they might like to see. This was based on the idea of having a virtual community of user preference to match against a user’s likes and dislikes to find similar users in order to make recommendations. By giving the system an

7

idea of what types of things a user likes, the system can compare to other users and locate additional movies that the user will probably also like. Another system that supports this same type of recommendation through user profiles was developed by Shardanand [87]. The system, called Ringo [88], a precursor of Firefly, is a social

information filtering system that makes personalized music recommendations. Users could indicate their listening preferences by assigning specific ratings to music. This profile they generated could then be compared to other users’ profiles to determine which users are similar in their musical tastes. recommendations could be made from the combined list of albums that similar users liked. Resnick et al [82] designed and implemented another social filtering architecture based on personal ratings and demonstrated its application to filtering net news. In this system, called GroupLens, users rate articles on a numerical scale and the system correlated the existing user profiles to predict which article a user will be interested in. This system was successfully field tested with about 200 users. Filtering the net news articles could result in one of four cases: hit, miss, false positive, or correct rejection [54]. A hit and correct rejection are both desirable outcomes in a filtering system. A false positive will simply add noise to the results. Most times, human users can easily detect false positives that the system could not. Missing a relevant document, though, can be dangerous if there was necessary information contained in the document and most filtering systems attempt to limit the number of missed documents. Another filtering system that works on Usenet news is called URN [14][15]. This collaborative Usenet interface allows users to vote on articles and provide keywords associated with those articles. These votes and weights are updated in the system and collated across all users. Now the system can determine which articles a user might want to see based on past voting and display them by popularity. Tapestry [39][93] is a site-oriented email system that allows the entry of text annotations that can be used later to filter messages for other users. Annotations are typically rich in high quality information so adding annotations to messages provides useful data and insights about the message content for future reference. Users can search through this email repository by developing SQL like queries of the text and annotations of the messages. Both querying the system and annotating messages requires significant user effort. There is a tradeoff between the quality of data that are collected and the user effort required. This is Then,

8

true of many systems. If a user must put forth a large amount of work for little benefit they are usually less inclined to use the system [40]. We have seen systems that require different types of work from the users. Some of these systems simply require a user vote, while others require users to write and attach full textual annotations to the messages in the system. The effort required to annotate documents well is too high. Instead of using high quality recommendations from a few people, it can be much more useful to have a large number of lower quality recommendations [28]. By gathering information from many diverse users, we can better predict quality documents. The system developed by Maltz and Ehrlich [62] supported both active and passive filtering. Active filtering enables users to create explicit recommendations and then send them to specific colleagues. This way, users receive recommendations from someone they know personally. The passive filtering aspect of the system is where users can annotate documents they feel someone might be interested in and then not send them to anyone in particular, but leave them in the system for others to discover. Then, when a user happens to be reading a particular document they can see any comments made by other users. Answer Garden 2 [2], a slightly different type of social filtering system, is an organizational memory system that provides collaborative help. By capturing the questions and answers between

employees and support staff in an organization, a huge repository of information can be built to assist other employees. When an employee has a question, they can ask the Answer Garden system and if an answer is not contained in the repository it is sent to an escalation agent. This agent goes through a series of steps, each more intrusive, to attempt to get an answer to the question. First a chat room is consulted, then a newsgroup, and finally, if no answer is found, a specific expert is contacted to answer the question. After this, the answer is added to the repository and anybody asking a similar question can be given an answer immediately. We have already seen systems that filter net news and email archives. Another potential source of rich information is bookmark lists. Siteseer [81] is a system that mines personal bookmark lists. Since bookmarks are an implicit declaration of interest in the bookmarked page’s content a count of the number of times a site appears in users’ bookmark lists can be seen as a quality ranking for that site. Since

9

individual users tend to group their bookmarks into folders, the system also attempts to gain more information about a site by observing the site groupings between multiple subjects. 2.2 STRUCTURE IN THE WEB The World Wide Web is a collection of linked hypertext documents. The underlying structure between web pages can be thought of as a directed graph. Pages are represented by nodes and links between the pages are the edges. A basic intuition derived from this structure is that links often represent an endorsement of the quality and relevance of the linked-to site. Thus, this can be considered a form of social filtering. There are some useful graph properties that can now be considered. The out degree of a node represents the number of links going from a particular node to any other node in the graph. The in degree of a node corresponds to the number of nodes pointing in at a particular node. 2.2.1 Hypertext Structure Many researchers have already done work in analyzing network structure. These same ideas can be applied to the web [48] in most cases. Network analysis has significant potential to generate insight into the communicative nature of web structures. The structure of the web can be an important component to investigate when dealing with web sites. endorsement of that web page. Citation links are another area where the structure has been investigated thoroughly. Butterfly [59] is a system that accesses DIALOG’s science citation databases and performs a purely structural analysis to support users in managing collections of information resources. The central UI object is a butterfly, which represents one article, its references, and its citers. By utilizing the citation structure, the interface makes it easy for users to browse from one article to a related one, group articles, and generate queries to retrieve articles that stand in particular relationship to the current article. There are two important issues to consider when building a hypertext structure: navigation and viewing. When building the structure it is important to keep in mind how easy it will be for users to navigate and view the information contained within the structure. Furnas [31][33] studied the requirements necessary for building effectively view-navigable structures. He found that the out degree should be small with respect to the overall size of the structure. This means that nodes in the hypertext graph should not In fact, links to another web page can be considered an

10

point to every other node but rather a small subset of nodes. In addition the distance between nodes should be kept small with respect to the structure. For related nodes there should be a short path to get from one to the other, rather than having to traverse the entire graph. In the web, we have no control over this structure, but luckily, so far, it meets both of these criteria. Botafogo et al [12] developed a number of algorithms for analyzing arbitrary networks, splitting them into structures (pre-trees, hierarchies) that are easier to visualize and navigate. These aggregate structures are inferred based on identifying articulation points in the undirected graph and removing them to create a set of subgraphs. An articulation point is a point such that removing it and its edges from the graph would disconnect the graph into two or more components. This algorithm removes indices (nodes with high out-links) and references (nodes with high in-links) because these nodes cause over-connection in the graph and in order to have good articulation points, the graph must not be highly connected. We will see later that these two types of nodes (indices and references) are an important part of the web structure. There has been some work done to categorize pages in a hyperlink structure by Pirolli et al [75]. Their categorization algorithm uses hyperlink structure, text similarity, and user access data to categorize web pages into various functional roles such as head, index, and content. These functional roles are used to extract structures from the web determined using the spreading activation based on a set of user provided seed pages. Their system follows the links from the seed pages and allows the structure to continually grow as more pages are evaluated. A head page is the front page of a site. Nodes that have a high in

degree and thus many links pointing at the page are considered content pages. Finally index pages contain a large number of links to other pages, most times content pages. These algorithms were tested on the Xerox PARC web site and were shown to categorize the pages with very good accuracy. The web is very large and it is not possible to ensure that everything is structured well. Order can be imposed at a local level but global organization is unplanned [37]. The high level structure of the web emerges only after later analysis. Another analysis of web structure that shows the breakdown of two types of web pages was developed by Kleinberg et al [52][53] in an effort to gain information about web sites to aid user comprehension of sites. This algorithm is called HITS – Hyperlink Induced Topic Search. The two main types of pages are hub and authoritative pages. These two terms are mutually dependent: a good hub is one that links to many authorities and a good authority is linked to by many hubs. Authorities and

11

hubs, when isolated in a graph, should form dense bipartite communities. That is one set of pages (hubs) will point to the other set of pages (authorities). This analysis uses co-citation to categorize the pages in the structure by clustering pairs of documents based on the number of times they were both cited by a third document. Several researchers have extended this basic algorithm. Chakrabarti et al [23][24] weight links based on the similarity of the text that surrounded the hyperlink in the source document to the query that defined the topic. Bharat & Henzinger [10] made several important extensions. First, they weighted documents based on their similarity to the query topic. Second, they count only links between documents from different hosts, and average the contribution of links from any given host to a specific document. That is, if there are k link from documents on one host to a document D on another host, then each of the links is assigned a weight of 1/k when the authority score of D is computed. In experiments, they showed that their extensions led to significant improvements over the basic authority algorithm. PageRank [74] is another link-based algorithm for ranking documents. Like Kleinberg’s algorithm, this is an iterative

algorithm that computes a document’s score based on the scores of documents that link to it.

Another project has concentrated on new techniques for inducing clusters of related documents on the web. Pitkow and Pirolli [78] describe algorithms that find lawful properties of document behavior and use. These methods again start with co-citation analysis but then use a desirability ranking of pages to improve clusters. Two concepts that can be seen in much of the above work on hypertext structure are pages that are heavily linked to and pages that point to many other pages. The first of these have a high number of inlinks and have been named content, reference, and authoritative pages. The second type have a high number of out-links and are called index and hub pages. 2.2.2 Using Structure in Tools After a thorough analysis of the structure, the next step is to use it to help users find the information they are seeking. By incorporating structure analysis into tools aimed at finding web pages and collections, we can improve the efficiency of searching on the web.

12

A number of researchers have created interfaces to support users in managing collections of information resources. SenseMaker [8] focuses on supporting users in the contextual evolution of their interest in a topic. They attempt to make it easy to evolve a collection, e.g., expanding it by query-byexample operations or limiting it by applying a filter. In addition Mukherjea et al. [71] designed algorithms for analyzing arbitrary networks, splitting them into substructures that are easier for users to visualize and navigate. Pirolli and Card [77] define the term information foraging to cover activities associated with assessing, seeking, and handling information sources. The main idea gained from this metaphor is that systems need to adapt their designs in the context of the information they are seeking and the tasks that will be performed with the information. Depending on the task at hand, systems need to adapt to different

information foraging strategies. In today’s information rich world, the design problem is no longer how to collect more information, but how to optimize a user’s time and increase the relevant information gained. Again, this goes back to the tradeoff in the value of information obtained against the cost of performing the search activity [40]. Scatter/Gather [76] browsing is a cluster based browsing technique for large text collections based on information foraging theory. The interface presents summaries of clusters of similar documents,

allowing the user to navigate through the topic structure. The concept of gathering is to select individual clusters that are of interest. Scattering is the process of reclustering the selected clusters to reveal more fine-grained clusters of documents. This type of interface supports browsing of a collection of documents rather than searching the collection and is aimed at satisfying the user need to learn about the collection in general before looking for specific documents. This method gives users a chance to iteratively reveal the topic structure of the collection and eventually locate desired documents. Two additional systems developed in support of information foraging theory are the WebBook and WebForager [20]. The WebBook uses a book metaphor to group a collection of related web pages into a compact unit for viewing, storing, and additional interaction. The WebForager lets users view and manage multiple WebBooks on their desktop. The collections of web pages required to make up a WebBook can be generated using a set of automated methods provided by the system. Typical methods of building collections include: following all links from a page one level, following relative links from a web page,

13

extracting book like structures by following previous and next links, and grouping pages returned from a search query. Another browsing method for viewing pages on the web was developed using multiple hierarchical windows. Kandogan et al [50] have shown that through the extensive use of single user operations on multiple windows, their elastic windows browser provides an efficient overview and sense of current location in information structures. Their interface facilitates the organization and filtering of information and aids users in accessing previously visited pages without high cognitive demands. As users goals change, they can quickly organize, filter, and restructure the pages on the screen using this browser. The Navigational View Builder [70] is a tool to effectively build overview diagrams of the hypertext structure behind the web. It uses binding, clustering, filtering, and hierarchization to accomplish this task. Binding is done first to bind the information attributes to the visual attributes of the nodes and links in the structure. Clustering is done to provide abstracted views to show the overall information space on a single screen by analyzing the structure and the content of pages. Filtering reduces the amount of information on the screen by specifying relationships in the links or specific content to filter out. Hierarchization is done on the resulting set of pages by inferring the hierarchy from the content and underlying structure. While this system attempts to do all this work automatically, the authors admit that they had to manually enter many of the useful semantic attributes that were not able to be extracted automatically. One of these attributes was the page topic which we will show later can be semi-

automatically generated. The web is unlike traditional hypertext systems in that it is both redundant and incomplete. In the web, when there is no link between two pages, that does not mean that they are unrelated; it simply means they have yet to be linked. The web also contains many pages containing the same information. In traditional hypertext systems, this would not be true. Spertus [88] states that content search alone is lacking and because of the untraditional nature of the web, new techniques are necessary. ParaSite is a system that analyzes the links between web pages to find additional pages that are related to a given set of pages and to infer the topic and function of the pages seen along the way. Google [16] is another system that crawls and indexes the web making use of the structure to provide more satisfying search results. Each page encountered in a crawl in this system is assigned a page

14

rank that consists of the inlinks and outlinks, and the similarity of anchor text and page text. These results are given back in response to a search query. Improving search results is the goal of WebQuery [22] which builds a graph of links and nodes from an initial search engine result set and extends it, assigning the highest rank to the most highly connected nodes. This system is unique because it allows users to visualize the results in a number of ways: cone trees, 2D graphs, 3D graphs, lists, and bullseyes. For large sets of web pages, cone trees provide the best view because they make excellent use of screen real estate. A 3D graph is the best view when there are less nodes with similar connectivity. The bullseye view helps to draw attention to the most highly ranked node and allows nodes to be selected bringing them to the front and displaying their relationships. These layout possibilities all serve different information seeking needs. The structure within a single web site can also improve navigation for users trying to view large web sites. MAPA [29] is a system for inducing and visualizing the hierarchical structure within a web site. It extracts the structure and builds an interactive map of the site to use for navigation. A walker gathers information about individual pages within the site and then organizes the total link topology to make the interactive map visualizations. Lamping et al. [55] explored hyperbolic tree visualization of information structures. Another system that aids in mapping a single web site is WebCutter [58]. This system has tightly integrated search and browse oriented information discovery tools that interactively crawl through a site to generate visualizations. Like other systems described above, WebCutter allows the user to explore the information using a few different visualizations: a tree control useful, for abstraction; ellipsis, or star like layout, for pursuing incremental exploration; and a fisheye view for focusing on different regions of the graph. 2.2.3 Web Crawling A more specific application of hypertext structure is web crawlers. By crawling through the structure of web pages, we can collect information about pages and build a graph of page links. One way to improve the efficiency of methodically crawling through a set of web pages is to dynamically vary the order in which pages are visited. There is an optimal order a crawler should visit URLs (Uniform Resource Locator) in order to obtain more important pages first. Also, since the web is

15

very large and not all URLs can be visited in a reasonable amount of time, we want to visit pages that add the most to a crawl first. Cho et al [26] investigated metrics to use when updating the URL ordering for a crawl. Some metrics they used are: query similarity, backlink count (same as inlinks), page rank, and a location metric. The page rank is similar to an inlink count except it is weighted to consider the inlink count of each page pointing in to the site. The location metric attempts to categorize sites by looking at the URL itself to determine if a site is a homepage, a commercial site, or a number of other page types. Of these three metrics, page rank works best. A user study was conducted on the ARC system (Automatic Resource Compiler) [23][24]. A list of authoritative web sites on a topic was compiled using this automated system by performing three tasks: search and growth; weighting; and iteration and reporting. The search and growth phase followed links one level from each node to grow the set of seed sites into an extended set. Then each page is weighted and the process is repeated. Finally the results are reported in a sorted list. The study compared the results of the ARC system with Yahoo and Infoseek. The lists were presented to the users in their original form including the title of the search engine that generated them. Users gave subjective rankings of the three lists. The results showed that ARC was able to produce a list that was almost competitive with Yahoo and Infoseek’s lists and occasionally produced a slightly better list. We will show later that an enhanced crawling method similar to this, along with an efficient interface tailored to this task can actually produce significantly better results. Miller and Bharat [65] developed a framework for site-specific web crawlers. SPHINX is a Java toolkit and interactive development environment to support users in creating maps of a single site. Users can customize the crawls by using classifiers that analyze the content of the site’s pages and categorize them specific to the particular topic.

2.3

WEB PAGE ARCHIVING As users browse information on the web, they need to keep track of quality sites they have seen

that may be useful to them in the future. Most browsers have some type of archiving capabilities, called bookmarks and favorites in some current popular browsers. These are usually very primitive, often

consisting simply of a method to mark pages of interest for later retrieval and the ability to group the pages

16

in folders. As this list of marked pages gets large, it becomes difficult to handle. Users who browse the web often need better methods of archiving the best information found in their web sessions. Abrams, Baecker, and Chignell [1] carried out a study of several hundred web users who used bookmarks. Bookmarks are a very popular way to create personal information spaces of web resources. They observed a number of strategies for organizing bookmarks, including a flat ordered list, a single level of folders, and hierarchical folders. They also made four design recommendations to help users manage their bookmarks more effectively. First, bookmarks must be easy to organize, e.g., via automatic sorting techniques. Second, visualization techniques are necessary to provide comprehensive overviews of large sets of bookmarks. Third, rich representations of sites are required; many users noted that site titles are not accurate descriptors of site content. Finally, tools for managing bookmarks must be well integrated with web browsers. These four design goals are important to consider when creating user interfaces such as TopicShop. One of the first systems to concentrate on bookmarks was the Group Asynchronous Browsing system [102]. It is a collaborative system that merges web sites from multiple personal bookmark lists and even different web-based topic directories. Based on the concept of a multitree, general bookmark lists that may be further categorized into folders are combined the server. Users can then query the server to specify a subset of trees from the large multitree database. The system generates an HTML document listing the web sites matching the query and any cross-reference linking to other related sites and bookmark files. WebTagger [51] is a personal bookmarking service that provides individuals and groups with a customizable means of organizing and accessing web-based information resources. This system is a

collaborative bookmarking system based on some of the ideas in the Group Asynchronous Browsing system. By sending bookmarks to the system during a browsing session, users can later retrieve bookmarks from the large repository by querying the system. The returned results list shows categories of the web page and allows the user to rate the results so the system can retrieve better quality sites corresponding to the feedback the user has given. Automating bookmarking by keeping a history is another method that can be employed to help users with this task. Takano’s Dynamic bookmark tool [92] is used to support revisiting past web pages. The system automatically watches and archives a user’s navigation behavior and shows the analyzed results

17

as clues for which URLs to revisit. Not only will this system allow users to find past sites they are interested in, it will also support users in finding URLs that they have visited before but did not realize were important enough to explicitly add to their bookmark list. Bookmarks are typically gathered opportunistically, as users happen to encounter interesting sites, and bookmark files usually span many different topics. We are more interested in situations where users are explicitly engaged in gathering and organizing a collection of related resources for a specific topic. Our systems will attempt to support users in performing this activity. The Data Mountain of Robertson et al [84]represents documents as thumbnail images in a 3D virtual space. Users can move and group the images freely, with various interesting visual and audio used to help users arrange the documents. In a study comparing the use of Data Mountain to Internet Explorer Favorites, Data Mountain users retrieved items more quickly, with fewer incorrect or failed retrievals. Hightower et al [42] based their work on the observation that users often return to previously visited pages. They used Pad++ [9] to implement PadPrints, browser companion software that presents a zoomable interface to a user’s browsing history. Interfaces to browsing history reduce the need for users to create collections of items explicitly, although the problems of organizing a collection are the same, however it is obtained. 2.4 INFORMATION WORKSPACES After evaluating items and selecting the interesting ones, users must organize the items for future use. Card, Robertson, and Mackinlay [19] introduced the concept of information workspaces to refer to environments in which information items can be stored and manipulated. A departure point for most such systems is the file manager popularized by the Apple Macintosh and then in Microsoft Windows. Such systems typically include a list view, which shows various properties of items, and an icon view, which lets users organize icons representing the items in a 2D space. Mander, Salomon, and Wong [64] enhanced the basic metaphor with the addition of “piles”. Users could create and manipulate piles of items. Interesting interaction techniques for displaying, browsing, and searching piles were designed and tested on an experiment that investigates this issue.

18

Marshall & Shipman’s VIKI system [66] lets user organize collections of items by arranging them in 2D space. Hierarchical collections are supported. Later extensions [89] added automatic visual layouts, specifically non-linear layouts such as fisheye views [34].

19

CHAPTER 3:

PHOAKS SYSTEMS

3.1

INTRODUCTION Usenet (User’s Network) news is full of pointers to useful resources but because of its immense

size it is not always easy to find the best and most reliable ones, without manually sifting through many non-relevant messages. We have developed the PHOAKS (People Helping One Another Know Stuff)

system at AT&T Labs to address the problem of constructing collections of web pages by scouring Usenet news and keeping a database of all web pages that have been mentioned in its everyday conversations. The basic premise of PHOAKS is that an effective way to find good information resources (web sites) about a given topic is to ask experts in that topic. Since users of Usenet newsgroups are already carrying on discussions about thousands of topics, there is a large body of information available to find recommendations for quality resources (web sites, downloadable files, etc.) available on the Internet, without requiring any additional work from the users. The typical user searching for information could, as one of many search methods, read through a newsgroup and look for relevant resources. But newsgroups have enormous amounts of traffic and this could be a time-consuming task. Some of the more active newsgroups have thousands of posted messages per day. Most people do not want to sift through that many messages to find what they are looking for. An agent such as PHOAKS eliminates much of this work by automatically sifting out resources from all the messages posted and presenting them to the user.

20

3.1.1

Usenet News Usenet is a large distributed depository for message exchange among interested users on the

Internet. It can be thought of as a global Internet bulletin board. It is subdivided into many topic areas and users posting messages decide where their message fits in best. The network topography of Usenet is distributed over many servers around the world. A local user posts a message to their local server and from there the message propagates out and eventually reaches all other news servers in the world. Reading messages is also done from a local server that has received messages from other servers. Due to the distributed nature of Usenet, there can be a lag in when a message is posted and when different servers around the world receive the messages. This may even lead to some machines receiving a reply to a message before the original message itself because of the paths the messages followed within the distributed network. There are over 23,000 newsgroups (taken from logs of innd, a common Usenet news server) in Usenet news. These are the topics into which all messages are divided. The structure of the subdivided topics is a very large hierarchy. At the top level there are about 600 broad categories and each level deeper into the hierarchy leads to more specific topics. comprising approximately 6900 groups: There are eight major top-level topic hierarchies

? ? ? ? ?

alt (alternative) [4641 groups, ~21%]: Almost any topic can appear in this hierarchy. comp (computers) [903 groups, ~4%]: Related to computer hardware and software. misc (miscellaneous) [135 groups, ~.61%]: Themes not easily classified into the other hierarchies. news [30 groups, ~.14%]: Concerned with the news software, network, and administration rec (recreation) [708 groups, ~3%]: Consists of discussions oriented toward hobbies and recreational activities.

? ? ?

sci (science) [205 groups, ~.93%]: Discussions in research or applications of science. soc (society) [264 groups, ~1.2%]: Topics relating to social issues and world cultures. talk [29 groups, ~.13%]: Geared toward debates. Topics are usually very open ended.

Some example groups are rec.boats, alt.sports.hockey.nhl.ny-rangers, and comp.lang.java.

21

There are other top-level topics; many of them are foreign topic hierarchies for other countries. The alt tree is different from the other trees in the way groups are added to it. In most newsgroup hierarchies, users can make requests for new groups; they are voted on and eventually approved by an administrator and added. But in the alt tree, anyone can simply add a group. This leads to a very wide variety of topics that might not make it to one of the other hierarchies. Due to the ease of group additions, the alt tree typically will have discussions of big news stories, sometimes minutes after the events have occurred. Social filtering can help determine which web sites mentioned in the messages are most important in the topic for which they are posted. By systematically counting the number of times a web site is mentioned within a newsgroup, we can gather a list of the most talked-about sites for each newsgroup. This list can be used to rank the sites and show a user which sites were most highly recommended by the community of users participating in the newsgroup discussion. PHOAKS was developed to implement this idea by constantly monitoring newsgroups and storing in a database all web resources mentioned in the discussions. 3.1.2 Frequency of Mention in Public Conversation The metric that PHOAKS used to determine which web sites are the most popular within a newsgroup is frequency of mention. The social data provided in Usenet news in the form of messages

posted by users can be used to determine what URL mentions the users of each newsgroup have referenced most often. Counting one vote per distinct person posting a message with the URL mention provides a frequency count for URL mentions. This prevents users from posting multiple times about a site to try to manipulate the system and cause their favorite page to move higher on the recommended resource list for a newsgroup. Currently, PHOAKS requires a threshold of only one vote to make it into the frequency page, which is a list of the top forty resources for a newsgroup ranked by frequency of mention. A better way to do this might be to accept a resource only when at least two distinct people have recommended it. This will help eliminate the spam and automatic posts that are found throughout Usenet. The main presentation page of PHOAKS shows frequency counts for web resources gathered from newsgroups.

22

Another order of presentation is recency of mention. Using this ordering PHOAKS presents web resources that users are currently talking about in a newsgroup. The recency view of PHOAKS lists all resources recommended in the most recent posts that PHOAKS has come across, in descending order by date and time of the post. A combination of recency and frequency is also available in PHOAKS. This allows a moving time window that causes only resources that were recommended somewhere within a specified time period to be included in presentation of the top recommended resources list. Since PHOAKS started running in October 1997, many pages have built up a large number of recommendations. This makes it more difficult for newer resources to reach the top of the frequency ordered list of pages in high traffic newsgroups. A moving time window allows a user to specify that they would rather see more current recommendations and information. Since web sites for many topics are rapidly changing, the best pages may be ones that have a fair number of recent recommendations instead of a large number of old recommendations. Of course, some pages that remain around for a long time are still the best source of information available, but that means that they will probably be recommended continually. 3.1.3 Classification Rules: Development & Iterative Refinement Each URL mention in a net news post is classified by PHOAKS into one of a number of categories, by applying a set of classification rules. By manually reading through a few thousand posts, we generated an initial representative set of categories: ? ? ? ? ? ? ? ? Private – mentions in messages with the private header field set to true PHOAKS URL – any mention of a PHOAKS web page Spam – URLs mentioned in more than 40 newsgroups in the same message Kill – mentions of an URL on a list of system definable undesirable sites Quoted – mention was inside a quoted area of text Code – mention was part of a source code sample Signature – mention was part of a user’s signature Organization Signature – mention was part of a user’s signature and was of the posting organization

23

? ? ? ? ? ? ? ?

URL in Signature – mention was part of a user’s signature and contained the user’s email name Approved FAQ – mention within an approved FAQ Unapproved FAQ – mention within an unapproved FAQ URL in FAQ – mention in message that appears to be a FAQ but is not specifically tagged as one Self Recommendation – mention is a recommendation of the user’s own site Recommendation List – mention is a recommendation within a list of two or more Recommendation – mention is a recommendation Other

PHOAKS uses categorization to determine which URLs will be counted as a recommendation. Currently the categories used in frequency calculations are recommendation, Approved/Unapproved FAQ, and URL in FAQ. Each of these categories consists of times when the message poster is recommending an URL. For example, in a list of frequently asked questions, the URLs found there are usually answers to specific questions and can be considered recommendations. The rule set that PHOAKS uses to determine an URL’s category has gone through three iterations. An initial set of rules was created using rule learning software called RIPPER [27]. This software takes as input a set of text samples along with a list of features and a result to conclude about each sample (in our case, categories). RIPPER then analyzes the features and develops a boolean combination of features that best predicts the desired result for each sample. Two independent raters manually classified a set of 200 URL mentions. Then, ten percent of these URLs were used to initialize RIPPER, and the rest used by RIPPER to learn the rule set that produced similar results to the examples given. Once these initial rules were created, they were applied to another independent set of URLs randomly sampled from PHOAKS. These URLs were also manually classified and results were compared to the automated classification, leading to a more refined set of rules. Finally, one more iteration was performed and after the rules were able to predict the URL mentions accurately within about 85%, they were used in the PHOAKS system. There are two aspects of rule accuracy: precision (percentage of URLs that rules classify into a certain category that actually belong in that category) and recall (percentage of URLs that belong to a category that rules classify into that category). For the current set of rules in PHOAKS, precision is 88% and recall is 87%, with an inter-rater reliability of 88% as determined by applying the rules to a sample set

24

of URLs and comparing to human categorization performed by two independent raters. It is much more important that the rules filter out false positives. A few false negatives are ok, because there are enough data coming through Usenet daily that these will tend to be overcome. But false positives lead to incorrect recommendations and must be kept to a minimum.

3.2

PHOAKS ARCHITECTURE The PHOAKS system architecture was carefully designed to be general enough to support many

different text filtering and collection tasks. There are three main parts of PHOAKS: filtering, categorizing, and disposition. New functions can be plugged in to perform these three tasks to create a new system with different behaviors. Examples of different tasks this architecture is capable of supporting are URL filtering from net news (described below), FAQ collection, or personal mail filtering. 3.2.1 PHOAKS News Agent

3.2.1.1

Filtering The first part of PHOAKS extracts recommendations of web resources from Usenet messages.

To do this, PHOAKS searches through every message of net news looking for a pattern (http://) that indicates the following text is an URL. Any message that contains binary data is ignored because these messages are typically long and take too much time to filter. Messages containing binary data do not normally contain URL mentions other than in the message header, so not many mentions are missed. To allow general searching in the filtering module, PHOAKS searches for a textual pattern or a regular expression. The pattern recognizers allow boolean operations so the pattern can consist of multiple phrases, each assigned different weights. Also, different sections of the message or text can be specified to reduce search time. Sometimes it may only be necessary to search for patterns in the subject header while other times they should be looked for throughout the entire message body, depending on the application.

25

3.2.1.2

Categorization Next, PHOAKS must classify every URL mention into one of the categories described above.

This is done by first performing a sort of tokenization of the message. We developed a set of features about each message that describes aspects of the message that we thought were important for determining the category of URL mentions in a message (i.e., the URL Block feature corresponds to the block of text surrounding an URL mention). While developing the rules described above, additional features were added so messages could be classified into the most appropriate categories. conjunction with the rules to categorize URL mentions. Combining these syntactic features into rules gives a systematic method of categorizing URL mentions from the messages. For example, if the URL is within 20 lines of the end and occurs after a double dash, the category is a signature. If there are no special characters to set off the signature, there is another rule that looks for standard signature items like email addresses, phone numbers, etc. If the URL occurs toward the end of a message and contains any of the signature items, it is also categorized as a signature. 3.2.1.3 Disposition Finally, after all other processing is finished, the message data and its category are stored in a database for later retrieval. At this point the HTML (Hypertext Markup Language) for each URL mention in the message is fetched. A title is taken from the page text and stored for use as the display name for the resource. Also, a reduced representation of the page text is kept so that a search index can be built on this text. This process not only provides this valuable information but also allows PHOAKS to ensure that the URL is a valid resource. If not, it is marked as unfetchable and not checked until the next iteration of the system. The web pages for each URL mention are continually checked to make sure that they are current. There is also an occasional problem of network lag and unavailable servers where a resource may appear to be invalid. If a resource once existed and is no longer available, it must be confirmed five times before the resource is dropped from PHOAKS. Since the web is constantly changing, sites tend to move frequently. If a site move is done in a standard way, such as adding a field in the http header, or using a meta-refresh tag, PHOAKS can determine that the two sites should be equated and combine the data for the two records. If, on the other These features are used in

26

hand, no forwarding information is left or some non-standard method of guiding users to the newly moved site is used, then PHOAKS will track two different sites. Since the title of the web pages is usually the same in these cases, PHOAKS can infer that the sites are the same, and will list the sites together when presenting the information to users. However, site information is not combined and will not affect the frequency count for the site.

Figure 3.1: PHOAKS Web Interface

27

3.2.2

Web Interface The last component of PHOAKS deals with displaying information from the database to users in

the form of web pages (shown in Figure 3.1). This system was developed to be easily extendable and updated. It incorporates a template language so site maintainers can build page templates to dynamically generate web pages based on the templates. This page definition language is an extension of HTML that adds iteration and conditional constructs and a set of variables specific to PHOAKS data. The language makes it easy to describe, for example, a resource summary page as an iteration over all recommended resources for a newsgroup. There is also a software layer in PHOAKS that fills database requests from the dynamically created web pages. When a page is requested by a PHOAKS user, the template is checked and any constructs and variables are translated into database requests. Then the database layer makes queries to the database and returns all requested items and finally a page is created. Since speed of presentation of web pages was important, we developed a caching feature to pre-cache popular pages for each newsgroup and keep a cached copy of any page that a user has requested (as long as it is still valid). Now, when a page is requested, a CGI (Common Gateway Interface) script first checks to see if the page has been cached. If it has, then the page is simply displayed in the browser. If not, the page is generated, cached, and displayed. Since 75% of the pages accessed by users of PHOAKS are resource summary pages and index pages, these are the pages that are pre-cached every time the database is updated for a newsgroup. This helps keep simultaneous database accesses to a minimum and lets users get commonly accessed pages back immediately. 3.3 LESSONS LEARNED PHOAKS effectively solved the problem of automatically collecting quality web sites about a topic. In addition, we showed that Usenet messages are an abundant source of recommendations of web pages, that recommendations could be recognized automatically with high accuracy, and that there is some correlation between the number of recommenders of a web page and other metrics of web page quality. However, there were a number of aspects of PHOAKS that needed improvement.

28

The basic unit of the items recommended by PHOAKS was the web page. However, for many purposes, the web page is the wrong unit of information. The World Wide Web consists of many web sites, coherent, structured multimedia documents consisting of many individual web pages. Many times PHOAKS would contain recommendations of multiple web pages within a single web site. Clearly, recommendations for these pages could be aggregated to count as recommendations for the common web site that they are a part of. We want to group web pages into sites and present the consolidated structure of the web site to users. But, we also must keep the original pointers to individual parts of the web site so that we may indicate which areas of a web site might have been more popular. A general goal of PHOAKS was to collect as many relevant web pages as possible while including few non-relevant pages. Because PHOAKS monitored newsgroup discussions and there were some off topic web pages mentioned within newsgroups, PHOAKS sometimes collected web pages that did not concern a particular newsgroup’s topic directly. Another common occurrence in newsgroups is the posting of general publications across many newsgroups. These publications may contain a few resources relevant to the newsgroup and many that are not. The opposite situation also arose. Since PHOAKS had a “no self promotion” rule, web pages mentioned by the site maintainer were not included. In a few of these cases, since the page was already mentioned in the discussion, additional users felt no need to repeat the recommendation and therefore the page was not included in PHOAKS. In future designs, we want to filter out irrelevant resources and include more relevant resources. PHOAKS was based mainly on a single ranking metric: the number of distinct individuals who recommended an item. This metric is useful for many purposes, but situations arise where users need additional metrics to evaluate sites. By including numerous ways of comparing web sites, we can meet this need, plus help to eliminate the case where a quality site is excluded because of a low ranking within a single metric. The main representation of a web page in PHOAKS is the title, which may not always be the best way to communicate what a page is useful for. It is our goal to construct representative profiles of web site content and structure that make it easy for users to evaluate sites, helping them to determine both site quality and function. Finally, there was no information workspace included in PHOAKS. We found that users had a desire to define and organize personal collections of web resources. In future interfaces we want to

29

implement an information workspace that allows users to easily manage their resources and make it easy for them to share their collections with others.

30

CHAPTER 4:

TOPICSHOP SYSTEMS

4.1

WEB CRAWLING There are many different sites on the web for any given topic. An alphabetized list of all known

sites is rarely the best method for finding useful information. The inherent hierarchical structure of the web can be used to gain further information about web sites. By following all hypertext links on a web site, a topic crawl can be generated for all sites linked to by a particular site. Continuing the crawl deeper into these sites will eventually provide a large body of topically-related sites that can be analyzed and presented to a user. This is based on the assumption that quality sites point to other relevant quality sites. Since site designers have theoretically already put effort into filtering out poor quality sites and only linking to quality sites, a crawl can simply follow links to build a better representation of the scope of sites for a given topic. The basic unit used in many search engines is the web page. While this may work for very specific topics, many times users need to be guided to appropriate sites containing information on a variety of sub-topics. In building topic crawls, the basic component we use is web sites rather than web pages. A site contains a coherent body of content on a given topic and is divided into pages, usually grouping related information, to ease navigation. Pages that make up a site can be roughly sorted into three basic

categories: navigation pages, content pages, and links pages. Navigation pages provide structure to the site content by giving indices and table of contents that a user can click on to find further information. One

31

navigation page usually represents the top or front page of the site and provides the starting point for navigating the rest of the site. This page is also commonly called the index page and is intended to be the first page a user sees when viewing the site. Content pages contain information about the topic that the site is representing. Links pages usually do not add any additional content of their own and are simply collections of links to other sites related to the topic. Of course not all sites follow this format, but many use this or a similar structure to provide users with easier ways to maneuver through the site. Pages are grouped into sites using heuristics that look at the directory structure of URLs. For example, if the crawler encounters a link to the URL http://a/b/page1.html, and http://a/b/index.html is a site known to the crawler, it records this URL as part of the site. Further, if the link was encountered while the crawler was analyzing the site http://x/y/, a link is recorded from the site http://x/y/ to the site http://a/b/index.html. Users can generate topic crawls by giving the crawler a user-defined set of seed pages. These seeds can be obtained in various ways: a list of pages that a user already knows about, the output from a search engine, or a list of URLs from PHOAKS. The crawl starts from these seed sites and follows links found on the seed pages by fetching the HTML of the corresponding page for each link on the seed page and analyzing the content. This process of analyzing content and following links continues for all pages within two links of a seed page. If links are internal to the site (point to a page from the same site), then the pages are added to the collection of pages already found for this site and the internal site structure is slowly revealed as the crawl progresses. Links that are external to the current site (point to a page on a different site) add to known sites about the topic. As more sites are visited, the inter-site structure is recorded during the crawl. A link does not have to point to the top page of the site; it can go from any page on a site to any page of another site. The resulting structure can be thought of as a directed graph of the sites with vertices representing sites and edges representing links between sites, called a site graph. Our crawl uses a clan graph as the primary information structure. A clan graph is a directed graph where nodes represent documents and edges represent a reference to the node pointed to. A local clan graph is the subgraph whose nodes are closely connected to the user-specified set of seed sites. Building on concepts from social network analysis[48][86], co-citation analysis [35], and social filtering [46] we have developed the notion of an NK local clan graph. ? The NK local clan graph for a seed set S is {(v,e) | v is in an N-clan with at least K members of S}.

32

An N-clan is a graph where every node is connected to every other node by a path of length N or less, and all of the connecting paths go through only nodes in the clan. Our crawler uses a 2-clan (the 2K local clan graph) because it represents a useful substructure extracted from the large structure of the web. By requiring that sites relate to a certain number of seeds (K), we ensure that we find not just dense graphs, but graphs in which a certain number of the seeds participate. There are three types of inter-document relationships where a relationship between two of the documents can be inferred based on a known relationship between the other two. Co-citation analysis says that two documents B and C are related if a third document, A, cites them both. Social filtering says that if documents B and C both refer to a third document, A, then B and C may be likely to link to similar sorts of items in general. Transitivity says that if document A refers to B and B to C, then A implicitly refers to C. These three relationships are the minimal 2-clans which are, in our case, necessary because no smaller structure allows us to make inferences about document relatedness, and sufficient because no large structure enables other simple inferences [95]. During a crawl, a number of parameters describing sites are gathered. Number of images, audio files, and movie files are recorded, as well as the number of in-links and out-links. The number of links pointing to a site by other outside sites is called the in-links. This parameter can be used to determine if the site is a popular site by finding the number of site designers that think it is good enough to be linked to. This is a form of social filtering. By considering each in-link to a site to be an endorsement to that site we can generate a list of the most linked-to or most endorsed sites. An out-link is where a site links to another site. The site with the most out-links can be considered a good index site with many links about the desired topic. Combining these two parameters can provide further information. If a site is pointed to by many sites, but does not point to any other sites, it may be an official site (perhaps a corporate site) on the topic since many sites think its important, but the site itself does not point to any other sites. If, on the other hand, a site is not pointed to by many other sites but itself points to a large number of other resources, it may be a newer site that other site designers have not noticed yet. Most likely, it is a link collection site if it has a high number of out-links. While a crawl is being performed, two metrics are used to ensure that highly relevant sites are visited in the early stages. First, a weighted sum of the number of in-links of all sites that point to a page is

33

used to rank the page on its potential for not only being a quality site but for recommending other quality sites. As a crawl progresses, this ranking is improved because more data about visited sites are collected. If a site is pointed to by many other sites with a high number of in-links (and hence are considered good sites because they are endorsed by others), then this site can also be considered a good site. Because of the immense size of the web, a crawl can take a very long time but by using this metric, more relevant sites are found by the crawler near the beginning of a crawl and a crawl can be stopped after some user-defined threshold number of sites is found. In addition, anchor text is searched for keywords related to the crawl. Anchor text is the text description, written by the site designer, that is displayed for each link and is what the user clicks on to visit the site linked to. This text is usually highly related to what the site contains. So during a crawl, all occurrences of anchor text are saved for each site and can be searched to gain relevance feedback. If a match is found, then the ranking for the site is improved; if no match is found, nothing is done, because that does not necessarily mean a site is off-topic. We noticed differences among the structure of crawls for certain topics. In particular, sites whose purpose was sales and business-related activities tended not to link to other sites and were isolated from the rest of the site graph. For business topics, 79% of the sites were isolated as opposed to only 32% isolated sites for non-commercial topics such as entertainment. Similarly, the average density of the graph for business-related topics was 0.004, but for other topics was much higher at 0.071. Topics dominated by merchants competing for the same customers do not exhibit collaboration and are not good candidate topics for our systems. Collaborative filtering based on linking can work only for topics with a significant number of inter-site links. An interactive Java applet is used to generate crawls on a server based Java crawler. The user enters into the applet a few seed sites on a topic that they want to crawl. The applet sends this information to the crawler running on our server. The crawl begins and the user is given feedback about the status of the crawl in the form of sorted lists of thumbnail images. Users choose what parameters they would like to watch (in-links, images, pages, etc.) and then rows of images in sorted order are displayed so the user can see what sites are being gathered in real time. When a crawl has completed and satisfies the user-specified parameter of number of sites to gather, individual files are compressed into a zip archive and downloaded

34

by the user to use as data for further visualization. This client/server architecture allows the system to be used by multiple web users efficiently and allows us to monitor the crawls being performed. Once a crawl has been completed and a database of related sites has been compiled, data can be further analyzed and presented to the user. Two visualizations of collected data that we have developed are described in the following sections. The first interface, WebCite, is a graphical interface and the other, TopicShop, combines thumbnail images with a simpler text-based representation.

35

Figure 4.1: WebCite User Interface 4.2 WEBCITE The WebCite user interface (shown in Figure 4.1) displays a graphical thumbnail image for every site visited in a crawl. The layout is a group of concentric semi-circles based on an auditorium seating

36

metaphor with sites most central to the topic located on the inner ring of images and fanning outward in elliptical rows where graphical thumbnail images get smaller as they get farther from the center. Centrality is determined by a metric that combines in-links and out-links of each site (the number of 2-clans the site occurs in). The title of the main page of each site is placed next to each thumbnail and the original seed sites are marked with an asterisk. The interface is interactive; when a user moves the mouse cursor over one of the thumbnail images, it is highlighted by increasing its size. In addition, links to and from the page are shown with colored arrows indicating the links. Arrows pointing toward the highlighted site from other thumbnail images indicate that other sites link to the highlighted site. Likewise, arrows pointing toward other sites indicate the highlighted site links to other sites. Any site that does not contain a link to the highlighted site and is also not pointed to by that site fades to black, so that just the relationships to and from the highlighted site can be seen. Sites can be visited by double clicking on them, which opens up a web browser and displays the top page of the site. Clicking the left mouse button while holding the shift key over any thumbnail image will display internal pages for that site. 4.2.1 Lessons Learned The WebCite user interface is a visually pleasing way to display collections of web sites on a topic. Layout of the thumbnail images begins to show some relationships of web sites in the collection. Much of the important information, such as in-links and out-links, is hidden within the interface. Users are required to manipulate the interface by clicking and moving the mouse to see the link structure between the sites emerge. This is, of course, better than presenting all inter-site links at once, resulting in a cluttered, useless display of the structure. We will see later (section 4.3, TopicShop) that by using the number of inlinks and out-links as parameters of the site and presenting ordered lists, we can allow the user to see important aspects of the inherent structure of the sites. User evaluation of this interface revealed that users wanted to change the structure and move the thumbnail images to suit their needs. We want to support this by letting users organize collections to reflect their own understanding of the topic area by grouping and categorizing items.

37

Figure 4.2: First version of TopicShop (Details View) 4.3 TOPICSHOP Another visualization for viewing and managing collections is the TopicShop Explorer, a customized version of the normal Windows file Explorer. The TopicShop Explorer, shown in Figure 4.2, is a very small Windows executable that knows how to read and process site profile files. Users can view their collections in two different ways: details or icons. The main feature of the details view is that it shows site profile information, and the main feature of the icons view is that users can arrange icons spatially (Figure 4.2 shows the details view; Figure 4.3 shows the icons view). We had three main design goals for TopicShop Explorer: 1. Make relevant but invisible information visible. We hypothesize that making site profile information visible will significantly inform users in evaluating a collection of sites. No longer must they decide to visit sites — a time-consuming process — based solely on titles and (sometimes) brief textual

38

annotations. (A chief complaint of subjects in the Abrams et al., [1] study was that titles were inadequate descriptors of site content — and that was for sites that users already had browsed and decided to bookmark.) Instead, users can choose to visit only sites that have been endorsed (linked to) by many other sites or sites that are rich in a particular type of content (e.g., images or audio files). In addition to site profile data, the thumbnail images also are quite useful; most notably, for sites a user has visited, thumbnail images are an effective visual identifier for sites. 2. Make it simple for users to explore and organize resources. In the details view, users can sort resources by any of the properties (e.g., columns showing number of in-links, out-links, images, etc.) simply by clicking on the label at the top of the column. In either view, right-clicking on a site brings up a window that shows profile data from which the numbers in the columns are derived (e.g., lists of all sites that link to the selected site and all internal pages of the site). Double-clicking on a site will send the user’s default web browser to that site. Users can organize resources both spatially (in the icons view) and by creating subfolders and moving resources into the subfolders. Nardi & Barreau [73] found that users of graphical file systems preferred spatial location as a technique for organizing their files. We believe spatial organization is particularly useful early in the exploration process while users are still discovering important distinctions among resources and user-defined categories have not yet explicitly emerged. As categories do become explicit, users can create folders to contain sites in each of the categories. 3. Integrate topic management into a user’s normal computing and communications environment. The TopicShop Explorer may not look like a novel interface at all; interestingly enough, this was an explicit goal. We wanted it to be as similar to the normal Windows Explorer as possible so Windows users could apply all their existing knowledge, meaning there would be little or no learning time and similar ease of use. Further, this decision makes it very easy for collections of resources to be shared. Since a collection is just a normal Windows folder containing files (of the special type that we designed), they can be shared in all the normal ways. As we already have explained, a collection can be compressed and downloaded. It can also be emailed. And if users share a common network, collections simply can be read directly from any machine on the network.

39

Figure 4.3: First version of TopicShop (Icons View)

The TopicShop Explorer interface allows users to organize their web site collection from any view. In the details view, users can change the order of the collection of web sites to represent their personal choice of best quality sites. This ordering becomes an additional column in the interface that can be sorted like any other column. In the icons view, spatial organization is allowed and web site icons can be arranged into groups before being moved to a new folder.

4.4

CURRENT INTERNET RESOURCE DISCOVERY TECHNIQUES Existing web search approaches can be broken down into a number of types: comprehensive

indices, keyword searches, hybrid directory/keyword searches, specialized indices, socially filtered interfaces, and task-specific interfaces (i.e., TopicShop).

40

4.4.1

Comprehensive Indices (Web Directories) One popular search approach is the comprehensive index (Yahoo, Netscape, Lycos, Infoseek,

etc.). This type of search engine has human web librarians that search and evaluate sites and place them in an appropriate category usually in alphabetical order. The result can be considered a web directory broken down into categories. This typically leads to highly relevant sites, but relies on human involvement to make decisions about which sites are on-topic and which sites are not. Because of the human role, directories can often provide better results than search engines. Search features are available to automate finding correct categories and/or sites for a particular query, by allowing users to specify keywords that are matched against category headings. 4.4.2 Keyword Searches Another type of web search engine is the keyword search engine (Alta Vista, Magellan, Excite, etc.). On these sites, a user specifies a query in the form of a list of words and is given back a list of pages ordered by a textual matching metric. This is done by implementing automated crawlers that catalog the web by following links on each page it finds and building indices to search on, thus eliminating the need for human intervention. Many times there are multiple pages from the same site listed in the results. These search engines constantly visit web sites on the Internet in order to create catalogs of web pages. Because they run automatically and index so many web pages, keyword search engines may often find information not listed in directories. 4.4.3 Hybrid Directory/Keyword Searches Many search engines that began as keyword search engines are slowly incorporating a categorical directory index in their database. Designers of these search engines apparently saw that providing some sort of structure to their list of sites would be very beneficial to their potential users. By default a user’s search query is still answered with a list of pages from the search engine’s catalog of the web, but a rough directory index is available as well for at least some of the sites returned.

41

4.4.4

Specialized Indices A fourth type of search approach is the specialized index, which can be further broken down into

two distinct categories: links pages and web ring interfaces (Links Pages, WebRing, Looplink, etc.). People interested in a topic that want to provide resources to other users create links pages. Typically links pages are just a list of resources presented on a page of a web site, sometimes categorized by sub-topics. Web rings attempt to provide some structured information to groups of pages by enabling users to form rings, linking together sites related by topic. Usually one user is the ringmaster and allows other site maintainers to join the ring. Navigation among the sites in a ring is accomplished either by going to an index page of sites or moving around the ring of sites by following links on each page. This provides a sort of topic community of interested site maintainers that support each other by bringing users to the whole ring rather than just to their site. The interface is not very efficient because users must traverse through pages in the ring to find information they are interested in. 4.4.5 Socially Filtered There are also search approaches that attempt to utilize user behavior as a predictor for relevant web sites (Alexa, Firefly). These systems watch where users navigate and also collect ratings from users about sites they visit. Mapping current behavior to the database provides a method for matching users and making recommendations on what other sites are likely to be related to the current site. These systems are highly automated, but still require users to give some feedback in voting for sites. They lead to a collection of many related resources but still do not provide a comprehensive overview of available information. 4.4.6 TopicShop While directories appear to contain higher quality resources, with all items likely to be relevant to the topic, they require a large amount of human effort to construct and maintain a good collection of items for a topic. Search engines are automated, requiring no human effort, offer much more data, and may include relevant items that were missed by a human librarian maintaining a topical collection. However, search results often contain irrelevant information due to the ambiguity of most queries, usually have poor organization, and almost always contain duplicate pages and dead links. TopicShop attempts to combine

42

the best of these two approaches with the computational means of a search engine to construct high-quality topical collections of a directory, with the addition of representative profiles that users can use to evaluate the quality and function of the resulting items. Task-specific interfaces, like TopicShop, may make finding relevant information faster and more efficient. Collecting user feedback is a step in the right direction for automatically gathering information about sites, but requires all users to perform some amount of work before gaining the benefit of getting the information. Table 4.1 shows some features of current search interfaces. Interface Graphical/ Textual Categorized (Yahoo) Rings (Web Ring) Search (AltaVista) Links Pages SortTable WebCite TopicShop Textual Textual Graphical Graphical X X X X X X Static Dynamic Dynamic Dynamic Human Computer Computer Computer Engines Textual Textual Textual X Structural Data Sub-Topic Categorization Static/Dynamic Interface Static Static Static Computer Generated/ Human Filtered Human Human Computer

Table 4.1: Comparison of search interfaces There appears to be a distinct division of labor between those people who prepare content sites and topic guides and those people who utilize them. Many people are not interested in rating sites and would rather just find what they are looking for, but there is a small collection of motivated people who do want to provide information. Harnessing the knowledge and motivation from these people will benefit other information seeking users. We propose to make this easier for both classes of people by semi-automating the preparation process with our web crawling system and then improving management of information with task-specific user interfaces. In addition, many site maintainers have already put forth some effort in linking their site to other sites that they feel are adequate resources about their topic. Using this

information directly eliminates the need for users to contribute personal ratings. Each link into a site can be considered an endorsement of that site, and used to rank linked-to sites. Web crawls can be performed by a

43

few individuals, who are highly interested in the topic, and presented for other users to view. These people can be considered topic librarians and are responsible for managing collections of web sites for a topic. The other class of user is the one attempting to view what resources are available about a topic. This person is a topic novice and can either be a novice regarding the topic at hand or a novice regarding the availability of web resources on that topic. Either way, they need guidance on the structure of available resources and an introduction to the most useful sites about their topic of interest. This is a promising way that resource discovery and ranking techniques presented above can be used. Index pages exist on the web for many, if not most, topics. However, indices have at least two problems. It is difficult for them to be comprehensive and up-to-date, and, paradoxically, the more comprehensive they are, the harder it may be to focus in on just high-quality sites. Our systems can address both problems. A person maintaining an index can apply techniques from our systems to follow links from the current index and discover new sites that may be relevant. The discovered sites can be presented to an index maintainer who then can decide which ones to add to the index. And site connectivity information can be used as an aid in ordering sites within the index. This process is collaborative in two ways. First, over time, a topic index becomes a product of emergent collaboration, since it contains sites because they were linked to by sites from earlier incarnations of the index. Second, this also is a human-computer collaboration process, with a web crawling algorithm continuously suggesting new sites to an index maintainer based on their relevance to sites already in the index, with the maintainer retaining the final decision over what sites to add.

44

CHAPTER 5: OVERVIEW OF USER STUDIES

The need for topic management was motivated in Section 1.2. This research is an attempt to explore and produce effective and efficient mechanisms for supporting this task. . 5.1 HYPOTHESIS TopicShop, with sort orders, categorization, user collections, etc., is more effective and efficient for the task of topic management than typical web search engines and indices that use simple alphabetization and site annotations (e.g., Yahoo). Socially filtered data over time will provide a better set of topical resources than automated keyword search engines.

5.2

EXPERIMENTS We performed a series of evaluations to determine the effectiveness and efficiency of several

different user interfaces to socially filtered data, and provide empirical evidence regarding the benefits of transparent data-rich interfaces. In addition to the normal parameters of usability studies, the social filtering interfaces we evaluated provide an additional variable to control: topic content. The subject area in which a web crawl is created for these studies can either be pre-determined and held constant for all

45

subjects, or it can be varied to fit with each user’s personal interests. Each of these approaches can provide valuable feedback and have unique benefits in a usability study. Personalized topic areas guarantee that a subject already knows a good deal about the topic and may even have a grasp for the breadth of web pages available for the topic. If they already know what is on the web, they have an idea of what they want to look for and will be able to generate better, more specific search strategies to find exactly what they want to see. A topic expert has already seen a large portion of the available web content on a topic and will be interested in finding additional sites that they have not been exposed to and any new information that may have been recently generated. With prior knowledge of topic content, these users will be able to concentrate on the interface itself and be able to compare it to other interfaces they have used in the past to investigate their topic. The other option for providing web crawls is to assign all users an identical topic for which they are a novice. This will ensure that each user has a similar experience in using the interfaces because they will be given the same initial data. A better comparison across users will be possible when the data set is held constant. When a user is not familiar with the topic, they will rely on the interface to provide them with good content that will begin to teach them about the topic they are researching. Their needs differ from users who are selecting their own topic because topic novices will be more interested in first gaining a broad overview of a topic which is likely to be contained in the most linked-to sites. A combination of these two approaches has been used for the two evaluations of TopicShop. The expert evaluation was performed by topic experts, who were presented with sites on the topic they were interested in. Experts were selected based on their self-perceived knowledge in one of the topic areas that we selected for the studies. Those same topic collections were then presented to novice users who had no prior knowledge of the topic. 5.2.1 Selecting a Domain The web is an immense repository of information and continues to grow at a very rapid pace. In order to do almost anything on the web, users must apply some type of search strategy. Early in the research, we speculated that analyzing current methods that people use when searching the web leads to insights into how user interfaces can better support users in finding information efficiently on the web.

46

To quantify this, we studied a set of approximately 770K queries issued to the Magellan search engine between March 1997 and August 1998. The Magellan search engine published on their web page a random sampling of twelve queries that users were currently performing with their search engine. Our sample was taken by collecting the twelve sites every 10 minutes and writing them to a standard text file. By breaking these searches down into their keywords and eliminating common stop words, we discovered that out of 1,473,077 keywords, only 159,725 (10.4%) of them were unique. So there is a large overlap in the topics that people are investigating on the web. The way that users generate queries to Magellan can be either small keyword phrases or large natural language queries. Not surprisingly, small keyword queries are by far more popular with users. Queries containing three keywords or less accounted for 85% of the total queries (one word queries=35%, two word queries=30%, three word queries=20%). Queries of four and five words were an additional 11% of the total, which means that queries of six words through 66 words occurred only 29,475 times (~4%). We also wanted to investigate the major topics that people are researching on the web. We decided to analyze a sample of the top 10% of all queries. The 515 most commonly occurring queries accounted for ~96,000 queries and represented approximately 10% of the total queries in the data sample. By categorizing the top 515 queries, we got a good idea of what topics are important on the web. We chose the top-level categories by reading through the 515 queries and coming up with seven distinct categories: business, current events, entertainment, sex, internet/technology, travel, and uncategorized. A brief

description of each category and the list of queries were given to two independent raters and they were asked to categorize the 515 queries. After analyzing the entire categorization (inter-rater reliability of .85, Cohen’s Kappa, p<.0001), we determined that 42% of the queries had to do with entertainment topics, including media fandom. The next two most popular topics were sex and internet/technology, accounting for 25% and 23%, respectively. Entertainment topics made up almost half of the queries performed and as such are a representative area to study using the TopicShop interfaces. 5.2.2 Introduction to Pilot Study The initial study we performed was a pilot study comparing Yahoo with TopicShop using 16 subjects and 8 topic experts. These subjects were given sets of 60 sites on one of two topics. This small-

47

scale study was used to verify that our hypotheses were on target and also to ensure that the methodologies we designed would capture the types of data we were interested in analyzing. Instead of using a couple of pilot subjects in a larger scale study, we designed this smaller study so that we would be able to not only eliminate any problems in our experimental design, but also obtain meaningful results without wasting any user data. With these data, we were also able to iterate on the design of TopicShop and make

improvements based on this first user study. Details of this study are presented in the following chapter. 5.2.3 Introduction to Interface Evaluation After redesigning TopicShop to reflect the things that we learned from the pilot study, we wanted to perform a more thorough evaluation of our interface. This larger evaluation used 40 subjects and 15 experts covering 5 different topic areas on the web. The number of sites each subject was given was also increased to better represent the magnitude of content available on the web. One important design change that users requested was better support for organizing the collections of sites they selected. Because of this, the task was changed to include organization and categorization of the selected sites. This allows us to look at what ways we can better support this operation and also investigate how much agreement there is about categories within a topic. Details of this study are presented in chapter 7.

48

CHAPTER 6: PILOT STUDY

6.1

INTRODUCTION We wanted a topic management web tool as a suitable baseline for comparison to TopicShop.

Yahoo is the most widely used means of finding and browsing collections of web resources. Figure 6.1 shows an analysis of search engine usage on the web between March 1997 and September 1999. These data were provided by Media Metrix [103] and were based on a sample of 50,000 web users. The chart shows the percentage of these web users that use each different search engine on a regular basis. Clearly Yahoo has been the most popular search engine, with Netscape and Microsoft Network following close behind. These are all category-based search directories and, as shown in the graph, they are more popular than the keyword search databases like Excite and AltaVista. Yahoo is also the largest of the search directories cataloging over one million sites, while the closest competitor contains less than half that amount. Bookmark lists are probably the most common means of organizing collections of resources. According to user surveys done at Georgia Tech [104] over the past 4 years, bookmarks were cited as a browsing strategy for locating information on the web by 80% of all participants. Bookmarks are built into most web browsers and are easy to use, requiring only a mouse click or keystroke to save a site for future reference. The study by Abrams, Baecker and Chignell [1] and the Data Mountain system by Robertson et al. [84] also indicate that bookmarks are a popular method of storing personal collections of sites.

49

Therefore, we decided that subjects would use either TopicShop or Yahoo/bookmarks. We chose two entertainment topics for the pilot study: homebrewing and the TV program “Buffy the Vampire Slayer”. Each contained about 60 sites on their corresponding Yahoo page. Our choice of these topics was

influenced by the fact that pursuing special interests, including hobbies and media fandom, is one of the main ways people use the web.

Figure 6.1: Search Engine Usage. (YH=Yahoo, MSN=Microsoft Network, NS=Netscape, GO=go.com {InfoSeek}, LY=Lycos, EX=Excite, AV=AltaVista, SP=Snap {search.com}, HB=HotBot, LS=LookSmart) (MediaMetrix [103]) 6.2 EXPERIMENTAL DESIGN To verify that the user interfaces support the tasks they were designed for, our pilot study compared TopicShop and Yahoo/Bookmarks. This evaluation concentrated on the initial version of the TopicShop interface along with Yahoo’s widely used interface on the web. In phase one, four experts in each topic evaluated the sites and gave their quality judgements. In phase two, a 2x2 between-subjects analysis was conducted with 16 subjects (see Table 6.1). Two topic collections (Buffy the Vampire Slayer and Homebrewing) were randomly presented in two different interfaces (TopicShop and Yahoo). The crawls were limited to sites that also existed on the Yahoo page for each topic to keep data sets consistent. Order of presentation of the sites shown to each subject was randomized. We randomly assigned each of

50

the 16 subjects into one of the 4 conditions resulting in 4 people per condition (topic/interface combination).

2x2 Experimental Design

Interface
TopicShop Yahoo 4 Subjects 4 Subjects

Topic Buffy the Vampire Slayer
Homebrewing

4 Subjects 4 Subjects

Table 6.1: Pilot study experimental design The two main metrics we wanted to measure were the quality of resources users gathered and the amount of effort (time and total number of sites browsed) required. To give a quality baseline, in phase one, four topic experts were presented a list of 60 sites (in random order) from each topic; only titles were presented, but no annotations or profile data. This meant that experts had to browse each site and evaluate it based on its content and layout. Each expert collected the 20 “best” sites. For this study, we defined “best” as a set of sites that collectively provided a useful and comprehensive overview for someone wanting to learn about the topic. During analysis, we used the “expert intersection”, the set of sites that all experts for a given topic selected, as the yardstick for measuring the quality of resources selected by subjects. It turns out that the “expert intersection” was 12 sites for both topics; we will discuss expert intersection in more detail below. In phase two for both the TopicShop and Yahoo conditions, topic novice subjects were presented with 60 sites from the appropriate topic, whose quality they were to evaluate. Yahoo subjects saw (as usual) site titles and, for about half the sites, a brief textual annotation for all sites in the appropriate Yahoo category. For the TopicShop condition, we applied our web crawler to the Yahoo sites to produce site profiles, which TopicShop then displayed. There were two main goals of this pilot study. First, we wanted to verify that the web crawler interfaces we had iteratively designed work for certain types of web search tasks, like maintaining a links page. Second, we wanted to develop some expert rankings of collections of web sites in an attempt to quantify what factors go into quality web sites.

51

6.3

PARTICIPANTS Subjects for the pilot study consisted of volunteers from AT&T. Topic experts included graduate

students from Virginia Tech and employees from AT&T. 6.4 METHODOLOGY In phase one, experts in each topic were given a list of web site titles of the 60 sites for their topic in random order. The instruction sheet they were given contained information explaining the task and the definition of a quality site. They were asked to look through the links exhaustively and choose the 20 sites they thought were the best quality sites, keeping them in ranked order. Experts took approximately four hours each to complete this task. In phase two, subjects (topic novices) were assigned randomly to one of the four conditions. To begin the experiment, subjects received 15 minutes of instruction and training in the task and user interface. TopicShop subjects were shown the basic interface features and taught how to collect sites by dragging and dropping icons into folders. Yahoo subjects were shown a sample list of sites and taught how to collect sites by bookmarking. After training, subjects performed a short task to ensure that they were comfortable with collecting and organizing sites. For the main task, subjects investigated the sites for their assigned topic by using their assigned interface (TopicShop or Yahoo) and browsing to sites. In both interface conditions, subjects were

presented with the same collection of sites for their topic. They were asked to choose the 15 “best” (as defined previously) sites and rank them by quality. Because people do not spend unlimited amounts of time browsing, we wanted to see whether users could find high-quality sites in a limited amount of time. Subjects were asked to complete the task in 45 minutes and were kept informed of the elapsed time at fiveminute intervals. Clearly, there is a relationship between time on task and quality of results: the more time spent, the better the results one can expect. By limiting the amount of time, we hoped to focus on any differences in the quality of results (i.e., sites users selected) between the two interfaces. The task ended when subjects were satisfied with their collections of sites or after 45 minutes had elapsed. Subjects then completed a short questionnaire. Finally, we conducted an informal interview to reveal strategies subjects used to perform the task, their reactions to the interface, and what could help them complete the task more effectively.

52

6.5

DATA COLLECTION AND ANALYSIS During the pilot study we observed a number of variables. We recorded time on task for each

interface and broke it down into time spent in the interface and time spent browsing web sites. In addition, a keystroke level log captured mouse movement and interface component clicks. This resulted in data regarding where in the interface the subject was and what they were doing during the experiment. Analysis of the browsing history showed each subject’s browsing behavior during the experiment. This included percentage of time a subject spent on web pages rather than in the interface, total number of sites visited, and average visit position of the subject’s top five ranked sites. This last piece of data is taken from a list of web sites in the order that the subject visited them. The subject’s top five sites compared with the first five sites they visited shows whether the subject was able to quickly find the sites they thought were best. By comparing rankings of the sites that the subject selected to topic experts’ rankings, we computed a quality metric to rate how similar each subject’s list of sites was to topic experts’ list of sites. Survey results were also tallied along with some additional statistical analyses on the questionnaire data. There were two factors to look at when comparing topic novice subjects’ lists of best quality sites to topic experts’ opinions: endorsement and ranking. Experts not only selected quality sites but also ranked them in order by quality. One method of comparison is the strict intersection of the four experts for each topic. This gives a set of sites that can be considered to contain quality information with a nice layout for the topic, that were endorsed independently by a total of four people. Another way we looked at the expert data was to assign a score for each of the 60 sites in a topic. The score was the number of experts (1 to-4) that recommended the site in their list of 20 quality sites. This gives a larger set of sites that were recommended by at least one expert. Finally, since the expert sets were ranked, a weighted score was computed for each site by averaging its position in each expert’s ranked list. This weighted score was then compared against the site position in the subjects’ lists to measure similarity to the experts. 6.6 QUANTITATIVE RESULTS We first compared the set of sites chosen by each novice subject to the expert intersection. For each topic, the expert intersection contained 12 sites. For the Buffy topic, Yahoo subjects selected an average of 5.0 sites that were in the expert intersection, while TopicShop subjects selected 7.5 expertendorsed sites. For homebrewing, Yahoo subjects matched 4.3 sites and TopicShop subjects matched 9.3.

53

Overall, Yahoo subjects selected 4.6 sites from the expert intersection, while TopicShop subjects selected over 80% more, or 8.4 sites. We performed an analysis of variance to look at the interaction between topic and interface for the expert interaction results. We are only interested in the main effect of the interface factor, but we want to be sure that topic is not significant and there is no interaction. A 2x2 betweensubjects two factor ANOVA (interface and topic) shows that topic is not a significant factor (F(1,12)=0.585, p=0.459). Also, the interaction between topic and interface was not significant

(F(1,12)=3.659, p=.08). The interface factor we are investigating is significant (F(1,12)=32.927, p<.0001). Since there is no interaction and no significance of the topic factor, the rest of the results in this section will be presented based on a pooled independent means t-test. Expert intersection results are summarized in Table 6.2. Thus, TopicShop subjects found

significantly more better quality sites in the time given to complete the task. Notice that choosing sites at random would result in obtaining 3 sites in the expert intersection. (Users selected 15 out of 60 sites, or 25%; 25% of the 12 sites in the expert intersection is 3 sites.) The Yahoo score of 4.6 is not that much better than random selection (one sample t-test (test-value=3), t(7)=3.87, p<0.006). This probably is due to task time limit of 45 minutes. If Yahoo subjects had had unlimited time, undoubtedly they would have been able to find more high quality sites. So, we see that TopicShop users found significantly better sites in the time given to complete the task. Topic Buffy Homebrewing Average over Topic Interface Type Yahoo TopicShop 5.0 7.5 4.3 9.3 4.6 8.4

Table 6.2: Expert intersection analysis (average number of expert endorsed sites selected) If instead we compare the subjects’ list of sites to the experts’ weighted union, we can see a similar trend. The weighted union is a sum of the ratio of experts that selected each of a subject’s sites (if 3 of the 4 experts selected a site, the ratio of experts would be 0.75). The Yahoo subjects’ average expert weighted unions were 6.5 and 6.63 for Buffy and homebrewing, respectively, with a total average of 6.56. For TopicShop, the scores were higher, 9.5 for Buffy and 10.81 for homebrewing. The total average of TopicShop subjects was 10.16, or 55% higher than Yahoo subjects (pooled independent means t-test, t(14)=-3.97, p<.0007). Table 6.3 shows results for the expert weighted union.

54

Topic Buffy Homebrewing Average over Topic

Interface Type Yahoo TopicShop 6.5 9.5 6.63 10.81 6.56 10.16

Table 6.3: Expert weighted union analysis It also is revealing to examine the amount of work subjects performed to complete their tasks. A study of data from the search engine Excite [49] (51,473 queries; 18,113 users) showed that 86% of all users look at three or fewer pages (each search results page contained 10 sites) of the search results. This shows typical users are willing to consider no more than 30 pages when browsing the web, many of which can be rejected by examining the title only. In our study, Yahoo subjects browsed an average of 44 sites, while TopicShop subjects visited about 36 (pooled independent means t-test, t(14)=1.14, p<0.14), or about 19% less. Further, the task of constructing a high-quality collection of resources is more difficult than doing a simple search; the task is global, since a user is trying to develop a comprehensive overview of a topic, so more sites must be considered. By providing additional dynamic data up front, TopicShop enables users to make better decisions about which sites to immediately rule out and which to investigate further. Yahoo users can rely only on textual annotations, which are provided by site maintainers. While these annotations are sometimes helpful, they can be out-of-date or self-promotional, so are not necessarily good indications of the perceived quality of a site. Our results for the number of sites visited where subjects looked at three or fewer pages were very similar to the Excite study. Subjects went only to the front page (first page of site) of 52% of the total visited sites and navigated to a second page on an additional 20%. Analyzing these results further reveals that sites where subjects visited two pages or less were many times selected into their final list of sites. So, many subjects judged the quality of sites after viewing only the first page or two of a site. In fact, 61% of the sites that subjects selected, matching the expert intersection, had only one or two pages browsed by all subjects. Subjects tended to visit more sites than necessary while selecting quality sites because they wanted to be sure there were no additional quality sites they might have missed. Even though they viewed more sites than necessary, subjects found quality sites for their final collections more rapidly using TopicShop than using Yahoo. We can analyze this by looking at the visit position of sites for each subject.

55

A site’s visit position is calculated by considering the entire temporal sequence of sites each subject has visited, and calculating the position of the site in that list. The average visit position, of the top five sites from subjects’ final sets of selected sites for Yahoo subjects was 21, while TopicShop subjects visited their top five sites within 13 visits on average (pooled independent means t-test, t(14)=1.52, p<0.08). So, even though Yahoo subjects browsed to an average of 44 sites and TopicShop subjects an average of 36 sites, their most productive browsing took place within the initial 42% of the sites browsed (an average of 48% for Yahoo subjects, 36% for TopicShop subjects (pooled independent means t-test, t(14)=1.40, p<0.09)). We also analyzed time on task. We did not expect a large difference in this metric since we gave subjects a (soft) limit of 45 minutes to complete the task and kept them aware of elapsed time during the experiment. Since subjects were encouraged to finish within 45 minutes, their times were usually not much more than the limit. Some subjects would have taken more time to complete this task had it been available to them. Still, TopicShop subjects took about 11% less time to finalize their selections (41.5 minutes vs. 46.6 minutes for Yahoo; the difference was not statistically significant but was in the predicted direction. (pooled independent means t-test, t(14)=-0.845 p<0.21). In a task like topic management, one of the goals of the interface is to give users some additional information and let them make decisions without having to browse through every page and have the time cost of downloading more pages. Users probably do not want to exhaustively search every available site in order to find a few that they are interested in. Instead, if they have a way to evaluate a collection of sites without visiting every one, they can more efficiently find the information they are interested in. So, the time that they spend in the interface rather than the browser should be maximized. Because TopicShop provides more information in the interface, users can spend more time evaluating sites based on their site profiles and not have to browse to each page to evaluate its content. The percentage of time that subjects spent in the interface rather than the browser was an average of 24.6% for Yahoo subjects and 34.5% for Topic Shop subjects (pooled independent means t-test, t(14)=-3.11. p<0.004). TopicShop was able to shift 40% more of a subject’s time from the browser to the interface. This means that they were able to make more judgments about the potential quality of a site before browsing and visiting the site. The questionnaire administered to subjects at the conclusion of the experiment asked them how confident they were with their results on a scale of 1 to 7 (1 being very confident, 7 being not at all

56

confident). TopicShop subjects were slightly more confident than Yahoo subjects (4.5 vs. 4.75). This is probably explained by the fact that they were given the data derived from a web crawl. Since an in-link can be considered an endorsement of a site, TopicShop subjects felt that if they agreed that a highly linked-to site was a quality site, they were agreeing with the existing opinion of other site designers. Yahoo subjects had only their own opinions to rely on and no data to help strengthen the perceived validity of their selections. The questionnaire gave data on what information subjects found most useful in evaluating a site. TopicShop site profiles include the title and number of in-links, out-links, images, audio files, and pages in a site. The questionnaire asked subjects to rate these properties from most to least useful on a scale of 1 to 7. Subjects rated three of these properties — in-links (2.00), title (2.75), and number of pages (3.00) — most highly. The other four properties had an average score greater than 5. Even though many subjects noted that title is not a very good indication of quality, it still was perceived as one of the most useful site properties. In interviews, subjects explained that titles were useful mainly as memory aids for sites. Thus, subjects considered the number of endorsements (in-links) and the size of a site (number of pages) to be the most useful indicators of quality [6]. The questionnaire also asked subjects what additional information would have helped them in evaluating sites. Six of the eight Yahoo subjects said that the number of links between sites would be very useful. One subject even made it a point to go to the links page of every site visited to see not only what sites were linked to, but also to read any annotations or recommendations made by the site author. Thus, link information was rated as highly useful by those subjects who had seen it and as very desirable by those subjects who had not. Browser logs show the order that subjects visited sites during the experiment. If we take the ordered list of viewed sites and look at whether subjects selected a site or not, we can see a couple of trends. Figure 6.2 shows a representation of each subject’s browser history. It shows the site visitation order, where each site is represented by a character describing whether the subject selected the site and whether it coincided with the expert intersection. Some trends worth noting are: ? ? In general, shorter length is better Long strings of periods (.) represent wasted work

57

?

More O’s are good, especially near the beginning

Yahoo
^......O...O.....^.......'.'^....O...O^..........'^.^...'.O........^'..''''' .O..'.O.....^..O...O.'....'.'..''..'...........^'.'...O....^'O... ...'..^..O'..O.'..O..O.....'..'...........'''....''..O O....O'.'^.''.O.'..O.....'O....^.''...''' O.....O..'.'^...'.^''^..'''....'O'O .'..'.'.O^'.......O.O.'''.''O.OO' .O.O'..^''.^...''..'O....... '...O.'O..'..'.^..O''

TopicShop
'O.OOOO.'......'....'..OO......O....''......'. ..O.^.O.....O.O.'.^O.'O.O.^O^.'....'O..'....'. ..^........^...O..O...OO....O.'.'O^'.^O. ...OOO.O...'O..O...O''.OO^.'....O....O OO.'O.O'.O..'.'O'..'O.^....O.'. .'...O.OOO.''.'OO''.O..'...O... O.^.OO...O''O.'..OOO'..O..O..'. ..'OOO''.'O^.''O''.....O'.

Legend: O- Selected, Expert endorsed .- Not Selected, Not Expert endorsed '- Selected, Not Expert endorsed ^- Not Selected, Expert endorsed Figure 6.2: Web browse history from user pilot study An obvious overall trend is that TopicShop subjects browsed fewer sites on average. In addition, TopicShop subjects tended to select more expert-endorsed sites earlier in the sessions. Their selections were also clustered more closely in time than the Yahoo subjects. At the end of the Yahoo sessions, when time was running out, subjects were selecting sites in order to complete the task of collecting 15 sites. We see from Yahoo trails that a few times this meant they selected sites that were not considered quality by the experts. TopicShop subjects did not select the majority of sites in their collection at the very end of the session. Table 6.4 shows a summary of the number of sites users viewed that were considered wasted or productive work. We consider productive work to take place when subjects browsed to a site that they selected that was also endorsed by an expert (represented by O in the browser trails). Wasted work occurred when subjects browsed non-expert endorsed sites that they did not end up selecting (represented

58

by a period in the browser trails). The other two categories shown in the browser trails cannot be considered productive or wasted work because they represent a difference in opinion between the subject and the experts. TopicShop subjects were productive for an average of 23% of the sites they browsed, while Yahoo subjects were only productive with 10% of their sites (pooled independent means t-test, t(14)=-5.38, p<0.00005).

TopicShop Number Productive Work Wasted Work Other 67 162 60 % of Sites Visited 23% 56% 21%

Yahoo Number 37 223 93 % of Sites Visited 10% 63% 37%

Table 6.4: Amount of work (from Browser History) We also observed that most subjects made their judgment of a site by viewing only the front page of the site. It makes sense that the “front door” page of a site should be both attractive and representative of the site as a whole – after all, the site author presumably designs it to be the initial impression a visitor to the site experiences. One can usually obtain a good idea of the amount and type of content available on the site as well as the production quality. Subjects navigated to a total of 642 web sites (the total number of symbols in Figure 6.2 above), and looked at only the front page of over half the sites. And of the 240 sites that subjects selected for their collection of the best sites, subjects browsed only the front page of 91. Among the 402 sites that subjects rejected, 285 sites were rejected after browsing the front page. Overall, subjects viewed an average of 2.39 pages per site. Thus, we see that a subject’s initial impression of a site is extremely important. The quality of the front page is very representative of the quality of the entire site. 6.7 USER EXPLORATION STRATEGIES Most Yahoo subjects, lacking any better options, simply looked through the 60 sites in alphabetical order, reverse alphabetical order, or sometimes a combination of the two. A few users tried reading all the titles and annotations to make some judgments about sites before browsing them; however, many times their initial judgment of a site proved inaccurate once it was browsed, so even these users often reverted to exhaustive alphabetical search. Of course, users still read annotations as they proceeded

59

methodically through the list of sites, but did not rely on annotations to decide which sites to browse. Users also often browsed a few sites at random to try to cover a good sample of available sites. TopicShop subjects used different strategies, ones that were informed by data in TopicShop Explorer. They spent more time prior to browsing sites on exploration within the TopicShop interface, sorting columns and watching how the arrangement of sites changed. They were mainly looking for sites that appeared near the top in multiple sorts. Many also attempted to get a rough idea of how sites were distributed in each column. Eventually, subjects tended to proceed by selecting a property they thought was useful and evaluating the first few sites in that column. After they exhausted the quality sites in that column, they would move on to another column and continue. Some subjects would also visit some sites at the low end of the columns to convince themselves that the profile data could be trusted. As evidence of the influence of the TopicShop Explorer on user strategies, we looked at overlap in sites selected by subjects. TopicShop subjects arrived at a much larger common set of sites. The

intersection for the eight TopicShop subjects across both topics was 9.5 sites, while the eight Yahoo subjects averaged an intersection of only 2.5 sites. It makes sense that TopicShop users would agree with each other quite a bit, even more than they agreed with the experts, since they relied on the same data, i.e., profile features, and tended to pursue the same strategies for selecting resources. To better evaluate the utility of TopicShop data, we created purely automated sets of the 15 best sites using the “gather from the top of the column” strategy. We defined six sets of sites mechanically: five of the sets consisted simply of the top 15 sites for each numeric site profile property, and the sixth consisted of the top three sites on each property. Recall that the Yahoo subjects had an overall average expert intersection of 4.6 (out of 12). All the automated TopicShop strategies performed better, with an average expert intersection of 5.6. We found it surprising and noteworthy that a purely mechanical strategy using only automatically computed data could outperform human subjects who had to rely only on Yahoo’s site titles and annotations. Of course, TopicShop subjects, human subjects with the added utility of the TopicShop data, outperformed the automated strategies, with an average expert intersection of 8.4. (Again, we assume that the task time limit was a factor; with enough time to browse and evaluate site content, we expect that people would outperform these mechanical strategies. Of course, who has enough time?)

60

We also observed a common, but unproductive strategy: nearly all subjects initially assumed that personal home pages (as determined by title and site location) would be of low quality. They supposed that they could immediately eliminate these sites and select only from the resulting, smaller subset. However, subjects quickly realized that this was not true – after visiting a few personal pages, they found that some were of quite high quality, so subjects abandoned this strategy. As we observed subjects, we noted that about one-third of them kept their list of high-quality sites sorted by quality as they were constructing them. The other subjects selected 15 to 20 sites and went back to sort them later. Yahoo subjects had a very difficult time with this because, after looking at so many sites, they could not recall site content from just the title. Usually they had to revisit all of the sites a second time to order them. Subjects in the Abrams et al. [1] study complained that titles were inadequate descriptors of site content. Our subjects corroborated this result and were not able to recall enough information about a site by simply seeing the title. TopicShop subjects had a much easier time because they used thumbnail images to refresh their memory of different sites. Even the small icons in the details view were useful once a site was visited; they contained enough information (color, general layout, etc.) to trigger subjects’ memory and help them remember site content. 6.8 DESIGN IMPLICATIONS Observations, interviews, and questionnaires suggested three significant design improvements to the TopicShop user interface. The first design improvement we considered for incorporation into TopicShop version 2 was better methods for creating subcategories of a topic. A key need that subjects in both interface conditions discussed was support for lightweight, flexible categorization. As subjects

explore sites, they create rough mental groupings, using site similarity, site type (general information sites, specific subtopic sites, personal sites, etc.), or even site layout. While the initial version of TopicShop lets subjects create folders and group subcategories of sites within folders, our observations of subjects showed that this seems to be too much overhead for users when they are starting out. Their mental groupings remain indistinct until they have encountered a sufficient number and variety of sites to enable them to articulate the organizing principle of their categories. Further, categories may be split or combined several times in early stages of exploration. And while the icons view (Figure 4.3) of TopicShop does support this flexible, lightweight categorization (and several

61

subjects used and liked it), this view hides the important site profile data from immediate view. We have two potential design solutions that could be added to TopicShop version 2 to better support categorization. Linked views are one solution to this problem. One window would show the icons view and another would show the details view, with user selections of thumbnail images mirrored in both windows. Users then could spatially arrange sites as they form opinions about types of sites within a topic, while simultaneously sorting sites based on profile data. As users develop firm categories, they could create folders to hold sites within each category. Another potential design solution is a color-coding scheme. Users could assign a color to a small informal grouping of sites and add others to the group as they continue to browse. Then, when sites are sorted, they would be sorted first by color (informal group), then whatever other property the user specified (e.g. inlnks, images, etc.). This would let users quickly create groups and still keep all sites in a single window. Again, when users are satisfied that a group really is a category, a folder can be created to contain it. This solution can easily be combined with the previous solution to give users more flexibility. A second improvement to the design of TopicShop is to add two levels of annotations. One of the TopicShop design goals was to make it easy to reuse and share topical collections. Subjects affirmed that this was important. In support of this desire, all 16 subjects mentioned that they wanted to record comments about sites as they visited and collected them. Comments could be recorded for individual sites as well as user defined categories. These comments would be useful both to original users when they returned to their collections in the future and to people with whom they shared the collections. Their comments would explain why sites were selected, why they were considered to be of high quality, and what they were good for. The final design change involves sorting techniques within TopicShop. Currently, sorting in TopicShop version 1 is limited to a single column, but subjects expressed a desire for several more powerful sorting techniques. First, they wanted to combine several columns, e.g., sorting by the sum of inlinks and out-links. Second, they wanted to be able to do a multi-level sort. For example, one might want to sort sites primarily by number of pages, then break ties by using another property, such as number of inlinks.

62

In the next section we will discuss the redesign of TopicShop describing which of these design changes we incorporated, and how they affected use of TopicShop’s interface, based on another empirical study we conducted.

63

CHAPTER 7:

USER INTERFACE EVALUATION

We developed a new version of the TopicShop Explorer interface (shown in Figure 7.1), incorporating the design changes described in the previous chapter. There are 4 main components to this interface. The first is the Work Area, an initially blank space, where users can drag selected sites for further investigation. The Site Profiles window displays all detailed information that our crawler has collected for each site. The Focused Site at the top left corner of the screen shows a large thumbnail image of the last site that a user clicked on. The final section of the interface is the Folder Selection area, where users can select which topic to display in the interface.

64

Figure 7.1: Revised version of TopicShop, based on results of pilot study 7.1 LESSONS LEARNED Like all artifacts, the initial version of the TopicShop Explorer embodied claims about how users will conceive and carry out their tasks [21]. With its two separate windows for exploring site details and for organizing icons into groups, only one of which could be visible at a time, it embodied a claim that the tasks of site evaluation and organization must be carried out separately. Further, it assumed a single data set (the collection of all topic-relevant items), which could be manipulated in two ways (exploring site profiles or organizing by spatial grouping). The pilot study revealed problems with both implicit claims. First, users wanted to organize items without losing sight of detailed information contained in site profiles. One subject commented: I really want to organize the large icons, but don’t want to lose the detailed information. Switching all the time is too painful, so I have to settle for the details view only.

65

The interface must allow users to integrate the two tasks of site evaluation and organization. Second, users preferred to group sites by spatial organization rather than by creating explicit folders. While the icons view supported this, the resulting groups were not first class objects. We wanted to explore spatial techniques to make it very easy to create and manipulate groups. Third, we realized that most items in a collection never would need to be organized, because users would not select them as worthy of further attention. Thus, rather than supporting a single collection, a better design would support two data sets. Users can evaluate the initial, machine-generated collection and select promising items. Organization will only be done for selected items. This also has implications for the nature of task integration. Users must be able to explore within groups they have created; for example, some users selected fairly large sets of similar sites, say ones that contained multimedia information, then wanted to keep only the best of these sites and throw the rest away. To do this, the interface should make it easy to sort within a user-defined group, e.g., to find multimedia sites with the most in-links or largest number of pages. Fourth, site recall could be improved by including more graphical and textual information. Many subjects asked for the ability to annotate both individual sites and groups of sites. (Note that annotations also make collections more informative for others.) And other subjects asked for a larger thumbnail image to provide a better visual cue: A larger thumbnail would be nice… It can be used to refresh your memory … and would be more effective if it looked more like the site. Fifth, the state of the user’s task must be manifest. Most important, it had to be clear which items in the initial collection users had already evaluated and which they had not. Unevaluated items are a kind of agenda of pending work. Subject comments made this clear: An indication of whether or not I visited the site would be useful. I can’t tell what I’ve already seen. It’s hard to know what you’ve looked at and what you haven’t…

66

7.2

TOPICSHOP DESIGN ITERATION Results and comments from the prior study guided us in designing an interface intended to more

effectively address users’ needs for topic management. Major changes in the second version of TopicShop include the following:

Two always visible, linked views support task integration and a cleaner definition of each task. In an attempt to assist users in dealing with the overwhelming number of web sites available on any given topic, we provided site profile data and a work area for organizing sites, keeping both visible at all times. Items in the initial collection are displayed in the Site Profiles window, and the Work Area is initially empty (unlike Figure 7.1, which shows results of a subject from the main user study). As users

discover sites that they are interested in, using the Site Profiles view, they select them simply by dragging and dropping them in the Work Area. Since icons are created just for selected items, the Work Area is uncluttered and provides a clear picture of sites users care about.

“Piling” icons makes it easy to create first-class groups by spatial arrangement. Users seem to have the desire to group things spatially by making piles [73]. ”Piling” is an easy way to allow a lightweight form of categorization because users are not required to create anything to contain the new category or to name it; they simply arrange thumbnail images in the Work Area by dragging icons. As users find sites that they feel are similar, they can arrange the sites to be close together in the Work Area window. When a user positions one icon “close enough” to another, a group is automatically formed. (How close two icons must be before a pile is formed is a system parameter, set by default to occur just when their bounding boxes touch.) Each group is assigned a color. As the views are linked, both the group of icons in the Work Area and the features for sites in that group in the Site Profiles window are displayed using the color as a background. Then users can add additional sites to the grouping as they visit similar sites. After the user is confident that their temporary category contains enough sites to be considered a sensible category for the topic, they can assign a meaningful name to it. One of the columns in the details view contains the user’s category information so that sites can be sorted by category. To help users better organize their groups, they can perform operations on piles (i.e. move, name/annotate,

67

arrange, and select), as well as the normal operations on single sites. Multi-level sorting is a useful operation that can be applied to a pile; it also illustrates how linked views support task integration. In the Site Profiles view, users can reorder sites based on primary and secondary sort keys. Users commonly sorted first by the groups they defined and then by some additional feature, such as in-links or number of pages. This lets users evaluate and compare sites within a single group. Figure 7.1 shows just such a sort. Visual indicators make the task state apparent. Any site included in the Work Area is marked with a green diamond in the Site Profiles view and kept at the top for easy reference. Users can mark irrelevant or low-quality sites for deletion; this marks the sites with a red X and moves them to the bottom of the list. Thus, users quickly see which sites they have already processed (selected or deleted) and which need additional evaluation. Annotations and large thumbnails support reuse and sharing. The Focused Site window (upper left of Figure 7.1) displays the most recently clicked-on site. This large thumbnail image of the site is now displayed on the main screen to give users a more prominent view of the layout of the currently selected site. This is in direct response to users’ claims that a large preview of sites was extremely useful but was too time-consuming to use in the initial TopicShop interface. It also serves as a memory aid to help users quickly remember additional details about the site. Users can create textual annotations for piles or individual sites in the Work Area. Providing two levels of annotations allows users to describe groups they have formed and also give an indication of what type of content can be found on a given site. Annotations are useful as individual memory aids, but also allow users to personalize their collection of sites by adding comments to share with other users. The interface also allows more customization by the user. Users have the option to show or hide any of the views they would like (Work Area view, Site Profiles view, Focused Site view, etc.). Subjects in the pilot experiment performed the task of topic management in many different ways. To support different approaches users take in evaluating and maintaining a collection of sites, we wanted to provide an interface that a user can tailor to their own specific needs. In the revised TopicShop, columns of displayed data can be hidden as well as moved to any position that the user desires. This way, columns that are important to a

68

user can be displayed first, while less relevant data for the user’s crawl can be moved to the end or hidden all together. More user feedback has been integrated into the interface for common operations such as drag & drop, selection, and mouse movement. There were quite a few situations where a user had accomplished an operation and did not even realize it because of the lack of useful feedback. 7.3 EXPERIMENTAL DESIGN The second experiment was similar to the pilot study but was larger in scale and was redesigned in light of lessons from the pilot study. One major change from the pilot study was due to the fact that the topic collections were much larger, ranging from about 90 to over 250 sites. Since experts are required to comprehensively browse each site while establishing their ratings, we wanted to limit the number of sites experts rated to about 40. It would be unrealistic to expect experts to rate all the sites. It was not even possible for experts to rate all the sites that any subject selected, because this subset was also too large. However, we were able to come close. We chose sites for experts to rate by including first all the sites selected by multiple subjects and then a sample of sites selected by a single subject (A more precise explanation is provided later in section 7.5). Of course, this means that the order of the two phases was reversed from the pilot study. We first gathered user data and used those results to decide which sites to present to experts for rating in phase two. This main user study consisted of two tasks that were performed simultaneously, a selection task and an organization task. Again, the experimental design has two levels of interface (TopicShop Explorer and Yahoo/bookmarks), but covers five different topics rather than two. As discussed before (section 5.2.1), analysis of the Magellan search data showed that entertainment was a very popular category that users searched for on the web, so we again selected topics from the domain of popular entertainment, including the television shows Babylon 5, Buffy The Vampire Slayer, and The Simpsons, and the musicians Tori Amos and Smashing Pumpkins. The experimental design was a 2x5 between-subjects design (see Table 7.1). Because results from the previous experiment were statistically significant with only four subjects per cell, we used the same number of subjects per condition for a total of 40 subjects.

69

2x5 Experimental Design

Interface
TopicShop Yahoo 4 subjects 4 subjects 4 subjects 4 subjects 4 subjects

Topic Babylon 5
Buffy the Vampire Slayer The Simpsons Smashing Pumpkins Tori Amos

4 subjects 4 subjects 4 subjects 4 subjects 4 subjects

Table 7.1: Main study experimental design We again obtained collections from Yahoo and then applied our web crawler to obtain site profiles and thumbnail images for use in TopicShop. For this experiment, we configured the crawler to start from the set of sites found on a Yahoo page for each topic, but this time we configured the crawler to crawl beyond the initial set and include any new sites found during the crawl. Because Yahoo is a humangenerated index, many of the newer web sites for a topic will not be displayed in the Yahoo index. Our crawler has the ability to find the newest sites before they show up in Yahoo. Topic experts also evaluated the sites that our crawler discovered beyond the initial set of seed sites. This allowed us to evaluate the quality of recent sites that have not yet been added to Yahoo. However, in the experiment, the sets of sites were still kept the same and limited to only the sites that appear on the Yahoo page. By once again maintaining the same data sets across the two interfaces, we could evaluate the efficiency and effectiveness of each interface. The experts’ task was to rate a collection of web sites derived from the sites selected by users. We had 16 experts evaluating sites with 4 experts for the Simpsons and 3 experts for each of the other four topics. This time we decided it would be both easier for them and more informative for us if experts rated the quality of sites on a scale of 1 (worst) to 7 (best) instead of ranking them in order. Again, experts rated sites by filling out a web-based form; the form presented sites in a random order. And it gave no information other than the URL, so experts had to browse each site to judge its quality. 7.4 PARTICIPANTS Participants for this study were students from Virginia Tech and were compensated ten dollars per hour for their time. Topic experts consisted of upper level graduate students and faculty from Virginia 70

Tech, as well as AT&T Employees. The topic experts received Amazon gift certificates for participating in our study. The novice subjects for our study came from 13 different majors and were between the ages of 17 and 35. Graduate students made up 39% and the rest were undergraduates. The majority of subjects used computers and the web daily and were predominantly PC users, with a few UNIX users and one Macintosh user. 7.5 METHODOLOGY Again, we used a two-phase approach, except this time the order was reversed from the pilot study, to ease the task of the topic experts. For this study, sizes of the collections in Yahoo had grown considerably, ranging from 88 sites to 258 sites. Since it would be unrealistic to ask an expert to visit and accurately rate that many sites, we culled the list of sites evaluated by experts using results from the first phase to select which sites experts rated. A more detailed explanation of site selection method is provided below. Subjects were assigned randomly to one of the ten conditions (2 interfaces, 5 topics). The study began with a pre-questionnaire to gather some demographic information from the users. We then had users read instructions on the web explaining how to use TopicShop or Yahoo, depending on their assigned experimental condition. TopicShop subjects were shown its basic interface features and taught how to collect and organize sites by dragging and dropping icons in the Work Area. Yahoo subjects were shown a sample list of sites and taught how to collect sites by bookmarking and how to arrange them into categories. After answering any questions they had about their assigned interface, we had users complete a short practice task to get them familiar with their interface and ensure that they were comfortable with collecting and organizing sites. In phase one of this study, he task was to collect the 15 “best” (as defined previously in section 6.1) sites and organize them into logical groups, with descriptive group labels, as they were collected. Since ranking the best sites was sometimes difficult and many times arbitrary in the pilot study, this time subjects were simply asked to collect sites and not worry about their relative rank. To complete this task, subjects utilized any information provided in their interface (for Yahoo: title and annotation; for TopicShop: site profiles and thumbnail images) along with site content.

71

In the pilot experiment, subjects were given a soft time limit of 45 minutes. They were warned at five-minute intervals when they were getting close to the time limit and were encouraged to attempt to finish within that time. This forced subjects to quickly choose sites they had already seen to fill their list of 15 quality sites. As a result, most subjects finished the task in approximately 45 minutes. The difference in task time between TopicShop and Yahoo subjects was very small in the pilot study. In the final

experiment, no time limit was given for the task. Subjects were free to take as long as they needed to evaluate the sites for their topic. The task ended when subjects were satisfied with their collections of sites. Subjects then

completed a short questionnaire. Finally, we conducted an informal interview to reveal strategies subjects used to perform the task, their reactions to the interface, and what would have helped them complete the task more effectively. In phase two, to collect expert ratings, we gathered three experts for each topic and asked them to fill out a short questionnaire detailing their interest in the topic and self-rated knowledge of the topic. They were then given an instruction sheet containing a description of their task and a definition of quality for our purposes. Experts in each topic were given a list of web site titles for approximately 45 sites for their topic in random order and asked to exhaustively browse the list of sites rating them on a scale of 1 (worst) to 7 (best). When they were finished, they filled out a final questionnaire to give feedback about the task and any problems they ran into. We asked the experts in the pilot study to look at a set of 60 sites, selecting 20 and ranking them by quality. This turned out to be a difficult and time-consuming task. As mentioned before, we simplified the expert task in this second experiment, by reducing the set of URLs that experts were asked to look at, and instead of having them rank the best sites, we simply had them assign a rating to each site they visited. The set of sites presented to experts consisted of four subsets: URLs selected by multiple subjects, URLs selected by a single subject, URLs selected by no subjects, and URLs discovered by our crawler. We included all URLs that were selected by more than one subject in the main user study because, according to our subjects, these were the best sites. This is analogous to the standard information retrieval theory that “good” items are very likely to be in the intersection. A small random sampling of URLs that were not chosen by any of our subjects was included so we could test our hypothesis that sites selected by subjects

72

would be given the best expert ratings. We included 5 URLs that were discovered by our crawler and then randomly added sites from the other two groups (URLs select by one subject and URLs selected by no subjects) in a ratio of 2 to 1 until the set was approximately 45 URLs. Table 7.2 shows exact sizes and breakdown of the expert sets for each topic, indicating how many were selected from each group and the total size of the original sets when subsets were selected randomly. Topic Babylon 5 Buffy Simpsons Smashing Pumpkins Tori Amos Total Number of Sites 173 258 210 95 88 824 Multiply selected Sites 28 29 21 33 36 147 Singly selected Sites 8/42 8/39 12/38 6/16 4/19 38/154 Sites not chosen 4/104 4/190 6/151 3/45 2/33 19/523 Discovered Sites 5 5 5 5 5 25 Total expert dataset 45 46 44 47 47 229

Table 7.2: Number of Sites in expert sets. (In cases where we randomly selected a subset of sets, we use the notation x/y to show that we selected x sites out of a possible y.)

7.6

DATA COLLECTION AND ANALYSIS The pilot experiment contained a flaw in the way the browsers were set up that might have

affected user task time. When browsing sites on the web, network lag can be introduced due to Internet congestion or downed servers and routers. In the pilot experiment, this lag could have affected the task time of some subjects since our browsers simply loaded pages from the original server location on which each site resided. This did not seem to create a noticeable delay, but still might have had some minor impact on the pilot results. For the final experiment, we installed a cached server that pre-loaded all web sites for the study to a local hard drive, providing a frozen snapshot. This way, we ensured that all subjects had the same page load times and the times were consistent across all subjects. In addition to data collected for the pilot experiment (user selections, task time, browser history, etc.), we also collected some additional interface data in this second main study. By adding interface instrumentation, we collected log files describing usage of the application and its individual interface components. In this way we could find what features of TopicShop are used most often. A user’s behavior using the interface can be tracked more easily through these logs. In the pilot experiment these data were available but would require searching through hours of videotape.

73

7.6.1

Phase One: User Study The main user study was automated so that data collection would be easier than it was for the pilot

study. Batch scripts were running that automatically timed tasks, logged data, and transferred files. For each user, we collected a list of sites that the user selected, either a bookmark page for the Yahoo condition or an icon list for TopicShop. The site categorization that users created was also derived from these two files. In addition, a snapshot of a browser history file was written for each user. Two log files were

captured to time users and watch what they were doing during the course of the task. One was a systemlevel log that registered which windows were active at all times, and the other was an application log showing exactly what users were doing by logging where in the browser or TopicShop window users clicked. The two questionnaires that subjects filled out before and after the study were web-based forms that recorded results to files. The pre-questionnaire gathered basic demographic information along with computer and Internet search experience. Subjects provided their strategies, confidence with results, and interface feature comments in the post-questionnaire. 7.6.2 Phase Two: Expert Ratings Expert ratings were collected remotely using forms on the web. This allowed experts to rate sites at their leisure and do a more comprehensive job than they might have in a lab setting. We once again collected demographic information with a pre-questionnaire and also asked about their perceived familiarity with the topic they were evaluating. Then the site rating scores from 1 to-7 were collected followed by a post-questionnaire that gathered information including how they went about doing their evaluation and how long they spent doing it. 7.7 QUANTITATIVE RESULTS

7.7.1

Expert Metrics Since we collected a numeric rating of each site viewed by experts, we have various applicable

methods of using these expert data in our analysis. Below are the two main expert metrics that we used in analyzing our results: expert average and majority score. We of course looked at other metrics, but since

74

results were comparable for all metrics, we chose these two because they are easy to understand and are the most logical for the types of analysis presented in this section. The first metric that we used was a straightforward average of the three experts. This is simply an average rating from 1 to 7 and is easy to use in calculations. The other metric, majority score, is a bit more complicated. Majority score can be defined as the percentage of experts that rated the site 5 or

higher. Since humans tend to apply different scales when rating quality, we wanted to collapse the 1 to 7 ratings from our experts into two bins: good and bad. We decided that a rating of 5, 6, or 7 was considered to represent a good site and anything below 5 was deemed a bad site. So by counting the number of experts that rated a particular site as good and dividing by the total number of experts, we get a ratio of how many experts considered a site to be of high quality. We considered URLs with a majority score greater than one half to be high quality and less than that low quality, according to our experts’ ratings. This is equivalent to saying that high quality sites (those with a majority score of one half) have been rated as “good” by more than half the experts for that topic. Note that for four of our five of our topics, the majority score must be 2 out of 3 experts for a site to be considered high quality but for the other topic, the Simpsons, the majority score must be 3 out of 4 experts. 7.7.2 Finding Quality Sites One of the main goals of our study was to help users find better quality sites. The first analysis we performed looked at the quality of sites found in each interface condition (TopicShop or Yahoo). Using the expert ratings, we computed an average expert majority score for the set of URLs selected by any subject. Recall that experts only rated a subset of the singleton URLs, which were selected by only one subject (e.g., only 8 of the 42 Babylon 5 singletons were rated). The average expert majority score includes a normalized expert score for any un-rated (singleton) URLs that were part of a subject’s collection. The normalized expert score is based on the ratings of URLs that experts did judge. For each topic, we computed the normalized expert score as the average of all expert-rated singly-selected URLs. When computing the average expert majority score, we substituted in the topic-specific normalized expert score for each un-rated singleton URL, rather than using zero to indicate that the URL was not rated. This way a

75

subject’s average expert majority score was not penalized because experts were unable to rate their selected URLs due to time constraints. Table 7.3 shows a summary of scores for each topic.

Topic
Babylon 5 Buffy the Vampire Slayer Simpsons Smashing Pumpkins Tori Amos Average

TopicShop
0.52 0.50 0.40 0.38 0.49 0.46

Yahoo % Increase (TS over Yahoo)
0.38 0.27 0.22 0.25 0.26 0.28 36.11% 80.65% 80.75% 53.48% 92.73% 65.72%

Maximum Possible Score
0.91 0.84 0.83 0.55 0.75 0.78

Table 7.3: Average expert majority scores for TopicShop and Yahoo users The scores presented are majority scores, which show the percentage of sites in a subject’s collection that would be rated good by the experts. Overall, TopicShop subjects were able to select 66% more high quality sites than Yahoo subjects. The majority score for TopicShop subjects was 0.46 and only 0.28 for Yahoo subjects (2-way ANOVA, interface factor F(1,30)=36.94, p<.00001). The topic factor was significant (F(4,30)=2.84, p<0.06), but the interaction was not (F(4,30)=0.49, p<.74). There was some variability across topics, which may be due to differing amounts of quality content about each topic. Most topics contained only a small number of high quality sites, so the total expert majority scores for a subject’s collection of 15 sites must include some lower quality sites that were not rated good by a majority of the experts. The last column in Table 7.3 shows the maximum possible average majority score for the best 15 sites in each topic. Since there were so few good sites, it is worthwhile to look at the top 5 and top 10 sites in each subject’s collection to see how many of the good sites subjects selected were rated high in quality by the experts. This gives an indication of how many of the limited quality sites subjects are able to find using each interface. Table 7.4 shows results of this analysis.

76

Topic
Babylon 5 Buffy the Vampire Slayer Simpsons Smashing Pumpkins Tori Amos Average

Sites
5 10 5 10 5 10 5 10 5 10 5 10

TopicShop
0.98 0.70 0.92 0.63 0.88 0.50 0.80 0.55 0.90 0.68 0.90 0.61

Yahoo
0.82 0.53 0.62 0.34 0.55 0.31 0.65 0.37 0.52 0.34 0.63 0.38

% Increase
20.41% 31.25% 48.65% 85.37% 59.09% 63.27% 23.08% 50.00% 74.19% 97.56% 42.06% 65.49%

Table 7.4: Majority score for Top 5/Top 10 user sites Limiting the analysis in this fashion shows that 90% of TopicShop subjects’ top 5 sites were rated good by experts, compared to 63% for Yahoo subjects’ sites. When looking at the top 10 sites in each subject’s collection, the percentage of good sites were 61% and 38% for TopicShop and Yahoo subjects, respectively. Again, TopicShop subjects found more of the better sites than Yahoo subjects. Two-way ANOVAs were run to check statistical significance. For the majority score of the top 10 sites, the interface factor was significant (F(1,30)=21.37, p<0.00005), but the topic factor and the interaction were not significant (topic factor: F(4,30)=1.94, p<0.13; interaction: F(4,30)=0.43, p<0.79). The analysis for the majority score for the top 5 sites was similar (interface: F(1,30)=28.37, p<.00009; topic: F(4,30)=2.08, p<0.11; interaction: F(4,30)=0.84, p<0.51).

Topic

Babylon 5 Buffy the Vampire Slayer Simpsons Smashing Pumpkins Tori Amos Average

TopicShop
7.00 7.25 6.50 8.50 7.75 7.40

Yahoo % Increase
5.75 3.50 5.25 5.00 3.00 4.50 21.74% 107.14% 23.81% 70.00% 158.33% 76.20%

Table 7.5: Intersection between users selections and top 15 expert-rated sites We performed another analysis to look at the number of sites from a subject’s collection that intersected with the top 15 expert-rated sites. This metric is a bit more straightforward than majority scores presented above. For each topic, we can generate a “good set” of sites by sorting all expert-rated sites and selecting the 15 best sites. We can then measure the quality of a user’s collection by looking at how many

77

of their selected sites match with the best sites according to experts. Table 7.5 shows the average number of sites that intersected with the “good set” for each interface condition. Of the 15 best sites, TopicShop subjects found 7.4 on average, while Yahoo subjects found only 4.5 sites. This is a 76.2% increase in quality for TopicShop users. Notice that the relative benefit of TopicShop over Yahoo varies from one metric to another (i.e., 76.2% better for the intersection and 65.72% better for the majority score analysis), because in each of these analyses we looked at the data in a different way. We computed a 2x5 two factor ANOVA on this metric since this metric was also computed in the pilot study. Results showed that the main effect of interface was significant (F(1,30)=18.55, p<.0002). Topic and the interaction between topic and interface were both insignificant (topic: F(4,30)=0.656, p<0.627; interaction: F(4,30)=1.097, p<0.376). Once again, since topic is insignificant and the interaction also has no effect, we report, below, the remaining statistical results using pooled independent means t-tests. 7.7.3 User Search Efficiency It is not only important to find quality sites, it also is important to find them quickly. The time and effort that subjects take is important when evaluating interfaces designed to help users search the web. The next two sections show analyses done on a user’s time and effort spent selecting the sites in their collections. Recall that in this experiment, subjects were given as much time as they needed. TopicShop subjects were able to complete the task in a little over a half an hour, but Yahoo subjects typically took almost an hour. The average time to complete this task for TopicShop subjects was 37.7 minutes. This is 28.2% faster than Yahoo subjects, who took over 52 minutes on average. Table 7.6 shows all the task times for each topic in both interface conditions. Statistical analysis shows that overall averages across the five topics were significantly different. (pooled independent means t-test, t(38)=-4.219, p<0.00007).

Topic

Babylon 5 Buffy the Vampire Slayer Simpsons Smashing Pumpkins Tori Amos Average

TopicShop
41.45 41.23 33.36 35.55 36.87 37.69

Yahoo % Diff
51.45 61.77 54.05 43.71 51.68 52.53

19.44% 33.26% 38.38% 18.66% 28.67% 28.25%

Table 7.6: Task Time (in minutes)

78

We also calculated the time it took each subject to find the 5 sites from their collection with the highest expert ranking. Since most people do not want to search through more than a few web sites to find the information they are looking for, it is important to analyze how quickly an interface allows subjects to find the best material. These results are summarized in Table 7.7.

Topic

Babylon 5 Buffy the Vampire Slayer Simpsons Smashing Pumpkins Tori Amos Average

TopicShop
13.12 9.20 16.64 7.99 15.46 12.48

Yahoo % Diff
15.93 24.52 25.78 9.05 20.59 19.17

17.64% 62.48% 35.46% 11.72% 24.91% 34.90%

Table 7.7: Time to visit Top 5 sites Again, TopicShop subjects were able to find the quality sites faster. TopicShop subjects selected their 5 best sites in an average of 12.48 minutes, which is 34.9% faster than Yahoo subjects who took 19.17 minutes to select their 5 best sites. These average times were statistically significant (pooled independent means t-test, t(38)=-2.356, p<.01). As shown before in Table 7.4, the quality of the 5 best sites found by TopicShop subjects was also much higher than the 5 best sites found by Yahoo subjects. The task for the first phase of this experiment included selection and then organization. Most users intertwined these two sub-tasks together by browsing and selecting a small number of sites and then incorporating them into their overall site organization. Another interesting analysis is to look at the percentage of time users spent performing each sub-task.

Topic
Babylon 5 Buffy the Vampire Slayer Simpsons Smashing Pumpkins Tori Amos Average

Browsing 82.73% 82.82% 73.90% 83.66% 84.97% 81.62%

TopicShop

Organizing 17.27% 17.18% 26.10% 16.34% 15.03% 18.38%

Browsing 61.61% 71.32% 71.32% 67.08% 52.28% 63.65%

Yahoo

Organizing 38.39% 28.68% 28.68% 32.92% 47.72% 36.35%

Table 7.8: Percentage of time spent Browsing/Organizing

The results in Table 7.8 show that TopicShop was able to shift subjects’ time from organizing to browsing and selecting. This suggests that organization is much easier in TopicShop and users can more

79

efficiently categorize their collections of sites, freeing up more of the time spent on this task for browsing. TopicShop subjects browsed sites on their topic for 82% of the time they participated in the experiment, while Yahoo subjects spent only 64% of their time actually viewing content about their topic (pooled independent means t-test, t(38)=-4.893, p<0.00009). Note that this is a slightly different metric than calculated in the pilot study. Instead of looking at the time a user spent using the interface versus the time they spent using their browser, here we compared the time they spent viewing content in the browser or interface with the time spent organizing their selected sites. 7.7.4 Required Effort One indication of the amount of effort required of users to find quality sites is the number of sites that they must visit to find a set acceptable to them. This roughly equates to the amount of work that users performed in completing this task and also gives an indication of how much of that work was wasted effort. Clearly if a user is trying to find 15 sites and must look at 50, they are wasting a lot of their time sifting through low quality sites. Instead, with TopicShop, we can reduce the number of sites users must

investigate to find a representative set of quality sites. This is summarized in Table 7.9.

Topic
Babylon 5 Buffy the Vampire Slayer Simpsons Smashing Pumpkins Tori Amos Average

Total # of Sites

TopicShop
31.00 25.00 22.50 33.25 23.50 27.05

Yahoo % Diff
45.00 50.00 38.25 36.25 31.75 40.25 31.11% 50.00% 41.18% 8.28% 25.98% 32.80%

173 258 210 95 88 164.80

Table 7.9: Average number of sites browsed On average, TopicShop subjects only browsed 27 sites while Yahoo subjects visited 40. Comparing these numbers to the total number of sites, we can see that subjects in both conditions visited far less sites than the total available, but TopicShop subjects considered 32.8% less sites on average (pooled independent means t-test, t(38)=-4.788, p<0.00001). Since TopicShop gives users additional information about the sites, it is logical that they can rapidly find sites they think will be high quality and eliminate the need to visit a large number of low quality sites. Subjects in the Yahoo condition also

80

attempted to eliminate sites up front to avoid visiting them, but at best, they could only guess based on title and annotation. 7.7.5 User Categorization From the organization sub-task, we had a collection of groupings that subjects placed on their collections of web sites. We analyzed categories that subjects made to assess whether or not they agreed with other subjects in their interface condition on which sites should be categorized together. In order to evaluate different categorizations that users made, we first looked at the size of the site intersections between users’ selected sets. By looking at the number of sites users had in common with each other, we obtained a better idea of how many they might group similarly.

Topic

Babylon 5 Buffy the Vampire Slayer Simpsons Smashing Pumpkins Tori Amos Average

TopicShop
5.17 5.00 7.00 6.83 5.67 5.93

Yahoo % Diff
3.00 3.17 4.17 6.00 3.17 3.90

72.22% 57.89% 68.00% 13.89% 78.95% 58.19%

Table 7.10: Average site intersection among users The values presented in Table 7.10 represent the average number of pair-wise intersections within each topic and condition. These average pair-wise intersections were calculated by computing the set of sites selected by each pair of subjects within a topic and interface condition and then averaging the size of those sets across the 6 pairs of subjects (all possible pairs of the 4 subjects in each condition). On average, TopicShop users selected 6 sites that were also selected by other users and Yahoo users selected 4 such sites (pooled independent means t-test, t(58)=-4.256, p<0.00004). Unfortunately this set of sites to

compare across subjects is fairly small. It would be nice to have more similar sites among users to begin investigating how they formed categories. But we can at least get an indication of how much agreement there was within topic categorization with a small intersecting set. We defined a number of metrics to measure performance on the organization sub-task. The metrics characterize effort involved, level of detail of the organization, and amount of agreement between subjects on how sites should be grouped.

81

We first computed how much time subjects spent on the organization sub-task (by examining the log files). TopicShop subjects spent 18% of their total time, while Yahoo subjects spent 36% of theirs. Since TopicShop subjects spent less time organizing sites, they were able to devote more time to evaluating and understanding the content of sites and selecting the good ones. Yet, even while taking less time, TopicShop users still created finer grained and more informative organizations, as we discuss next. We also computed the number of groups that subjects created. TopicShop subjects created 4 groups on average, and Yahoo subjects created 3. Thus, TopicShop subjects articulated the structure of the topic somewhat more. In addition, TopicShop subjects grouped nearly all of their selected sites (3% were left ungrouped), while Yahoo subjects left more ungrouped (15% were not grouped). TopicShop subjects created more site annotations, thus making their collections more informative for their own use or for sharing with others. The experiment did not require subjects to annotate sites. Yet 10 of 20 TopicShop subjects did so, annotating a total of 15% of their selected sites. Two Yahoo subjects annotated a total of four sites. TopicShop subjects annotated sites using the group and site annotation features of TopicShop. Since Yahoo subjects were using the bookmarking feature of their browser, they were also able to annotate sites and groups directly in their bookmarks. One way to gain insight into how groups are formed is to find out what percentage of sites users categorize similarly. This is a difficult issue to investigate; in general, it requires interpreting the semantics of groups. We computed a simpler metric; by looking at each pair of subjects within a topic and interface condition and then looking at each pair of sites that they have in common, we can decide whether they agree on their categorization, i.e., whether they both put the sites in the same group or in different groups. If both subjects grouped the pair of sites together, or both grouped them separately, we counted this as agreement; otherwise, we counted it as disagreement.

82

Topic

Babylon 5 Buffy the Vampire Slayer Simpsons Smashing Pumpkins Tori Amos Average

Users
TSP Yahoo TSP Yahoo TSP Yahoo TSP Yahoo TSP Yahoo TSP Yahoo

1,2

0.64 1 0.79 0.67 0.76 0.67 0.76 0.9 0.5 1

1,3

1 0 0.5 0.33 0.81 0 0.9 0.6 0.4 0.5

1,4

0.73 0 0 1 0.93 0 0.72 0.4 0.62 0

2,3

1 0.67 0.73 0 0.81 0.4 0.87 0.5 0.33 0.33

2,4

0.5 0 0.83 0.33 0.71 0.6 0.71 0.4 0.33 0.6

3,4

0.8 0.67 0.7 0.33 0.67 0.5 0.53 0.4 0.67 0

Avg %
0.78 0.39 0.59 0.44 0.78 0.36 0.75 0.53 0.48 0.41 0.68 0.43

99.57% 33.46%

116.13% 40.31% 17.28% 61.35%

Table 7.11: Pairwise category agreement between users (1-4) Table 7.11 shows details of this analysis. Each column represents the percentage of site

agreement between any two subjects. On average, TopicShop subjects agreed on their categorization of 68% of the sites they had in common with other subjects, while Yahoo subjects agreed on only 43%. TopicShop subjects, on average, created more categories than Yahoo subjects, so random agreement would be less likely to occur between TopicShop subjects, yet they actually agreed more often than Yahoo subjects (since the pairwise category agreements are not independent, a t-test was not provided for this analysis). The organizational facilities provided by TopicShop allowed users to easily group and evaluate sites in their collection. Subjects used these abilities along with the site profiles and therefore had an advantage over the Yahoo condition when forming groups.

83

Figure 7.2: A sample subject's categorization of Tori Amos sites. (Subject 3) Figure 7.2 & Figure 7.3 show the groups formed by two Tori Amos subjects (subjects 3 & 4 in Table 7.11). These two subjects had ten sites in common in their final set of selected sites and agreed on 67% of their categorization of pair-wise URLs. Each subject had two separate groups devoted to audio. If we combine each of these, we can see that they categorized 4 sites similarly as audio sites. Both subjects had two additional groups that formed. For subject 3 they were: ‘tori amos images’ and ‘complete information on tori amos’ , and for subject 4: ‘Pics’ and ‘Best’. These two subjects categorized their sites into 3 semantically identical categories: audio sites, image sites, and a category representing the best available comprehensive sites on Tori Amos. Figure 7.4 is a representation of the 3 main groups, showing how these two subjects agreed on their categorization. They placed 8 of the 10 sites that the subjects had common in the same groups. But two of the sites that subject 3 classified as image sites were classified by subject 4 as audio and best. This is not surprising since most sites can be classified in a few different categories depending on a user’s personal preference and interpretation.

Figure 7.3: A second subject's categorization of Tori Amos sites. (Subject 4)

84

S3

Audio

S4

Images
S3

S4

Complete
S3 S4
Common site in same group Common site in different group Site not common

Figure 7.4: Groups for Tori Amos as created by subjects 3&4

These results show that TopicShop subjects appear to do a better job of organizing the items they select: they create more groups, they annotate more sites, and they agree in how they group items more of the time. Further, they achieve these results in half the time Yahoo subjects devote to the task. We believe these results are because TopicShop makes grouping and annotation very easy, because of the rich information about sites that is available and remains visible while users organize sites.

7.7.6

Relationship between evaluation and organization sub-tasks We also studied the relationship between evaluation and organization sub-tasks. The TopicShop

Explorer allows these sub-tasks to be integrated, but does not force a user to perform them integrally. On the other hand, in the Yahoo/bookmarks condition, browsing sites and organizing bookmarks can only be performed as separate sub-tasks. The log files contained data that let us quantify relationship between these sub-tasks. Each user action was timestamped, and we knew whether it was an evaluation or organization action. Evaluation actions included visiting a page in a web browser and sorting data in the Site Profiles Window. For TopicShop, organization actions included moving or annotating icons or groups in the Work Area. In the Yahoo/bookmarks condition, organization actions included creating a bookmarks folder, naming a folder, naming a bookmarked item, and placing an item in a folder. We computed how many actions of each type occurred in each quartile of the task, i.e. how many

85

occurred in the first 25% of the total time a subject spent on task, how many in the second 25%, etc. Table 7.12 shows results for organizational actions. First, it shows how much more organizational work TopicShop users did: 533 actions vs. 172. (And recall they did this in half the time.) Second, as expected, TopicShop users integrated organization and evaluation to a much greater extent than did Yahoo users. They did about a quarter of their total organizational work in each of the first two quartiles, dipped slightly in the third quartile, then increased a bit in the final quartile. Yahoo users, on the other hand, did virtually no organizational work in the first quartile of their task, then ended by doing more than 50% in the last quartile. We should emphasize that TopicShop does not force sub-task integration; rather, it enables it. And when users had the choice, they overwhelmingly preferred integration of the sub-tasks of evaluation and organization. Quartile TopicShop # of % of actions total 125 23% 138 26% 110 21% 160 30% 533 Yahoo # of % of actions total 2 1% 31 18% 50 29% 89 52% 172

Quartile 1 Quartile 2 Quartile 3 Quartile 4 Total

Table 7.12: Distribution of organizational actions across time quartiles We also constructed detailed timelines of user activity. Figure 7.5 shows such timelines for two Yahoo and two TopicShop subjects. They provide vivid illustrations of the overall results. TopicShop users interleaved the two sub-tasks throughout the course of their work and performed many more organization actions. On the other hand, Yahoo users began by focusing exclusively on evaluation; then, toward the end of the task, they shifted to focus mostly on organization. And they did much less organization.

86

30 Evaluate 9 Organize

38 Evaluate 8 Organize

14 Evaluate 12 Organize

15 Evaluate 9 Organize

4
17 Evaluate 13 Organize 10 Evaluate 13 Organize 17 Evaluate 7 Organize 20 Evaluate 12 Organize

TopicSho

3 TopicShop - Evaluate User
35 Evaluate 0 Organize 69 Evaluate 0 Organize 37 Evaluate 4 Organize 30 Evaluate 10 Organize

TopicShop - Organize Yahoo - Evaluate Yahoo - Organize

2
40 Evaluate 0 Organize 48 Evaluate 0 Organize 36 Evaluate 2 Organize 18 Evaluate 11 Organize

Yahoo

1

0 0 10 20 30 40 50 60 70 80 Task Time (in Minutes)

Figure 7.5: Timelines of user activity. TopicShop users did more organization actions and interleaved organization with evaluation. Yahoo/bookmarks users did less organization, and did it at the end of their task

7.7.7

Expert Ratings for Site Breakdowns To reduce the amount of work our experts had to do, we selected a subset of sites on each topic to

be presented to the experts, as explained above (section 7.5). There were four categories that sites in the expert set fell into: sites selected by multiple users, sites selected by one user, sites selected by no users, and sites discovered by our crawler (and shown to no users). Our expectation supporting this choice was that a site selected by more than one user would be better quality than a site selected by a single user and any site not chosen by any users would be the lowest quality. We validated this claim by looking at the average expert rating for each of the groups (a high expert rating is considered good).

87

Topic Babylon 5 Buffy Simpsons Smashing Pumpkins Tori Amos Total

Multiply selected Sites 4.78 4.28 3.91 3.45 3.95 4.05

Singly selected Sites 3.37 3.58 2.33 2.33 3.16 2.91

Sites not chosen 2.84 1.75 2.33 1.44 2.12 2.09

Discovered Sites 4.70 4.2


相关文章:
更多相关标签: