Global Reach Internet Productions - Ames & Des Moines Iowa
Home Services Products Clients Corporate Research News Contact 07-August-2008
Web Usage Mining
Contact Global Reach

 

 

The goal of this project is to develop a general framework for the application of data mining and knowledge discovery techniques to discover patterns from clickstream and e-commerce data. WEBMINER is a Web usage mining system integrating techniques to discover association rules, sequential patterns, and classification rules from WWW transaction data. The system includes a knowledge query mechanism and new algorithms for inferring and identifying unique user sessions and transactions from missing data.

WEBMINER: A System for Pattern Discovery from World Wide Web Transactions

This research involves the development of a general and flexible framework for Web usage mining, the application of data mining techniques, such as the discovery of association rules and sequential patterns, to extract relationships from data collected in large Web data repositories. The proposed framework includes a modular architecture for the Web mining process which distinguishes between the domain dependent data transformation tasks, such as the discovery and identification of several types of user transactions, and the generic data mining engine, and data and transaction models for each of these components. The type of knowledge discovered can be used, for instance, to restructure a Web site for increased effectiveness, for better management of workgroup communication in intranets, and in analyzing user access patterns to dynamically present information tailored to specific groups of users. The architecture of the WEBMINER system which is based on the proposed framework is depicted below.


Click to Zoom

One of the significant factors which distinguish Web mining from other data mining activities is the method used for identifying user transactions. This process, which we call transaction identification is particular ly important in the context of web data, because it allows the discovery phase to focus on relevant access points of a particular user rather than the intermediate pages accessed for navigational reasons. However, special algorithms must be used to identi fy unique user sessions and to find user transactions, since generally references are not uniquely identified by user and many of the references are cached by client-side agents or proxy servers. A combination of standard methods such as client-side cook ies and heuristics are used to identify unique user sessions. The heuristics include using IP, agent, and OS fields as key attributes; using session time-outs; using synchronized referrer log entries to expand user paths belonging to a session; and using sophisticated algorithms to infer cached references by completing and disambiguating user paths belonging to a session.

Once unique user sessions are identified, grouping user references into transactions must use information about both the nature of the data and the type of analysis to be done. We propose to use such informati on in a 2-step process. In the first step we use clustering as a general approach to grouping references into transactions. The clustering is based on comparing pairs of log entries and determining the similarity between them by means of some kind of distance measure(s). WEBMINER also uses a model of user browsing behavior and statistical techniques to atumatically determine if a particular user treats individual references as content or navigational references, In the second step, we use information about the type of analysis and specialize the groups formed in step 1 into transactions suited to the specific analysis. In using clustering to determine the similarity of two references, i.e. whether they belong to the same group, dista nce metrics on many different attributes can be defined. Determining an appropriate set of attributes to cluster on, and defining appropriate distance metrics for them is an important problem, and is being addressed in our ongoing research.

Our specific goals in this research include (i) development of a flexible architecture for web usage mining, (ii) developing a model for a user transaction which consists of multiple user references, (iii) clustering algorithms for grouping log entries into transactions, (iv) integration of data from other sources such as user registration databases with access log data, (v) adaptation of association rule, temporal sequence, and classification rule discovery algorithms to Web mining, (vi) development of knowledge-based intelligent agents to interpret the discovered rules, (vii) development of a flexible query mechanism that can be used to query the integrated data and the discovered rules in a unified manner, and (vi ii) experimental evaluation of the system.

The web usage mining project is being conducted by Dr. Bamshad Mobasher. Dr Mobasher is the VP for Research and Development at Global Reach and the Director of the Center for Web Intelligence at Depaul University.

 

2321 North Loop Drive · Ames, Iowa 50010 · Phone: 515-296-0792 · Toll-Free: 888-287-4108
Legal Your Privacy Top