|
The goal of this project is to develop a general framework for the application of data mining
and knowledge discovery techniques to discover patterns from clickstream and e-commerce data.
WEBMINER is a Web usage mining system integrating techniques to discover association rules,
sequential patterns, and classification rules from WWW transaction data. The system includes
a knowledge query mechanism and new algorithms for inferring and identifying unique user
sessions and transactions from missing data.
WEBMINER: A System for Pattern Discovery from World Wide Web Transactions
This research involves the development of a general and flexible framework for Web
usage mining, the application of data mining techniques, such as the discovery of association rules and
sequential patterns, to extract relationships from data collected in large Web data repositories. The
proposed framework includes a modular architecture for the Web mining process which distinguishes
between the domain dependent data transformation tasks, such as the discovery and identification of
several types of user transactions, and the generic data mining engine, and data and transaction models
for each of these components. The type of knowledge discovered can be used, for instance, to
restructure a Web site for increased effectiveness, for better management of workgroup communication
in intranets, and in analyzing user access patterns to dynamically present information tailored to
specific groups of users. The architecture of the WEBMINER system which is based on the proposed
framework is depicted below.
 Click to Zoom
One of the significant factors which distinguish Web mining from other data
mining activities is the method used for identifying user transactions. This process, which we
call transaction identification is particular ly important in the context of web data, because
it allows the discovery phase to focus on relevant access points of a particular user rather
than the intermediate pages accessed for navigational reasons. However, special algorithms must
be used to identi fy unique user sessions and to find user transactions, since generally
references are not uniquely identified by user and many of the references are cached by client-side
agents or proxy servers. A combination of standard methods such as client-side cook ies and
heuristics are used to identify unique user sessions. The heuristics include using IP, agent,
and OS fields as key attributes; using session time-outs; using synchronized referrer log entries
to expand user paths belonging to a session; and using sophisticated algorithms to infer cached
references by completing and disambiguating user paths belonging to a session.
Once unique user sessions are identified, grouping user references into
transactions must use information about both the nature of the data and the type of analysis to
be done. We propose to use such informati on in a 2-step process. In the first step we use
clustering as a general approach to grouping references into transactions. The clustering is
based on comparing pairs of log entries and determining the similarity between them by means
of some kind of distance measure(s). WEBMINER also uses a model of user browsing behavior and
statistical techniques to atumatically determine if a particular user treats individual
references as content or navigational references, In the second step, we use information about
the type of analysis and specialize the groups formed in step 1 into transactions suited to the
specific analysis. In using clustering to determine the similarity of two references, i.e.
whether they belong to the same group, dista nce metrics on many different attributes can be
defined. Determining an appropriate set of attributes to cluster on, and defining appropriate
distance metrics for them is an important problem, and is being addressed in our ongoing research.
Our specific goals in this research include (i) development of a flexible
architecture for web usage mining, (ii) developing a model for a user transaction which consists
of multiple user references, (iii) clustering algorithms for grouping log entries into transactions,
(iv) integration of data from other sources such as user registration databases with access log data,
(v) adaptation of association rule, temporal sequence, and classification rule discovery algorithms
to Web mining, (vi) development of knowledge-based intelligent agents to interpret the discovered
rules, (vii) development of a flexible query mechanism that can be used to query the integrated
data and the discovered rules in a unified manner, and (vi ii) experimental evaluation of the system.
The web usage mining project is being conducted by Dr. Bamshad Mobasher.
Dr Mobasher is the VP for Research and Development at Global Reach and the Director of the Center
for Web Intelligence at Depaul University.
|
 |
 |