Verified Document

Clustering In Algorithms That Employ Term Paper

Structural relationships within data can be revealed by such detailed analysis. The final deliverable will be the search time trial results, and the conclusions drawn with respect to the optimum algorithm designs. A definitive direction for the development of future design work is considered a desirable outcome.

Quality assurance will be implemented through systematic review of the experimental procedures and analysis of the test results. Achieving the goals stated at the delivery dates, performance of the tests, and successful completion of the project as determined by the committee members will provide quality assurance to the research outcomes.

CHAPTER 2

Background and a Review of Literature

Data Clustering is a technique employed for the purpose of analyzing statistical data sets. Clustering is the classification of objects with similarities into different groups. This is accomplished by partitioning data into different groups, known as clusters, so that the elements in each cluster share some common trait, usually proximity according to a defined distance measure. Essentially, the goal of clustering is to identify distinct groups within a dataset, and then place the data within those groups, according to their relationships with each other.

One of the main points of document clustering is that it isn't so much a matter of finding the documents off of the web- that's what search engines are for. A clustering algorithm is much more behind the idea of how that information is displayed; in what order they are displayed, what is considered relevant to the query, listing broad categories that can become narrower. Search engines do a fantastic job by themselves when the user has a specific query, and knows exactly what words will get the desired results. But the user gets a ranked list that has questionable relevance because it turns up anything that has the word in it, and the only metric is the number of times that word appears in the document. With a good clustering program, the user will instead be presented with multiple avenues of inquiry organized into broad groups that get more specific as selections are made.

Clustering exploits similarities between the documents to be clustered. The similarity of two documents is computed as a function of distance between the corresponding term vectors for those documents. Of the various measures used to compute this distance, the cosine-measure has proved the most reliable and accurate. [10]

Data clustering algorithms come in two basic types: hierarchical and partitional. In partitions, searches are matched against the designated clusters, and the documents in the highest scoring clusters are returned as a result.

When a hierarchy processes a query, it moves down the tree along the highest scoring branches until it achieves the predetermined stopping condition. The sub-tree where the stopping condition is satisfied is then returned as the result of the search.

Both strategies rely on variations of the near-neighbor search.

In this search, nearness is determined by the similarity measure used to generate the clusters. Cluster search techniques are comparable to direct near-neighbor searches.

Both are evaluated in terms of precision and recall.

The evidence suggests that cluster search techniques are only slightly better than direct near-neighbor searches, and the former can even be less effective than the latter in certain circumstances. Because of the quadratic running times necessary, clustering algorithms are often slow and unsuitable for extremely large volume tasks.

Hierarchical algorithms start with established clusters, then create new clusters based upon the relationships of the data within the set. Hierarchical algorithms can do this one of two ways, from the bottom up or from the top down. These two methods are known as agglomerative and divisive, respectively. Agglomerative algorithms start with the individual elements in the set as clusters, then merge them into successively larger clusters. Divisive algorithms start with the entire dataset in one cluster, and then break it up into successively smaller clusters. Because hierarchical algorithms must analyze all the relationships inherent in the dataset, they tend to be costly in terms of time and processing power.

Partitional algorithms determine the clusters at one time, in the beginning of the clustering process. Once the clusters have been created, each element of the dataset is then analyzed and placed within the cluster that it is the closest to. Partitional algorithms run much faster than hierarchical ones, which allows them to be used in analyzing large datasets, but they have their disadvantages as well. Generally, the initial choice of clusters is arbitrary, and does not necessarily comprise all of the actual groups that exist within a...

Therefore, if a particular group is missed in the initial clustering decision, the members of that group will be placed within the clusters that are closest to them, according to the predetermined parameters of the algorithm. In addition, partitional algorithms can yield inconsistent results- the clusters determined this time by the algorithm probably won't be the same as the clusters generated the next time it is used on the same dataset.
Of the five algorithms under scrutiny in this paper, two are hierarchical and three are partitional. The two hierarchical methods are suffix tree and single pass. The suffix tree is "a compact representation of a trie corresponding to the suffixes of a given string where all nodes with one child are merged with their parents."[1]

It is a divisive method; it begins with the dataset as a whole and divides it into progressively smaller clusters, each composed of a node with suffixes branching off of it like leaves. Single-pass clustering, on the other hand, is an agglomerative, or bottom-up, method. It begins with a single cluster, and then analyzes each element in turn to determine if it falls within a current cluster, or places it in a new cluster, depending on the similarity threshold set by the analyst.

The three partitional algorithms are k-means, buckshot, and fractionation. K-means derives its clusters based upon longest distance calculations of the elements in the dataset, then assigns each element to the closest centroid. Buckshot partitioning starts with a random sampling of the dataset, then derives the centers by placing the other elements within the randomly chosen clusters. Fractionation is a more careful clustering algorithm which divides the dataset into smaller and smaller groups through successive iterations of the clustering subroutine. [2] Fractionation requires more processing power, and therefore time.

Clustering is a methodology for more effective search and retrieval functions pertaining to datasets, and it has been investigated in great depth in the literature. The principle is simple enough- documents with a high degree of similarity will automatically be sought by the same query. By automatically placing the documents in groups based upon similarity (e.g. clusters), the search is effectively broadened.

The buckshot algorithm is simple in design and intent. It chooses a small random sampling of the documents in the database, and then apply the cluster routine to them. The centers of the clusters generated by the subroutine are returned. This use of a rectangular time clustering algorithms makes the buckshot method fast. The tradeoff is that buckshot is not deterministic, since it initially relies on a random sampling process. Repeated use of this algorithm can return different clusters than previous searches, although it is maintained that repeated trials generate clusters that are similar in quality to the previous set of clusters. [2]

Fractionation algorithms find centers by initially breaking the corpus of documents into a set number of buckets of predetermined size. The cluster subroutine then is applied to each bucket individually, breaking the contents of the bucket into yet smaller groups within the bucket. This process is repeated until a set number of groups is found, and these are the k centers. This method is like building a branching tree from bottom up, with leaves as individual documents and the centers (clusters) as the roots. The best fractionation methods sort the dataset based on a word index key (e.g. based on words in common between two documents).

It is important to note that both buckshot and fractionation rely on a clustering subroutine. Both algorithms are designed to find the initial centers, but rely on a separate algorithm to do the actual clustering of individual documents. This subroutine can itself be agglomerative, or divisive. However, in practice, the subroutine tends to be an agglomerative hierarchical algorithm.

Buckshot applies the cluster subroutine to a random sampling of the dataset to determine the centers, while the fractionation algorithm uses repeated applications of the subroutine over groups of fixed size in order to find the centers. Fractionation is considered more accurate, while buckshot is much faster, making it more suitable for searching in real time on the Web. [2]

After the centers have been determined, each algorithm proceeds to place the documents with their nearest center. After this step is completed, the resulting partitions are refined into better clusters, either by a simple move-to-nearest protocol, or by techniques known as split (dividing a poorly defined cluster into two distinct groups) and join (merging two similar clusters). This step suffers from the same time and accuracy constraints as the initial centering- the more accurate join and split functions are much more time-consuming…

Sources used in this document:
References

Black, P. 2005. 'Dictionary of Algorithms and Data Structures'. National institute of standards and technology web site. [online] http://www.nist.gov/dads/[25 Aug 2005]

Cutting, D.R., Karger, D.R., Pedersen, J.O., and Tukey, J.W. 1992, 'Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections., Proceedings of the Fifteenth Annual International ACM SIGIR Conference, June 1992, p. 318-329.

Jensen, E.C., Beitzel, S.M., Pilotto, A.J., Goharian, N., and Frieder, O. 2002. Parallelizing the buckshot algorithm for efficient document clustering. Information Retreival Laboratory, Illinois Institue of Technology, Chicago, IL.

Dhillon, I., Fan, J., and Guan, Y. 2001. 'Efficient Clustering of Very Large Document Collections', Data Mining for Scientific and Engineering Applications, Kluwer Academic Publishers.
Cite this Document:
Copy Bibliography Citation

Related Documents

Identify and Describe the Weaknesses of the Data Encryption Standard...
Words: 684 Length: 2 Document Type: Dissertation or Thesis complete

weaknesses of the Data Encryption Standard (DES). The Data Encryption Standard (DES) was a system developed by the USD government for use by the general public. Accepted both by the U.S. And abroad, many hardware and software systems employ the DES. Both individuals can send and encrypt and decrypt information to and from the other. The symmetry of the situation makes this a popular key. Authenticity is guaranteed since only

Business - Information Systems: Network
Words: 1606 Length: 5 Document Type: Term Paper

Brix can use NTP, but in our testing its active Verifiers employed small Code Division Multiple Access receivers, which receive timing signals similar to what is used for cellular telephony. Most vendors also support an option for satellite-based global positioning system (GPS) receivers" (22). Conclusion The research showed that Network Time Protocol is a longstanding Internet protocol that is used to ensure the accurate synchronization to the millisecond of computer clock

Data Warehouse Case Study VF
Words: 1268 Length: 5 Document Type: Case Study

From Supply Chain Efficiency to Customer Segmentation Focus Because of this focus on supply chain forecasting accuracy and efficiency, the need for capturing very specific customer data becomes critical. The case study portrays the capturing of segmentation data as focused on growing each of the brands mentioned that VF relies on this data to base marketing, location development and store introductions, and pricing strategies on. In reality, the data delivered for

Data Mining
Words: 580 Length: 2 Document Type: Research Paper

algorithms that can mine mounds of data that have been collected from people and digital devices have led to the adoption of data mining by most businesses as a means of understanding their customers better than before. Data mining takes place in retailing and sales, banking, education, manufacturing and production, health care, insurance, broadcasting, marketing, customer services, and a number of other areas. The analytical information gathered by data-mining

Data Mining, a Process That Involves the
Words: 1271 Length: 4 Document Type: Essay

Data mining, a process that involves the extraction of predictive information which is hidden from very large databases (Vijayarani & Nithya,2011;Nirkhi,2010) is a very powerful and yet new technology having a great potential in helping companies to focus on the most important data in their data warehouses. The use of data mining techniques allows for the prediction of trends as well as behaviors thereby allowing various businesses to make proactive

Data Mining
Words: 1427 Length: 4 Document Type: Research Paper

Data Mining Determine the benefits of data mining to the businesses when employing: Predictive analytics to understand the behaviour of customers "The decision science which not only helps in getting rid of the guesswork out of the decision-making process but also helps in finding out the perfect solutions in the shortest possible time by making use of the scientific guidelines is known as predictive analysis" (Kaith, 2011). There are basically seven steps involved

Sign Up for Unlimited Study Help

Our semester plans gives you unlimited, unrestricted access to our entire library of resources —writing tools, guides, example essays, tutorials, class notes, and more.

Get Started Now