TreeSet and Hashset

A set is a collection of objects in which no object occurs more than once. Sets implement all the methods in the interface Collection<T>, but do so in a way that ensures that no element occurs twice in the set. For example, if set is an object of type Set<T>, then set.add(obj) will have no effect on the set if obj is already an element of the set. Java has two classes that implement the interface Set<T>: java.util.TreeSet and java.util.HashSet.

In addition to being a Set, a TreeSet has the property that the elements of the set are arranged into ascending sorted order. An Iterator (or a for-each loop) for a TreeSet will always visit the elements of the set in ascending order.

TreeSet cannot hold arbitrary objects, since there must be a way to determine the sorted order of the objects it contains. Ordinarily, this means that the objects in a set of type TreeSet<T> should implement the interface Comparable<T> and that obj1.compareTo(obj2) should be defined in a reasonable way for any two objects obj1 and obj2 in the set. Alternatively, an object of type Comparator<T> can be provided as a parameter to the constructor when the TreeSet is created. In that case, the compareTo() method of the Comparator will be used to compare objects that are added to the set.

TreeSet does not use the equals() method to test whether two objects are the same. Instead, it uses the compareTo() method. This can be a problem as compareTo() can consider two objects to be the same for the purpose of the comparison even though the objects are not equal. For a TreeSet, this means that only one of those objects can be in the set. For example, if the TreeSet contains mailing addresses and if the compareTo() method for addresses just compares their post codes, then the set can contain only one address in each post code. You have to be aware of the semantics of TreeSets, and you need to make sure that compareTo() is defined in a reasonable way for objects that you put into a TreeSet. This will  also be true for Strings, Integers and many other built-in types, since the compareTo() method for these types considers two objects to be the same only if they are actually equal.

In the implementation of a TreeSet, the elements are stored in something similar to a binary sort tree. However, the data structure that is used is balanced in the sense that all the leaves of the tree are at about the same distance from the root of the tree. This ensures that all the basic operations  — inserting, deleting, and searching – – are efficient.

The fact that a TreeSet sorts its elements and removes duplicates makes it very useful in some applications, eg: if you wanted to write a program that would read a file and output an alphabetical list of all the words that occurred in the file, with duplicates removed. The words could be stored in an ArrayList, so it would be up to you to make sure that the list was sorted and contained no duplicates. This can be can be programmed much more easily using a TreeSet instead of a list.

TreeSet automatically eliminates duplicates, and an iterator for the set will automatically visit the items in the set in sorted order. An algorithm for the program, using a TreeSet, would be:

TreeSet words = new TreeSet();

while there is more data in the input file:
   Let word = the next word from the file
   Convert word to lower case
   words.add(word)   // Adds the word only if not already present.

for ( String w : words ) // for each String w in words
   Output w

As another example, suppose that coll is any Collection of Strings. (This would also work for any other type for which compareTo() is properly defined.) We can use a TreeSet to sort the items of coll and remove the duplicates simply by saying:

TreeSet set = new TreeSet();

The second statement adds all the elements of the collection to the set. Since it’s a Set, duplicates are ignored. Since it’s a TreeSet, the elements of the set are sorted. If you would like to have the data in some other type of data structure, it’s easy to copy the data from the set. For example, to place the answer in an ArrayList, you could say:

TreeSet set = new TreeSet();
ArrayList list = new ArrayList();

Now, in fact, every one of Java’s collection classes has a constructor that takes a Collection as an argument. All the items in that Collection are added to the new collection when it is created. So, if coll is of type Collection, then new TreeSet(coll) creates a TreeSet that contains the same elements as coll, but with duplicates removed and in sorted order. This means that we can abbreviate the four lines in the above example to the single command:

ArrayList list = new ArrayList( new TreeSet(coll) );

This makes a sorted list of the elements of coll with no duplicates.

HashSet stores its elements in a hash table, a type of data structure that we will discuss shortly. The operations of finding, adding, and removing elements are implemented very efficiently in hash tables, even more so than for TreeSets. The elements of a HashSet are not stored in any particular order, and so do not need to implement the Comparable interface. (They do, however, need to define a proper “hash code,” as we’ll see shortly.)

The equals() method is used to determine whether two objects in a HashSet are to be considered the same. An Iterator for a HashSet will visit its elements in what seems to be a completely arbitrary order, and it’s possible for the order to change completely when a new element is added. Use a HashSet instead of a TreeSet when the elements it contains are not comparable, or when the order is not important, or when the small advantage in efficiency is important.

In mathematical set theory, the items in a set are called members or elements of that set. Important operations include adding an element to a set, removing an element from a set, and testing whether a given entity is an element of a set. Operations that can be performed on two sets include union, intersection, and set difference. All these operations are defined in Java for objects of type Set, but with different names. Suppose that A and B are Sets. Then:

  • A.add(x) adds the element x to the set A.
  • A.remove(x) removes the element x from the set A.
  • A.contains(x) tests whether x is an element of the set A.
  • A.addAll(B) computes the union of A and B.
  • A.retainAll(B) computes the intersection of A and B.
  • A.removeAll(B) computes the set difference, A – B.

There are of course, differences between mathematical sets and sets in Java. Most important, perhaps, sets in Java must be finite, while in mathematics, most of the fun in set theory comes from working with infinity. In mathematics, a set can contain arbitrary elements, while in Java, a set of type Set<T> can only contain elements of type T. The operation A.addAll(B) acts by modifying the value of A, while in mathematics the operation A union B computes a new set, without changing the value of A or B.

Next: EnumSet