Thursday, June 9, 2011

What is a data structure?

A data structure is encapsulation of data and relevant operations. It is referred to as an Abstract Data Type (ADT), signifying that the actual implementation is left open, subject to satisfying the intended meaning of the operations. For the most part the exact data representation is not presented; rather, all access to the data is through the operations.

The Java class is a perfect model for an ADT. In addition to the encapsulation with accessibility via operations only, the presentation of operations are in the modern object-oriented fashion in which the data takes precedence over the operation.

According to this definition, every Java class is a data structure. However, classically, we have particular interest in the classes which are often termed to be part of the Java collections. In Java terms these are contained in the java.util package. In particular, they are aggregate data structures typically representing unbounded structures whose size is parameterized.

The focus of interest to this class is:
  1. Building data structures.
    • what data to use
    • what operations are of interest
    • how to implement the operations
  2. Comparing and contrasting the efficiency of different implementations
  3. Using data structures to solve difficult algorithmic problems.
There are many factors involved in these pursuits, but we can often pin it down to three things:
  • Simplicity
  • Speed (run-time).
  • Space usage.

Simplicity

This really should be the first consideration. We say time is money, but the "time" may be the amount of time you spend on something, not the time it takes to execute.
  • KISS philosophy: Keep It Simple and Stupid
  • Always use the simplest data structure (dumbest?) first. Try to get a working model. Then build on the model, to make it faster, etc.
  • Don't spend time making something complex so as to speed it up if it won't greatly improve the overall speed.
  • Keep in mind the utility and possible life-cycle of this code so that you're not wasting your time.

Speed

This is the second consideration: try to improve the overall speed. There are several considerations:
  • Profiling and tweaking. Try to speed up the most costly operations. For example, in a database algorithm that involves disk and RAM operations, the disc operations are at least 1000 times as costly as the RAM operations, so we want to focus on making the disc operations less costly.
  • If you identify some time-consuming operation, look for an order-of-magnitude improvement, e.g., going from quadratic time to linear time. This is difficult and involves finding a better data structure and/or better algorithm.

Space efficiency

Advanced data structures usually require extra storage to hold information about the structure which can be used for efficient processing. Usually this features is considered last, but in data-intensive data structures, say, for a database, it is crucial.