CAS CS 105 -- Lab 10 -- 2012 Spring

Here is a table with contact lenses related data....

AGE	SPECTACLE-PRESCR	ASTIGMATISM	TEAR-PRODUCT-RATE	CONTACT-LENSES
young	myope	no	reduced	none
young	myope	no	normal	soft
young	myope	yes	reduced	none
young	myope	yes	reduced	hard
young	hypermetrope	no	reduced	none
young	hypermetrope	no	normal	soft
young	hypermetrope	yes	reduced	none
young	hypermetrope	yes	normal	hard
pre-presbyopic	myope	no	reduced	none
pre-presbyopic	myope	no	normal	soft
pre-presbyopic	myope	yes	reduced	none
pre-presbyopic	myope	yes	normal	hard
pre-presbyopic	hypermetrope	no	reduced	none
pre-presbyopic	hypermetrope	no	normal	soft
pre-presbyopic	hypermetrope	yes	reduced	none
pre-presbyopic	hypermetrope	yes	normal	none
presbyopic	myope	no	reduced	none
presbyopic	myope	no	normal	none
presbyopic	myope	yes	reduced	none
presbyopic	myope	yes	normal	hard
presbyopic	hypermetrope	no	reduced	none
presbyopic	hypermetrope	no	normal	soft
presbyopic	hypermetrope	yes	normal	none
presbyopic	hypermetrope	yes	normal	none

We want to classify the tear-production-rate. Which is the best one-attribute classifier according to 1R?

Now, lets build a decision-tree classifier for the contact-lenses dataset using the algorithm discussed in lecture.

The final project has been posted here. It is an open project -- you can choose the data set(s) you are interested in. While there are no constraints on the data sets you will choose, I would like to share some thoughts if I were assigned the same project.

Of course, choosing an appropriate dataset is the most important. Unfortunately, sometimes a single data set is not amenable for mining meaningful patterns. For instance, we consider GDP growth data in UNData: World Development Indicators. Why does this dataset not lend itself to data-mining?

You may be able to come up with a model to describe GDP growth of a specific country, but intuitively, finding a model that fits multiple countries is quite hard. To obtain a model that can generalize to previously unseen countries, we would need additional attributes that describe the countries and that might be relevant to their GDP.

Again, before looking for a specific dataset, we should have a well-defined problem.

To choose appropriate datasets from thousands of available datasets, a better approach is to have a question in your mind first and look for corresponding datasets. For instance, with different questions, you can explore GDP growth data from different perspectives (in combination with other datasets): for instance, the following two questions seem a lot better than GDP growth itself: discussing the importance of agricultural and industrial growth in the general GDP growth; the relationship between GDP growth and life expectancy. These are just my superficial thoughts, I am sure that you can find better topics of interest and produce meaningful results.

Maybe you are interested in predicting values in time series data, such as predicting prices for stock data. But it is not easy to predict values without complementary information because you need a good model to predict something. Still it is also difficult to evaluate your prediction, if the time series data is not large enough. The easier way is to find out some relationships between time series data and other information.