Here is a table with contact lenses related data....
We want to classify the tear-production-rate. Which is the best one-attribute classifier according to 1R? Now, lets build a decision-tree classifier for the contact-lenses dataset using the algorithm discussed in lecture. The final project has been posted here. It is an open project -- you can choose the data set(s) you are interested in. While there are no constraints on the data sets you will choose, I would like to share some thoughts if I were assigned the same project. Of course, choosing an appropriate dataset is the
most important. Unfortunately, sometimes a single data set is not
amenable for mining meaningful patterns. For instance, we consider
GDP growth data in
UNData:
World Development Indicators. Why does this dataset not lend itself to data-mining?
You may be able to come up with a model to describe GDP growth of a specific country, but intuitively, finding a model that fits multiple countries is quite hard. To obtain a model that can generalize to previously unseen countries, we would need additional attributes that describe the countries and that might be relevant to their GDP.
To choose appropriate datasets from thousands of available datasets, a better approach is to have a question in your mind first and look for corresponding datasets. For instance, with different questions, you can explore GDP growth data from different perspectives (in combination with other datasets): for instance, the following two questions seem a lot better than GDP growth itself: discussing the importance of agricultural and industrial growth in the general GDP growth; the relationship between GDP growth and life expectancy. These are just my superficial thoughts, I am sure that you can find better topics of interest and produce meaningful results. Maybe you are interested in predicting values in time series data, such as predicting prices for stock data. But it is not easy to predict values without complementary information because you need a good model to predict something. Still it is also difficult to evaluate your prediction, if the time series data is not large enough. The easier way is to find out some relationships between time series data and other information. |