& MISSING VALUES
DATA MINING TECHNIQUES USED ON VOTING PATTERNS
Sadaf Mehmood, Naima Talib
In this paper, number of approaches
to classifications and missing attributes are presented and compared on a data
set. Classification is a data mining technique which is used to predict group membership
for data instances. This paper presents some basic techniques of classification
and methods to deal with missing values, then their results are compared. The
algorithm which produces the result with highest accuracy is most suitable for
the given data set.
mining, data preprocessing, classification algorithm, Decision tree analysis, missing
attributes, voting pattern
Data preprocessing is one
of the major steps in the knowledge discovery process. Data preprocessing involves
transforming the raw input data into an appropriate form or which may include
dealing with missing values. In order to perform classification on the data, it
must be pre-processed or cleaned to remove noise and redundant values and
selecting records and attributes that would be relevant. Sometime peoples are likely
making mistakes when they analyzing the data or
Sometime when they are
trying to establish relationships between multiple features. In this scenario result
will not satisfied if the data is not accurate, i.e. if it is noisy or contains
missing value. Because, it may not give us the valid information or knowledge that
we want or need, or it may provide misleading information. So, it is difficult
to find solutions to certain problems.
Data mining techniques help us
in this scenario. Data mining approach are used for large number of values to
find new and useful patterns that might remain unknown. It involves the use of data
analysis tools to discover, previously unknown, valid patterns, meaningful
relationships, and to summarize it, in large amount of data sets.
technique that assigns items in a
collection to target classes. Classification goal is to predict accurately
target class for each case in the data set. For example according to our data
set, a classification model could be used to identify the target class is from
republican or democrat.
values are a common occurrence in data sets, and you need to have a strategy
for removing them. A missing value can indicate a number of different things in
your data set. Possibly the data was not available, or may not applicable, or
the event did not happen. It may because of person who entered or recorded the
data did not know the right value, or missed filling in. Data mining techniques
helps in the missing values. Typically, they ignore the missing values, or
exclude records/attribute that contained missing values, or replace missing
values with mean(mean will be taken from attribute), or it may conclude missing
values from existing values.
Our data set that is based on identify
voting patterns in the US House of Representatives. Each state in the US is
represented in the House proportional to its population, but each state is
entitled to at least one Representative. The total number of Representatives in
data set are 435.US online voting is based on CQA standard. Target class of our
data set decide the representative is democrat or republican. In paper that we
followed use only classification techniques. In data set there are some missing
values, we first apply classification techniques on data set then we remove
these missing values by using missing values techniques that is built in WEKA.
So we use classification and missing values both techniques on data set and
compare the result of these.
Classification is define as
The commonly used methods for data mining
classification tasks can be classified into the following groups.
Decision tree induction methods
Memory based learning
Navie bayes multinominaltext
Navie bayes updaetable
We use 3 techniques (trees, Rule based, Bayesian) from
these in WEKA and take comparison of these three.
After classification we remove missing values by using
data mining techniques.
Approaches to Missing attribute values:
Ignored attributed that have missing value
Replace missing values with mean if data
in the form of numeric, replace missing value with most frequent value if the
data is categorical
in missing values manually based on your domain knowledge.
results generated using WEKA:
Correctly classified instances
Incorrectly classified instances
In this table we got result after applying algorithms.
Classification methods are typically strong in modeling interactions it is more
difficult to recommend any one technique as superior to others as the choice of
a dataset. But J48 algorithm give best result as compared to others and have
less error rate as compared to others.
attribute Values using WEKA
We use three filters for removing missing values
Replace missing values
Replace missing with user constant
Replace with missing value
We use “Replace missing values” filter after applying
this we have not missing values in our dataset. We shared some screenshot as a
result in which there is 0% missing values in attribute. After removing missing
values we got complete dataset
without removing missing values Results after removing missing values
Our main objective was comparison of classification
techniques and comparison of methods to deal with missing attributes values.
Results of our experiment will be shown in table 1 and in images. In classification
J48 perform well as compared to others and have less absolute error. The goal of
classification algorithms is to generate more certain, precise and accurate
system results. Numerous methods have been suggested for the creation of
ensemble of classifiers. Classification methods are typically strong in
J. C. Concept acquisition through representational adjustment. Doctoral
dissertation, Department of Information and Computer Science, University of
California, Irvine, CA
& Kamber, M. Data Mining:
Concepts and Techniques. USA:Morgan Kaufmann Publishers.
J. R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers,
San Mateo CA
Agrawal, R., Imielinski, T., & Swami, A. Mining association rules between sets of items
in large databases.ACM SIGMOD Conference, pp. 207-216.
M. Al-Razgan, A. S. Al-Khalifa, and H. S. Al-Khalifa,
“Educational data mining: A systematic review of the published literature
20062013,” in Proc. the 1st International Conference on Advanced Data and
Information Engineering, 2013, pp. 711-719.
Sundar.C, M.Chitradevi and Dr.G.Geetharamani ?Classification
of Cardiotocogram Data using Neural Network based Machine Learning Technique?
International Journal of Computer Applications (0975 – 888) Volume 47– No.14,