ABSTRACT – Early disease prediction isone of the core elements of biomedical andhealthcare communities to improve thequality of prior diagnostics for fataldiseases like Congenital Heart Disease,Cancer etc. Advanced Data Miningtechniques can help remedial situations.Experimenting the medical structured datawith data mining concepts like Classifiersand Association Rule Mining (ARM)techniques helps in the detection ofoccurrence for a particular disease. Medicaldata set obtained from the open source ofUnited Kingdom is processed and analysedfor heart disease prediction and then thesystem suggests hospital for furthertreatment. Accuracy comparison betweenthe classifier algorithms used is generatedfrom R Studio. This prediction results paveway for proper diagnosis and earlytreatment of chronic diseases. It can be usedto mitigate the death rate increase due to thelate prediction of fatal diseases only at thecritical stage.
Key Words – Data mining, CongenitalHeart Disease, ARM.I. INTRODUCTIONThe healthcare industry collects reliableand huge amounts of healthcare data which,unfortunately, are not “mined” to discoverhidden information for effective decisionmaking. Clinical decisions are often madebased on doctors’ intuition and experiencerather than on the knowledge rich datahidden in the database.I.1 ClassificationThere are two forms of data analysis thatcan be used for extracting modelsdescribing important classes or to predictfuture data trends. These two forms are asfollows?Classification? PredictionClassification models predict categoricalclass labels; and prediction models predictcontinuous valued functions. The two mainefficient classifiers implemented here arethe Decision Tree and Naïve Bayesclassification algorithm.
I.2 Association Rule MiningAssociation means finding relationshipbetween different data items in a same datatransaction that is used to discover varioushidden patterns. For instance, if someonebuys a desktop (A), then they alsopurchases a speaker (B) in 55% of theoccurrence. This relationship occurs in8.2% of desktop buys. An association rulein this condition can be A intend B. 55% isthe CF (confidence factor) and 8.
2% is theSF (support factor). Apriori algorithm,Pincer search and AprioriDP are theefficient ARM algorithms in data mining.Figure-1 Data Set PreviewI.3 Data Set CollectionFor this proposed work, the dataset with anentry count of 750 values for 10 parameterswas taken from 11 data.gov.uk dataset.The dataset is called the Congenital HeartDisease (CHD) data published byHealthcare Quality ImprovementPartnership which has been licensed underOGL Open Government Licence.
The datacontain 30 day outcomes (alive or dead) forcongenital heart disease treatment inEngland, although the audit covers all of theUK and the Republic of Ireland. All data isavailable in the National Institute forCardiovascular Outcomes Research(NICOR) Congenital Public Portal. Thepreview of the structured dataset with allthe ten attributes’ explanation is givenabove in Figure-1 which bears the resourcefrom”CongenitalDataFieldDescription20102011″.There are some of the experimentalprocedure adapted to this work involves thefollowing steps as in Figure-2.
Data preprocessingis done to remove duplicationand cleaning to make it fit for mining.Figure-2 Methodology WorkflowII. LITERATURE SURVEYQuite a number of research work have beencarried out in recent decades using datamining techniques on medical data. AjadPatel et al., 7 has developed a system thatindicates whether a patient had a risk ofheart disease or not. The work describesabout a prototype using Naïve Bayes andWAC (Weighted Associated Classifier).
The work of Ms.M.C.S.Geetha et al.
, 1will be able to respond more difficultqueries in forecasting the heart attack disease. The predictive accuracydetermined by REPTREE, J48 andBayesNet algorithms propose thatparameters used are consistent indicator topredict the heart disease.Ilham KADI et al., 3 have constructed acardiovascular dysautonomias predictionsystem using a decision tree based classifierdeveloped using C4.5 tree algorithm andproved it to be accurate and efficient.Swaroopa Shastri et al.
, 4 analysed adataset using Apriori algorithm to producea detailed correlation involving diabetesand kidney disease.The works of Jagdeep Singh et al., 9focused on heart disease prediction usingassociation classification methods. Theproposed hybrid associative classificationis implemented on weka environment.Similarly, Dao-l Lin et al., 8 presented anovel algorithm called the pincer-searchthat can efficiently discover the maximumfrequent set.
It does not require the explicitexamination of every frequent itemset.III. CLASSIFICATION TECHNIQUESClassification consists of predicting acertain result based on a given inputtraining data. In order to predict the result,the algorithm processes a training setcontaining a set of attributes and theindividual outcome, usually calledprediction attribute.
Data classification isthe process of organizing data intocategories for its most effective andefficient use. There is some algorithm inclassification which helps to analyse ourwork are decision tree, Naïve Bayes.A.
Decision TreeDecision tree is a predictive model to gofrom observations about an attribut toconclusions about the attribute’s targetvalue represented in the leaf nodes. Thismodel is used in statistics, data mining andmachine learning. Decision trees where thetarget variable can take continuous values(typically real numbers) are calledregression trees. Decision tree J48 is theimplementation of algorithm ID3 (IterativeDichotomiser 3) developed by the WEKAproject team. J48 allows classification viaeither decision trees or rules generated fromthem 1.C4.
5 is a standard algorithm for inducingclassification rules in the form of decisiontree. It was introduced by Quinlan. It is anextension of the basic ID3 algorithm usedto overcome its disadvantages. C4.5algorithm has the most accuracy rate whilecompared with KNN and Naïve Bayes.Some of these are 3 :1. Choosing an appropriate attributeselection measure.2.
Pruning the decision tree after itscreation.3. Handling continuous attributes.It uses divide and conquer approach to forma binary tree as shown in Figure-3 whenanalysed with the given data set 11.
Figure-3 Obtained Decision TreeB. Naïve Bayes AlgorithmThis is a family of simple Probabilisticclassifiers based on applying Bayes’theorem with strong independenceassumptions between the features.Maximum-likelihood training can be doneby evaluating a closed-form expression,which takes linear time, rather than byexpensive iterative approximation as usedfor many other types of classifiers. Bayes’Theorem finds the probability of an eventoccurring given the probability of anotherevent that has already occurred. Themathematical equation is,P(A|B) = P(B|A) P(A) P(B)Naïve Bayes can answer diagnostic andpredictive problems. It is particularly suitedwhen the dimensionality of the inputs arehigh 1. The model is based on theconditional independence model of eachpredictor given the target class 7. Thework in 3 achieved high accuracy rates upto 98.
4% for training set and 97.76% fortesting set respectively.The obtained bar charts after processing thetraining data set 11 for Naïve Bayesclassifier is as shown in Figure-4 and thethreshold value for disease prediction isfixed to be 9 in Figure-5.Figure-4 Naïve Bayes’ ClassifierFigure-5 Threshold line for diseasepredictionIV.
ASSOCIATION RULE MININGProposed by Agrawal et al in 1993. It is animportant data mining model studiedextensively by the database and data miningcommunity. Initially used for MarketBasket Analysis to find how itemspurchased by customers are related. Thismethod uses the support and confidencefactors.
A. Apriori AlgorithmThe Apriori Algorithm is an influentialalgorithm for mining frequent itemsets forBoolean association rules. It uses a “bottomup” approach, where frequent subsets areextended one item at a time ( a step knownas candidate generation, and groups ofcandidates are tested against the data).Hence, we have used this algorithm topredict disease occurrence with theavailable factors obtained from the dataset11. Its main advantages are: (i) uses largeitemset property (ii) easily parallelized (iii)easy to implement.This is used in order to mine the mostoccurred set of items with a transactionaldatabase (collection of items bought bycustomers or details of a websitefrequentation), to identify these items onto the catalog and extend it largely until theyare reached or settled in the list sufficiently4. They have implemented this to predictDiabetes influenced Kidney disease. In algorithms like Apriori, FP-Growth, Naïvebayes, ZeroR et al are applied in their studyfor prediction of heart diseases.
B. Pincer Search algorithmTo overcome the disadvantages of Apriorialgorithm which requires many databasescans, Pincer search algorithm can be usedwhich can mine the frequent candidate setin just two pass.Dao-l Lin et al in 8 have combined boththe “bottom-up and the top-down”searches. This search is used only formaintaining and updating a new datastructure, the maximum frequent candidateset. This work mainly focuses on twoclosure properties:1. If an itemset is infrequent, all itsupersets must be infrequent andthey need not be examined further.
2. If an itemset is frequent, all itssubsets must be frequent and theyneed not be examined further.With the above proposed work, we haveimplemented the knowledge of this searchalgorithm for predicting the disease withwhich has the high frequency among thecommon attributes (IF this, THEN that).
Association rule mining thus paves fordisease prediction as well with thegenerated candidate set.V. CONCLUSIONIn this paper, we proposed a system forprediction of diseases by Data Miningconcepts like Decision Tree, Naïve bayesclassifier, Apriori algorithm and Pincersearchalgorithm using the structured data11. To the best of our knowledge, theexisting papers have not focused onapplying the Pincer-search algorithm fordisease predictions which showsconsiderably good prediction results.
Oncepredicted with disease, we can suggestpatients with particular hospital(correspondence to the data available) forfurther diagnosis and surgical treatments.In future, this work can be extended bysuggesting dietary suggestions and theprecautionary points can be offered.Parameters used are consistent indicatorsfor heart disease, thus more parameters withgeographical variations can be used forbetter prediction results. Variousparameters like processing time, resourcesand memory used can be enhanced in futurefor making it an important aid for medicaland healthcare communities.