Feature Selection With the Random Forest Packages to Predict Student Performance

Each study program seeks to improve the quality of education and accreditation. One element that becomes the value of accreditation is students who graduate on time. The more active students, the more students will graduate on time. Thus, the head of the study program needs to make predictions of students who will be inactive in the next semester. To make predictions, we must determine what features are needed. This article is the result of feature selection research to predict the active status of students. The selection of features using seven features using the RandomForest package from R Studio. One feature as output is the active status of students and six features as input i.e; grade point (GP), grade point average (GPA), parent work, school majors, school category, and student hometown. The results of the selection of features show the strongest features to the weakest are; grade points (GP), grade point average (GPA), work of parents, majors of origin, schools of origin, and student hometown


INTRODUCTION
Each study program seeks to improve the quality of education and accreditation. One element that becomes the value of accreditation is students who graduate on time [1]. The more students who graduate on time, the better the value of accreditation. Non-active students will influence graduate on time. Handling of potentially non-active students is needed to prevent non-active students. With this prevention, it is expected to reduce the number of non-active students, so that the graduation rate on time has increased. With the increase in graduation rates on time, it is expected to further enhance the accreditation of study programs.
Research on the predictions of student activity has been done. Among them are researched to predict the students' performance of the Faculty of Computer Science, Dian Nuswantoro University using Decision Tree Algorithm [2]. Other than, the same research had also carried out using the KNN algorithm [3]. The research that had carried out using one algorithm. Thus there is no comparison, so it is possible to have another algorithm that is better for making predictions.
Research to predict student activity by comparing several algorithms were many done. Some research i.e: research by comparing Support Vector Machine (SVM) and Decision Tree algorithms [4], comparing of J48, Random Forest, Multilayer Perceptron, IB1, and Decision Table algorithm [5], comparing of Logistic Regression, Decision Tree, Naïve Bayes, dan Neural Network algorithm [6], comparing of K-Nearest Neighbor (KNN), Support Vector Machine (SVM) , and Random Forest algorithm [7], perbandingan algoritme Decision Tree, Support Vector Machine (SVM), dan Naïve Bayes [8]. Comparing of Support Vector Machine (SVM), Neural Network, Naïve Bayes, and K-Nearest Neighbor (KNN) algorithm [9], and comparing of K-Nearest Neighbor (KNN), Support Vector Machine (SVM), and Decision Tree algorithm [10]. From some of the research, the researcher only focuses on the selection of algorithms to prediction. The selection of the right algorithm will certainly produce accurate predictions. However, the right algorithm will produce a prediction that is less accurate if the attributes used for predictions are incorrect.
Feature selection is needed to identify what factors influence student performance. feature selection is one important means to attack problems with various aspects of data, and to enable existing tools to apply, otherwise not possible [11]. One of the algorithms commonly used to features selection is Random Forest. A random forest (RF) classifier is an ensemble classifier that produces multiple decision trees, using a randomly selected subset of training samples and variables [12]. Some research on feature selection using these algorithms i.e; research for the classification and features selection for diagnosis and prediction of breast cancer [13], airborne lidar feature selection for urban classification [14], and feature selection for protein division [15]. Research had shown that using feature selection will improve performance [16]. The purpose of this research was doing features selection to look for features that most influence student performance.

RESEARCH METHOD
The data used in the research were from data of Informatics Engineering student of Politeknik Harapan Bersama from 2014 to 2017 (1530 observation). Data used included: grade point (GP), grade point average (GPA), parent work, school majors, school category, student hometown, and active status of students. The tool used in this study was R Studio software. This tool was used to features selection using RandomForest packages. The RandomForest package is an implementation of Breiman's random forest algorithm (based on Breiman and Cutler's Fortran code) which is used for classification and regression. Thus, it is possible to calculate estimates between data points [17]. The research procedure is shown in Figure 1.

Study Literature
Sources of literature come from journals, proceedings, and books. Study literature continues to be carried out in tandem with the other research stages until the end of the study. This has done so that in the next stage other sources of reference that support the research were found, the reference sources can be used as literature to help complete the research that was conducted.

Data Collection
The data used in the study are data from Informatics Engineering students of Harapan Bersama Polytechnic from 2014 to 2016. The data used include GPA, credits taken, hometown, origin school, parent work, and student activity every semester.
Preprocessing Data At this stage, the input and output data or target data are determined. In addition, data normalization is also carried out, namely by converting character data into numerical data.

Feature Selection
Feature selection is done to determine what features are most influential. In addition, feature selection is done to get fewer features so that it will facilitate the computing process.

Result
At this stage, scores are obtained for each feature that affects student activity

Analysis
The analysis is done by analyzing the score of each feature.

Conclusion
The final results are the most influential features and which features are less influential.
The randomForest () function owned by R Studio (the randomForest package) is an interesting part of rpart (), has sufficient complexity, and often provides accurate prediction results.. "For each of a large number of bootstrap samples (by default, 500) trees are independently grown". "In addition, a new random sample of variables is chosen for use with each new tree". "The out-of-bag (OOB) prediction for each observation is determined by a simple majority vote across trees whose bootstrap sample did not include that observation". "Trees are grown to their maximum extent, limited however by nodesize (minimum number of trees at a node)". "Additionally, maxnodes can be used to limit the number of nodes". "There is no equivalent to the parameter cp". "The main tuning parameter is the number mtry of variables that are randomly sampled at each split". "The default is the square root of the total number of variables; this is often satisfactory". "It may seem surprising that it is (usually) beneficial to take a random sample of variables". Essentially, mtry controls the trade-off between the amount of information in each individual tree, and the correlation between trees. A very high correlation limits the ability of an individual tree to convey information that is specific to that tree" [18].

Gain Ratio
"The gain ratio is an extension of the information gain measure, which attempts to overcome the bias that the information gain measure is prone to selecting features with a large number of values" [13]. "Thereby, the information gain measure is used as an attribute selection measure of the decision tree and is obtained by computing the difference between the expected information requirement, classifying a tuple in tuples, and the new information requirement for attribute A after the partitioning. The measure of the expected information requirement is given by" [19] (1) "where m is the number of distinct classes; pi indicates the probability by calculating the proportion of belonging to class Ci in tuples D. The new information requirement for attribute A is measured by" (2) "where v indicates that D was divided into v partitions or subsets, {D1,D2,⋯,Dv}. Thus, the information gain measure Gain(A) for attribute A can be calculated by the formula".
(3) "Then, a 'split information' function was used to normalize the information gain measure Gain(A). The split information function was defined by" (4) Finally, to calculate the size of the information gain used a profit ratio with the calculation of Gain (A) divided by SplitInfo (A) which is a measure of information split.

(5)
The best feature is seen from the size of the gain ratio. The greater the value of the gain ratio, the more important the feature.

Random Forest
"The feature evaluation approach based on random forest is known as an embedded method" [20] "and provides a variable importance criterion for each feature by computing the mean decrease in the classification accuracy for the out of bag (OOB) data from bootstrap sampling" [21]. "Assuming bootstrap samples b = 1, …, B, the mean decrease in classification accuracy D¯¯¯j for variable xj as the importance measure is given by" (6) where Roobb denotes the classification accuracy for OOB data ℓoobb using the classification model Tb; and Roobbj is the classification accuracy for OOB data ℓoobbj permuted the values of variable xj in ℓoobb (j = 1, …, N). Last, the z-score of the xj variable representing the most important variable can be found by using the calculation with the formula zj = D¯¯¯jsj / B√, after the standard deviation sj from the decrease in classification accuracy is calculated. In this study, the feature evaluation procedure is performed automatically using the 'RandomForest' R package.

Correlation-based feature selection
"Unlike the feature evaluation methods mentioned above, a feature subset was evaluated simply by using the filter algorithm Correlation-based Feature Selection (CFS). The CFS assessed the worth of a set of features using a heuristic evaluation function based on the correlation of features, and Hall and Holmes" [22] claiming that most features must be correlated with classes that are highly uncorrelated with each other. Thereby, the formula below is used to evaluate the criteria for a subset.
The symbol f is a feature where; c is class, rcc is the average feature correlation with class, rfff is the average feature between correlations, and k is the number of attributes in a subset. To explore the feature space the first best search is used, and the five subsets that do not develop sequentially have been set to stop criteria to avoid searching for the entire subset of feature space.

RESULTS AND DISCUSSION
The most influential feature on studentactive status i.e; grade point (GP), grade point average (GPA), parent's work, school majors, school category, and student's hometown. The correlation value score is shown in Figure 2. Figure 2 shows the Mean Decrease in Gini correlation score. The sequence of feature correlation values with student-active status is shown in Table 1. Table 1 shows the sequence of correlation values biggest to the smallest.

Fig 2. Correlation Value Score
The feature selection was done using the RandomForest packages R Studio using seven features. One feature as output is student-active status and six features as input i.e: grade point (GP), grade point average (GPA), parent work, school majors, school category, student hometown. The results of feature selection as shown in Table 4.1 show that the strongest feature that influences is grade point (GP) with score 209.27 and grade point average with score 134.37. This shows that the activity of students in the next semester is strongly influenced by GP and GPA. The lower the GP and GPA, the more potential it is not active in the next semester. The lowest score that influences the studentactive in the next semester is student's hometown with score 3.8 and school category with score 5.7. This shows that students from Tegal and its surroundings, as well as those from outside Tegal, do not have a strong influence on predicting student-active in the next semester. Likewise, the school category both from public and private schools do not have a strong influence on predicting student-active in the next semester. Parent work and school majors have a middle influence. This shows that the income of parents and majors from students has enough influence on student academic status. The lower the income of parents of students, the more potentially in-active in the next semester, and students from IT majors have a higher potential for active-status than students who come from science majors or even others. Based on the results of the study, efforts are needed from the management of study programs to increase GP and GPA score so that students who are potentially in-active will decrease. In addition, efforts are also needed from the head of the study program to provide information relating to tuition funding assistance such as scholarships or the assistance of student side jobs. Thus students who have economic problems caused by a lack of parents' income can be handled. In addition, the head of the study program must also pay special attention to students who come from other than IT and or Science. Thus students who have difficulty in following academic activities can be handled.

CONCLUSION
The features that most influence the activity of students in the next semester i.e: grade points (GP), grade point average (GPA), work of parents, majors of origin, schools of origin, and student hometown. Then, to increase the number of students who are active in the next semester, the head of the study program needs to make efforts to improve the academic score of students. In addition, The head of the study program also needs to provide information relating to funding assistance for tuition such as scholarships or side job assistance for students.