ساخت درخت ‌تصمیم مقیاس‌پذیر مبتنی بر تقسیم سریع داده‌ها و پیش‌هرس

محورهای موضوعی : electrical and computer engineering

سميه لطفي ¹ , محمد قاسم زاده ² , مهران محسن زاده ³ , ميترا ميرزارضايي ⁴

1 - دانشگاه آزاد اسلامی واحد علوم و تحقیقات
2 - دانشگاه یزد
3 - دانشگاه آزاد اسلامی واحد علوم و تحقيقات
4 - دانشگاه آزاد اسلامی واحد علوم و تحقيقات

تاریخ دریافت : 1399/10/02 تاریخ پذیرش : 1400/01/01 تاریخ انتشار : 1400/07/12

کلید واژه: پیش‌هرس, داده‌کاوی, درخت تصمیم, مقیاس‌پذیر,

چکیده مقاله :

دسته‌بندی، یکی از وظایف مهم داده‌کاوی و یادگیری ماشین است و درخت تصمیم به ‌عنوان یکی از الگوریتم‌های پرکاربرد دسته‌بندی، دارای سادگی و قابلیت تفسیر نتایج است. اما در مواجهه با داده‌های حجیم، درخت تصمیم بسیار پیچیده خواهد شد و با محدودیت‌های حافظه و زمان اجرا مواجه‌ است. الگوريتم‌هاي ساخت درخت باید همه مجموعه داده آموزش و یا بخش زیادی از آن را درون حافظه نگه دارند. الگوریتم‌هایی که به علت انتخاب زیرمجموعه‌ای از داده با محدودیت حافظه مواجه نیستند، زمان اضافی جهت انتخاب داده صرف‌ می‌کنند. جهت انتخاب بهترین ویژگی برای ایجاد انشعاب در درخت هم باید محاسبات زیادی بر روی این مجموعه داده انجام شود. در این مقاله، یک رویکرد مقیاس‌پذیر افزایشی بر مبنای تقسیم سریع و هرس، جهت ساخت درخت تصمیم بر روی‌ مجموعه داده‌های حجیم ارائه شده است. الگوریتم ارائه‌شده درخت تصمیم را با استفاده از کل مجموعه داده‌ آموزش اما بدون نیاز به ذخیره‌سازی داده در حافظه اصلی می‌سازد. همچنین جهت کاهش پیچیدگی درخت از روش پیش‌هرس استفاده شده است. نتایج حاصل از اجرای الگوریتم بر روی مجموعه داده‌های UCI نشان می‌دهد الگوریتم ارائه‌شده با وجود دقت و زمان ساخت قابل رقابت با سایر الگوریتم‌ها، بر مشکلات حاصل از پیچیدگی درخت غلبه کرده است.

چکیده انگلیسی:

Classification is one of the most important tasks in data mining and machine learning; and the decision tree, as one of the most widely used classification algorithms, has the advantage of simplicity and the ability to interpret results more easily. But when dealing with huge amounts of data, the obtained decision tree would grow in size and complexity, and therefore require excessive running time. Almost all of the tree-construction algorithms need to store all or part of the training data set; but those algorithms which do not face memory shortages because of selecting a subset of data, can save the extra time for data selection. In order to select the best feature to create a branch in the tree, a lot of calculations are required. In this paper we presents an incremental scalable approach based on fast partitioning and pruning; The proposed algorithm builds the decision tree via using the entire training data set but it doesn't require to store the whole data in the main memory. The pre-pruning method has also been used to reduce the complexity of the tree. The experimental results on the UCI data set show that the proposed algorithm, in addition to preserving the competitive accuracy and construction time, could conquer the mentioned disadvantages of former methods.

منابع و مأخذ:

[1] S. Agarwal, "Data mining: data mining concepts and techniques," in Proc. IEEE Int. Conf. on Machine Intelligence and Research Advancement, . pp. 203-207, Katra, India, 21-23 Dec. 2013.
[2] S. B. Kotsiantis, "Decision trees: a recent overview," Artificial Intelligence Review, vol. 39, no. 4, pp. 261-283, 2013.
[3] Z. Ruiz, J. Salvador, and J. Garcia-Rodriguez, "A survey of machine learning methods for big data," in International Work-Conf. on the Interplay Between Natural and Artificial Computation, pp. 259-267, Springer, Cham, Jun. 2017.
[4] S. Lomax and S. Vadera, "A survey of cost-sensitive decision tree induction algorithms," ACM Computing Surveys (CSUR), vol. 45, no. 2, pp. 1-35, Feb. 2013.
[5] Y. L. Chen, C. C. Wu, and K. Tang, "Time-constrained cost-sensitive decision tree induction," Information Sciences, vol. 354, pp. 140-152, Mar. 2016.
[6] F. Stahl and M. Bramer, "Jmax-pruning: a facility for the information theoretic pruning of modular classification rules," Knowledge-Based Systems, vol. 29, pp. 12-19, 2012.
[7] S. Hwang, H. G. Yeo, and J. S. Hong, "A new splitting criterion for better interpretable trees," IEEE Access, vol. 8, pp. 62762-62774, 2020.
[8] J. Gehrke, V. Ganti, R. Ramakrishnan, and W. Y. Loh, "BOAT-optimistic decision tree construction," ACM SIGMOD Record, vol. 28, no. 2, pp. 169-180, Jun. 1999.
[9] M. Mehta, R. Agrawal, and J. Rissanen, "SLIQ: a fast scalable classifier for data mining," Advances in Database Technology-EDBT'96, pp. 18-32, 1996.
[10] J. Shafer, R. Agrawal, and M. Mehta, "SPRINT: a scalable parallel classifier for data mining," in Proc. Int. Conf. Very Large Data Bases, vol. 96, pp. 544-555, 3-6 Sept. 1996.
[11] J. Gehrke, R. Ramakrishnan, and V. Ganti, "Rainforest-a framework for fast decision tree construction of large datasets," Data Mining and Knowledge Discovery, vol. 4, pp. 127-162, 2000.
[12] G. Hulten and P. Domingos, "Mining decision trees from streams," in Garofalakis M., Gehrke J., Rastogi R. (eds.) Data Stream Management. Data-Centric Systems and Applications. pp. 189-208, Springer Berlin Heidelberg, 2016.
[13] C. C. Wu, Y. L. Chen, Y. H. Liu, and X. Y. Yang, "Decision tree induction with a constrained number of leaf nodes," Applied Intelligence, vol. 45, no. 3, pp. 673-685, Oct. 2016.
[14] A. Franco-Arcega, J. A. Carrasco-Ochoa, G. Sanchez-Diaz, and J. F. Martinez-Trinidad, "Decision tree induction using a fast splitting attribute selection for large datasets," Expert Systems with Applications, vol. 38, no. 11, pp. 14290-14300, Oct. 2011.
[15] B. Chandra, R. Kothari, and P. Paul, "A new node splitting measure for decision tree construction," Pattern Recognition, vol. 43, no. 8, pp. 2725-2731, Aug. 2010.
[16] P. S. Akash, M. E. Kadir, A. A. Ali, and M. Shoyaib, "Inter-node hellinger distance based decision tree," in Proc. 28th Int. Joint Conf. on Artificial Intelligence, IJCAI’19, pp. 1967-1973, Aug. 2019.
[17] M. Bramer, "Using J-pruning to reduce overfitting in classification trees," Knowledge-Based Systems, vol. 15, no. 5, pp. 301-308, Jul. 2002.

اشتراک گذاری

آدرس مقاله

ساخت درخت ‌تصمیم مقیاس‌پذیر مبتنی بر تقسیم سریع داده‌ها و پیش‌هرس