Data Mining Classification Simplified: Key Types, Steps & 6 Best Classifiers

By: Arsalan Mohammed | Published: May 17, 2022

Related Articles

data mining classification case study

The term Big Data is gaining immense popularity. It means huge amounts of data, rich in insights, that can provide value to an organization. There are multiple techniques that can be followed to process data and Data Mining is one of them.

Table of Contents

Data Mining refers to the process of converting raw data into valuable insights by running software solutions to find patterns in batches of data. The Classification Technique is one such Data Mining technique that helps in Clustering the data into similar categories based on various parameters. 

This article will provide you with a comprehensive guide on Data Mining, Data Mining Classification, Classification Applications in Data Mining, and many more. 

What is Data Mining?

Data Mining is the process of discovering and identifying new patterns from Big Data or large amounts of enterprise data. It is also known as KDD – Knowledge Discovery in Data. The rate of adoption of Data Mining techniques has increased in the past couple of years. 

Data Mining helps organizations to leverage data in order to make decision-making more valuable than traditional methods. The Data Mining process helps in gaining insights that define the pathway an enterprise has to take regarding its campaigns, products, locations, and a lot more aspects. Data Mining has two main types: It can either work on the target dataset to describe parameters or predict the outcomes by employing the Machine Learning models.

With the advancement in software solutions, Artificial Intelligence is being used to expedite information. But even as the technology improves, the scalability issues still remain, and mining the data becomes a lot more difficult but at the same time important. 

Hevo Data , a No-code Data Pipeline helps to load data from any data source such as Databases, SaaS applications, Cloud Storage, SDKs, and Streaming Services and simplifies the ETL process. It supports 150+ data sources ( including 40+ free data sources ) and is a 3-step process by just selecting the data source, providing valid credentials, and choosing the destination. Hevo not only loads the data onto the desired Data Warehouse/Destination but also enriches the data and transforms it into an analysis-ready form without having to write a single line of code.

GET STARTED WITH HEVO FOR FREE [/hevoButton]

Hevo is the fastest, easiest, and most reliable data replication platform that will save your engineering bandwidth and time multifold. Try our 14-day full access free trial today to experience an entirely automated hassle-free Data Replication!

What is Data Mining Classification?

Data Mining Classification is a popular technique where the data point is classified into Different Classes. It is a supervised learning technique where the quality of data can be changed based on previous data.

The Data Mining Classification Algorithms create relations and link various parameters of the variable for prediction. The algorithm is called the Classifier and the observations are called Instances. Classification helps in determining if the instance is useful to the organization or not.

A Data Mining Classification example can be that of a bank giving loans. There is a master database with the details of all the account holders. Classification helps in categorizing this master database into the probability of loan takers as high, mid, and low so that the bank can determine whom to spend time on so as to meet the target. 

What are the Classification Applications in Data Mining?

The classification in Data Mining has many applications in day-to-day life. A few Classification Applications in Data Mining are:

  • Product Cart Analysis on the eCommerce platform uses the classification technique to associate the items into groups and create combinations of products to recommend. This is a very common Classification Applications in Data Mining
  • The weather patterns can be predicted and classified based on parameters such as temperature, humidity, wind direction, and many more. These Classification Applications of Data Mining are used in daily life.
  • The public health sector classifies the diseases based on the parameters like spread rate, severity, and a lot more. This helps in charting out strategies to mitigate diseases. These Classification Applications of Data Mining help in finding cures.
  • Financial institutes use classification to determine the defaulters and help in determining the loan seekers, and other categories. These Classification Applications in Data Mining helps in finding the target audience much easier.

What are the Key Tools & Languages used for Mining Data?

Key languages used for data mining.

  • Python Programming Language: Python is one of the most adaptable programming languages, that is efficient in performing operations ranging from Data Mining, Web Development, Application Development, creating Embedded Systems, and many more all under a single platform. The Pandas library in Python helps in Data Analysis, processing datasets, visualizing using histograms and performing operations on data efficiently. This library is also used to mine data. 
  • R Programming Language: R has wide support for operations like Data Manipulation, Data Calculations, and Data Visualization. R is also suitable for all operations as it has the provision to implement all the Machine Learning algorithms swiftly. It also has the provision for various statistical and graphical techniques such as Linear Modelling, Non-Linear Modelling, Time-series analysis, Classification techniques, and many more.
  • SQL (Structured Query Language): SQL is the language that was designed to maintain and query the data stored inside Relational Database Management Sytems. SQL allows operations like Insertion, Deletion, Updation, and Retrieval of data present in the database. Also, operations like aggregation, max, min, and many more can be applied to the data. 

Best Tools for Data Mining

There are many tools available in the market that can perform efficient Data Mining Classification, a few are mentioned below:

1) Oracle Data Mining

Oracle provides an Enterprise Edition for its Database that includes an Oracle Data Mining Tool prebuilt. This tool can easily combine with Oracle Database to perform Data Analysis with ease. This eliminates the requirement of transportation of data into specialized servers. The ODMs help in mining data to identify patterns, and form valuable insights. The ODM can asynchronously process Data Pipelines. 

2) RapidMiner

The RapidMiner is a Predictive Analytical tool that is based on Java. This tool is proficient in performing Deep Learning, Text Mining, and Predictive Analytics, under a single platform. It provides both on-premise solutions as well a Cloud framework. The templates employed reduce errors and increase efficiency by reducing delivery times.

3) SAS Enterprise Miner

SAS stands for Statistical Analysis System. This provides Enterprise Miner software that has prebuilt tools and proficiency in Data Mining and Data Optimization. The methodologies employed by the software boost the organization’s goals. The models incorporated in the tool are Descriptive Modeling, Predictive Modelling, and Prescriptive Modeling. The Scaling of the system is handled by Distributed Memory Processing.

4) IBM SPSS Modeler

IBM provides a cutting-edge software solution that offers an enterprise-wide solution. IBM SPSS Modeler is a solution that offers Visual Data Science and Machine Learning tools. This tool is proficient in Data Preparation, Predictive Analysis, and Data Mining Deployment Operations. It also combines the governance and security needs of the organization under the same platform. 

What are the Data Mining Classification Techniques?

Data Mining has two main types of Classification Categories available:

  • Generative Classification
  • Discriminative Classification

Now let us understand the two Data Mining Classification categories in detail.

1) Generative Classification

These Data Mining Classification Algorithm models the distribution of Individual Classes and learns from the model that generates data through estimations and assumptions. The Generative Classification algorithm is used to predict the data that is unseen.

An example of a Generative Data Mining Classification Algorithm is the Naive Bayes Classifier.

Example : Naive Bayes Classifier – Detecting Spam emails by looking at the previous data.

2) Discriminative Classification

The Discriminative Data Mining Classification algorithm is a basic Classifier that determines classes for the entire rows of the data. The classes are decided based on the data quality. 

An example of a Discriminative Classifier is Logistic Regression.

Example : Logistic Regression – Acceptance into university based on student grades and test results.

What are the steps involved in Data Mining Classification?

Step 1: learning phase, step 2: classification phase.

This phase of Data Mining Classification mainly deals with the construction of the Classification model based on different algorithms available. This step requires a training set for the model to learn. The trained model gives accurate results based on the target dataset. When the test data is added to the model it provides accuracy to the Classification Model created.

This phase of Data Mining Classification deals with testing the model that was created by predicting the class labels. This also helps in determining the accuracy of the model in real test cases.

6 Best Classifiers for Mining Data/Data Mining

  • Linear Regression
  • Logistic Regression
  • Random Forest
  • Naive Bayes
  • Decision Tree

1. Logistic Regression

Logistic Regression is a statistical method that creates a Binomial Classification for a particular event or class. This model gives the probability of every trial and decides which side of the Binary Classification will move. Logistic Regression also helps in determining multiple independent parameters impacting a single outcome.

Logistic Regression is only viable when the predicted variable is binary and there are no missing values in the target dataset. It also requires all the predictors to be independent of each other.

2. Linear Regression

Linear Regression is a Supervised Learning algorithm that performs simple Regression to predict the values based on the independent variables. To find the value of the dependent variable relation between independent variables. 

The main issue with the model is it is highly prone to overfitting , and it is not always feasible to separate data in a linear manner. 

3. Decision Trees

This is the most robust Classification Technique for Data Mining. It follows a flowchart similar to the structure of a tree. The leaf nodes hold the classes and their labels. The internodes have a Decision algorithm that routes it to the nearest leaf node. There can be multiple internal nodes to do this. The horizontal and vertical phases can be prediction boundaries.  

The only challenge is that it is complex, and requires expertise to create and ingest data into it.

4. Random Forest

As the name suggests this model employs multiple Decision Trees and applies sub-sets to these models. Then an average is taken for all the trees to predict the class accuracy. The subsets created are of the same size as that of the true dataset but the samples are replaced for every subgroup.

It is efficient in reducing overfitting and increasing accuracy . The drawback is, that it is very slow for real-time applications and is highly complex to implement.

5. Naive Bayes

The Naive Bayes Algorithm makes the assumption that every independent parameter will equally affect the outcome and has almost equal importance. It calculates the probability of the event occurring, given that an event has already occurred. Naive Bayes requires smaller training sets to learn. It is faster in predicting when compared to other models.

It is plagued with the poor estimation issue where all the parameters have equal importance. It doesn’t provide results that are true in the real world.

Hevo’s Automated, No-code Data Integration Platform empowers you with everything you need to have a smooth Data Integration experience. Our platform has the following in store for you!

Check out what makes Hevo amazing:

  • Fully Managed:  It requires no management and maintenance as Hevo is a fully automated platform.
  • Data Transformation:  It provides a simple interface to perfect, modify, and enrich the data you want to transfer.
  • Real-Time:  Hevo offers real-time data migration. So, your data is always ready for analysis.
  • Schema Management:  Hevo can automatically detect the schema of the incoming data and map it to the destination schema.
  • Scalable Infrastructure:  Hevo has in-built integrations for 100’s sources that can help you scale your data infrastructure as required.
  • Live Support:  Hevo team is available round the clock to extend exceptional support to its customers through chat, email, and support calls.

What are the Advantages of Data Mining Classification?

  • Data Mining is cost-effective and very efficient compared to other data applications.
  • Data Scientists use Data Mining for information analysis, risk modelling, and product safety. 
  • Data Mining Classification helps businesses make informed decisions and also analyze huge amounts of enterprise data.
  • Data Mining Classification helps financial institutions to help defaulters, loan seekers, and other aspects.

What are the Disadvantages of Data Mining Classification?

  • Data Mining done through Data Analytics tools is a complex and challenging task.
  • There are privacy concerns when the data is mined.
  • The data may become inaccurate, and sometimes there are issues with relevancy.

Data Mining is a leading Data Processing technique that provides a holistic view of raw data. There are various data mining techniques available, that can be chosen based on the data requirements. Data Mining helps organizations stay ahead of the competition by charting plans that are gained from enterprise data. This article provided a comprehensive overview of Data Mining, Data Mining Classification, Classification Applications in Data Mining, and many more.

There are various Data Sources that organizations leverage to capture a variety of valuable data points. But, transferring data from these sources into a Data Warehouse for a holistic analysis is a hectic task. It requires you to code and maintains complex functions that can help achieve a smooth flow of data. An Automated Data Pipeline helps in solving this issue and this is where Hevo comes into the picture. Hevo Data is a No-code Data Pipeline and has awesome 150+ pre-built Integrations that you can choose from.

Hevo can help you Integrate your data from 100+ data sources and load them into a destination to analyze real-time data at an affordable price . It will make your life easier and Data Migration hassle-free. It is user-friendly, reliable, and secure.

SIGN UP for a 14-day free trial and see the difference!

Share your experience of learning about Data Mining Classification in the comments section below.

Arsalan is a research analyst at Hevo and a data science enthusiast with over two years of experience in the field. He completed his B.tech in computer science with a specialization in Artificial Intelligence and finds joy in sharing the knowledge acquired with data practitioners. His interest in data analysis and architecture drives him to write nearly a hundred articles on various topics related to the data industry.

No-code Data Pipeline For your Data Warehouse

  • Classification
  • Data Mining
  • Data Mining Tools

Continue Reading

data mining classification case study

Data Mesh vs Data Warehouse: A Guide to Choosing the Right Data Architecture

data mining classification case study

Vinita Mittal

Data Lake vs Data Warehouse: How to choose?

data mining classification case study

Rashmi Joshi

Matillion vs dbt: 5 Key Differences

Leave a reply cancel reply.

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

I want to read this e-book

data mining classification case study

  • Data Science
  • Data Analysis
  • Data Visualization
  • Machine Learning
  • Deep Learning
  • Computer Vision
  • Artificial Intelligence
  • AI ML DS Interview Series
  • AI ML DS Projects series
  • Data Engineering
  • Web Scrapping

Basic Concept of Classification (Data Mining)

Data Mining : Data mining in general terms means mining or digging deep into data that is in different forms to gain patterns, and to gain knowledge on that pattern. In the process of data mining, large data sets are first sorted, then patterns are identified and relationships are established to perform data analysis and solve problems. 

Classification is a task in data mining that involves assigning a class label to each instance in a dataset based on its features. The goal of classification is to build a model that accurately predicts the class labels of new instances based on their features.

There are two main types of classification: binary classification and multi-class classification. Binary classification involves classifying instances into two classes, such as “spam” or “not spam”, while multi-class classification involves classifying instances into more than two classes.

The process of building a classification model typically involves the following steps:

Data Collection: The first step in building a classification model is data collection. In this step, the data relevant to the problem at hand is collected. The data should be representative of the problem and should contain all the necessary attributes and labels needed for classification. The data can be collected from various sources, such as surveys, questionnaires, websites, and databases.

Data Preprocessing: The second step in building a classification model is data preprocessing. The collected data needs to be preprocessed to ensure its quality. This involves handling missing values, dealing with outliers, and transforming the data into a format suitable for analysis. Data preprocessing also involves converting the data into numerical form, as most classification algorithms require numerical input.

Handling Missing Values: Missing values in the dataset can be handled by replacing them with the mean, median, or mode of the corresponding feature or by removing the entire record.

Dealing with Outliers: Outliers in the dataset can be detected using various statistical techniques such as z-score analysis, boxplots, and scatterplots. Outliers can be removed from the dataset or replaced with the mean, median, or mode of the corresponding feature.

Data Transformation: Data transformation involves scaling or normalizing the data to bring it into a common scale. This is done to ensure that all features have the same level of importance in the analysis.

Feature Selection: The third step in building a classification model is feature selection. Feature selection involves identifying the most relevant attributes in the dataset for classification. This can be done using various techniques, such as correlation analysis, information gain, and principal component analysis. Correlation Analysis: Correlation analysis involves identifying the correlation between the features in the dataset. Features that are highly correlated with each other can be removed as they do not provide additional information for classification.

Information Gain: Information gain is a measure of the amount of information that a feature provides for classification. Features with high information gain are selected for classification.

Principal Component Analysis: 

Principal Component Analysis (PCA) is a technique used to reduce the dimensionality of the dataset. PCA identifies the most important features in the dataset and removes the redundant ones.

Model Selection: The fourth step in building a classification model is model selection. Model selection involves selecting the appropriate classification algorithm for the problem at hand. There are several algorithms available, such as decision trees, support vector machines, and neural networks. Decision Trees: Decision trees are a simple yet powerful classification algorithm. They divide the dataset into smaller subsets based on the values of the features and construct a tree-like model that can be used for classification.

Support Vector Machines: Support Vector Machines (SVMs) are a popular classification algorithm used for both linear and nonlinear classification problems. SVMs are based on the concept of maximum margin, which involves finding the hyperplane that maximizes the distance between the two classes.

Neural Networks: 

Neural Networks are a powerful classification algorithm that can learn complex patterns in the data. They are inspired by the structure of the human brain and consist of multiple layers of interconnected nodes.

Model Training: The fifth step in building a classification model is model training. Model training involves using the selected classification algorithm to learn the patterns in the data. The data is divided into a training set and a validation set. The model is trained using the training set, and its performance is evaluated on the validation set.

Model Evaluation: The sixth step in building a classification model is model evaluation. Model evaluation involves assessing the performance of the trained model on a test set. This is done to ensure that the model generalizes well

Classification is a widely used technique in data mining and is applied in a variety of domains, such as email filtering, sentiment analysis, and medical diagnosis.

data mining classification case study

Classification : It is a data analysis task, i.e. the process of finding a model that describes and distinguishes data classes and concepts. Classification is the problem of identifying to which of a set of categories (subpopulations), a new observation belongs to, on the basis of a training set of data containing observations and whose categories membership is known. 

Example : Before starting any project, we need to check its feasibility. In this case, a classifier is required to predict class labels such as ‘Safe’ and ‘Risky’ for adopting the Project and to further approve it. It is a two-step process such as:  

  • Learning Step (Training Phase) : Construction of Classification Model  Different Algorithms are used to build a classifier by making the model learn using the training set available. The model has to be trained for the prediction of accurate results.
  • Classification Step : Model used to predict class labels and testing the constructed model on test data and hence estimate the accuracy of the classification rules.

data mining classification case study

Test data are used to estimate the accuracy of the classification rule

Training and Testing:  

Suppose there is a person who is sitting under a fan and the fan starts falling on him, he should get aside in order not to get hurt. So, this is his training part to move away. While Testing if the person sees any heavy object coming towards him or falling on him and moves aside then the system is tested positively and if the person does not move aside then the system is negatively tested.  The same is the case with the data, it should be trained in order to get the accurate and best results. 

There are certain data types associated with data mining that actually tells us the format of the file (whether it is in text format or in numerical format). 

Attributes – Represents different features of an object. Different types of attributes are:  

  • Symmetric : Both values are equally important in all aspects
  • Asymmetric : When both the values may not be important.
  • Ordinal : Values that must have some meaningful order.  Example: Suppose there are grade sheets of few students which might contain different grades as per their performance such as A, B, C, D  Grades: A, B, C, D
  • Continuous : May have an infinite number of values, it is in float type  Example: Measuring the weight of few Students in a sequence or orderly manner i.e. 50, 51, 52, 53  Weight: 50, 51, 52, 53
  • Discrete : Finite number of values.  Example: Marks of a Student in a few subjects: 65, 70, 75, 80, 90  Marks: 65, 70, 75, 80, 90

Syntax:  

  • Mathematical Notation: Classification is based on building a function taking input feature vector “X” and predicting its outcome “Y” (Qualitative response taking values in set C)  
  • Here Classifier (or model) is used which is a Supervised function, can be designed manually based on the expert’s knowledge. It has been constructed to predict class labels (Example: Label – “Yes” or “No” for the approval of some event).

Classifiers can be categorized into two major types:   

  • Discriminative : It is a very basic classifier and determines just one class for each row of data. It tries to model just by depending on the observed data, depends heavily on the quality of data rather than on distributions.  Example : Logistic Regression 
  • Generative : It models the distribution of individual classes and tries to learn the model that generates the data behind the scenes by estimating assumptions and distributions of the model. Used to predict the unseen data.  Example : Naive Bayes Classifier  Detecting Spam emails by looking at the previous data. Suppose 100 emails and that too divided in 1:4 i.e. Class A: 25%(Spam emails) and Class B: 75%(Non-Spam emails). Now if a user wants to check that if an email contains the word cheap, then that may be termed as Spam.  It seems to be that in Class A(i.e. in 25% of data), 20 out of 25 emails are spam and rest not.  And in Class B(i.e. in 75% of data), 70 out of 75 emails are not spam and rest are spam.  So, if the email contains the word cheap, what is the probability of it being spam ?? (= 80%)

Classifiers Of Machine Learning:  

  • Decision Trees
  • Bayesian Classifiers
  • Neural Networks
  • K-Nearest Neighbour
  • Support Vector Machines
  • Linear Regression
  • Logistic Regression

Associated Tools and Languages: Used to mine/ extract useful information from raw data. 

  • Main Languages used : R, SAS, Python, SQL
  • Major Tools used : RapidMiner, Orange, KNIME, Spark, Weka
  • Libraries used : Jupyter, NumPy, Matplotlib, Pandas, ScikitLearn, NLTK, TensorFlow, Seaborn, Basemap, etc.

Real – Life Examples :  

  • Market Basket Analysis:   It is a modeling technique that has been associated with frequent transactions of buying some combination of items.  Example : Amazon and many other Retailers use this technique. While viewing some products, certain suggestions for the commodities are shown that some people have bought in the past.
  • Weather Forecasting:   Changing Patterns in weather conditions needs to be observed based on parameters such as temperature, humidity, wind direction. This keen observation also requires the use of previous records in order to predict it accurately.

Advantages: 

  • Mining Based Methods are cost-effective and efficient
  • Helps in identifying criminal suspects
  • Helps in predicting the risk of diseases
  • Helps Banks and Financial Institutions to identify defaulters so that they may approve Cards, Loan, etc.

Disadvantages:  Privacy: When the data is either are chances that a company may give some information about their customers to other vendors or use this information for their profit.  Accuracy Problem: Selection of Accurate model must be there in order to get the best accuracy and result. 

APPLICATIONS:    

  • Marketing and Retailing
  • Manufacturing
  • Telecommunication Industry
  • Intrusion Detection
  • Education System
  • Fraud Detection

GIST OF DATA MINING : 

  • Choosing the correct classification method, like decision trees, Bayesian networks, or neural networks. 
  • Need a sample of data, where all class values are known. Then the data will be divided into two parts, a training set, and a test set.

Now, the training set is given to a learning algorithm, which derives a classifier. Then the classifier is tested with the test set, where all class values are hidden.  If the classifier classifies most cases in the test set correctly, it can be assumed that it works accurately also on the future data else it may be the wrong model chosen.

Please Login to comment...

Similar reads.

  • Computer Subject
  • data mining
  • PS5 Pro Launched: Controller, Price, Specs & Features, How to Pre-Order, and More
  • How to Make Money on Twitch
  • How to activate Twitch on smart TV
  • 105 Funny Things to Do to Make Someone Laugh
  • #geekstreak2024 – 21 Days POTD Challenge Powered By Deutsche Bank

Improve your Coding Skills with Practice

 alt=

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

Data Mining with Python: Implementing Classification and Regression [Video] by Packt Publishing

PacktPublishing/-Data-Mining-with-Python-Implementing-Classification-and-Regression

Folders and files.

NameName
5 Commits

Repository files navigation

Data mining with python: implementing classification and regression.

This is the code repository for Data Mining with Python: Implementing Classification and Regression , published by Packt . It contains all the supporting project files necessary to work through the video course from start to finish.

About the Video Course

Python is a dynamic programming language used in a wide range of domains by programmers who find it simple yet powerful. In today’s world, everyone wants to gain insights from the deluge of data coming their way. Data mining provides a way of finding these insights, and Python is one of the most popular languages for data mining, providing both power and flexibility in analysis. Python has become the language of choice for data scientists for data analysis, visualization, and machine learning.

In this course, you will discover the key concepts of data mining and learn how to apply different data mining techniques to find the valuable insights hidden in real-world data. You will also tackle some notorious data mining problems to get a concrete understanding of these techniques.

We begin by introducing you to the important data mining concepts and the Python libraries used for data mining. You will understand the process of cleaning data and the steps involved in filtering out noise and ensuring that the data available can be used for accurate analysis. You will also build your first intelligent application that makes predictions from data. Then you will learn about the classification and regression techniques such as logistic regression, k-NN classifier, and SVM, and implement them in real-world scenarios such as predicting house prices and the number of TV show viewers.

By the end of this course, you will be able to apply the concepts of classification and regression using Python and implement them in a real-world setting.

What You Will Learn

  • Understand the basic data mining concepts to implement efficient models using Python
  • Know how to use Python libraries and mathematical toolkits such as numpy, pandas, matplotlib, and sci-kit learn
  • Build your first application that makes predictions from data and see how to evaluate the regression model
  • Analyze and implement Logistic Regression and the KNN model
  • Dive into the most effective data cleaning process to get accurate results
  • Master the classification concepts and implement the various classification algorithms

Instructions and Navigation

Assumed knowledge.

To fully benefit from the coverage included in this course, you will need: Basic knowledge of Python

Technical Requirements

This course has the following software requirements: OS: Any modern OS (Windows, Mac, or Linux)

· RAM: minimum required for the Operating System

· CPU: minimum required for the Operating System

Related Products

Exploratory Data Analysis with Pandas and Python 3.x [Video]

Scalable Data Analysis in Python with Dask [Video]

Data Storytelling with Power BI [Video]

Contributors 2

  • Jupyter Notebook 100.0%

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Academia.edu no longer supports Internet Explorer.

To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to  upgrade your browser .

Enter the email address you signed up with and we'll email you a reset link.

  • We're Hiring!
  • Help Center

paper cover thumbnail

A Case Study: Stream Data Mining Classification

Profile image of Ketan Desale

Continuous and unending growth of data created so many challenges in data mining task. Data mining is extraction meaningful information i.e. knowledge from large datasets for the future decision making. The data which is continuously generating with changing values is known as streaming data. We face many problems with streaming data as we are unable to store it and process it. Network data is one of the best examples of streaming data. Intrusion Detection System (IDS) used to detect the malicious user to protect the network. System’s safety in a network is a prime important factor. In this paper, we present comprehensive approach to improve performance of IDS by applying some classification techniques with streaming dataset. For the experiment purpose we created our own network dataset which shows significant accuracy in results after applying classifiers.

Related Papers

Dr. M.A. Jawale HOD IT COE

In this paper, we have been explored the brief review about the intrusion detection system. This review emphasizes about how to automatically and systematically build adaptable and extensible advanced intrusion detection system using data mining techniques and how to provide in-built prevention policies in the detection system so that it will reduce network administrator's system re-configuration efforts and application of sentiment analysis to

data mining classification case study

WARSE The World Academy of Research in Science and Engineering

This study presents an overview of intrusion classification algorithms, based on popular methods. Here an intelligent system first performs feature extraction based on oppositional particle swarm optimization algorithm (OPSO). These reduced features are then fed to HFFPNN for training and testing on NSL-KDD dataset. HFFPNN is a hybridization of feed forward neural network (FFNN) and probabilistic neural network(PNN). Pre-processing of NSL-KDD dataset has been done to convert string attributes into numeric attributes before training. This system then behaves intelligently to classify test data into attack and non-attack classes. The aim of the feature reduced system is to achieve same degree of performance as a normal system. Comparison of proposed method with feature reduction is done in terms of various performance metrics. Comparisons with recent and relevant approaches are also tabled. Experimental results show the prominence of HFFPNN technique over the existing techniques in terms of intrusion detection classification. Therefore , the scope of this study has been expanded to encompass hybrid classifiers.

GRD JOURNALS

With popularization of internet, internet attack cases are also increasing, thus information safety has become a significant issue all over the world, hence Nowadays, it is an urgent need to detect, identify and hold up such attacks effectively [1]. In this modern world intrusion occurs in a fraction of seconds and Intruders cleverly use the adapted version of command and thereby erasing their footprints in audit and log files. Successful IDS intellectually differentiate both intrusive and nonintrusive records. Most of the existing systems have security breaches that make them simply vulnerable and could not be solved. Moreover substantial research has been going on intrusion detection system which is still considered as immature and not a perfect tool against intrusion. It has also become a most priority and difficult tasks for network administrators and security experts. So it cannot be replaced by more secure systems [2].

The security of a computer system is compromised when an intrusion takes place. The popularization of shared networks and Internet usage demands increases attention on information system security. Importance of Intrusion detection system (IDS) in computer network security well proven. Data mining approach can play very important role in developing intrusion detection system. Classification is identified as an important technique of data mining. This paper investigates the possibility of using ensemble algorithms and feature selection to improve the performance of network intrusion detection systems.

Ketan Desale

In today's era, network security has become very important and a severe issue in information and data security. The data present over the network is profoundly confidential. In order to perpetuate that data from malicious users a stable security framework is required. Intrusion detection system (IDS) is intended to detect illegitimate access to a computer or network systems. With advancement in technology by WWW, IDS can be the solution to stand guard the systems over the network. Over the time data mining techniques are used to develop efficient IDS. Here,a new approach is introduced by assembling data mining techniques such as data preprocessing, feature selection and classification for helping IDS to attain a higher detection rate. The proposed techniques have three building blocks: data preprocessing techniques are used to produce final subsets. Then, based on collected training subsets various feature selection methods are applied to remove irrelevant & redundant features. ...

Vaibhav Khatavkar

Supervised learning algorithm for Intrusion Detection needs labeled data for training. Lots of data is available through internet, network and host. But this data is unlabeled data. The availability of labeled data needs human expertise which is costly. This is the main hurdle for developing supervised intrusion detection systems. We can intelligently use both labeled and unlabeled data for intrusion detection. Semisupervised learning has attracted the attention of the researcher working in Intrusion Detection using machine learning. Our goal is to improve the classification accuracy of any given supervised classifier algorithm by using the limited labeled data and large unlabeled data. The key advantage of the proposed semisupervised learning approach is to improve the performance of supervised classifier. The results show that the performance of the proposed semi-supervised algorithm is better than the stateof theart supervised learning algorithms. We compare the performance of ou...

In today’s world data is rapidly and continuously growing and is not constant in nature. There is a problem to deal with such kind of evolving data, because it is impractical to store and process this streaming data. Also, in real world application, the stream of data coming is typically noisy, has some missing values, repeated features, and thus very large time is wasted to process that data. The time complexity can reduce by selecting only useful features to build model for classification. The proposed system takes into consideration the issue of adaptive preprocessing for streaming data. Here Genetic algorithm (GA) is used as a search method while selecting the features which will further use in learning model. GA alongwith selective windowing strategy is the proposed system. The proposed system is applied to different stream datasets and, also compared with existing preprocessing technique PCA, is showing significant improvement in classification accuracy. Keywords— Genetic Algo...

Today, with the advent of internet, everyone can do information exchange and resource sharing. Even business organization and government agencies are not behind in this move to reach users for their decision making and for business strategies. But at the same time, with ease of use and availability of various software tools, breaching and penetrating into other‟s network and confidential credential can be done by any individual with little knowledge expertise and hence the internet attacks are rise and are main concerns for all internet users and business organizations for internal as well as external intruders. Even, existing solutions and commercial Intrusion Detection Systems (IDSs) are developed with limited and specific intrusion attack detection capabilities without any prevention capabilities to secure vital resources of the information infrastructure. So, this paper explores the details about the implementation and experimental analysis of Advanced Intrusion Detection System...

IJMRAP Editor

IDS is a procedure of observing the events happening in a computer system or network and analyzing them for indication of a conceivable potential event which is an infringement or inevitable dangers of infringement or computer security approaches or standard security strategies of adjacent threats. MATLAB 2018a tool is used for the implementation on a NSL-KDD dataset. The motivation behind this investigation is to detect the attack. This paper, deals with the evaluation of data mining based machine learning algorithms viz. Fuzzy C-Means and Fuzzy Possibilistic C-Means clustering algorithms to identify intrusion over NSL-KDD dataset for effectively detecting the major attack categories i.e. DoS, R2L, U2R and Probe.

Journal of Computer Science IJCSIS

Intrusion Detection (ID) is one of the most challenging problem in today's era of computer security. New innovative ideas are used by the hackers to break the security, hence the challenge for developing better ID systems are increasing day-by-day. In this paper, we applied the Artificial Immune System (AIS) based classifiers for intrusion detection. Each classifier is evaluated based on high accuracy and detection rate with low false alarm rate. The results are compared using percentage split (80%) and cross validation (10 fold) test options basing on two nominal target attributes i.e type of attacks and protocol types having 5 and 3 sub-classes respectively. The results of the experiment in this paper proposes CSCA (clonal selection classification algorithm) to be a better AIS based classifier for applying in network based Intrusion Detection System(IDS).

Loading Preview

Sorry, preview is currently unavailable. You can download the paper by clicking the button above.

RELATED PAPERS

International Journal of Computer Applications

IRJET Journal

Aleksandar Jevremovic , Milan Milosavljevic

My Abdelaziz ID MANSOUR

Waseem Shahzad

Hoda Abdel Hafez

Applied Sciences

Raja Murugesan

IJCST Eighth Sense Research Group

Michael Lawo

IJRES Journal

Lecture Notes in Computer Science

Frederic Stahl

IOSR Journals

Panagiotis Radoglou Grammatikis

saumya goyal

Haimonti Dutta

International Journal of Advanced Computer Science and Applications

Tarik Hachad

International Journal of Electrical and Computer Engineering (IJECE)

bhaskar adepu

Amine Boukhtouta

International Journal of Computer Applications (0975 – 8887)

Rami Alzahrani

ARID ZONE JOURNAL OF ENGINEERING, TECHNOLOGY AND ENVIRONMENT , Emmanuel Gbenga Dada

Journal of Information Security

Biswajit Biswal

NASPI-2018-TT-007

Reza Arghandeh

International Journal of Advance Research in Computer Science and Management Studies [IJARCSMS] ijarcsms.com

International Journals for Researchers [ER Publication, WOAR Journals, IJEAS and IJEART]

Security and Privacy

SUSHANT PANDEY

Jorge Levera

Innovative Research Publications

IEEE Access

Muhammad Usama

International Journal of Computer …

Satish Kolhe

Alireza Bagheri

IJAR Indexing

maya jeevan , Nagaraju Sangepu , Lalitha Chandrashekar , Patrick Ngumbi S.a. Nssf , Shaik Fairooz , Steve Kruba , Loganathan C , joshua samuel , Ismail Ataie , Journal of Computer Science IJCSIS , Varadarajan Sourirajan , Anandhi Jayadharmarajan

Computational and Mathematical Organization Theory

Guilherme Barreto

TJPRC Publication

Irfan Chauhan

RELATED TOPICS

  •   We're Hiring!
  •   Help Center
  • Find new research papers in:
  • Health Sciences
  • Earth Sciences
  • Cognitive Science
  • Mathematics
  • Computer Science
  • Academia ©2024

To read this content please select one of the options below:

Please note you do not have access to teaching notes, using data mining technology to solve classification problems: a case study of campus digital library.

The Electronic Library

ISSN : 0264-0473

Article publication date: 1 May 2006

Traditional library catalogs have become inefficient and inconvenient in assisting library users. Readers may spend a lot of time searching library materials via printed catalogs. Readers need an intelligent and innovative solution to overcome this problem. The paper seeks to examine data mining technology which is a good approach to fulfill readers' requirements.

Design/methodology/approach

Data mining is considered to be the non‐trivial extraction of implicit, previously unknown, and potentially useful information from data. This paper analyzes readers' borrowing records using the techniques of data analysis, building a data warehouse, and data mining.

The paper finds that after mining data, readers can be classified into different groups according to the publications in which they are interested. Some people on the campus also have a greater preference for multimedia data.

Originality/value

The data mining results shows that all readers can be categorized into five clusters, and each cluster has its own characteristics. The frequency with which graduates and associate researchers borrow multimedia data is much higher. This phenomenon shows that these readers have a higher preference for accepting digitized publications. Also, the number of readers borrowing multimedia data has increased over the years. This trend indicates that readers preferences are gradually shifting towards reading digital publications.

  • Digital libraries
  • Electronic publishing
  • Knowledge mining

Chang, C. and Chen, R. (2006), "Using data mining technology to solve classification problems: A case study of campus digital library", The Electronic Library , Vol. 24 No. 3, pp. 307-321. https://doi.org/10.1108/02640470610671178

Emerald Group Publishing Limited

Copyright © 2006, Emerald Group Publishing Limited

Related articles

All feedback is valuable.

Please share your general feedback

Report an issue or find answers to frequently asked questions

Contact Customer Support

  • DOI: 10.1007/978-3-319-91192-2_21
  • Corpus ID: 51892318

Classification, Clustering and Association Rule Mining in Educational Datasets Using Data Mining Tools: A Case Study

  • Sadiq Hussain , R. Atallah , +1 author J. Hazarika
  • Published in Computer Science On-line… 25 April 2018
  • Computer Science, Education

39 Citations

Decision support based on analysis of relationship between errors using association rule mining on the example of graduate students' scientific papers, an effective learning management system for revealing student performance attributes, simulation of machine learning techniques to predict academic performance, formalistic modelling based on pattern recognition applied to the knowledge and human talent sector in ecuador, a survey on educational data mining methods used for predicting students' performance, graduation prediction of s1 industrial engineering students ist akprind by using data mining method, simulated annealing for svm parameters optimization in student’s performance prediction, prediction model on student, clustering-based emt model for predicting student performance, data mining approach to predicting the performance of first year student in a university using the admission requirements, 24 references, an educational data mining system for advising higher education students, data mining: a prediction for performance improvement of engineering students using classification, data mining: a prediction for performance improvement using classification, analysis of students' performance by using different data mining classifiers, clustering algorithms applied in educational data mining, data mining: a prediction for student's performance using classification method, data mining: concepts and techniques, data mining : a prediction of performer or underperformer using classification, application of k means clustering algorithm for prediction of students academic performance, validating cluster structures in data mining tasks, related papers.

Showing 1 through 3 of 0 Related Papers

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 13 September 2024

The government intervention effects on panic buying behavior based on online comment data mining: a case study of COVID-19 in Hubei Province, China

  • Tinggui Chen 1 , 2 ,
  • Yumei Jin 2 ,
  • Bing Wang 1 &
  • Jianjun Yang 3  

Humanities and Social Sciences Communications volume  11 , Article number:  1200 ( 2024 ) Cite this article

Metrics details

  • Science, technology and society
  • Social policy

At the end of 2019, the world grappled with an unparalleled public health crisis due to the COVID-19 pandemic, which also precipitated a global economic downturn. Concurrently, material panic buying occurred frequently. To restore benign market order, the government instituted a series of interventions to stabilize the market. This scholarly exploration dives deep into evaluating the tangible impact of these governmental measures in Hubei, China, a region which found itself at the very epicenter of the epidemic in its onset phase. Existing papers often employ structured questionnaires and structural equation methods, with small samples and limited effective information. In contrast, we used a dataset of tens of thousands of entries and employed text analysis to maximize the extraction of valid information. Through a meticulous analysis of public feedback, our findings unveil several pivotal insights: (1) The news measures of materials have the best effect. Their effectiveness, in descending order, is ranked as: material sufficiency > authority effect > market supervision > appeal and guidance; (2) Government measures during the epidemic’s initial phase exhibited a delay. After the lockdown measures, the phenomenon of large-scale buying has been formed, and the relevant material news was released later; (3) A dual approach combining authority influence with material sufficiency yielded the most favorable results. In light of these findings, the paper concludes with tailored recommendations aimed at amplifying the efficacy of government-led public opinion interventions in future crises.

Introduction

Following the emergence of COVID-19, many newspapers from different countries published photos of barren supermarket shelves, underscoring the shortage of food and essentials (Lufkin, 2020 ; Nicholson, 2020 ). These individuals usually present this phenomenon as ‘panic buying’ (Nicholson, 2020 ). The social and psychological reactions of the general public to new outbreaks of infectious diseases, such as SARS, the H2N1 pandemic, and the Ebola virus, often provoke public sentiments such as fear, anxiety, and depression. (Maunder et al., 2003 ; Sim et al., 2010 ). In fact, research has shown that only a small number of people purchase a large amount of goods (for example, only 3% of people purchase an excessive amount of pasta (Kantar, 2020)). Such data suggest that while many people anticipate and act on potential shortages during crises, only a fraction engage in extreme purchasing, which is different from compulsive purchasing disorders (Sharma et al., 2020 ). However, after this purchasing behavior occurs, the resulting social problems are extremely serious, leading to supply chain collapses and market disarray. At this point, expedient government intervention is extremely important. The purpose of this paper is to explore how the government can swiftly intervene, thereby mitigating social impact after such panic buying behavior occurs.

At present, several scholars have found that government intervention plays a very important role in disease prevention and control (Zhao et al., 2020 ). However, there is a noticeable gap in the literature in regard to addressing panic buying during the COVID-19 outbreak, especially during its early stages. Such an investigation holds significant academic value, given that swift and appropriate government actions can effectively reduce its negative impact on society. Therefore, it is highly important to study the influence of government intervention on panic buying during the early outbreak of COVID-19. For example, Li and Dong ( 2022 ) developed a game theoretic supply chain model to assess the impact of government regulation on the shortage of life-saving materials and profits within the supply chain (Liu et al. 2022 ). Olowookere et al. ( 2022 ) emphasized that the government should comprehensively help people overcome the difficulties resulting from epidemics, particularly vulnerable populations. Cariappa et al. ( 2022 ) proposed a fundamental panic buying intervention, i.e., starting from agriculture to build public confidence. Taylor ( 2022 ) summarized the intervention experience. An examination of current intervention research reveals multifaceted analytical perspectives. However, many of these approaches lack the immediacy required for swift responses, rendering them more suitable for post crisis management rather than urgent interventions. In addition, public opinion cannot be accurately reflected, and most related research methods rely heavily on model simulation (Rajkumar and Arafat, 2021 ); thus, actual public data are lacking. Therefore, it is necessary to conduct mining and analysis based on netizens’ online comments on government measures to assess the effectiveness of implementing various government intervention measures. By identifying the most impactful factors on intervention outcomes, this approach aims to shape more effective strategies for future incidents.

Given this background, the cardinal objective of this research is to evaluate the impact of government intervention measures on panic buying during the COVID-19 pandemic by mining and analyzing online comments dispersed across cyberspace. In addition to merely assessing the tangible outcomes of these interventions, this study aims to determine the actual effectiveness of government intervention measures, gain deeper insights into public perceptions and attitudes toward these measures, analyze the emotional trends of the public under different government interventions, and provide valuable insights for formulating more effective strategies in the future. The research methodology will involve semantic network analysis, sentiment analysis, and LDA modeling for categorization and exploration of implementation effects. Additionally, a multiple regression model will be used to analyze the factors influencing the intervention effects. The findings of this study will contribute to enhancing intervention efficiency, mitigating the negative consequences of panic buying, and promoting the normalization of market order.

Materials and methods

Literature materials.

In Google Scholar’s academic research, our document retrieval was carried out with “panic buying” and “intervention” as the main fields, yielding a wealth of studies addressing various facets of panic buying, encompassing its reasons for formation, prevention methods and functional evaluation. Many studies have analyzed the reasons for the formation of these rocks via various methods. Prevention methods and effect evaluation are also used, but the methods are relatively simple. At present, the analysis of intervention effects has been conducted mainly by constructing regression models (Si et al., 2020 ) and PMC index models (Chen et al., 2021 ). Billore and Anisimova ( 2021 ) summarized the relevant research of the past 20 years, upon which we have also based our considerations. This section provides a comprehensive literature review, discusses the trajectory from the root causes of panic buying to preventive methods, and explores the evaluation of intervention outcomes. This approach ensures a thorough understanding of the complexities surrounding panic buying.

The research literature elucidates the genesis of panic buying by examining it predominantly through the lens of psychological triggers and the sway of external media. Psychological factors such as perceived scarcity and fear of the unknown contribute to anxiety and panic buying behavior (Yuen et al., 2020 ; Omar et al., 2021 ; Taylor, 2021 ). This behavior can create a cycle of increased anxiety (Prentice et al., 2022 ). Social influences such as norms and observational learning also play a role in amplifying perceptions of scarcity and triggering panic buying (Yuen et al., 2022 ). External media, particularly social media, also significantly influences panic buying. Expert opinions and official communications during public health crises can trigger panic buying (Naeem, 2021 ). Excessive exposure to information on social media intensifies perceived scarcity and purchase intentions (Islam et al., 2021 ). Government and corporate interventions can mitigate panic buying while influencing social dynamics (Prentice et al., 2021 ). Research methods include correlation analysis (Lins and Aquino, 2020 ; Bentall et al., 2021 ), qualitative analysis (Taylor, 2021 ), and statistical simulations (Fu et al., 2021 ) to explore the causes and dynamics of panic buying during crises. Communication patterns during disasters also impact panic buying behaviors (Arafat et al., 2022 ). De Brito Junior et al. (2023) found some results that may assist policymakers in introducing public policies and managing resources during a crisis that requires social distancing and lockdowns. In general, the examination of panic buying’s root causes is characterized by diverse methodologies and well-substantiated conclusions, suggesting a mature body of research in this domain.

The intervention strategies outlined in the literature can be broadly categorized into three types: psychological, market-based, and network monitoring. Lei et al. ( 2020 ) used the SAS and SDS to increase anxiety and depression rates in affected populations, stressing the need for government initiatives in economic aid, medical support, and psychological intervention. Bermes ( 2021 ) used structural equation modeling to suggest improving consumer resilience and modifying consumers’ information environment. Ho et al. ( 2020 ) emphasized psychological support, highlighting the vulnerability of people exposed to epidemics. Mukhtar ( 2020 ) reviewed past epidemics to develop crisis intervention plans. Roy et al. ( 2020 ) and Zheng et al. ( 2020 ) stressed mental health care and psychological panic reduction, respectively. Wu ( 2009 ) focused on network emergency management. Tsao et al. ( 2019 ) advocated retail strategy changes to ease market pressure. Ling et al. ( 2020 ) and Sahin et al. ( 2020 ) proposed social and economic system adjustments. Boyacι-Gündüz et al. ( 2021 ) underscored food system resilience amid population growth. This review reveals diverse intervention approaches, highlighting gaps in multidimensional studies combining psychology and market dynamics.

With regard to the assessment of the outcomes of diverse interventions for panic buying, several studies have examined this topic further. Prentice et al. ( 2020 ) noted that countries typically employ regular interventions during such events. Using Twitter data for Australia, they found that government measures aligned with panic buying periods through semantic analysis. Arafat et al. ( 2021a ) discussed historical perspectives and crisis prevention planning, integrating sociology, marketing, and industrial purchasing. Prentice et al. ( 2021 ) studied the impacts of government, business, and social groups on panic buying, highlighting the roles of government and business over social groups. Mao et al. ( 2022 ) developed a dynamic game model showing that government interventions could control panic buying duration. Rajkumar ( 2021 ) proposed a biological-psychological-social model, suggesting measured punitive actions, responsible media reporting, and social contact to mitigate panic buying. Niu et al. ( 2021 ) emphasized targeted interventions based on survey data. Fast ( 2014 ) used network analysis to predict social responses to disease outbreaks. In a series of papers (2020a; 2020b; 2020c), Arafat et al. analyzed media reports on panic buying. However, these studies did not adequately capture the public’s reactions to these events. In fact, public comments imply a great deal of valuable information. In addition, Arafat et al. ( 2021b ) introduced panic buying intervention measures in their book, but most of them used summary words and did not analyze changes in public sentiment. The above literature analysis shows that existing studies mostly analyze structural equations, questionnaires and mathematical modeling but rarely mine information from people’s online comments.

In addition, empirical investigations into panic buying in China have produced insightful findings. Prentice et al. ( 2021 ) and Islam et al. ( 2021 ) partially used samples from China in their analysis. However, Wang and Na ( 2020 ) applied a multivariate statistical model to study rational and irrational motives for food hoarding by aggregating online samples from three Chinese cities. The results confirmed the occurrence of rational and irrational food hoarding. Fu et al. ( 2022 ) utilized mathematical modeling techniques to analyze the efficacy of panic buying interventions and validated the model using Chinese panic buying data. In addition, Yang et al. ( 2022 ) conducted a survey of 517 participants who experienced panic buying during the Omicron pandemic in China. Their findings revealed connections between media exposure, perceived emotional risk, stakeholder perception, protective awareness, and panic buying behavior. Research on panic buying in China has yielded promising results, but in-depth research is still strongly needed.

In summary, structural questionnaires and structural equation methods may inadvertently overlook much valid information. In contrast, online comments are relatively free and can multidimensionally and immediately capture emotional changes and actual needs. Therefore, more effective information can be obtained by mining people’s online comments on government interventions. As such, this paper carries out data analysis by crawling online comments under different government interventions and further uses the LDA topic extraction model to determine the effects of the intervention, which has strong practical significance. The LDA review by Jelodar et al. ( 2019 ) covers research on LDA from 2003 to 2016, revealing its application across various fields such as software engineering, political science, medicine, and linguistics. Moreover, it has been studied in fields such as communication research (Maier et al., 2021 ) and artificial intelligence (Yu and Xiang, 2023 ), highlighting its broad applicability.

LDA (latent Dirichlet allocation) model

The LDA model, introduced by (Blei et al., 2003 ), addresses certain limitations inherent in traditional text analysis mechanisms. While classical methods, such as TF-IDF, gauge the correlation between two documents by counting shared words and employing metrics such as term frequency (TF) and term frequency-inverse document frequency (TF-IDF), they often overlook deeper semantic connections. Such methods merely scratch the surface, focusing on word frequency without diving into the underlying themes and associations. Therefore, the LDA method can better determine the relationships among comments. This section analyses the effect of government intervention measures on preventing panic buying through the use of the powerful topic extraction function of LDA.

The similarity adaptive method is used to find the optimal LDA topic number; this method does not require manual debugging of the topic number and has a small number of iterations, high operational efficiency, fast speed, and high accuracy. The specific operation steps are as follows.

Randomly select the initial topic number K to obtain the initial model and calculate the similarity between the topics, i.e., the average cosine value cosθ ( i represents the dimension, K represents the number of topics). The specific calculation formula is as follows:

where θ represents the angle between X and Y , and X and Y represent two n -dimensional vectors, i.e., X is represented by \(({X}_{1},\,{X}_{2,}\,\ldots {X}_{n})\) , and Y is represented by \(({Y}_{1},{Y}_{2,}\ldots {Y}_{n})\) . A larger cosine value indicates that the texts are more similar and are grouped into the same category during topic extraction.

The model is trained again by increasing or decreasing the value of K , and the average cosine between subjects is calculated.

Perform step (2) again, and the cycle is repeated until the optimal K value is obtained, i.e., termination of the iteration. At this time, K is the optimal number of topics extracted.

Text sentiment analysis methods

Semantic networks primarily analyze the relationships between sentences. By plotting the relationships between evaluation targets and their respective opinions, they aid in visually analyzing the attributes among evaluation targets. ROSTCM5.8 software is used to generate semantic network graphs for the four intervention categories. By constructing semantic networks, potential connections and hidden information between evaluation targets and evaluation opinions can be uncovered; these are primarily represented through directed edges and nodes. Edges represent connections between nodes, while nodes represent individuals or events.

Sentiment analysis involves analyzing the sentiment of each sentence. A paragraph of text is input, processed with Python code, and automatically received feedback on the emotional orientation of the text, along with a score indicating whether it is a positive endorsement or a negative critique. This magical functionality is known as text sentiment analysis, also referred to as opinion mining. It involves the collection, processing, analysis, summarization, and inference of subjective texts with emotional tones, spanning multiple research fields such as artificial intelligence, machine learning, data mining, and natural language processing. Text sentiment analysis plays a crucial role in today’s information industry era: in sentiment analysis, it dissects hot events emotionally, identifies emotional reasons, aids governments in understanding public sentiment, and prevents harmful events from occurring.

Multiple regression analysis

Toward the end of the article, we employed multiple regression analysis to further explore the factors influencing people’s emotional tendencies. Multiple regression is a statistical analysis method used to study the relationship between a dependent variable and multiple independent variables. This approach helps us understand the impact of multiple independent variables on the dependent variable and quantifies the magnitude of these impacts. In multiple regression, we initially assume a linear relationship between the dependent variable and the independent variables and establish a mathematical model to describe this relationship. Then, we conduct further analysis using multiple regression to explore this relationship further.

Background material

The intervention of government departments can alleviate public panic and reduce adverse social impacts. However, there is a lack of action in sorting out intervention measures and evaluating effects, especially when the sorting of government intervention plans during the initial outbreak of the epidemic in Hubei Province is limited. Therefore, this section aims to delve into the backdrop of panic buying, examine official intervention mandates, categorize the government’s countermeasures against panic buying, and subsequently assess their effectiveness.

Event Background

At the onset of the epidemic, the government lacked corresponding management experience, resulting in various practical issues. By retrospectively analyzing the initial control plans and their shortcomings, our study can glean valuable lessons to enhance future management strategies and improve overall responsiveness. Therefore, this paper takes panic buying in Hubei, China, during the initial period of the epidemic as an example to evaluate the effectiveness of the intervention measures. First, we meticulously chart the progression of the epidemic in Hubei Province, aligning it with a chronological framework. Key milestones and pivotal events are itemized, providing a clear and sequential depiction of the unfolding situation. The trajectory of event progression is shown in Fig. 1 , which presents a flow chart capturing the sequence and development of events during the epidemic in the province.

figure 1

Epidemic development timeline in Hubei Province.

Among them, “Shuanghuanglian” is a traditional Chinese patent medicine and a simple preprotection that was once said to inhibit COVID-19. The “One yuan dish” was a pound dish that was launched during the epidemic in Wuhan to stabilize the market and the public.

As illustrated in Fig. 1 , the epidemic in Hubei Province arose rapidly, prompting immediate and decisive actions from the government. Notably, the interval between the discovery of human-to-human transmission and the “closure of the city” was less than 3 days. Lockdown measures effectively stemmed the large-scale spread of the epidemic. However, in its aftermath, the public swiftly reacted with a spree of panic buying at supermarkets. In response, the government sought to assure the public with announcements regarding ample supplies while also enforcing market regulations. However, government intervention has a notable lag. Panic buying had already taken root by the time the city was closed, and there was a delay before official assurances regarding supply sufficiency were broadcasted. Such delays had lasting consequences. Ideally, news about sufficient supplies should be synchronized with, if not preempted, actions such as city closures to better manage public panic and reduce market strain.

Sorting out intervention measures for panic buying

Google’s search function is powerful and can directly show changes in item demand. However, this article analyzes drug demand based on Weibo data for three reasons: First, Weibo is a platform based on user relationships, used for information sharing, dissemination, and acquisition. The earliest and most famous microblogging platform is Twitter from the United States. According to official reports, as of the end of 2023, the daily active users of Weibo have reached 260 million, and Weibo has become an important channel for the public to express their opinions. The government also releases reports on relevant measures through Weibo. For drugs that are popular or have regional characteristics in China, the emotional mining of its comments can more accurately reflect changes in demand. Secondly, Weibo online comments contain rich and detailed personal experiences and emotional expressions of users, such as feelings, effects, and side effects of using drugs, providing more in-depth information for emotional mining. Google Trends data, on the other hand, is more macroscopic and general. Finally, the text of Weibo comments has strong social interactivity. The replies and discussions among users can form the collision and dissemination of viewpoints, revealing the deep-seated reasons and potential influencing factors for drug demand. Google Trends is based on search data and lacks the rich information brought by this social interaction. In addition, compared with structured data such as questionnaires and interviews, online comment data are more spontaneous and expansive, and their data volume is large, often dozens, hundreds, or even thousands of times that of the questionnaire, which offers a richer mine of actionable insights. There are many fields in which online review data are used for data mining to optimize management plans, such as through the use of online reviews to evaluate hotel management (Guo et al., 2022 ) and commodity reviews to optimize product functions (Song et al., 2021 , Lan et al., 2020 ); these fields have been proven to be able to draw effective conclusions and promote industrial development.

Therefore, to better understand public sentiment toward government interventions, this study uses “panic buying”, “snap purchase”, “masks”, “materials” and other search keywords from Weibo to screen out related intervention reports on the CCTV News, People’s Daily Online, People’s Daily and other official media.

Panic buying possesses several distinct characteristics. For example, it is typically triggered by factors like concerns regarding the shortage of supplies, panic over ineffective market regulation, and unease resulting from the absence of authoritative information. Research has found that adequate supplies have a major impact on panic buying (Fu et al., 2022 ; Lins and Aquino, 2020 ). During the COVID-19 pandemic, local governments disclosed supplies of materials like grains, vegetables, and masks. For instance, a city said its grain reserves could last over half a year. This avoided panic buying and stabilized the market. The authority effect also influences panic buying (Zhang et al., 2020 ; Naeem, 2021 ). When there were rumors, experts were invited to explain. For example, for the rumor of a drug treating COVID-19, experts clarified, eliminating doubts. The government’s regulatory measures can curb panic buying. Keane and Neal ( 2021 ) said they are crucial for stability. For comparison, there’s initiative guidance (Mao et al., 2022 ), a non-mandatory measure, guiding the public through information.

We divided the Weibo topics based on these characteristics and relevant research. We initially conducted keyword searches such as “authority effect”, “adequate supplies”, “market regulation”, and “active guidance” to filter relevant Weibo topics in Fig. 2 . For instance, in the market regulation category, we specifically searched for Weibo topics related to “panic buying” and “market regulation”, and manually selected those with a significant amount of data for analysis. The specific categories and classification descriptions are shown in Table 1 .

figure 2

Proportion of each category.

By selecting representative Weibo topics with relatively hot discussions among the four categories, a total of 14 topics were screened out, and Python was used to crawl the corresponding public comment data to further analyze their perceptions. A total of 84,534 comments were crawled. The Weibo topics and comment volumes are summarized in Table 2 , where the Weibo topics are marked with “#”.

Utilizing data scraped from official Weibo comments allows us to understand in a timely manner the acceptance and emotional preferences of netizens toward governmental announcements, thereby enabling us to better grasp public opinion and determine the following work. In this section, the crawled data are cleaned, and semantic network analysis and sentiment analysis are used to further explore the netizens’ perceptions of government intervention measures. The semantic network can clearly show the connection between the topic and the subject, which is helpful for us to observe the basic information of the comment. Emotional analysis delves deeper, unearthing underlying emotional biases in the comments, aiding us in assessing the effectiveness and reception of the related governmental texts.

Selecting and cleaning comments

Given the vast internet penetration and extensive user base, netizens voice diverse opinions while maintaining a semblance of moral integrity, resulting in intricate and multifaceted feedback. To best harness these data, we employ Python for preliminary processing, ensuring that only valid comments are retained. This approach allows the subsequent analysis results to more closely reflect the real situation. The data cleansing rule in this section is to delete invalid content on the basis of retaining comment information to the greatest extent. The main steps are as follows:

Nonessential comments, such as punctuation, place names, personal names, modal particles, and advertisements, are eliminated.

Comments such as “xswl”, “hhh” and other character-based comments that cannot be accurately identified during sentiment analysis are eliminated in an abbreviated manner.

Use regular expressions to remove invalid comment content, including URLs, links, and “reply to @XXX”, from a comment, and retain the remaining valid content.

Translate the English expressions into Chinese.

After cleaning, a total of 80,280 comments remained, among which the proportion of each category is shown in Fig. 2 .

As depicted in Fig. 2 , the public is more enthusiastic about material news, with comments accounting for 44.27%, while the proportion of the authority effect is lower. The data were collected mainly because, regarding panic buying, authoritative figures, such as Zhong Nanshan, tend to focus more on the epidemic than on the social problems caused by the epidemic. (Zhong Nanshan is a respiratory disease expert and a key figure leading the fight against the COVID-19 pandemic.)

Semantic analysis based on online comments

The abovementioned cleaned comment data can be visually analyzed to obtain additional essential information.

Semantic network analysis of initiative guidance

Semantic network analysis is performed on the Weibo comments related to initiative guidance, and the results are shown in Fig. 3 . The statistics of the top 30 words and their frequencies are calculated to validate the results of the Semantic network analysis, as shown in Table 3 .

figure 3

Semantic network of initiative guidance.

As shown in Fig. 3 , people’s attention was primarily focused on face masks. Despite government opposition to hoarding, the term “face mask” appeared to be staggered 4281 times, leading to a spike in prices. Both offline and online pharmacies experienced shortages of face masks. Additionally, “hoarding” appeared 781 times and “shortage” 676 times. People’s emotions were fragile due to the pandemic, making them prone to overinterpreting official statements. For instance, a Weibo repost by the People’s Daily claimed that Shuanghuanglian could suppress the novel coronavirus, triggering panic buying of this traditional Chinese medicine. Although officials later clarified that Shuanghuanglian is not a treatment and urged people to stop panic buying, the impact was limited, with cases reporting worsened conditions due to self-administration. Overall, government guidance was ineffective, and there were instances of careless communication. Officials should exercise greater caution to avoid unnecessary misunderstandings during crises. It is noteworthy that toilet paper appeared approximately 200 times, which aligns with reality. Research (Garbe et al., 2020 ) suggests that toilet paper can provide a sense of security, highlighting its unique prominence in panic-buying events triggered by this pandemic. This calls for reflection.

Semantic network analysis of market regulation

The comment data of market regulation are integrated to generate a semantic network, as shown in Fig. 4 , and the frequencies of the words are shown in Table 4 .

figure 4

Semantic network of market regulation.

Figure 4 highlights a significant concern during the epidemic: rising prices, especially for face masks. The analysis revealed that masks are becoming more expensive and of lower quality. Public discourse centers mainly around mask issues, with “masks” mentioned 6058 times. Terms such as “secondary”, “second-hand”, “fake”, and “black heart” indicate the presence of recycled or counterfeit masks, with “second-hand” appearing 342 times. Reusing masks reduces their effectiveness and increases the risk of cross-infection, posing significant harm and negative social impacts. Additionally, terms such as “severe punishment”, “calling”, and “deserving it” show public support for government actions and active participation in reporting price gouging. Severe punishment was mentioned 380 times, and deserved punishment was mentioned 372 times, reflecting strong public sentiment. Moreover, alongside rising mask prices, living essentials also see price hikes. The term “price increase” was mentioned 1205 times, indicating widespread concern. In conclusion, the analysis combines word frequency and semantic networks to highlight robust public support and participation in regulatory measures. However, it also underscores significant regulatory loopholes, suggesting that regulations should not only focus on pricing but also consider public sentiment.

Semantic network analysis of the authority effect

The comment data of the authority effect are integrated to generate a semantic network, as shown in Fig. 5 , and the frequencies of the words are shown in Table 5 . (Li Lanjuan is an expert in infectious diseases and is one of the spokespersons representing the fight against the epidemic.)

figure 5

Semantic network of the authority effect.

Figure 5 shows that when a person’s prestige is particularly high, the influence of his or her speech is greater than that of the general populace, and it is easier for him or her to gain trust. Therefore, rational use of the authority effect can effectively improve work efficiency. During the epidemic, Zhong Nanshan, Li Lanjuan, and other authoritative figures drew the public’s attention. The aged academicians Zhong Nanshan and Li Lanjuan who fought on the front line of the epidemic are knowledgeable and graceful, enjoying high public prestige. When officials release their speeches, the public is more willing to believe and obey. From the word frequencies in Table 4 , “Zhong Nanshan” and “Li Lanjuan” appeared 703 and 317 times, respectively, of which “Zhong Nanshan” ranked first, which shows the public’s attention given to public figures. It is not difficult for Semantic Networks that have experienced authority effects to find that people are more positive, including describing them as “cute”, which has appeared 451 times, hoping that they can do a good job in “protection”, which has appeared 378 times, and expressing understanding of their behavior. Most of the comments are positive and express their desire to return home from Hubei Province early. Overall, the effect of authority has a significant effect.

Semantic network analysis of sufficient materials

Figure 6 illustrates a semantic network derived from integrated comment data, with corresponding word frequencies detailed in Table 6 . During the epidemic, the assurance of adequate supplies provided significant comfort to affected populations. The analysis highlighted essential food items such as “cabbage”, “rice”, “beef and mutton”, and “potato”, which were frequently mentioned with 501, 332, 434, and 329 instances, respectively. Additionally, phrases such as “thank you” (461 mentions), “refueling” (321 mentions), and “waste” (308 mentions) were prominent. These findings indicate gratitude for material sources and concerns about food waste. People not only appreciate access to essential supplies but also demonstrate conscientiousness toward resource conservation. This mutual support fosters national cohesion. Overall, the impact of adequate supplies during the epidemic is evident, emphasizing positive sentiments without derogatory connotations.

figure 6

Semantic network of sufficient materials.

Sentiment analysis based on online comments

Sentiment analysis can help netizens intuitively grasp their attitudes toward government intervention. The data collected in this paper are obtained from intervention plans for panic buying events, mainly in Hubei Province, the center of the epidemic, at the beginning of the COVID-19 pandemic. Comments belonging to Hubei were analyzed separately to observe the actual relationship between the effect of government intervention and the severity of the epidemic in the intervention area.

Data preprocessing

First, Python is used to reprocess the cleaned 80,280 comments, and the Jieba word segmentation package is adopted to segment the Weibo comments. To avoid redundant words and retain effective content, stopped words, including “on the other hand”, “here” and other conjunctions without actual emotional meaning, are removed, increasing the accuracy of the results of the sentiment analysis. For a bird’s-eye view comparison, out of an original cache of 1878 comments emanating from Hubei, a streamlined set of 1301 comments emerges postpurification.

Subsequently, the geographical origin of user comments becomes the focal point of our analysis, resulting in the illustrative Fig. 7 , in which the circle represents the user’s region. A darker color indicates more corresponding people. This indicates that, except for western regions such as Qinghai, Tibet, and Ningxia, comment volume is relatively harmonized across the vast expansion of other regions. Thus, even when narrowed down to a specific 1301 comments from Hubei, the analytical value of these comments remains undiminished and profoundly significant.

figure 7

User distribution (dark color represents more users).

Sentiment analysis

Sentiment analysis in Python involves two dictionaries-a sentiment word dictionary and a degree word dictionary-primarily based on the HowNet Chinese sentiment dictionary. The sentiment dictionary is divided into positive and negative emotion words, while the degree word dictionary categorizes words such as “most”, “very”, “more”, “ish”, “insufficiently”, and “inverse”, each assigned specific weights for degree distinctions (Li et al., 2015 ). For instance, “most” carries a weight of 2, “very” is weighted at 1.5, and intriguingly, “inverse” is marked with −1.

The results of the sentiment analysis are summarized separately for non-Hubei areas in Fig. 8 and for the Hubei region in Fig. 9 . Positive sentiment is notably highest in the sufficient materials category, constituting 47.55% of favorable reactions. Conversely, initiative guidance shows the lowest positive sentiment at 25.26%. Conversely, initiative guidance evokes the highest negative sentiment at 38.81%, whereas sufficient materials record the lowest negative sentiment at 19.53%. This pattern indicates a public preference for sufficient material interventions over initiative guidance. In terms of neutral emotions, initiative guidance is the most common, suggesting that it polarizes public sentiment, while the authority effect elicits the least neutral response. Notably, negative emotions under initiative guidance significantly outweigh positive emotions, contrasting with better public perceptions in the other intervention categories.

figure 8

The sentiment analysis results for the non-Hubei areas.

figure 9

The sentiment analysis results for the Hubei region.

The average sentiment values further clarify the dataset’s collective sentiment. The sufficient materials category scores highest for positive sentiment, with a notable score of 2.3. Conversely, the authority effect and market regulation categories score lowest for negative sentiment, both at −2.2, indicating stronger negative public sentiment despite positive aspects. This discrepancy may stem from a perceived lack of market regulation or overly technical authoritative statements.

In sum, the sufficient materials category clearly garners the most public appreciation among the interventions. When these interventions are ranked based on positive public sentiment, the hierarchy is as follows: sufficient materials > authority effect > market regulation > initiative guidance.

As illustrated in Fig. 9 , the favorability of people in Hubei, the center of the epidemic, is also sorted as follows: sufficient materials > authority effect > market regulation > initiative guidance, which is consistent with the situation outside Hubei. However, it is obvious that the percentage of positive emotions for all types of interventions is greater than that in non-Hubei areas, indicating that people in high-risk areas are more eager for government control and have greater support for it than people in low-risk areas. A total of 56.48% of the participants reported having positive emotions toward such interventions, which indicates that more than half of the participants had positive emotions toward such interventions, and these positive emotions are significant.

In terms of the average emotion score, the highest positive score is 2.6 for both sufficient materials and initiative guidance, which is 0.3 higher than the highest score for non-Hubei areas, and the negative score is significantly lower. Conversely, the negative sentiment peaks at −2.6 for Hubei, which is a decrease of 0.4 from the non-Hubei areas, because people in the center of the epidemic have more intense emotional expressions and are more likely to experience extreme emotions. Given these insights, the government should fully consider the impact of risk level on panic rush behavior when controlling and developing intervention measures in conjunction with risk level.

Analyzing government intervention effects on panic buying using the LDA model

In our preceding analysis, we delved into the fundamental connections within the comment data and discerned emotional inclinations. To gain a deeper understanding of the impact of government intervention measures on panic buying, this section adopts the LDA topic model for semantic mining, further explores the correlation between texts, extracts the topic of the four intervention categories, and further analyzes the implementation effect under each category. At the same time, relevant variables are selected, and regression models are used to explore the factors that affect people’s perceptions of government intervention. Regression models are often applied for correlation analysis among multiple factors and are more appropriate for this section.

LDA topic number optimization

The analysis divides comments into four types, extracting positive and negative emotions based on emotion scores, resulting in 8 distinct comment sets. For each category, the average cosine similarity of positive and negative emotions is calculated. Figures 10 – 13 illustrate the findings:

figure 10

Average cosine similarity change in the initiative guidance category.

figure 11

Average cosine similarity change in the market regulation category.

figure 12

Average cosine similarity change in the authority effect category.

figure 13

Average change in the cosine similarity for the sufficient materials category.

Figure 10 shows that for initiative guidance, setting the LDA topic number ( K ) to 2 achieves the lowest average cosine similarity for positive comments. For negative comments, K values of 2 or 6 yield the lowest average cosine similarity.

Figure 11 reveals that in market regulation, ( K  = 3) results in the lowest average cosine similarity for positive comments, while ( K  = 4) achieves this for negative comments.

Figure 12 indicates that for the authority effect, K  = 2 yields the lowest average cosine similarity for both positive and negative comments.

Figure 13 demonstrates that in sufficient materials, ( K  = 3) or ( K  = 7) achieves the lowest average cosine similarity for positive comments, and ( K  = 3) for negative comments.

These insights guide the optimal selection of ( K) values for each category, facilitating more nuanced topic extraction and sentiment analysis from the comments.

Topic extraction and analysis

The LDA model, also known as the three-tiered Bayesian probability framework, encompasses documents, topics, and words. The model introduces Dirichlet’s principle, which has a strong generalization ability and is not prone to overfitting.

Taking the authoritative effect class (constituting comment set D , with d representing a comment in set D , hereinafter referred to as document d ) as an example, the main steps of topic extraction using LDA are as follows:

Step 1: Select the document that has been divided into words, and use the word sequence to represent \(d=({w}_{1},{w}_{2},\ldots {w}_{n})\) , where w represents the word, 1,2… n represents the word sequence number, and select a document with a prior probability \(P({w}_{i})\) . The Dirichlet distribution is used to create the topic distribution \({\varphi }_{d}\) , which is realized by using the hyperparameter α in the distribution function ( i represents the dimension, K represents the number of topics). The distribution formula used is:

where \(\varGamma \left(\cdot \right)\) represents the gamma function and K represents the number of topics.

Step 2: Each topic obeys a polynomial distribution, and the topic word z of document d is generated by sampling from the distribution, corresponding to the polynomial distribution of topics as follows.

Step 3: The word distributions for each topic are based on the Dirichlet distribution.

Step 4: Each word also obeys a polynomial distribution, and the keyword w under the topic is generated from the word polynomial distribution.

Using Python’s Gensim module, we applied LDA topic modeling to both the positive and negative comment datasets. After determining the optimal topic numbers, we conducted LDA analysis to identify recurrent words per topic. For negative comments in the “initiative guidance” category, given K -values of 2 and 6, we tested 2 and 6 topics, respectively, noting excessive word repetition at 6 topics. There were 2 topics with positive comments, as detailed in Table 7 .

Positive comments under “initiative guidance” highlighted two main dimensions: (1) trust in official statements and (2) reduced public anxiety about rational cognition. Negative comments indicated poor independent thinking, judgment, and misleading official information.

For “market regulation”, 3 topics were considered positive, and 4 were considered negative. The LDA analysis results for the keywords are detailed in Table 8 . Positives focused on timely government interventions (“response”, “timely”, “fast”), robust regulatory measures, and prompt departmental actions. Negatives cited inadequate regulation prone to repetition, network flaws (e.g., order cancellations), material safety concerns, and limited regulatory effectiveness. Market regulation remains a challenging, ongoing effort.

Utilizing LDA topic modeling, we analyzed the “authority effect” category with 2 topics each for positive and negative sentiments. Table 9 summarizes the key themes: positive sentiments emphasize trust in authority figures (“know”, “thank you”, “hard work”), and the social influence of authorities, notably “fans”. Negative comments critique those who challenge authority, spread misinformation, and highlight gaps in public understanding of disease, stressing the need for transparent and credible epidemic information.

In the “sufficient materials” category, after comparing K  = 3 and K  = 7 for positive comments, we settled on 3 topics due to lower cosine values indicating distinct topics. Table 10 reveals positive comments focusing on national unity in epidemic response, material security reducing panic, and increased patriotism. Negative sentiments highlight regional food disparities affecting disaster preparedness, excessive hoarding, and panic buying. Effective government supervision is crucial for managing hoarding and ensuring the equitable distribution of resources tailored to regional needs.

Analysis of factors affecting intervention effects based on multiple regression

Government interventions aim to create an environment where people can thrive, with public feedback guiding their refinement. Understanding public perception postimplementation is crucial for optimizing these measures. Emotions serve as a key indicator of public perception, and a regression analysis will explore factors influencing these emotions and inform intervention adjustments. Stepwise regression will help manage independent variables, ensuring that the analysis avoids collinearity issues.

Construction of the multiple regression model

The comprehensive dataset used in this section encapsulates various dimensions, from user-specific parameters such as ID; the number of followers, fans, posts and likes; contextual data such as follow-up comments; the timestamp of the original Weibo post; comment time; geographical attribution; and the core content of the comment, as illustrated in Fig. 14 . An evaluation system for government interventions under COVID-19 based on the literature (Chen et al., 2020 ) is constructed to reflect the effects of different types of interventions. The evaluation system includes 4 first-level indicators and 8 second-level indicators. These 8 indicators are obtained by directly crawling Weibo data or by expanding crawling data. The indicators at all levels and their meanings are summarized in Table 11 .

figure 14

Crawled comments.

The dependent variable Y is obtained from emotion analysis to represent people’s perception of government intervention, and a multiple regression model is constructed as follows.

where \(\varepsilon\) represents the error term, \({b}_{0}\) represents the constant term, and \({b}_{i}\) ( i  = 1, 2, 3…,8) represents the respective variable coefficients.

Analysis of factors influencing the effect of intervention

Based on the regression model analyzed using SPSS 26, collinearity among independent variables was checked and found to be within acceptable limits (VIF < 10). The significance level for variable inclusion or exclusion was set at 0.1. The initial results in Table 12 indicate that variables X_1 and X_6 have p values less than 0.05, suggesting that their parameters are statistically significant and should be retained in the regression equation. The final regression results in Table 13 show that epidemic severity (X_1) alone explains 60.1% of the variance in public perception (Y), while together with follow-up comments (X_6), they explain 70% of the variance.

The findings indicate that as epidemic severity increases, there is a stronger public inclination toward robust government intervention and a heightened demand for transparent intervention strategies. This aligns with findings by Chen et al. ( 2022 ) that during severe epidemics, public behavior such as panic buying is significantly influenced by government actions. Negative correlations imply that greater public engagement on the topic increases susceptibility to negative emotions, which diminishes gradually over time.

Certain variables, such as the location of the epidemic center, number of user followers, number of user blogs, timeliness of commenting, and number of user likes, do not significantly influence the dependent variable-the emotional value of the public. This suggests that despite various indicators of user attention and engagement, most individuals maintain their own perspectives unaffected by these factors. The location of the epidemic center also does not appear to significantly impact public emotional responses, possibly due to a perceived similarity of experience among those affected Fig. 15 .

figure 15

Regression normalization residuals.

The regression model requires that residuals follow a normal distribution for the analysis to be valid. SPSS tests confirm this requirement, ensuring the effectiveness of the regression analysis.

Figure 16 illustrates the temporal dynamics of online public discourse, specifically focusing on fluctuations in the number of comments over time. This reveals that there is a notable increase in public engagement within the first 3 h after the release of relevant reports, peaking between the 4th and 5th hours. Subsequently, the volume of public discussions gradually declines, stabilizing after approximately 8 h. This pattern indicates that governmental interventions typically occur preemptively within 4–5 h of report release, aiming to guide public opinion and mitigate potential consequences before discussions peak.

figure 16

Variation in comment volume with time interval for each category.

Result analysis

This section provides a summary of the key findings of the paper, compares them with the conclusions of the literature, and proposes relevant recommendations. The paper concludes by highlighting the main contributions of this article.

Government prevention and control lag

From the background analysis of 2.3.1, it can be seen that when sorting the timeline of the epidemic development events in Hubei Province, a clear pattern appears, that is, the measures taken by the government lag behind. After the announcement of the lockdown, panic buying has taken root and spread, and relevant news has been released to assure the public of material abundance. At that time, large-scale panic buying had already occurred and the social impact was uncontrollable.

The problem of insufficient breadth and depth of market supervision urgently needs to be solved

In the analysis of 3.2 and 3.4, we deeply perceive that there are many shortcomings to be remedied in the aspects of breadth and depth of market supervision. Due to the limitations of the categories of regulated materials and the places, thorny problems such as repeated price increases frequently occur, seriously disrupting the normal order of the market.

The in-depth research conducted by Shan and Pi ( 2023 ) on government supervision shows that when the government chooses the “active supervision” strategy, its decision-making basis mainly depends on supervision costs and government credibility, rather than simply the “amount of fines”. This finding fully demonstrates that the starting point of government management is based on rationality and always puts the people’s interests first.

Market supervision plays a crucial and indispensable role in curbing panic buying. Herbon and Kogan ( 2022 ) emphasizes that market intervention can start from subsidizing enterprises and consumers, but it is restricted by the government’s financial pressure. Relevant departments should increase management efforts, optimize the system, refine the supervision plan, effectively intervene in the market, and deal with similar problems. The core lies in that market supervision should not only focus on solving current problems but also pay attention to the establishment of long-term mechanisms to enhance the self-regulatory ability and risk-resistance ability of the market and ensure that the market can still operate stably in the face of various impacts.

The effects of the adequate supplies category and the authority effect category are more remarkable

It can be clearly seen from the results of the 3.3 sentiment analysis that among various intervention measures, the effect of the adequate supplies category is the most outstanding, followed by the authority effect category. If the two can be ingeniously integrated and implemented, that is, when authoritative figures come forward to declare the sufficiency of supplies, it undoubtedly can arouse stronger and more positive public sentiments, and thus is expected to achieve more powerful and effective intervention results.

During the difficult period when the epidemic was rampant, the influence of those prestigious authoritative public figures far exceeded that of ordinary people (Ding, 2009 ). Renowned figures like Zhong Nanshan and Li Lanjuan not only have highly authoritative teams as support but also, with their rich practical experience and professional and precise interpretation abilities, can easily win high recognition and full trust from the public.

This phenomenon is not limited to the epidemic itself. Even in the face of a series of social phenomena such as panic buying caused by the epidemic, releasing news about the adequacy of supplies through influential figures at this special stage is highly likely to achieve unexpectedly good effects. The underlying principle is that when the public faces uncertainties and potential risks, they tend to seek guidance and security from authorities. When authoritative figures convey a clear signal of adequate supplies, it can greatly alleviate the anxiety and unease in the public’s hearts, thereby effectively suppressing the spread of panic sentiments and the irrational purchasing behaviors caused thereby.

There was a significant positive correlation between epidemic severity and public perception

In the analysis of 3.5, the severity of the epidemic can independently explain 60.1% of the changes in public perception. The correlation between the severity of the epidemic and the changes in public perception is the strongest, indicating that the public’s perception of government intervention is strongly affected by the severity of the epidemic. The more severe the epidemic is, the more eager the public is for government control, and the greater their perception of government intervention is. Prentice et al. ( 2021 ) found that government intervention and support influenced panic-buying participation.

When the government intervenes in response to panic buying, it can start with the initial break of the epidemic. On the one hand, timely disclosure of epidemic trends can alleviate public panic caused by unknown factors; on the other hand, it can actively promote the implementation of epidemic prevention and control measures and release relevant prevention and control news. In addition, the government can appropriately intervene and guide the internet, remove rumors, cut off sources of dissemination, correct the direction of public opinion in a timely manner, and guide the public to think rationally.

Practical implications

The practical significance of this paper lies in that it provides in-depth analysis and possible strategic directions for understanding and responding to the phenomenon of panic buying.

Firstly, it helps the government formulate and adjust intervention policies more precisely. By evaluating the effects of intervention measures and understanding the public attitude, the government can promptly identify the deficiencies of existing policies and thus formulate more targeted and effective strategies to better cope with similar emergencies. For example, at the beginning of 2020, a sudden COVID-19 outbreak occurred in a certain city, leading to the public’s panic buying of masks and disinfectants. Through monitoring the market situation and collecting public opinions, the government found that the previous measure of simply appealing to the public to buy rationally was ineffective. Based on the viewpoint proposed in this article that the intervention measures of the adequate supply category are effective, the government promptly coordinated local related enterprises to increase the production of masks and disinfectants and allocated materials from other places to increase market supply. At the same time, based on the viewpoint that the intervention measures of the authority effect category are effective, medical experts were invited to explain the scientific usage methods of masks and disinfectants and the material reserves of the city to the public through TV and online live broadcasts. As a result, within a short period of time, the public’s rush-buying behavior was alleviated, the prices of masks and disinfectants in the market gradually stabilized, and the supply was sufficient.

Secondly, the revelation that the severity of the epidemic is positively correlated with public perception provides strong theoretical support for the government’s crisis communication and public opinion guidance strategies. According to the dynamic changes of the epidemic, the government needs to formulate a phased and hierarchical information release and communication plan to meet the information needs of the public at different stages of the epidemic. For example, in the early days of the COVID-19 epidemic, the public had limited understanding of the epidemic, lacked a clear judgment on the direction of the situation, and the herd mentality was obvious. According to the development of the epidemic, the government will disclose the real data of the epidemic, prevention and control measures, and the allocation of medical resources to the public in stages. During the severe phase of the outbreak, the number of new cases increases significantly. At this time, the government’s medical resources are facing great pressure, and the people also have greater psychological pressure. The government updated the relevant information of the number of cases and medical resources in a timely and accurate manner, and explained the government’s active efforts to make the public clearly aware of the severe situation of the epidemic, so as to encourage the public to more consciously comply with the epidemic prevention and control measures and actively cooperate with the government’s prevention and control work; At the stage when the epidemic is under control, the trend of reducing new cases and related policies for resuming work and production will be announced in a timely manner. For example, some restaurants are gradually opening their dining services to the extent permitted by the government, and customers are taking precautions to maintain social distancing in accordance with regulations. The clear announcement of this policy change enabled the gradual return of social and economic activities to normal under the premise of safety. Through transparent, timely and accurate information transmission, the public’s understanding of the epidemic has gradually become rational, avoiding excessive panic and blind optimism, and thus better pooling social consensus to fight the epidemic together.

Conclusions

In summary, the analysis highlights several key aspects of the government’s response to the epidemic in Hubei. First, there was a noticeable lag in the implementation of prevention and control measures, leading to panic buying and uncontrollable social impacts. Second, interventions by authoritative figures and assurances about material sufficiency had a positive effect on public sentiment, emphasizing their influential role during the crisis. Third, market regulation has shown insufficient breadth and depth, resulting in issues such as repeated price increases. Finally, there was a significant positive correlation between the severity of the epidemic and public perception of government intervention, indicating that the more severe the epidemic was, the stronger the public’s desire for government control and perception of government involvement were.

Based on the data of public comments, this paper evaluates government intervention measures and draws some conclusions. However, it is imperative to recognize certain inherent limitations, providing avenues for further exploration:

This paper uses online comment data to analyze the effect of government intervention, mainly from the public’s perspective. Future studies can pivot toward an economic lens, drawing insights from the sales metrics of prominent retail outlets before and after the rollout of government initiatives in epidemic areas. In addition, online comment data may have certain limitations, although we chose Weibo, a platform with 230 million daily active users, to obtain as many as 80,000 comment data points as possible. However, a segment of the population, particularly those less digitally inclined, remains unrepresented in the online discourse. Therefore, further optimization can be carried out in the future to solve this problem.

While this study primarily anchors its analysis to data from the early stages of the epidemic, it is crucial to understand that government measures should be dynamic and adaptable to the evolving nature of the epidemic. It is necessary to consider the characteristics of epidemic development, grasp people’s emotional situation, and formulate relevant measures. Consequently, the insights gained from this research might not seamlessly translate to the subsequent phases of epidemic management in China.

While the digital realm is often celebrated for its perceived freedom, the reality is that internet censorship is a ubiquitous phenomenon characterized by global implementation. A salient aspect of such censorship is internet filtering, which blocks some illegal or sensitive words. Worldwide renowned filtering tools such as PureSight PC, CYBERsitter, SafeEyes, and CyberPatrol have been employed worldwide for this purpose. Filtering and blocking are normally performed on ISP (Internet Service Provider) servers. This shows that although cyberspace is free, speech is not 100% open. In the context of the People’s Republic of China, ISPs are the main body responsible for internet censorship, as are most other countries. Consequently, it is acknowledged that a minority of the comments might have been expunged, rendering them inaccessible for our analysis. However, the volume of data we amassed in this study mitigates this limitation, ensuring the robustness of our findings despite such omissions. Moving forward, subsequent research endeavors could focus on quantifying this specific impact, paving the way for improving the study’s robustness and raising additional credibility to its findings.

Data availability

The author confirms that all data generated or analyzed during this study are included in this published article. Furthermore, secondary sources and data supporting the findings of this study were all publicly available at the time of submission. Additional data related to this study can be found in the Supplementary Information submitted with this article.

Arafat SM, Hakeem S, Kar SK, et al. (2022) Communication during disasters: role in contributing to and prevention of panic buying. Panic Buying and Environmental Disasters: Management and Mitigation Approaches. Springer International Publishing, Cham, p 161–175

Arafat SM, Kar SK, Kabir R (2021a) Editorial for panic buying: human psychology and environmental influence. Front Public Health 9:589

Article   Google Scholar  

Arafat SM, Kar SK, Kabir R, (2021b) Panic buying perspectives and prevention. Springer Briefs in Psychology. 114–124. https://doi.org/10.1007/978-3-030-70726-2

Arafat SM, Kar SK, Menon V, Alradie-Mohamed A, Mukherjee S, Kaliamoorthy C, Kabir R (2020a) Responsible factors of panic buying: an observation from online media reports. Front Public Health 8:603894. https://doi.org/10.3389/fpubh.2020.603894

Article   PubMed   PubMed Central   Google Scholar  

Arafat SM, Kar SK, Menon V, Marthoenis M, Sharma P, Alradie-Mohamed A, Kabir R (2020b) Media portrayal of panic buying: a content analysis of online news portals. Glob Psychiatry 3(2):249–254

Google Scholar  

Arafat SM, Kar SK, Menon V, Kaliamoorthy C, Mukherjee S, Alradie-Mohamed A, Kabir R (2020c) Panic buying: an insight from the content analysis of media reports during COVID-19 pandemic. Neurol Psychiatry Brain Res 37:100–103. https://doi.org/10.1016/j.npbr.2020.07.002

Bentall RP, Lloyd A, Bennett K, McKay R, Mason L, Murphy J, Shevlin M (2021) Pandemic buying: testing a psychological model of over-purchasing and panic buying using data from the United Kingdom and the Republic of Ireland during the early phase of the COVID-19 pandemic. Plos One 16(1):e0246339

Article   CAS   PubMed   PubMed Central   Google Scholar  

Bermes A (2021) Information overload and fake news sharing: a transactional stress perspective exploring the mitigating role of consumers’ resilience during COVID-19. J Retail Consum Serv 61:102555. https://doi.org/10.1016/j.jretconser.2021.102555

Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(Jan):993–1022

Billore S, Anisimova T (2021) Panic buying research: a systematic literature review and future research agenda. Int J Consum Stud 45(4):777–804

Boyacι-Gündüz CP, Ibrahim SA, Wei OC, Galanakis CM (2021) Transformation of the food sector: security and resilience during the COVID-19 pandemic. Foods 10(3):497. https://doi.org/10.3390/foods10030497

Cariappa AA, Acharya KK, Adhav CA, Sendhil R, Ramasundaram P, Kumar A, Singh GP (2022) COVID-19 induced lockdown effect on wheat supply chain and prices in India–Insights from state interventions led resilience. Socio Econ Plan Sci 101366. https://doi.org/10.1016/j.seps.2022.101366

Chen T, Jin Y, Yang J et al. (2022) Identifying emergence process of group panic buying behavior under the COVID-19 pandemic. J Retail Consum Serv 67:102970

Chen T, Peng L, Yin X, Jing B, Yang J, Cong G, Li G (2020) A policy category analysis model for tourism promotion in China during the COVID-19 pandemic based on data mining and binary regression. Risk Manag Healthc Policy 13:3211. https://doi.org/10.2147/RMHP.S284564

Chen T, Rong J, Peng L, Cong G, Fang J (2021) Analysis of social effects on employment promotion policies for college graduates based on data mining for online use review in China during the COVID-19 pandemic. Healthcare 9(7):846. https://doi.org/10.3390/healthcare9070846

De Brito Junior I, Yoshizaki HTY, Saraiva FA, Bruno NDC, da Silva RF, Hino CM, Ataide IMFD (2023) Panic buying behavior analysis according to consumer income and product type during COVID-19. Sustainability 15(2):1228

Ding H (2009) Rhetorics of alternative media in an emerging epidemic: SARS, censorship, and extra-institutional risk communication. Tech Commun Q 18(4):327–350. https://doi.org/10.1080/10572250903149548

Fast SM (2014) Pandemic panic: a network-based approach to predicting social response during a disease outbreak. Massachusetts Institute of Technology

Fu P, Jing B, Chen T, Xu C, Yang J, Cong G (2021) Propagation model of panic buying under the sudden epidemic. Front Public Health 9:675687. https://doi.org/10.3389/fpubh.2021.67568

Fu P, Jing B, Chen T, Yang J, Cong G (2022) Identifying a new social intervention model of panic buying under sudden epidemic. Front Public Health 10:842904. https://doi.org/10.3389/fpubh.2022.842904

Garbe L, Rau R, Toppe T (2020) Influence of perceived threat of COVID-19 and HEXACO personality traits on toilet paper stockpiling. Plos one 15(6):e0234232

Guo Y, Zhang X, Zhao Z, Zhao Q (2022) The impact of online review content on hotel customer satisfaction. Econ Res Guide 26:45–47

Herbon A, Kogan K (2022) Scarcity and panic buying: the effect of regulation by subsidizing the supply and customer purchases during a crisis. Ann Oper Res 318:251–276. https://doi.org/10.1007/s10479-022-04837-7

Article   MathSciNet   PubMed   PubMed Central   Google Scholar  

Ho CSH, Chee CY, Ho RC (2020) Mental health strategies to combat the psychological impact of COVID-19 beyond paranoia and panic. Ann Acad Med Singap 49(1):1–3

Islam T, Pitafi AH, Arya V, Wang Y, Akhtar N, Mubarik S, Li X (2021) Panic buying in the COVID-19 pandemic: a multi-country examination. J Retail Consum Serv 59:102357. https://doi.org/10.1016/j.jretconser.2020.102357

Jelodar H, Wang Y, Yuan C, Feng X, Jiang X, Li Y, Zhao L (2019) Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimed Tools Appl 78:15169–15211

Kantar World Panel (2020) Accidental stockpilers driving shelf shortages—Global site—Kantar Worldpanel. 2020. Available from: https://www.kantarworldpanel.com/global/News/Accidental-stockpilers-driving-shelf-shortages

Keane M, Neal T (2021) Consumer panic in the COVID-19 pandemic. J Econ 220(1):86–105

Article   MathSciNet   Google Scholar  

Lan G, Shan Y, Huang J (2020) Dynamic interaction between online reviews and product sales—an empirical study based on Taobao commodity panel data. J Lit Data 2(03):48–59

Lei L, Huang X, Zhang S, Yang J, Yang L, Xu M (2020) Comparison of prevalence and associated factors of anxiety and depression among people affected by versus people unaffected by quarantine during the COVID-19 epidemic in Southwestern China. Med Sci Monit: Int Med J Exp Clin Res 26:e924609–1. https://doi.org/10.12659/MSM.924609

Article   CAS   Google Scholar  

Li D, Dong C (2022) Government regulations to mitigate the shortage of life-saving goods in the face of a pandemic. Eur J Oper Res 301(3):942–955. https://doi.org/10.1016/j.ejor.2021.11.042

Article   ADS   MathSciNet   Google Scholar  

Ling T, Hoh G, Ho C, Mee C (2020) Effects of the coronavirus (COVID-19) pandemic on social behaviours: From a social dilemma perspective. Tech Soc Sci J 7:312

Lins S, Aquino S (2020) Development and initial psychometric properties of a panic buying scale during COVID-19 pandemic. Heliyon, 6(9)

Liu J, Lin S, Xin L, Zhang Y (2022) Human Buyers: A Study of Alibaba’s Inventory Replenishment System. Available at SSRN. https://doi.org/10.2139/ssrn.4207171

Li S, Zhang J, Shan L (2015) Measuring social security policy implementation effect based on multidimensional and hierarchical scale. Manag Rev 27:24–38

Lufkin B Coronavirus: The psychology of panic buying. BBC. 2020 Mar 4. Available from: https://www.bbc.com/worklife/article/20200304-coronavirus-covid-19-update-why-people-are-stockpiling

Maier D, Waldherr A, Miltner P, Wiedemann G, Niekler A, Keinert A, … & Adam S (2021) Applying LDA topic modeling in communication research: Toward a valid and reliable methodology. In Computational methods for communication science. Routledge, p 13–38

Mao QH, Hou JX, Xie PZ (2022) Dynamic impact of the perceived value of public on panic buying behavior during COVID-19. Sustainability 14(9):4874. https://doi.org/10.3390/su14094874

Maunder R, Hunter J, Vincent L, Bennett J, Peladeau N, Leszcz M, Sadavoy J, Verhaeghe LM, Steinberg R, Mazzulli T (2003) The immediate psychological and occupational impact of the 2003 SARS outbreak in a teaching hospital. CMAJ 168(10):1245–1251

PubMed   PubMed Central   Google Scholar  

Mukhtar S (2020) Psychological health during the coronavirus disease 2019 pandemic outbreak. Int J Soc Psychiatry 66(5):512–516. https://doi.org/10.1177/0020764020925

Naeem M (2021) Do social media platforms develop consumer panic buying during the fear of COVID-19 pandemic? J Retail Consum Serv 58:102226

Nicholson R UK coronavirus panic: Shoppers form HUGE queues before shops open in panic buying spree. Express. 2020 Mar 14. Available from: https://www.express.co.uk/news/uk/1255270/uk-coronavirus-panic-buying-shops-huge-queues-shortages-pictures

Niu J, Han Q, Hao Y (2021) Panic buying and stockpiling of emergency supplies among Chinese citizens during COVID-19 epidemic: an online survey. Chin J Public Health 37(7):1101–1106. https://doi.org/10.11847/zgggws1133947

Olowookere CA, Olanrewaju TS, Apanisile AS (2022) Effect of COVID-19 on Household Food Security in Ekiti State. Int J Innovative Sci Res Technol 4(7):2456–2165

Omar NA, Nazri MA, Ali MH, Alam SS (2021) The panic buying behavior of consumers during the COVID-19 pandemic: examining the influences of uncertainty, perceptions of severity, perceptions of scarcity, and anxiety. J Retail Consum Serv 62:102600. https://doi.org/10.1016/j.jretconser.2021.102600

Prentice C, Chen J, Stantic B (2020) Timed intervention in COVID-19 and panic buying. J Retail Consum Serv 57:102203. https://doi.org/10.1016/j.jretconser.2020.102203

Prentice C, Nguyen M, Nandy P, Winardi MA, Chen Y, Le Monkhouse L, Dominique-Ferreira S, Stantic B (2021) Relevant, or irrelevant, external factors in panic buying. J Retail Consum Serv 61:102587. https://doi.org/10.1016/j.jretconser.2021.102587

Prentice C, Quach S, Thaichon P (2022) Antecedents and consequences of panic buying: the case of COVID-19. Int J Consum Stud 46(1):132–146. https://doi.org/10.1111/ijcs.12649

Rajkumar RP (2021) A biopsychosocial approach to understanding panic buying: integrating neurobiological, attachment-based, and social-anthropological perspectives. Front Psychiatry 12:184. https://doi.org/10.3389/fpsyt.2021.652353

Rajkumar RP, Arafat SMY (2021) Model driven causal factors of panic buying and their implications for prevention: a systematic review. Psychiatry Int 2(3):325–343. https://doi.org/10.3390/psychiatryint2030025

Roy D, Tripathy S, Kar SK, Sharma N, Verma SK, Kaushal V (2020) Study of knowledge, attitude, anxiety and perceived mental healthcare need in Indian population during COVID-19 pandemic. Asian J Psychiatry 51:102083. https://doi.org/10.1016/j.ajp.2020.102083

Sahin O, Salim H, Suprun E, Richards R, MacAskill S, Heilgeist S, Stewart RA, Beal CD (2020) Developing a preliminary causal loop diagram for understanding the wicked complexity of the COVID-19 pandemic. Systems 8(2):20. https://doi.org/10.3390/systems8020020

Shan H, Pi W (2023) Mitigating panic buying behavior in the epidemic: an evolutionary game perspective. J Retail Consum Serv 73:103364. https://doi.org/10.1016/j.jretconser.2023.103364

Sharma P, Kar S, Menon V et al. (2020) Panic buying: is it a normal social construct? The Anatolian. J Fam Med 3(3):270

Si X, Hou Y, Zhang S, Jiang X, Li W, Li D, Deng Y (2020) Psychological investigation of front-line medical staff during COVID-19 and analysis of targeted intervention effect. Psychol Monthly 15(17):83

Sim K, Chan YH, Chong PN, Chua HC, Soon SW (2010) Psychosocial and coping responses within the community health care setting towards a national outbreak of an infectious disease. J Psychosom Res 68:195–202

Article   PubMed   Google Scholar  

Song S, Li S, Ji Z, Zeng Y, Peng W (2021) Research on the influencing factors of the usefulness of online reviews - empirical research based on JD Mall. Sci Technol Entrepreneurship Monthly 34(02):72–79

Taylor S (2021) Understanding and managing pandemic-related panic buying. J Anxiety Disorders, 102364 https://doi.org/10.1016/j.janxdis.2021.102364

Taylor S (2022) The psychology of pandemics: Lessons learned for the future. Can Psychol Psychol Can 63(2):233. https://doi.org/10.1037/cap0000303

Tsao YC, Raj PVRP, Yu V (2019) Product substitution in different weights and brands considering customer segmentation and panic buying behavior. Ind Mark Manag 77:209–220. https://doi.org/10.1016/j.indmarman.2018.09.004

Wang HH, Na HAO (2020) Food hoarding during the pandemic period with city lockdown. J Integr Agricult 19(12):2916–2925

Wu T (2009) Research on the propagation process and intervention countermeasures of network emergencies. Fudan University

Yang Y, Ren H, Zhang H (2022) Understanding consumer panic buying behaviors during the strict lockdown on omicron variant: a risk perception view. Sustainability 14(24):17019

Yu D, Xiang B (2023) Discovering topics and trends in the field of Artificial Intelligence: using LDA topic modeling. Expert Syst Appl 225:120114

Yuen KF, Tan LS, Wong YD, Wang X (2022) Social determinants of panic buying behavior amidst COVID–9 pandemic: the role of perceived scarcity and anticipated regret. J Retail Consum Serv 66:102948. https://doi.org/10.1016/j.jretconser.2022.102948

Yuen KF, Wang X, Ma F, Li KX (2020) The Psychological causes of panic buying following a health crisis. Int J Environ Res Public Health 17(10):3513. https://doi.org/10.3390/ijerph17103513

Zhang L, Chen K, Jiang H, Zhao J (2020) How the health rumor misleads people’s perception in a public health emergency: lessons from a purchase craze during the COVID-19 outbreak in China. Int J Environ Res Public Health 17(19):7213

Zhao X, Li X, Nie C (2020) Backtracking transmission of COVID-19 in China based on big data source, and effect of strict pandemic control policy. Bull Chin Acad Sci 35(3):248–255. https://doi.org/10.16418/j.issn.1000-3045.20200210002

Zheng GW, Siddik AB, Yan C, Masukujjaman M (2020) Official and unofficial media information and the public panic during the COVID-19 pandemic in China: An empirical analysis. Rev Argent de Clin Psicol 29(5):1538–1551

Download references

Acknowledgements

This research is supported by the Major Program of National Philosophy and Social Science Foundation of China (Grant No. 22&ZD162), Zhejiang Provincial Natural Science Foundation of China (Grant No. LY22G010004), as well as Zhejiang Gongshang University “digital +” discipline construction key project (Grant No. SZJ2022B019).

Author information

Authors and affiliations.

School of Artificial Intelligence and Electronic Commerce, Zhejiang Gongshang University Hangzhou College of Commerce, Hangzhou, China

Tinggui Chen & Bing Wang

School of Statistics and Mathematics, Zhejiang Gongshang University, Hangzhou, China

Tinggui Chen & Yumei Jin

Department of Computer Science and Information Systems, University of North Georgia, Oakwood, GA, USA

Jianjun Yang

You can also search for this author in PubMed   Google Scholar

Contributions

Conceptualization: Tinggui and Bing Wang; methodology: Yumei Jin and Tinggui Chen; software: Yumei Jin; validation: Jianjun Yang; formal analysis, Yumei Jin and Jianjun Yang; data curation: Bing Wang; writing-original draft: Tinggui Chen, Yumei Jin and Bing Wang. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Bing Wang .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Ethical approval

This article does not contain any studies with human participants performed by any of the authors.

Informed consent

Additional information.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Chen, T., Jin, Y., Wang, B. et al. The government intervention effects on panic buying behavior based on online comment data mining: a case study of COVID-19 in Hubei Province, China. Humanit Soc Sci Commun 11 , 1200 (2024). https://doi.org/10.1057/s41599-024-03725-8

Download citation

Received : 24 October 2023

Accepted : 05 September 2024

Published : 13 September 2024

DOI : https://doi.org/10.1057/s41599-024-03725-8

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

data mining classification case study

A Comparative Study of Famous Classification Techniques and Data Mining Tools

  • Conference paper
  • First Online: 22 November 2019
  • Cite this conference paper

data mining classification case study

  • Yash Paul 39 &
  • Neerendra Kumar 40 , 41  

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 597))

2558 Accesses

4 Citations

Data mining is the procedure or technique of drawing out the facts and patterns hidden in huge sum of data and converts it into a readable and understandable form. Data mining has four main modules like classification, association rule analysis, and clustering and sequence analysis. The classification is the major module and is used in many different areas for classification problems. Classification process gives a summary of data investigation which may be utilized to develop models or structures, telling different classes or predict future data trends for improved understanding of the data at maximum. In this survey, various data mining classification techniques and some important data mining tools along with their advantages and disadvantages are presented. Data classification techniques are classified into three categories namely, Eager learners, Lazy learners, and other Classification techniques. Decision tree, Bayesian classification, Rule based classification, Support Vector Machines (SVM), Association rule mining and backpropagation (Neural Networks) are eager learners. The K -Nearest Neighbor (KNN) classification and Case Based Reasoning (CRT) are lazy learners. Other classification techniques include genetic algorithms, fuzzy logic and Rough Set Approach. Here six important data mining tools, basic Eager learner, Lazy learner and other classification techniques for data classification are discussed. The aim of this article is to provide a survey of six famous data mining tools and famous different data mining classification techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

data mining classification case study

Classification Through Data Mining Algorithm

data mining classification case study

A Perspective Overview on Machine Learning Algorithms

data mining classification case study

Comparing Various Classifier Techniques for Efficient Mining of Data

Han, J., Kamber, M.: Data Mining: concepts and Techniques, 2nd edn, Morgan Kaufmann Publishers (2006)

Google Scholar  

Weiss, S.M., Kulikowski, C.A.: Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufmann, Burlington (1991)

Murthy, S.K.: Automatic construction of decision trees from data: a multi-disciplinary survey. Data. Min. Knowl. Discov. 2 , 345–389 (1998)

Article   Google Scholar  

Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann (1993)

Breiman, L., Friedman, J., Olshen, R., Stone. C.: Classification and Regression Trees. Wadsworth International Group (1984)

Kamber, M., Winstone, L., Gong, W., Cheng, S., Han, J.: Generalization and decision tree induction: efficient. Classification in data mining. In: Proceedings of 1997 International Workshop Research Issues on Data Engineering (RIDE’97), pp. 111–120. Birmingham, England (1997)

Kalpana, R., Bansal, K.L.: Comparative study of data mining tools. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 4 (6), 216–223 (2014)

Shafer, J., Agrawal, R., Mehta, M.: SPRINT: a scalable parallel classifier for data mining. In: Proceedings of 1996 International Conference on Very Large Data Base (VLDB’96), pp. 544–555. Bombay, India (1996)

Gehrke, J., Ramakrishnan, R., Ganti. V.: Rainforest: a framework for fast decision tree construction of large datasets. In: Proceedings of 1998 International Conference Very Large Data Bases (VLDB’98), pp. 416–427. New York, NY (1998)

Gehrke, J., Ganti, V., Ramakrishnan, R., Loh, W.-Y.: BOAT—optimistic decision tree construction. In: Proceedings of 1999 ACM-SIGMOD International Conference on Management of Data (SIGMOD’99), pp. 169–180. Philadelphia, PA (1999)

Mitchell, T.M.: Version spaces: a candidate elimination approach to rule learning. In: Proceedings of 5th International Joint Conference on Artificial Intelligence, pp. 305–310. Cambridge, MA (1977)

Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd. edn. Wiley (2001)

Heckerman, D.: Bayesian networks for knowledge discovery. In: Fayyad U.M., Piatetsky-Shapiro G., Smyth P., Uthurusamy R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 273–305. MIT Press (1996)

Pearl, J.: Probabilistic Reasoning in Intelligent Systems. Morgan Kauffman (1988)

Rumelhart, D.E., Hinton, G.E., Williams, R,J,: Learning internal representations by error propagation. In: Rumelhart D.E., McClelland J.L. (eds.) Parallel Distributed Processing. MIT Press (1986)

Rosenblatt, F.: The perceptron: a probabilistic model for information storage and organization in the brain. Psychol. Rev. 65 , 386–498 (1958)

Russell, S., Norvig, P.: Artificial Intelligence: a Modern Approach. Prentice Hall (1995)

Minsky, M.L., Papert, S.: Perceptrons: an Introduction to Computational Geometry. MIT Press (1969)

Mezard, M., Nadal, J.P.: Learning in feedforward layered networks: the tiling algorithm. J. Phys. 22 (12), 2191 (1989)

MathSciNet   Google Scholar  

Boser, B., Guyon, I., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: Proceedings of Fifth Annual Workshop on Computational Learning Theory, pp. 144–152. ACM Press: San Mateo, CA (1992)

Burges, C.J.C.: A tutorial on support vector machines for pattern recognition. Data Min. Knowl. Disc. 2 , 121–168 (1998)

Vapnik, V.N., Chervonenkis, A.Y.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Probability Appl. 16 , 264–280 (1971)

Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer-Verlag (1995)

Vapnik, V.N.: Statistical Learning Theory. Wiley (1998)

Clark, P., Niblett, T.: The CN2 induction algorithm. Mach. Learning 3 , 261–283 (1989)

Chen, M.S., Han, J., Yu, P.S.: Data mining: an overview from a database perspective. IEEE Trans. Knowledge Data Eng. 8 , 866–883 (1996)

Li L., Dong, G., Ramamohanrarao. K.: Making use of the most expressive jumping emerging patterns for classification. In: Proceedings of 2000 Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’00), pp. 220232. Kyoto, Japan (2000)

Quinlan, J.R.: Learning logic definitions from relations. Mach. Learn. 5 , 139–166 (1990)

Major, J., Mangano, J.: Selecting among rules induced from a hurricane data base. J. Intell. Info. Syst. 39–52 (1995)

Liu, B., Hsu, W., Ma. Y., Integrating classification and association rule mining. In: Proceedings of 1998 International Conference on Knowledge Discovery and Data Mining (KDD’98), pp. 80–86. New York, NY (1998)

Li, W., Han, J., Pei, J.: CMAR: accurate and efficient classification based on multiple classification rules. In: Proceedings of 2001 International Conference on Data Mining (ICDM’01), pp. 369–376. San Jose, CA (2001)

Ziarko, W.: The discovery, analysis, and representation of data dependencies in databases. In Piatetsky-Shapiro G., Frawley W.J. (eds.) Knowledge Discovery in Databases, pp. 195–209. AAAI Press (1991)

Cios, K., Pedrycz, W., Swiniarski, R.: Data Mining Methods for Knowledge Discovery. Kluwer Academic Publishers (1998)

Fix, E., Hodges, J.R.: Discriminatory analysis non-parametric discrimination: consistency properties. In: Technical Report 21–49-004(4), USAF School of Aviation Medicine, Randolph Field, Texas (1951)

Riesbeck, C., Schank, R.: Inside Case-Based Reasoning. Lawrence Erlbaum (1989)

Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of 1993 ACM-SIGMOD International Conference on Management of Data (SIGMOD’93), pp. 207–216. Washington, DC (1993)

Yin, X., Han, J.: CPAR: classification based on predictive association rules. In: Proceedings of 2003 SIAM International Conference on Data Mining (SDM’03), pp. 331–335, San Francisco, CA (2003)

Mitchell, M.: An Introduction to Genetic Algorithms. MIT Press (1996)

Goldberg, D.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley (1989)

Mehta, M., Agrawal, R., Rissanen, J.: SLIQ: a fast scalable classifier for data mining. In: Proceedings of 1996 International Conference on Extending Database Technology (EDBT’96), pp. 18–32. Avignon, France (1996)

Dong, G., Li, J.: Efficient mining of emerging patterns: discovering trends and differences. In: Proceedings of 1999 International Conference on Knowledge Discovery and Data Mining (KDD’99), pp. 43–52. San Diego, CA, (1999)

Pawlak, Z.: Rough Sets, Theoretical Aspects of Reasoning about Data. Kluwer Academic Publishers, Dordrecht, Netherlands (1991)

MATH   Google Scholar  

Download references

Author information

Authors and affiliations.

Ph.D. School of Informatics, Eötvös Loránd University, Budapest, Hungary

John von Neumann Faculty of Informatics, Óbuda University, Budapest, Hungary

Neerendra Kumar

Department Computer Science & IT, Central University of Jammu, Jammu, India

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Yash Paul .

Editor information

Editors and affiliations.

Department of Computer Science and Engineering, Jaypee University of Information Technology, Waknaghat, Himachal Pradesh, India

Pradeep Kumar Singh

Indian Institute of Technology Delhi, New Delhi, Delhi, India

Arpan Kumar Kar

Central University of Jammu, Jammu, Jammu and Kashmir, India

Yashwant Singh

Indian Institute of Technology Patna, Patna, Bihar, India

Maheshkumar H. Kolekar

Institute of Technology, Nirma University, Ahmedabad, Gujarat, India

Sudeep Tanwar

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Cite this paper.

Paul, Y., Kumar, N. (2020). A Comparative Study of Famous Classification Techniques and Data Mining Tools. In: Singh, P., Kar, A., Singh, Y., Kolekar, M., Tanwar, S. (eds) Proceedings of ICRIC 2019 . Lecture Notes in Electrical Engineering, vol 597. Springer, Cham. https://doi.org/10.1007/978-3-030-29407-6_45

Download citation

DOI : https://doi.org/10.1007/978-3-030-29407-6_45

Published : 22 November 2019

Publisher Name : Springer, Cham

Print ISBN : 978-3-030-29406-9

Online ISBN : 978-3-030-29407-6

eBook Packages : Engineering Engineering (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

What Is Data Classification? How It's Useful for Businesses

data mining classification case study

In this post

Data classification levels

Types of data classification, industries that use data classification, benefits of data classification.

All businesses manage copious amounts of data. Every day, new documents are created, older files are updated, and collaborative work is shared between employees. 

Business data typically contains sensitive and private information that unauthorized users should never have access to. This is particularly important if you work in a regulated field. But even if you’re not, you have to understand which data is most critical and how to prioritize it through data classification to keep it safe.

What is data classification?

Data classification defines and organizes data in a way that accounts for how important it is against predefined criteria, such as the file type, sensitivity, or the value the data has to the company. 

Improving business security and safeguarding information like financial records, customer health records, or personally identifiable information (PII) requires a reliable data classification process. This involves tagging and categorizing documentation to make it easier to track and search, while eliminating the need to keep copies of important files in multiple locations.

By organizing your business data according to the predetermined framework laid out in your data classification policy, you can sort this information accordingly, find what you need more easily, and invest more security resources into the most critical information. Most businesses find using data-centric security software the most efficient way to do this, as these tools help companies protect the data itself, rather than focusing on servers or other devices where data is stored.

Data classification can be arranged into two types of levels based on the type of data your business has and the size of your organization. The first method uses a simple system that organizes data by sensitivity.

Also referred to as confidential or restricted, this is the most critical data an organization has. Business operations would experience severe setbacks if this data were lost or compromised. Financial records, business intelligence documentation, or health records all qualify as high sensitivity. 

This information can be classified as internal use only or sensitive. Data at the medium level is information that stays within an organization, but it can be shared with the majority of the team. Emails, general business information, and documents that don’t contain sensitive or private information fall into this category. 

Otherwise known as public or unrestricted data, this category includes information that can be shared with the public; it’s considered low sensitivity. This could include press releases, website content, or marketing materials.

Further approaches to data classification levels

For businesses that hold a large range of data, it helps to break the levels into more defined categories.

  • Public: Any researchable info that can be freely used by the public 
  • Internal: Information that should only be seen by a company’s employees or contractors, such as memos, emails, or corporate intellectual property
  • Confidential or restricted: Trade secrets , proprietary business knowledge, and legal documents
  • Private: Typically belongs to individuals, such as contact information, biometric data, or health records.
  • Critical: Data that’s essential for day-to-day operations of the business, like emergency response plans, system configuration and infrastructure data, and customer databases
  • Regulatory: Any information that falls under national or international compliance rules, like PII , financial records, or medical records
  • Archived: Inactive data that needs to be kept for legal, regulatory, or financial reasons

Once data has been tagged according to its sensitivity level, it can be filed based on its type. This typically has three categories.

  • Content-based classification . Data security tools scan the information to look for potentially sensitive details. This can be done using the tagging level system, wherein data is reviewed for personal or private specifics.
  • Context-based classification . This approach looks at the metadata, which is information about the data’s application and location in the system, along with creator details. Essentially, this type of classification looks at the non-sensitive parts of the data file.
  • User-based classification . Data classified by user is based primarily on how the information is used. This isn’t an automated process; the end user must choose this category manually. These individuals pull from their own knowledge of the data to determine how sensitive the information is after reviewing it.

Ddat classification is essential for businesses in fields that work with high-level compliance and regulatory standards, like the General Data Protection Regulation ( GDPR ) or Health Insurance Portability and Accountability Act ( HIPAA ).

Finance and insurance

Regulatory compliance in the finance and insurance industries is critical for dealing with large amounts of PII. Data classification in these fields focuses on maintaining information in a secure way to mitigate cyberthreats, and at the same time ensuring compliance with GDPR and payment card industry data security standard ( PCI DSS ) regulations.

Government agencies

The government holds information about private citizens and protects sensitive information that’s critical to national security and public safety. This data represents some of the most sensitive information in the world and requires the highest level of security. The agency must also comply with regulations like the Freedom of Information Act (FOIA) both on a national, state, and local level.

PII and protected health information (PHI) of patients is extremely sensitive information that falls under HIPAA regulations. Data classification ensures compliance and lowers the risk of this information being leaked in a data breach. 

ransomware attacks were reported by healthcare organizations in the US in 2023.

Source: Federal Bureau of Investigation

Not all healthcare information is the highest level of sensitivity, though. Data like drug information or completed medical studies may be released to the public.

Student records, academic performance data, and other information relating to faculty and staff are all held in databases at educational institutions. Beyond this, certain campus offices may keep other sensitive information , such as tax details for students and their families. Not all information needs to be shared with every department, so data classification protects these files from unauthorized users.

Both online and brick-and-mortar stores have a large amount of customer information, such as sales transaction data and payment details, along with operations-critical files like inventory information. It can be used for targeted marketing efforts and customer experience improvement, but must be protected to comply with privacy and payment security regulations.

Organizations that don’t actively classify their data put themselves at greater risk of cyberthreats and compliance-based fines. That’s why incorporating a data classification process into your business is essential, no matter how much and what type of data your company keeps. 

Improves data protection and security

Data classification gives you an added layer of security. It helps your organization prioritize data based on its sensitivity, which means you can focus your resources and budget to protect the most critical assets. It saves you money and it also helps you avoid costly fines should your business suffer a data breach.

Data classification also helps your IT team with identity and access management because knowing the data classification level of documents allows them to assign access according to role. This also goes a long way in preventing internal information theft.

It’s not only regulators and internal employees who are concerned about your business data security, though. A strong data classification policy that fits into your wider security strategy is one of the best ways to build and retain trust with your customers. When they hand over their payment or personal details to your company, they want to know that it’s safe from exploitation by cybercriminals.

of surveyed users would not trust a company that had experienced a data breach that exposed their personal data.

Source: Statista

Helps meet compliance standards

Not every regulatory standard applies to your business, but it’s likely that you have to comply with something. If you sell to European customers, you need to be GDPR compliant. If you take digital payments in any form, you need to comply with PCI DSS. Using data classification, you can identify which information helps you remain compliant and avoid significant penalties.

Data classification, in conjunction with other data security tools you might be using, can help you keep a paper trail of how information has been used in your business, who has access to those files, and when updates were last made. This is essential if you ever face a compliance audit and need to prove that data is being properly secured.

Enhances operational efficiency 

Classifying data expedites and simplifies analysis and reporting for internal employees. You can easily find the most relevant data without having to root through extraneous information. 

This approach to data security also enhances record retention and assessment. archived files can be moved to lower priority storage servers or networks to conserve valuable space on the most used and secure devices.

Cracking the code on your business data

Keeping company data well-organized and secure should be a top priority for any organization, no matter how big or small. With data classification, you can boost your business’s security and create a more efficient data organization system that benefits your whole team.

Looking for more ways to protect your data? With sensitive data discovery software , your employees can locate your most sensitive business information across multiple company systems, databases, and applications.

Holly Landis

Holly Landis is a freelance writer for G2. She also specializes in being a digital marketing consultant, focusing in on-page SEO, copy, and content writing. She works with SMEs and creative businesses that want to be more intentional with their digital strategies and grow organically on channels they own. As a Brit now living in the USA, you'll usually find her drinking copious amounts of tea in her cherished Anne Boleyn mug while watching endless reruns of Parks and Rec.

Explore More G2 Articles

Identifying the Classification Performances of Educational Data Mining Methods: A Case Study for TIMSS

  • August 2017
  • Educational Sciences Theory & Practice 17(5)

Serpil Kılıç Depren at Yildiz Technical University

  • Yildiz Technical University
  • This person is not on ResearchGate, or hasn't claimed this research yet.

Ersoy öz at Yildiz Technical University

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations

No full-text available

Request Full-text Paper PDF

To read the full-text of this research, you can request a copy directly from the authors.

Serap Büyükkıdık

  • NEURAL COMPUT APPL

Özlem Bezek Güre

  • Mike Nkongolo

Dongjo Shin

  • Stud Educ Eval

María José Rodríguez Conde

  • Seda Bağdatlı-Kalkan

Serpil Kılıç Depren

  • Yun-Fang Tu

Sina Esmaeilpour Charandabi

  • Fernandez Fernandez

Ponnusamy R P

  • REV ESP PEDAGOG

Joyce Maia

  • João Ricardo Sato

Ezgi̇ Gülenç

  • Owen P. Hall Jr
  • Mustafa YAĞCI

Yusuf Ziya Olpak

  • Sıdıka Seda OLPAK

Sam Goundar

  • Aleksandar Pejic

Piroska Stanic Molcer

  • Alfred Lindl

Stefan Krauss

  • Jennifer Kepka

Juan Yang

  • Educ Inform Tech

Ayman Elshenawy Elsefy

  • Soufiane Lyaqini

Alicia Chaparro

  • Oykum Esra Askin

Fulya Gokalp

  • Öyküm Esra Askin

Ömer Bilen

  • Sulbha S. Apte

RAMESH VAMANAN

  • APPL ARTIF INTELL

Sotiris Kotsiantis

  • EXPERT SYST APPL

Cristóbal Romero

  • Jesse Davis

Mark Goadrich

  • R. Bhaskaran
  • S.L. Gortmaker

David W Hosmer

  • Stanley Lemeshow
  • Feriha Hande İDİL

Serkan Narlı

  • APPL SOFT COMPUT

Serkan Kurt

  • Yeliz Yücel

Ryan Baker

  • Leo Breiman
  • COMPUT HUM BEHAV

Wu He

  • Jennifer A Fredricks
  • Phyllis C. Blumenfeld
  • Alison H. Paris
  • C. J. Hamelink
  • U. M. Krüger

Jiawei Han

  • Micheline Kamber
  • Cort J. Willmott
  • Kenji Matsuura

Shujie Liu

  • Lingqi Meng

Oliver Neuschmidt

  • Juliane Barth

Dirk Hastedt

  • Kuriakose Athappilly
  • PATTERN RECOGN

Andrew P Bradley

  • EUR J OPER RES

Karel Dejaeger

  • César Vialardi Sacín

Jorge Chue

  • Juan Pablo Peche

Alvaro Ortigosa

  • J CLIN EPIDEMIOL
  • Allan Donner
  • GER ECON REV
  • Ludger Woessmann
  • G Berberoğlu
  • A Pena-Ayala
  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

applsci-logo

Article Menu

data mining classification case study

  • Subscribe SciFeed
  • Recommended Articles
  • Google Scholar
  • on Google Scholar
  • Table of Contents

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

JSmol Viewer

Classification of logging data using machine learning algorithms.

data mining classification case study

1. Introduction

  • The main results of well log data interpretation using machine learning for different types of deposits are presented;
  • A uranium well log (UWL) dataset is presented and described, allowing us to set up machine learning methods for ROZ detection and lithological classification;
  • This paper presents the state-of-the-art result in solving ROZ detection and lithological classification tasks obtained using the UWL dataset;
  • The influence of floating window size on the quality of classification is investigated.

2. Related Works

  • Lithological classification.
  • The identification of reservoirs.
  • Stratigraphic classification.
  • The estimation of rock permeability.
  • The identification of reservoir oxidation.
  • The technology of extraction dictates the necessity of identifying impermeable layers with a thickness of 20 cm (this is a requirement of the regulatory documentation) within the ore-bearing horizon with a thickness of 60–80 m. In some cases, one identified layer in petroleum geophysics corresponds to the entire interpreted ore-bearing horizon in uranium geophysics [ 1 ].
  • The set of recorded logging data is much smaller compared to oil and gas fields. In fact, only fairly simple variations in electrical logging (AR, SP, IL) are available. Gamma logging cannot be used for lithological classification because the contribution to the recorded gamma radiation from radium and its decay products is two orders of magnitude greater than that from lithology. Of the neutron methods, only fission neutron logging is used, aimed at the direct determination of uranium [ 1 ].
  • Difficulties with extracting and tying the core due to the characteristics of the section (sand and clay).
  • Use of experts’ assessments, which contain a significant degree of subjectivity [ 16 ].
  • The regulatory framework, interpretation methods, and standard set of logging methods were inherited from the USSR and underwent only minor changes in Kazakhstan.
  • There are no publicly available datasets that allow for a comparative analysis of classification and forecasting methods based on well logging data from uranium deposits.

3.1. Data Preprocessing

  • ‘up5’:up_w = 5 (size of the top of the data window).
  • ‘dn150’:dn_w = 150 (size of the bottom of the data window).
  • ‘t5’:test_part = 5 (which part of the dataset will be the test part. In this case, it is the 5th of 10 possible test parts).
  • ‘n1’:norm = 1 (whether the input parameters were normalized; 1—normalized, 0—not normalized).

3.2. Training and Evaluating Machine Learning Models

  • The dataset is not balanced. It means that the number of objects of different classes in the dataset is different—class 2: 6876; class 1: 35,812;class 8: 28,073 (ROZ).
  • In the ROZ classification task, objects of all three classes are equally important for the researcher.

4.1. ROZ Identification

4.2. lithological classification, 5. discussion, 6. conclusions, author contributions, institutional review board statement, informed consent statement, data availability statement, conflicts of interest.

MetricsFormulaExplanation
Regression modelsMean absolute error—MAE where n is the sample size; the real value of the target variable for the i-th example; calculated value of the i-th example;
Determination coefficient

Linear correlation coefficient (or Pearson correlation coefficient) where
Classification modelsAccuracy where is the number of correct answers and N is the total number of possible answers of the model
Precision where true positive (TP) and true negative (TN) are cases of correct operation of the classifier. Accordingly, false negatives (FNs) and false positives (FPs) are cases of misclassification
Recall
F1 score
Dw_NClassifierAccf1_Score_Class1f1_Score_Class2f1_Score_Class8f1_Score_Macrof1_Score_MicroDuration
5LGBM0.8070.8050.6330.6590.6990.80722.2
5RBF SVM0.7920.7860.6270.6320.6820.7921596
25LGBM0.8080.8050.6610.6730.7130.80852
25RBF SVM0.7940.7820.6300.6500.6880.7943274
50LGBM0.8140.8150.6680.6810.7210.81481.1
50RBF SVM0.7950.7800.6030.6620.6820.7955743.5
F1_Score
Dw_NModelAcc1345679f1_Macrof1_MicroDuration
5LGBM0.550.6950.2950.1680.290.0150.64300.3010.553.98
XGB0.5650.7050.2920.1590.43300.62300.3440.565120.97
MLP0.5990.7390.2660.0780.02200.67100.2760.59932.001
10LGBM0.5650.7060.3440.2090.2470.0630.62900.3140.5654.731
RFC0.5880.7290.3180.1970.160.0430.60800.3240.58824.25
XGB0.5770.7150.3480.190.3370.0890.62800.3590.577114.24
MLP0.620.7580.3280.110.05900.64600.30.6237.518
25LGBM0.5780.7190.3750.2060.1850.0220.6500.3080.5789.851
RFC0.6080.7450.3370.2130.1430.0270.64100.3320.60858.774
XGB0.6090.7440.3920.210.2630.0280.67100.3640.609260.25
MLP0.6410.7740.4020.180.14200.66800.3410.64150.892
50LGBM0.610.7540.40.2410.1780.0120.64400.3180.6117.161
RFC0.6390.7730.4050.2240.1700.64900.350.639104.44
XGB0.6360.7680.4150.2430.3010.0050.68300.3810.636435.20
MLP0.6590.7920.4540.2280.19800.67400.370.65981.985
100LGBM0.6340.7760.4340.2530.1530.0110.66500.3270.63432.906
RFC0.6560.790.4310.2450.21600.64900.3680.656195.50
XGB0.6590.7910.4410.2470.30800.69200.3910.659732.61
MLP0.6560.7960.4320.2530.18800.66600.3680.656102.92
200LGBM0.650.7940.4710.2770.0910.0190.63600.3270.6546.825
RFC0.6810.810.4890.2820.19900.63300.3810.681658.43
XGB0.6940.8190.4920.2950.280.0080.70500.410.6941319.5
MLP0.6490.7890.4550.2340.1800.65200.3640.649157.25
  • Mukhamediev, R.I.; Kuchin, Y.; Amirgaliyev, Y.; Yunicheva, N.; Muhamedijeva, E. Estimation of Filtration Properties of Host Rocks in Sandstone-Type Uranium Deposits Using Machine Learning Methods. IEEE Access 2022 , 10 , 18855–18872. [ Google Scholar ] [ CrossRef ]
  • Amirova, U.; Uruzbaeva, N. Overview of the development of the world market of Uranium. Univers. Econ. Law Electron. Sci. J. 2017 , 6 , 1–8. [ Google Scholar ]
  • Baldwin, J.L.; Bateman, R.M.; Wheatley, C.L. Application of a neural network to the problem of mineral identification from well logs. Log Anal. 1990 , 31 , SPWLA-1990-v31n5a1. [ Google Scholar ]
  • Poulton, M.M. Computational Neural Networks for Geophysical Data Processing ; Elsevier: Amsterdam, The Netherlands, 2001. [ Google Scholar ]
  • Benaouda, D.; Wadge, G.; Whitmarsh, R.; Rothwell, R.; MacLeod, C. Inferring the lithology of borehole rocks by applying neural network classifiers to downhole logs: An example from the Ocean Drilling Program. Geophys. J. Int. 1999 , 136 , 477–491. [ Google Scholar ] [ CrossRef ]
  • Saggaf, M.; Nebrija, E.L. Estimation of missing logs by regularized neural networks. AAPG Bull. 2003 , 87 , 1377–1389. [ Google Scholar ] [ CrossRef ]
  • Kumar, T.; Seelam, N.K.; Rao, G.S. Lithology prediction from well log data using machine learning techniques: A case study from Talcher coalfield, Eastern India. J. Appl. Geophys. 2022 , 199 , 104605. [ Google Scholar ] [ CrossRef ]
  • Kim, J. Lithofacies classification integrating conventional approaches and machine learning technique. J. Nat. Gas Sci. Eng. 2022 , 100 , 104500. [ Google Scholar ] [ CrossRef ]
  • Thongsamea, W.; Kanitpanyacharoena, W.; Chuangsuwanich, E. Lithological Classification from Well Logs using Machine Learning Algorithms. Bull. Earth Sci. Thail. 2018 , 10 , 31–43. [ Google Scholar ]
  • Liang, H.; Xiong, J.; Yang, Y.; Zou, J. Research on Intelligent Recognition Technology in Lithology Based on Multi-parameter. Fusion 2023 . [ Google Scholar ] [ CrossRef ]
  • Mohamed, I.M.; Mohamed, S.; Mazher, I.; Chester, P. Formation lithology classification: Insights into machine learning methods. In Proceedings of the SPE Annual Technical Conference and Exhibition, Calgary, AB, Canada, 30 September–2 October 2019. [ Google Scholar ]
  • Ahmadi, M.-A.; Ahmadi, M.R.; Hosseini, S.M.; Ebadi, M. Connectionist model predicts the porosity and permeability of petroleum reservoirs by means of petro-physical logs: Application of artificial intelligence. J. Pet. Sci. Eng. 2014 , 123 , 183–200. [ Google Scholar ] [ CrossRef ]
  • Gholami, R.; Moradzadeh, A.; Maleki, S.; Amiri, S.; Hanachi, J. Applications of artificial intelligence methods in prediction of permeability in hydrocarbon reservoirs. J. Pet. Sci. Eng. 2014 , 122 , 643–656. [ Google Scholar ] [ CrossRef ]
  • Zhong, Z.; Carr, T.R.; Wu, X.; Wang, G. Application of a convolutional neural network in permeability prediction: A case study in the Jacksonburg-Stringtown oil field, West Virginia, USA. Geophysics 2019 , 84 , B363–B373. [ Google Scholar ] [ CrossRef ]
  • Khan, H.; Srivastav, A.; Kumar Mishra, A.; Anh Tran, T. Machine learning methods for estimating permeability of a reservoir. Int. J. Syst. Assur. Eng. Manag. 2022 , 13 , 2118–2131. [ Google Scholar ] [ CrossRef ]
  • Kuchin, Y.I.; Mukhamediev, R.I.; Yakunin, K.O. One method of generating synthetic data to assess the upper limit of machine learning algorithms performance. Cogent Eng. 2020 , 7 , 1718821. [ Google Scholar ] [ CrossRef ]
  • Merembayev, T.; Yunussov, R.; Yedilkhan, A. Machine learning algorithms for stratigraphy classification on uranium deposits. Procedia Comput. Sci. 2019 , 150 , 46–52. [ Google Scholar ] [ CrossRef ]
  • Kuchin, Y.I.; Mukhamediev, R.I.; Yakunin, K.O. Quality of data classification under conditions of inconsistency of expert estimations. Cloud Sci. 2019 , 6 , 109–126. (In Russian) [ Google Scholar ]
  • Mukhamediev, R.I.; Kuchin, Y.; Popova, Y.; Yunicheva, N.; Muhamedijeva, E.; Symagulov, A.; Abramov, K.; Gopejenko, V.; Levashenko, V.; Zaitseva, E.; et al. Determination of Reservoir Oxidation Zone Formation in Uranium Wells Using Ensemble Machine Learning Methods. Mathematics 2023 , 11 , 4687. [ Google Scholar ] [ CrossRef ]
  • Dacknov, V.N. Interpretation of the Results of Geophysical Studies of Well Sections ; Nedra: Moskow, Russia, 1982; p. 448. (In Russian) [ Google Scholar ]
  • Mukhamediev, R.I.; Popova, Y.; Kuchin, Y.; Zaitseva, E.; Kalimoldayev, A.; Symagulov, A.; Levashenko, V.; Abdoldina, F.; Gopejenko, V.; Yakunin, K.; et al. Review of Artificial Intelligence and Machine Learning Technologies: Classification, Restrictions, Opportunities and Challenges. Mathematics 2022 , 10 , 2552. [ Google Scholar ] [ CrossRef ]
  • Singh, H.; Seol, Y.; Myshakin, E.M. Automated well-log processing and lithology classification by identifying optimal features through unsu-pervised and supervised machine-learning algorithms. SPE J. 2020 , 25 , 2778–2800. [ Google Scholar ] [ CrossRef ]
  • Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017 , 30 , 3149–3157. [ Google Scholar ]
  • Al Daoud, E. Comparison between XGBoost, LightGBM and CatBoost using a home credit dataset. Int. J. Comput. Inf. Eng. 2019 , 13 , 6–10. [ Google Scholar ]
  • Bentéjac, C.; Csörgő, A.; Martínez-Muñoz, G. A comparative analysis of gradient boosting algorithms. Artif. Intell. Rev. 2021 , 54 , 1937–1967. [ Google Scholar ] [ CrossRef ]
  • Breiman, L. Random forests. Mach. Learn. 2001 , 45 , 5–32. [ Google Scholar ] [ CrossRef ]
  • Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [ Google Scholar ] [ CrossRef ]
  • Fix, E.; Hodges, J.L. Discriminatory analysis. Nonparametric discrimination: Consistency properties. Int. Stat. Rev. /Rev. Int. De Stat. 1989 , 57 , 238–247. [ Google Scholar ] [ CrossRef ]
  • Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986 , 1 , 81–106. [ Google Scholar ] [ CrossRef ]
  • Hornik, K.; Stinchcombe, M.; White, H. Multilayer feedforward networks are universal approximators. Neural Netw. 1989 , 2 , 359–366. [ Google Scholar ] [ CrossRef ]
  • Galushkin, A.I. Neural Networks: Fundamentals of Theory ; Telecom: Perm, Russia, 2010; p. 496. (In Russian) [ Google Scholar ]
  • Bayes, T. An essay towards solving a problem in the doctrine of chances. Biometrika 1958 , 45 , 296–315. [ Google Scholar ] [ CrossRef ]
  • Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995 , 20 , 273–297. [ Google Scholar ] [ CrossRef ]
  • Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997 , 9 , 1735–1780. [ Google Scholar ] [ CrossRef ]
  • LeCun, Y.; Bengio, Y. Convolutional Networks for Images, Speech, and Time Series. In The Handbook of Brain Theory and Neural Networks ; MIT Press: Cambridge, MA, USA, 1995; p. 3361. [ Google Scholar ]
  • Xueqing, Z.; Zhansong, Z.; Chaomo, Z. Bi-LSTM deep neural network reservoir classification model based on the innovative input of logging curve response sequences. IEEE Access 2021 , 9 , 19902–19915. [ Google Scholar ] [ CrossRef ]
  • Patidar, A.K.; Singh, S.; Anand, S. Subsurface Lithology Classification Using Well Log Data, an Application of Supervised Machine Learning. In Workshop on Mining Data for Financial Applications ; Springer Nature: Singapore, 2022; pp. 227–240. [ Google Scholar ]
  • Zhang, J.; He, Y.; Zhang, Y.; Li, W.; Zhang, J. Well-Logging-Based Lithology Classification Using Machine Learning Methods for High-Quality Reservoir Identification: A Case Study of Baikouquan Formation in Mahu Area of Junggar Basin, NW China. Energies 2022 , 15 , 3675. [ Google Scholar ] [ CrossRef ]
  • Xing, Y.; Yang, H.; Yu, W. An approach for the classification of rock types using machine learning of core and log data. Sustainability 2023 , 15 , 8868. [ Google Scholar ] [ CrossRef ]
  • Maxwell, K.; Rajabi, M.; Esterle, J. Automated classification of metamorphosed coal from geophysical log data using supervised machine learning techniques. Int. J. Coal Geol. 2019 , 214 , 103284. [ Google Scholar ] [ CrossRef ]
  • Al-Mudhafar, W.J. Integrating well log interpretations for lithofacies classification and permeability modeling through advanced machine learning algorithms. J. Pet. Explor. Prod. Technol. 2017 , 7 , 1023–1033. [ Google Scholar ] [ CrossRef ]
  • Liu, J.J.; Liu, J.C. Integrating deep learning and logging data analytics for lithofacies classification and 3D modeling of tight sandstone reservoirs. Geosci. Front. 2022 , 13 , 101311. [ Google Scholar ] [ CrossRef ]
  • Rogulina, A.; Zaytsev, A.; Ismailova, L.; Kovalev, D.; Katterbauer, K.; Marsala, A. Similarity learning for well logs prediction using machine learning algorithms. In Proceedings of the International Petroleum Technology Conference, Dhahran, Saudi Arabia, 21–23 February 2022. D032S158R005. [ Google Scholar ]
  • Zhong, R.; Johnson, R.L., Jr.; Chen, Z. Using machine learning methods to identify coal pay zones from drilling and logging-while-drilling (LWD) data. SPE J. 2020 , 25 , 1241–1258. [ Google Scholar ] [ CrossRef ]
  • Hou, M.; Xiao, Y.; Lei, Z.; Yang, Z.; Lou, Y.; Liu, Y. Machine learning algorithms for lithofacies classification of the gulong shale from the Songliao Basin, China. Energies 2023 , 16 , 2581. [ Google Scholar ] [ CrossRef ]
  • Schnitzler, N.; Ross, P.S.; Gloaguen, E. Using machine learning to estimate a key missing geochemical variable in mining exploration: Applica-tion of the Random Forest algorithm to multi-sensor core logging data. J. Geochem. Explor. 2019 , 205 , 106344. [ Google Scholar ] [ CrossRef ]
  • Joshi, D.; Patidar, A.K.; Mishra, A.; Mishra, A.; Agarwal, S.; Pandey, A.; Choudhury, T. Prediction of sonic log and correlation of lithology by comparing geophysical well log data using machine learning principles. GeoJournal 2021 , 88 , 47–68. [ Google Scholar ] [ CrossRef ]
  • Al-Khudafi, A.M.; Al-Sharifi, H.A.; Hamada, G.M.; Bamaga, M.A.; Kadi, A.A.; Al-Gathe, A.A. Evaluation of different tree-based machine learning approaches for formation lithology classification. In Proceedings of the ARMA/DGS/SEG International Geomechanics Symposium, Al Khobar, Saudi Arabia, 30 October–2 November 2023; p. ARMA-IGS-2023-0026. [ Google Scholar ] [ CrossRef ]
  • Merembayev, T.; Yunussov, R.; Yedilkhan, A. Machine learning algorithms for classification geology data from well logging. In Proceedings of the 14th International Conference on Electronics Computer and Computation (ICECCO), Kaskelen, Kazakhstan, 29 November–1 December 2018; pp. 206–212. [ Google Scholar ]
  • Wenhua, W.; Zhuwen, W.; Ruiyi, H.; Fanghui, X.; Xinghua, Q.; Yitong, C. Lithology classification of volcanic rocks based on conventional logging data of machine learning: A case study of the eastern depression of Liaohe oil field. Open Geosci. 2021 , 13 , 1245–1258. [ Google Scholar ] [ CrossRef ]
  • Kuchin, Y.; Mukhamediev, R.; Yunicheva, N.; Symagulov, A.; Abramov, K.; Mukhamedieva, E.; Levashenko, V. Application of machine learning methods to assess filtration properties of host rocks of uranium deposits in Kazakhstan. Appl. Sci. 2023 , 13 , 10958. [ Google Scholar ] [ CrossRef ]
  • Kuchin, Y.; Yakunin, K.; Mukhamedyeva, E.; Mukhamedyev, R. Project on creating a classifier of lithological types for uranium deposits in Kazakhstan. J. Phys. Conf. Ser. 2019 , 1405 , 012001. [ Google Scholar ] [ CrossRef ]
  • Pedregosa, F. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011 , 12 , 2825–2830. [ Google Scholar ]
  • Scikit-Learn. Machine Learning in Python. Available online: https://scikit-learn.org/stable/ (accessed on 1 February 2024).
  • Ali, M. PyCaret: An Open Source, Low-Code Machine Learning Library in Python. PyCaret Version 1.0.0. 2020. Available online: https://www.pycaret.org (accessed on 1 February 2024).
  • Arnaut, F.; Kolarski, A.; Srećković, V.A. Machine Learning Classification Workflow and Datasets for Ionospheric VLF Data Exclusion. Data 2024 , 9 , 17. [ Google Scholar ] [ CrossRef ]
  • Raschka, S. MLxtend: Providing Machine Learning and Data Science Utilities and Extensions to Python’s Scientific Computing Stack. J. Open Source Softw. 2018 , 3 , 638. [ Google Scholar ] [ CrossRef ]
  • Raschka, S. Available online: https://rasbt.github.io/mlxtend/ (accessed on 3 May 2023).
  • Scikit-Optimize. Sequential Model-Based Optimization in Python. Available online: https://scikit-optimize.github.io/stable/ (accessed on 4 August 2024).
  • Zahedi, L.; Mohammadi, F.G.; Rezapour, S.; Ohland, M.W.; Amini, M.H. Search algorithms for automated hyper-parameter tuning. arXiv 2021 , arXiv:2104.14677. [ Google Scholar ]

Click here to enlarge figure

ClassifierAbbreviated NameReferences
1.Light gradient boosting machineLGBM[ , , ]
2.Random forest classifierRF[ ]
3.Extreme gradient boostingXGB[ ]
4.k-nearest neighborskNN[ ]
5.Decision treeDT[ ]
6.Artificial neural network or multilayer perceptronMLP or ANN[ , ]
7.Naive Bayes classifierNB[ ]
8.Support vector machines with linear kernelLinear SVM[ ]
9.Support vector machines with rbf kernelRBF SVM[ ]
10.Long short-term memoryLSTM[ ]
11.Convolution neural networkCNN[ ]
Extracted ResourcesTaskModelResultsRef.
Oil1, 2Bidirectional LSTMAcc = 92.69%[ ]
Oil, gas1DT, RFf1 = 0.97 (RF), f1 = 0.94 (DT)[ ]
Oil1, 2UL, SLUL = 80%, SL = 90%[ ]
Oil1, 2XGBoost and RFAcc = 0.882 (XGB)[ ]
Oil, gas1kNN, RF, XGB, MLPAcc = 0.79[ ]
Coal1XGB, RF, ANNAcc = 0.99 (RF)[ ]
Oil4DFFNN, XBG, LRR = 0.9551 (LR)[ ]
Oil, gas1ANNAcc = 0.88 (ANN)[ ]
Oil, gas1Hybrid model based on CNN and LSTMAcc = 87.3% (CNN-LSTM)[ ]
Oil, gas2XGB, LogRROC AUC = 0.824[ ]
Coal1LR, SVM, ANN, RF, XGBAcc > 0.9[ ]
Coal1SVM, MLP, DT, RF, XGAcc = 0.8[ ]
Oil, gas1MLP, SVM, XGB, RFAcc = 0.868 (XGB) and Acc = 0.884 (RF)[ ]
Geothermal wells1kNN, SVM, XGBAcc = 0.9067 (XGB)[ ]
Sulfide ore1RFR > 0.66 between calculated and measured Na concentration in core[ ]
Oil1ULAcc = 0.5840[ ]
Oil1RFF1 = 0.913[ ]
Uranium1, 3RF, kNN, XGBAcc = 0.65 (1), Acc = 0.95 (3)[ ]
Oil, gas1SVM, RFAcc = 0.9746[ ]
Uranium4ANN, XGBR = 0.7 (XGB)[ ]
Uranium4XGB, LGBM, RF, DFFNN, SVMR = 0.710, R = 0.845 (LGBM)[ ]
Uranium1kNN, LogR, DT, SVM, XGB, ANN, LSTMAcc = 0.54 (XGB)[ ]
Uranium5SVM, ANN, RF, XGB, LGBMf1_weighted = 0.72 (XGB)[ ]
CodeRock NameAR (Ohm*m)Filtration Coefficient (m/Day)
1gravel, pebblesmedium12–20
2coarse sandmedium8–15
3medium-grained sandmedium5–12
4fine-grained sandmedium1–7
5sandstones high0–0.1
6silt, siltstonelow0.8–1
7claylow0.1–0.8
8gypsum, dolomitehigh0–0.1
9carbonate rockshigh0–0.1
ClassifierAccf1_Score_Class1f1_Score_Class2f1_Score_Class8f1_Score_Macrof1_Score_MicroDuration
1LGBM0.8680.880.6680.8910.8130.8685.757
2RFC0.8440.8590.570.8690.7660.84494.193
3XGB0.8590.870.6570.8820.8030.859161.549
4kNN0.6660.710.3960.6580.5880.666134.631
5DT0.7210.7620.4190.7530.6450.72118.003
6MLP0.8090.8250.6280.8250.7590.80954.788
7NB0.4790.4040.2790.6780.4540.4790.98
8Linear SVM0.7500.7750.4610.7570.6640.7503322.33
9RBF SVM0.7990.8120.5600.8180.7300.7991156.59
Dw_NAccf1_Score_Class1f1_Score_Class2f1_Score_Class8f1_Score_Macrof1_Score_MicroDuration
00.790.8280.5050.6850.6720.791.651
50.8250.8430.6360.6980.7260.8256.386
250.820.8390.6560.690.7290.8215.56
500.8270.8420.6640.6960.7340.82726.56
1000.8360.8470.670.7230.7470.83648.94
2000.8370.8150.680.730.7410.83773.61
Dw_NAccf1_Score_Class1f1_Score_Class2f1_Score_Class8f1_Score_Macrof1_Score_MicroDuration
00.7790.8030.4980.6470.6490.7790.902
50.8070.8050.6330.6590.6990.8072.503
250.8080.8050.6610.6730.7130.8085.833
500.8140.8150.6680.6810.7210.81410.47
1000.8180.8150.6680.6970.7270.81818.99
2000.8260.8150.6790.7390.7440.82631.5
F1_Score
Dw_nAcc1345679F1_MacroF1_MicroDuration
50.5650.7050.2920.1590.43300.62300.3440.565120.973
100.5770.7150.3480.190.3370.0890.62800.3590.577114.243
250.6090.7440.3920.210.2630.0280.67100.3640.609260.255
500.6360.7680.4150.2430.3010.0050.68300.3810.636435.202
1000.6590.7910.4410.2470.30800.69200.3910.659732.618
200 0.8190.4920.2950.280.008 00.410.6941319.534
The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

Mukhamediev, R.; Kuchin, Y.; Yunicheva, N.; Kalpeyeva, Z.; Muhamedijeva, E.; Gopejenko, V.; Rystygulov, P. Classification of Logging Data Using Machine Learning Algorithms. Appl. Sci. 2024 , 14 , 7779. https://doi.org/10.3390/app14177779

Mukhamediev R, Kuchin Y, Yunicheva N, Kalpeyeva Z, Muhamedijeva E, Gopejenko V, Rystygulov P. Classification of Logging Data Using Machine Learning Algorithms. Applied Sciences . 2024; 14(17):7779. https://doi.org/10.3390/app14177779

Mukhamediev, Ravil, Yan Kuchin, Nadiya Yunicheva, Zhuldyz Kalpeyeva, Elena Muhamedijeva, Viktors Gopejenko, and Panabek Rystygulov. 2024. "Classification of Logging Data Using Machine Learning Algorithms" Applied Sciences 14, no. 17: 7779. https://doi.org/10.3390/app14177779

Article Metrics

Article access statistics, further information, mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

IMAGES

  1. Classification In Data Mining

    data mining classification case study

  2. Classification and Predication in Data Mining

    data mining classification case study

  3. The Taxonomy of Data Mining

    data mining classification case study

  4. Classification and Predication in Data Mining

    data mining classification case study

  5. Data Mining Classification Model (DMCM)

    data mining classification case study

  6. Classification In Data Mining

    data mining classification case study

VIDEO

  1. week 6 Classification Part 1

  2. Data Mining

  3. Data Mining

  4. Preprocesing in Data Mining-Discretization

  5. 3.1 DATA MINING

  6. Data Mining (Classification Analysis)

COMMENTS

  1. PDF R and Data Mining: Examples and Case Studies

    process and popular data mining techniques. It also presents R and its packages, functions and task views for data mining. At last, some datasets used in this book are described. 1.1 Data Mining Data mining is the process to discover interesting knowledge from large amounts of data [Han 1 R

  2. Classification, Clustering and Association Rule Mining in ...

    Classification, Clustering and Association Rule Mining in Educational Datasets Using Data Mining Tools: A Case Study. Conference paper; First Online: 17 May 2018; pp 196-211; Cite this conference paper; ... Not only performance but all the explanatory variables data were converted to categorical data in case of classification and association ...

  3. PDF Classification Techniques in Data Mining-Case Study

    Classification Techniques in Data Mining-Case Study Swathi Agarwal1, G.L.Anand Babu2, Dr.K.S.Reddy3 1,2,3Department of Information Technology, Anurag Group of Institutions, Hyderabad ,Telangana, India. Abstract: Data Mining was a basic technique for examine the data in a choice of perception and categorizes it and finally to compress it.

  4. Data Mining Classification Simplified: Steps & 6 Best Classifiers

    It means huge amounts of data, rich in insights, that can provide value to an organization. There are multiple techniques that can be followed to process data and Data Mining is one of them. 1) Generative Classification. Step 1: Learning Phase. Step 2: Classification Phase.

  5. Data Mining Classification Techniques on The Analysis of Student'S

    University is the case study. 3. Implementation. ... There are many data mining classification techniques with different levels of accuracy. The objective of this paper is to ...

  6. Basic Concept of Classification (Data Mining)

    Classification is a widely used technique in data mining and is applied in a variety of domains, such as email filtering, sentiment analysis, and medical diagnosis. Classification: It is a data analysis task, i.e. the process of finding a model that describes and distinguishes data classes and concepts.

  7. A Comparative Study in Data Mining: Clustering and Classification

    A Comparative Study in Data Mining: Clustering and Classification Capabilities. Abstract. The ICT evolution has driven on the creation of a capable society, in providing new kinds and type of information. The gathered information is stored continuously, meaning that a great amount of databases has to be created.

  8. Classification, Clustering and Association Rule Mining in Educational

    Classification, Clustering and Association Rule Mining in Educational Datasets Using Data Mining Tools: A Case Study January 2019 DOI: 10.1007/978-3-319-91192-2_21

  9. PDF A Case Study: Stream Data Mining Classification

    the streaming data. Classification has more applications that are artifice detection, retailing, analytical modeling, manufacturing and medical analysis [11][12]. In this chapter we will introduce some classification algorithm of streaming data mining such as Naïve Bayes algorithm and Hoeffding Tree algorithm.

  10. Introduction to Data Mining With Case Studies

    The techniques include data pre-processing, association rule mining, supervised classification, cluster analysis, web data mining, search engine query mining, data warehousing and OLAP. To enhance the understanding of the concepts introduced, and to show how the techniques described in the book are used in practice, each chapter is followed by ...

  11. Classification Techniques in Data Mining-Case Study

    The major purpose of this assessment was to grant a complete examination of various classification mechanisms in data mining. Data Mining was a basic technique for examine the data in a choice of perception and categorizes it and finally to compress it. Classification was used to expect cluster relationship for data instances in data mining. Many key types of classification techniques ...

  12. Data Mining: Classification Techniques of Students' Database A Case

    Romero and Ventura in [4] gave a survey in applications of data mining in learning management systems and a case study class with the Moodle system.

  13. Data Mining with Python: Implementing Classification and ...

    This is the code repository for Data Mining with Python: Implementing Classification and Regression, published by Packt. It contains all the supporting project files necessary to work through the video course from start to finish. Python is a dynamic programming language used in a wide range of ...

  14. A comparative analysis of classification algorithms in data mining for

    Classification algorithms are the most commonly used data mining models that are widely used to extract valuable knowledge from huge amounts of data. The criteria used to evaluate the classifiers are mostly accuracy, computational complexity, robustness, scalability, integration, comprehensibility, stability, and interestingness. This study compares the classification of algorithm accuracies ...

  15. A study on classification techniques in data mining

    Classification is a data mining (machine learning) technique used to predict group membership for data instances. In this paper, we present the basic classification techniques. Several major kinds of classification method including decision tree induction, Bayesian networks, k-nearest neighbor classifier, the goal of this study is to provide a ...

  16. A Review: Data Mining Classification Techniques

    There are three types of learning methodologies for data mining algorithms: supervised, unsupervised, and semi-supervised. The algorithm in supervised learning works with a collection of instances whose labels are known. In the case of a classification job, the stamps can be ceremonial worth, whereas in the case of a regression work, the labels can be numerical values. In unsupervised learning ...

  17. A Case Study: Stream Data Mining Classification

    For the experiment purpose we created our own network dataset which shows significant 1/3 A Case Study: Stream Data Mining Classification accuracy in results after applying classifiers. ences Refer - Shabiashabir khan, M. A. Peer, S. M. K. Quadri, &quot;Comparative Study of Streaming Data Mining techniques&quot;, International conference on ...

  18. Using data mining technology to solve classification problems: A case

    Using data mining technology to solve classification problems: A case study of campus digital library - Author: Chan‐Chine Chang, Ruey‐Shun Chen ... The paper seeks to examine data mining technology which is a good approach to fulfill readers' requirements., - Data mining is considered to be the non‐trivial extraction of implicit ...

  19. Web Data Mining: A Case Study

    Web Data Mining: A Case Study. Samia Jones Galveston College, Galveston, TX 77550. Omprakash K. Gupta Prairie View A&M, Prairie View, TX 77446 [email protected]. Abstract. With an enormous amount of data stored in databases and data warehouses, it is increasingly important to develop powerful tools for analysis of such data and mining ...

  20. Predicting Student Performance by Using Data Mining Methods for

    To analyze the data, we use well known data mining. algorithms, including two rule learners, a d ecision tree classifier, two popular Bayes. classifiers and a Nearest Neighbour classifier. The ...

  21. Classification, Clustering and Association Rule Mining in Educational

    This study produced a design to set the goals of Educational Data Mining, this case as a student modeling that would be achieved by predicting using the Decision Tree method, and showed a mismatch between the general information data passed and the drop out of the rule obtained using the decision tree algorithm in the Rapidminer software.

  22. ch 4 midterm Flashcards

    Study with Quizlet and memorize flashcards containing terms like In data mining, classification models help in prediction., The data mining in cancer research case study explains that data mining methods are capable of extracting patterns and ________ hidden deep in large and complex medical databases., List five reasons for the growing popularity of data mining in the business world. and more.

  23. The government intervention effects on panic buying behavior based on

    The government intervention effects on panic buying behavior based on online comment data mining: a case study of COVID-19 in Hubei Province, China. Humanit Soc Sci Commun 11 , 1200 (2024). https ...

  24. A Comparative Study of Famous Classification Techniques and Data Mining

    Abstract. Data mining is the procedure or technique of drawing out the facts and patterns hidden in huge sum of data and converts it into a readable and understandable form. Data mining has four main modules like classification, association rule analysis, and clustering and sequence analysis. The classification is the major module and is used ...

  25. What Is Data Classification? How It's Useful for Businesses

    Types of data classification. Once data has been tagged according to its sensitivity level, it can be filed based on its type. This typically has three categories. Content-based classification. Data security tools scan the information to look for potentially sensitive details. This can be done using the tagging level system, wherein data is ...

  26. Identifying the Classification Performances of Educational Data Mining

    Educational Data Mining (EDM) is an important tool in the field of classification of educational data that helps researchers and education planners analyse and model available educational data for ...

  27. Classification of Logging Data Using Machine Learning Algorithms

    A log data analysis plays an important role in the uranium mining process. Automating this analysis using machine learning methods improves the results and reduces the influence of the human factor. In particular, the identification of reservoir oxidation zones (ROZs) using machine learning allows a more accurate determination of ore reserves, and correct lithological classification allows the ...