Data Curation Tools/Techniques

Introduction

Data curation is a process of collecting, munging, and storing data in a database system. Lord and Macdonald (2003) describe data curation to be a process of managing digital data as soon as it is created, gathered, and used for analysis (Lord & Macdonald, 2003).

This process aids in achieving quality datasets by using tools like Informatica, a cloud-based performance tool to carry out ETL (Extraction, Transformation, and Loading), data masking, and data visualization including mastering data management.

Summary of Themes of Concerns of Big Data, Data Management & Databases

When it comes to using big data, data management, and databases, there are many concerns that have been raised over time. There is an increase in security gaps as more data is being amassed and the increase in complexity of the data management systems available as technology advances. There is also greater concern that although there are several options for data management systems and databases, there is a skill gap when it comes to utilizing these tools and techniques. Digitalization has resulted in more and more data becoming available and constantly changing. Some may also question how accessible big data is aside from the compliance hurdles that need to be to be considered when using big data.

Khashan et al. (2021) in their study highlighted big data analysis popularity is resulting in the creation of new big data management standards for databases and other storage solutions, it is usually hard to predict how much will be spent on storing the data (Khashan et al., 2021). The reason sometimes stems from being able to identify what kind of data is being stored and how frequently the data will be accessed. A skillset and extra cost might be incurred when it becomes necessary or required to house cold data that is rarely accessed in a storage media separately from hot data that is frequently accessed in a fast and easy-to-access storage media.

There is also the concern of what is the best place to store big data. Should it be kept in the cloud, or in a database management system like SQL or NoSQL? For smaller datasets, they could be kept on an external hard drive or burnt onto a DVD. In addition, there is also the concern of compatibility of the portal management interface for all locations big data is to be stored. If this is not ensured, it will hinder the smooth movement of data from one location to another thereby restricting data replication.

Comparing How Big Data and Data Management are Interrelated

Jiang’s (2022) research addresses how the implementation of big data strategies worldwide has become relevant in several fields (Jiang, 2022) When comparing big data to data management and how they are interrelated, big data can be viewed as very complex datasets that cannot be managed by traditional data processing software and are usually stored in data warehouses. Data management on the other hand involves the collection, keeping, and using of data securely, efficiently, and cost-effectively.

In this digital age, data management is very important because proper data management minimizes how many times data is moved by localizing it on desirable platforms using intelligent data management techniques. Also, proper data management ensures that data are adequately governed by assigning authority and control to the right people responsible for managing the data. The format of big data includes structured, unstructured, and sometimes semi-structured datasets because they can be gathered from various sources like social media, customer databases, mobile apps, emails, or even through medical records. Velocity, volume, value, variety, and veracity are some characteristics of big data.

Currently, data are being produced at a very high speed in real-time. Since very large volumes of these datasets are from different sources, it raises the question of how truthful or accurate the datasets are. However, due to the massive amount of data available to analyze, there is a huge benefit of achieving better results and drawing meaningful insights from big data analytics. Some of the best data management practices include clarifying what research goal is making ethical use of the data, its protection, and security a priority. The researcher must also seek to use quality data and eliminate duplicated data. If the research is being conducted by a team, the data should be readily accessible to everyone on the team.

There should be a data recovery strategy in place and if data management software is of higher quality. The different types of data management systems like data warehouse systems, marketing technology systems, customer relationship management systems, and analytics tools come with some great advantages and disadvantages. Although the established processes and policies decrease potential errors, increase efficiency, provides the structure for information to be easily shared and accessed as well as guaranteed data integrity and security, the software and hardware expenditure is costly. There is a high complexity with new technologies that require frequent updates and require skilled database staff to handle the data.

Some of the pros of big data are the ability to identify patterns to make better decisions, analyze, for example, customer behavior to improve customer relations, gain insights into the viewing habits of viewers, and the ability to develop effective risk management processes and strategies to improve productivity and efficiency. On the other hand, data quality for big data can be questionable and might have high-security implications that will require the implementation of special protection levels. There are compliance problems like privacy rules and regulatory demands that arise with big data. Just like data management, big data is also capital intensive with a shortage of skillsets to pick the right tool, design, deploy and manage big data analysis.

Use of data mining tools and techniques to retrieve data to answer the research questions

Data mining techniques and processes involve using curated data to carry out data analysis and data interpretation to draw meaningful insights that address the research goals and objectives. Li and Tian (2022) define data mining as mining hidden attributes from massive datasets to carry out decision analysis (Li & Tian, 2022). There are several data mining tools and techniques available for research work. Although some do require coding expertise, others require no coding experience. MonkeyLearn is one example of a data mining tool that does not require coding expertise. Other tools like Weka, H2O, and Orange are open-source tools whilst RapidMiner requires the use of Python coding.

The choice of a data mining tool is usually dependent on the complexity and size of the data. For extremely large-scale and complex data mining tasks, Apache Mahout is appropriate (Anil et al., 2020). But for other tasks like text mining, MonkeyLearn which is a machine learning platform will be most appropriate. For retrieving data to answer this study’s research question, a free open-source data science platform like RapidMiner with lots of algorithms for preparing data, carrying out machine learning tasks, and text mining will be employed. This platform is very easy to use for both non-programmers who want to create predictive workflows and programmers who would want to customize or tailor their data mining through the maximization of the R and Python extensions the platform offers.

Some data mining techniques like clustering analysis and classification techniques will be used to assemble varied attributes of residential real estate. Through these techniques, different suburbs for example may be grouped into a cluster based on their similarities in terms of housing demand. To forecast or make a prediction on the housing demand post-covid within the suburbs as well as the business districts, a prediction data mining technique will be very useful. Here we will use the change in current housing demand to predict the future housing demand. Lastly, outlier detection techniques will be used to identify any anomalies in the data that would be collected for this study.

For instance, during the pandemic, if demand for housing in the urban areas decreased tremendously but our data is showing that during one of the weeks, there is a huge spike in housing demand within this business district that experienced a shutdown then such an anomaly can easily be spotted. Data repositories like a data warehouse, data lake, data marts, and data cubes are data libraries that gather, manage and store datasets for data analysis and reporting. The data warehouse can lump data from different sources while the data lake stores unstructured data classified and tagged with metadata. Data marts, on the other hand, being in subset format are easier and more secure for the user due to limitations placed on authorized users to access the isolated dataset.

For less complex datasets, data cubes which are lists of data with a few dimensions stored in a table format like a spreadsheet will be ideal. These repositories enable easier time tracking of problems due to compartmentalization thereby making it easier and faster to carry out data reporting and analysis. Ultimately, the EOC Satisfaction Survey dataset was organized and prepared through data cleaning techniques. Consistency in coding for the categorical variables was ensured. The missing values were initially replaced with NA before exploring the data to determine if they needed to be deleted. The features were then renamed to shorten the long text description of each column.

Author: Adwoa Osei-Yeboah

References

Anil, R., Capan, G., Dunning, T., Friedman, E., Grant, T., Quinn, S., Ranjan, P., Schelter, S., Yılmazel, O., Dunning, T., Friedman, E., Grant, T., Quinn, S., Ranjan, P., Schelter, S., , Ö Y., Anil, Drost, & Dunning. (2020). Apache Mahout: Machine Learning on Distributed Dataflow Systems Isabel Drost-Fromm

Jiang, S. (2022). Hotspot Mining in the Field of Library and Information Science under the Environment of Big Data. Journal of Environmental and Public Health, 2022, 1. 10.1155/2022/2802835

Khashan, E., Eldesouky, A., & Elghamrawy, S. (2021). An adaptive spark-based framework for querying large-scale NoSQL and relational databases. PloS One, 16(8), e0255562. 10.1371/journal.pone.0255562

Li, J., & Tian, X. (2022). Research on Embedded Multifunctional Data Mining Technology Based on Granular Computing. Computational Intelligence and Neuroscience, 2022, 1-7. 10.1155/2022/4825079

Lord, P., & Macdonald, A. (2003). e-Science Curation Report Data curation for e-Science in the UK: an audit to establish requirements for future curation and provision