Data Life Cycle Evaluation

Executive Summary

The data for this research was first acquired through web scrapping, then cleaned by deleting rows of empty cells and standardizing the variables. The authors analyzed the data using unsupervised learning methods for data exploration.

Before proceeding to use supervised learning algorithms to make predictions about tourism sustainability. The findings from the research work were then disseminated in this publication.

Business Problem

Tourism is a huge contributor to the economic growth in many countries. Due to Covid-19-related lockdowns, the patronage of tourism dwindled during the pandemic, but it is expected to return in most countries that used to be leading in the tourism industry [1]. With this rise in tourism post-Covid, there is an opportunity to revisit sustainability in the tourism and hospitality business sector. There is the need to monitor and measure the environmental impact of tourism moving forward using clear metrics that are trackable.

This effort requires data, however, the traditional way of collecting data for sustainability assessment is costly. This paper focused on identifying readily available alternative sources of data that can be transformed via the use of data science strategies to enable the assessment of sustainability. The authors have, thus presented the specific business problem in this question: “Can statistical learning techniques using data from an online tourism platform predict tourist accommodations as sustainable, as indicated by the sustainability label?”.

Data Collection Methodology

Access to the data to measure, for example, energy efficiency can be very challenging. The data acquisition method used in this article was web scrapping to assess tourist responses to surveys and ratings. From social media sites and through google, data were collected for different locations. Usually, when web scrapping is done, there is the possibility of discovering missing data due to incomplete responses from tourists when completing a survey or review online. So as part of the data cleaning process, the rows that had empty cells were deleted. Even though this resulted in a reduction in the size of the dataset, it was better to work with a transformed dataset to reduce skewness [2].

Initial data exploration is usually carried out to identify any form of outliers in the data. In this article, to explore the data further, a Machine learning algorithm – An unsupervised learning method like Cluster analysis, and Principal Component Analysis was employed. This was followed by statistical data analysis. During the analysis, sampling of the dataset ought to be done in a way that ensures a good representation of the data is achieved to reduce bias. [3]

To predict the sustainability of tourism in the different locations, a supervised learning method was used. Although the authors did not specify how the data were split, it is important to consider three splits of the data into a training set, validation set, and testing set. This will help prevent overfitting of the different models that were built aside from ensuring better evaluation in the end. Some of the supervised learning models built for the prediction were logistic regression, random forest, and k-nearest neighbors. A text analysis of the tourist hotels and Airbnb reviews, although not conducted in this article, would have added value to the data by highlighting recent patterns of tourist experiences post-Covid.

Results

The general components of scientific methods require researchers to make observations, ask a question, then form a hypothesis before making and testing a prediction based on the hypothesis. The authors did not clearly form a hypothesis, nonetheless, they posed a question with the end goal in mind and built models to make predictions. The results of the analysis focused on the comparison of the accommodation award status for selected features in our dataset. From the descriptive statistics, the authors realized that only GreenLeader accommodations received a higher rating, had larger and more options for their amenities and their tourist uploaded more photos on the sites.

For the observations made during the data exploration phase, the unsupervised learning methods showed four principal components that captured a greater proportion of the variation in the dataset for the dimensionality reduction analysis. From the hierarchical clustering, the dendrogram showed four clusters representing four unique characteristics of the accommodations. Lastly, the K-means clustering highlighted the sustainability shared features among the accommodations. In the end, all the datasets have the possibility of being split into at least four groups based on the exploratory analysis. The authors do not make any claims about whether there exists any causal relationship between these features and the sustainability label. However, they observed they tend to co-occur in the data.

For the supervised learning algorithm, the models were evaluated using cross-validated average effect, F2 score, Recall, and Receiver Operating Characteristics (ROC) Area Under Curve (AUC). The grid search results showed similar performance for most of the models. There was a high performance on the F2 scores although the Random Forest model was the best at discriminating between two classes. This made the Random Forest record the highest score in the ROC AUC metric. When the authors compared the confusion matrices as well as the overall prediction performances of the different models, none of them was a better choice than the other models. It was therefore implied in the article that the authors were not surprised that the machine learning models were not able to perfectly differentiate between groups.

Outcome / Conclusion:

The researcher’s overall purpose of the analysis was not in predicting and identifying individual hotels as sustainable, but on providing a probabilistic assessment of the distribution of sustainable hotels in the region. And analyzing whether online platform data can be used to inform about sustainable tourism. The article was focused on classifying accommodations by sustainability and offered an alternative to existing methodologies using travel platform data. It also added to the literature by utilizing accommodation’s own presentation and associated user interactions to gain information regarding their sustainability practices. Although the prediction quality in this article was high, it was not excellent, and the findings were subject to limitations.

Author: Adwoa Osei-Yeboah

References

Tasnim, Z., Shareef, M. A., Dwivedi, Y. K., Kumar, U., Kumar, V., Malik, F. T., & Raman, R. (2022). Tourism sustainability during COVID-19: developing value chain resilience. Operations Management Research, 1-17.
Lun, Z., & Khattree, R. (2021). Imputation for skewed data: Multivariate lomax case. Sankhya B, 83(1), 86-113.
Balayn, A., Lofi, C., & Houben, G. J. (2021). Managing bias and unfairness in data for decision support: a survey of machine learning and data engineering approaches to identify and mitigate bias and unfairness within data management and analytics systems. The VLDB Journal, 30(5), 739-768.