Data Optimization For Data Analysis

Business Need

Electronic commerce (e-commerce) essentially involves buying or selling products online or over the Internet. Online stores such as Amazon, eBay, Shopify, or Olist are some forms of e-commerce websites that will be explored in this paper.

Since these e-commerce sites, e.g. Amazon, connect small businesses from different locations, merchants can sell their products through these platforms and ship them directly to the customers using these e-commerce logistics. The objective of this study is to determine how e-commerce organizations can determine future sales whilst discovering product categories that are more prone to customer dissatisfaction.

Whenever a customer purchases a product, the seller receives a notification and fills the order for the customer. For these e-commerce companies to ensure repeated purchases from their customers, it is important for them to access the customer’s journey, and identify which products are profitable and had the best reviews.

This study addresses some of the business problems most e-commerce organizations could face. Understanding this phenomenon is critical because Fedirko et al’s research predicted that in the next 3 years, online sales would increase rapidly, and data-driven solutions are needed to increase the usability of e-commerce systems and their profitability (Fedirko et al, 2021).

Research question (RQ):

RQ1: How can e-commerce platforms monitor the order status of their customers?

RQ2: What features within the e-commerce industry impact the freight values of the products ordered?

RQ3: What is the review score/ratings received on the orders placed with the sellers?

Data Selection Process

For e-commerce organizations to be able to estimate future sales of their products, it will require them to gather historical metrics like previous sales records as well as current sales records. For this study, Kaggle.com is a potential source of the e-commerce dataset to be used (Kaggle.com, 2022). With an appropriate dataset, future sales predictions are made right and will reduce the risks of wastage and optimize inventory levels whilst increasing customer satisfaction and reducing the expenses of the organization.

Records of understocking and overstocking at different times alongside the sales patterns, and product purchase history during these periods go a long way to identifying which products are seasonal products and which categories of products are purchased on a regular basis. By monitoring the sales channels and stock levels, e-commerce platforms will ensure that they always have the right amount of inventory demand. In addition, by collecting customer feedback through survey administration, and reviewing chat logs and customer ratings, these companies can gather meaningful information that influences the future sales prediction of their customers.

Data Cleaning Process & Data Preparation for Analysis

Data preparation processes ensure raw datasets are in a format and/or transformed into formats suitable for further processing or analysis. According to Zhang et al, data preparation is a fundamental stage of data analysis since about 80% of the total data engineering process comprises data cleaning and preparation (Zhang et al., 2003). The steps used for data preparation ensure quality data is utilized in the data analysis phase since the quality of the input data strongly influences the quality of the results achieved from the analysis.

The data preparation framework includes data selection, data integration, data transformation, data cleaning, and lastly data reduction (Sattler & Schallehn, 2001). Sidhu (2021) explains the data preparation phase as a way of understanding a dataset more before proceeding to further analyze it. And this part of the methodology should answer the question ‘Is the data that you collected representative of the problem to be solved?’ (Sidhu, 2021). The Olist e-commerce dataset of 100k orders from 2016 to 2018 is representative of the business problem the study will be addressing.

The variables in the files comprised order status, price, payment and freight performance, customer location, product attributes, customer reviews, and geolocation dataset using Brazilian zip codes. The 9 different files in which the different features were collected consist of the customer file, geolocation file, order items file, order review file, orders file, products file, sellers file, and the product category name translation file. Since all these features were divided into several files, there was the need to merge the different files using a designed schema where two datasets could be joined based on a common variable.


Code1: Loading the Datasets

(9 Different Files in the Olist E-Commerce Dataset were used in the Study)

custrs = pd.read_csv(‘olist_customers_dataset.csv’).sample(n=5000, replace=True, random_state=1)

geoLoc = pd.read_csv(‘olist_geolocation_dataset.csv’).sample(n=5000, replace=True, random_state=1)

ordItm = pd.read_csv(‘olist_order_item_dataset.csv’).sample(n=5000, replace=True, random_state=1)

ordPym = pd.read_csv(‘olist_order_payments_dataset.csv’).sample(n=5000, replace=True, random_state=1)

ordRev = pd.read_csv(‘olist_order_reviews_dataset.csv’).sample(n=5000, replace=True, random_state=1)

orders = pd.read_csv(‘olist_orders_dataset.csv’).sample(n=5000, replace=True, random_state=1)

prodts = pd.read_csv(‘olist_products_dataset.csv’).sample(n=5000, replace=True, random_state=1)

selers = pd.read_csv(‘olist_sellers_dataset.csv’).sample(n=5000, replace=True, random_state=1)

pdtCat = pd.read_csv(‘product_category_name_transaction.csv’).sample(n=5000, replace=True, random_state=1)


Code2: Merging of Product Orders File to Order Payment file code

df_ord_ordPym = orders.merge(ordPym, on =‘order_id’)


This Code2 helps in identifying how best to join the different files based on their common feature. For instance, the orders dataset and the order payment dataset were merged on ‘order_id’. The order items and the seller’s dataset could also be joined on the seller id whilst joining the order dataset to the order item dataset on the order id. Prior to proceeding to use the derived datasets from merging in analysis, it will be more appropriate to clean it.

Lord and Macdonald (2003) describe data curation/cleaning to be a process of managing digital data as soon as it is created, gathered, and used for analysis (Lord & Macdonald, 2003). As part of the data-cleaning steps, there was a need to check for null values, the data types of the variables under consideration, the existence of potential noise, and outliers using boxplots and how to delete them.

Based on the research question of predicting future sales/prices whilst discovering product categories that are more prone to customer dissatisfaction in e-commerce, some variables were dropped. Due to the size of the dataset, random sampling of 5000 datasets was carried out of the sampled data to be utilized in the analysis. Entries with missing data or empty cells as well were replaced by their mean values for numeric variables and whilst the duplicated data were dropped since the dataset is large enough to enable us to proceed in this direction.

Data Modeling

To make sense of the e-commerce dataset, and as part of the exploratory data analysis phase, an initial investigation of the dataset would be carried out to discover insights and patterns and spot unique features (Behrens, 1997). This process will be carried out using Python coding with appropriate libraries like NumPy, Pandas, Matplotlib, and Seaborn in a Jupiter notebook. Because according to Fedirko et al (2021) research, aside from Python being a more popular tool, python is faster and easily deals with Big Data (Fedirko et al, 2021).

To predict e-commerce organizations’ future sales, Univariate analysis like the histogram of the data could be coded to explore how the frequency of values is represented, whilst a box plot could be used to check for potential outliers. A bivariate analysis would follow with a Pairplot and a Heatmap of Pearson correlation. This was used to compare two features to each other and determine potential relationships and correlations that may exist between them. Multivariate analysis such as Linear Regression was explored. As multiple variables are being evaluated together, a possible association between these variables was identified.

The results of the descriptive statistics provide basic information about the variables – orders, products, and sellers in the dataset. From the frequency distribution table, the count of orders, percentages, and frequency of products sold could be determined to help understand how often this trend occurs within a given time. The measure of central tendencies such as the mean, median, and mode would provide an estimation of the average sales of the most patronized seller. The measures of dispersion – range, variance, and standard deviation would be used to ascertain how dispersed the data are. This information is helpful because the median would be chosen over the mean if the data is spread out since data dispersion affects the mean.

Ultimate Findings

From the exploratory analysis, a histogram of the prices of the products was right-skewed. The descriptive statistics also shows a mean price of products of about $55 with a standard deviation of 35. In addressing research question 1, which enquires about how e-commerce platforms could monitor the order status of their new and existing customers, it was discovered from the study that there were 6 order statuses for the customers. These comprise delivered, shipped, canceled, invoiced, processing, or unavailable status. From the order status Python output, 2 orders were canceled, 231 were delivered, 1 was invoiced, 2 were being processed and I was unavailable. It could be inferred from these results that more orders were being delivered to the customers.

In addressing research question 2, it was necessary to explore the features within the e-commerce industry that impact the freight values of the products ordered. Whilst analyzing the dimensions of the products ordered, most of them weighed between 45g to 60g. Most of the product description lengths fell below 600 words with an average of approximately 2 photos uploaded per product although the maximum number of photos uploaded is around 6. The heat map showed a mild positive correlation (0.35) between the freight value of a product and its price. There is also a slight positive correlation (0.39) between the freight value of a product that has been ordered and how much the product weighs.

Although product description length, product height, and product width do have some correlation with freight value, it is very small compared to the correlation explained earlier between the freight value and the weight of the product. The linear regression plot showed that a heavier product does not necessarily result in a larger freight value. There were some products with very lower weight readings but extremely higher freight values and other products that seemed very heavy but had an average freight value. Several factors might have played a role like the geographical location where an order is to be shipped or how fragile the product is.

The last research question for this study required the determination and analysis of the review score/ratings received on the orders placed with the sellers. From the review score summary statistics in Figure 10, it shows an approximation of almost 4 ratings for the products. Overall, the ratings range from 1 to 5 with 1 being the minimum review score and 5 being the highest review score value. for the review scores customers gave the products. 127 products received a review score of 5, 52 products received a review score of 4, 18 products rated 3 for review score, only 3 products received a review score of 2 and finally, the remaining review score of 1 was given to 38 products.

Conclusion

Ultimately, all three research questions were addressed with the analyses that were carried out. Out of the 37 unique product categories, Cama Mesa Banho and Belza saude product categories represented the majority of the products whilst Fashion product categories were among the lowest number of products for the Olist e-commerce dataset. The maximum price of the order was around $155, the minimum price at $7.8, and the mean price at $55.

There were a total of about 2478 sellers who come from 22 different states although 60.6% were from SP state which represented the highest proportion for the different states the sellers were located. Even though the average review score was recorded at 4 out of a total score of 5, it will be interesting to carry out further study on which product categories were rated at 4 or 5 for the different sellers. This will help the e-commerce industry make better-informed decisions on selecting sellers/vendors on their platform.

Author: Adwoa Osei-Yeboah
References

Behrens, J. T. (1997). Principles and Procedures of Exploratory Data Analysis. Psychological Methods, 2(2), 131-160. 10.1037/1082-989X.2.2.131

Fedirko et al. (2021). Data Science and Marketing in E-Commerce Amid COVID-19 Pandemic. European Research Studies Journal, 26(2, 2021), 3-16.

Kaggle.com. (2022). Brazilian E-Commerce Public Dataset by Olist. Retrieved February 22, 2023, from https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce?resource=download

Lord, P., & Macdonald, A. (2003). e-Science Curation Report Data curation for e-Science in the UK : an audit to establish requirements for future curation and provision. Intellectual Property, 1-84.

Sattler, K., & Schallehn, E. (2001). A data preparation framework based on a multidatabase language. Paper presented at the 219-228. 10.1109/IDEAS.2001.938088 https://ieeexplore.ieee.org/document/938088

Sidhu, R. (2021, Apr 24,). Life Cycle of a Data Science Project. https://ieeexplore.ieee.org/document/9442519 Zhang, S., Zhang, C., & Yang, Q. (2003). Data preparation for data mining. Applied Artificial Intelligence, 17(5-6), 375-381. 10.1080/713827180