Use OSS data as machine learning training samples

This article describes how to store objects OSS The data inside is used as PAI Training samples.

explain

This article is written by Longlin @ AliCloud Provided for reference only.

background information

This article passed OSS And PAI And provide decision support for a traditional stationery retail store. The specific business scenarios involved in this article (both scenarios and data are virtual) are as follows:

A traditional offline stationery retail store hopes to find strongly related stationery categories through data mining to help reasonably adjust the shelf layout of the stationery store. However, as the cash register equipment is old, it is used by one XP Systematic POS For cash registers, only one copy of sales data is available from POS Order records exported by cash register (csv Format). This article describes how to integrate this csv File import OSS, Parallel connection OSS And PAI, Realize the associated recommendation of goods.

Operation steps

  1. Upload data to Bucket。

    Take the uploaded file as Sample_superstore.csv, Upload to East China 1. Target storage space in Hangzhou examplebucket Take for example.

    1. structure Sample_superstore.csv Sample file data.

       order_id,order_date,customer_id,item,sales,quantity 1,20240101,1, aa,10,100
    2. take Sample_superstore.csv File upload to examplebucket。 See Simple upload

  2. connected OSS and PAI。

    1. In East China 1. New workflow in Hangzhou region. See New Custom Workflow

    2. Click the new workflow, and then select Source/Destination > read CSV file

    3. Double click read CSV File component, read on the right CSV Of the file component panel Parameter setting Tab, File Path Set to oss://examplebucket/Sample_superstore.csv Schema Set to order_id string,order_date string,customer_id string,item string,sales string,quantity string , Open Whether to ignore the first row of data Switch and other parameters remain the default configuration.

    4. Right click to read CSV File component, and then click Execute this node

    5. After the execution is completed, right-click Read CSV File component, and then click View Data > arbitrarily

      Under the component, view the table information. Data preview is only supported one thousand Records. If you need to view the full table, please follow the instructions on the page DataWorks。

Data exploration process

The main algorithm component used in this paper is collaborative filtering. For detailed usage of this component, see Collaborative filtering for product recommendation

The data exploration process in this case is as follows:

This case is based on 8:2 The proportion of item, so ID Column Selection order_id, Multiple item 's order will not be split, as shown in the following figure:

In this case, there are seventeen Products item。 Through the collaborative filtering algorithm component, take the highest similarity item, The results are as follows:

conclusion

Through machine learning, we found that the similarity between "paper" and "stapler" is high, and also has a high similarity with other products.

For this stationery retail store, according to this data, there are two ways to layout the shelves:

  • The paper and stapler shelves are placed in the middle, and the shelves of other products are placed in a ring around the two, so that no matter which shelf customers enter from, they can quickly find the paper and stapler with a high degree of correlation.

  • Place the paper and stapler shelves at both ends of the stationery store. Customers need to cross the entire stationery store to buy the other one. Passing the shelves of other products halfway can improve the cross purchase rate. Of course, this layout method sacrifices the convenience of users' shopping, and should be cautious in actual operation.