file

Use OSS data as machine learning training samples

This article describes how to store data in OSS as PAI Training samples.

explain

This article is written by Longlin @ AliCloud Provided for reference only.

background information

This article passed OSS And PAI And provide decision support for a traditional stationery retail store. The specific business scenarios involved in this article (both scenarios and data are virtual) are as follows:

A traditional offline stationery retail store hopes to find strongly related stationery categories through data mining to help reasonably adjust the shelf layout of the stationery store. However, due to the outdated cash register equipment, which is a POS cash register using XP system, only one order record (csv format) exported from the POS cash register is available for sales data. This article describes how to import this csv file into OSS, connect OSS and PAI, and implement product association recommendation.

Operation steps

  1. Data is uploaded to the bucket.

    Take the uploaded file Sample_superstore.csv, which is uploaded to the target storage space examplebucket in East China 1 (Hangzhou) as an example.

    1. Construct the Sample_superstore.csv file data sample.

       order_id,order_date,customer_id,item,sales,quantity 1,20240101,1,aa,10,100
    2. Upload the Sample_superstore.csv file to examplebucket. See Simple upload

  2. Connect OSS and PAI.

    1. Create a new workflow in East China 1 (Hangzhou). See New Custom Workflow

    2. Click the new workflow, and then select Source/Destination > Read CSV file

    3. Double click the read CSV file component, and click the Parameter setting Tab, File Path Set to oss://examplebucket/Sample_superstore.csv Schema Set to order_id string,order_date string,customer_id string,item string,sales string,quantity string , Open Whether to ignore the first row of data Switch and other parameters remain the default configuration.

    4. Right click the Read CSV File component, and then click Execute this node

    5. After the execution is completed, right-click the Read CSV File component, and then click View Data > arbitrarily

      Under the component, view the table information. Data preview only supports 1000 records. If you need to view the full table, please follow the page instructions to DataWorks.

Data exploration process

The main algorithm component used in this paper is collaborative filtering. For detailed usage of this component, see Collaborative filtering for product recommendation

The data exploration process in this case is as follows:

In this case, the source data is split into training sets and test sets at a ratio of 8:2. One order may have multiple items, so the ID column selects order_id to ensure that orders containing multiple items will not be split, as shown in the following figure:

There are 17 product items in this case. Through the collaborative filtering algorithm component, take the item with the highest similarity, and the results are shown in the following table:

conclusion

Through machine learning, we found that the similarity between "paper" and "stapler" is high, and also has a high similarity with other products.

For this stationery retail store, according to this data, there are two ways to layout the shelves:

  • The paper and stapler shelves are placed in the middle, and the shelves of other products are placed in a ring around the two, so that no matter which shelf customers enter from, they can quickly find the paper and stapler with a high degree of correlation.

  • Place the paper and stapler shelves at both ends of the stationery store. Customers need to cross the entire stationery store to buy the other one. Passing the shelves of other products halfway can improve the cross purchase rate. Of course, this layout method sacrifices the convenience of users' shopping, and should be cautious in actual operation.