This article speaks the data (2): Data collection

Editor’s Guide: With the arrival of the “Ministry” era, all of our lives are inseparable from data, and do you really understand data? This article will reinterpret the concept and value of the data, and how the value of the data is applied and sublimated in the “Mini IT” era; due to a lot of content, the author will take a few times to explain.

I. Introduction

In the previous article, we have learned that “data” is a huge system (as shown below) This article speaks the data (1): Data source; and use the example of the vegetable market, explain data for everyone The meaning of the source; today Xiao Chen mainly gave everyone to the designated “vegetable market”, how do we “buy food”, the process of data acquisition.

Second, data collection (buy vegetables)

First, let’s first introduce a simple classification of data acquisitions, and later introduce the main points of each data collection.

1. Press data acquisition method

Offline (questionnaire, field survey) – Note: Follow 5 factors!

5 elements:

1) Tight around the subject and purpose

The criteria for evaluating the degree of questionnaire investigation is very important, whether the questionnaire content is bonded to research themes, even if a questionnaire is designed and exquisite, if it is not related to the subject, it is also worthless, because we conduct questionnaire survey The essential purpose is also to investigate the related factors and investigate the link behind the group.

For example, investigate user satisfaction, generally involve two dimensions of product itself (price, packaging, etc.) and audience characteristics (age, geography, mentality satisfaction, etc.).

2) Easy to read, easy to understand, and have general

After the questionnaire is distributed, it is necessary to fill in, so the degree of understanding of the questionnaire is also finally the quality of the questionnaire.

The questionnaire is not an academic papers. It does not need to demonstrate a lot of professionalism, obscured words, so that the respondents can really understand is the key.

The generality refers to whether the setting of this problem is universal for all audiences. For example, in the questionnaire of the residents, you think the safest transport is, a train B airplane C BMW car D electric vehicle, we can see that the C option does not have a general meaning, and and A, B, D is not A dimension option.

3) Take full consideration of the characteristics of the victim

When using the questionnaire survey, it is necessary to fully combine the characteristics of the invigible group to carry out the settings of the questionnaire; for example, for preschool children and the elderly, it is not advisable to take a written questionnaire to investigate the form of research, to fully consider their language preferences (some The elderly may not be standard in Mandarin, but the dialect is fluent), and the understanding of the content, and then dispatched to investigate the investigation team for investigation.

4) Consider the problem order (step by step)

In addition to considering the normative and rationality of each problem, the settings of the questionnaire are also required to take into account the logic and coherence between problems and problems, avoid time, space, and humanity and other dimensions frequently.

5) Consider the statistical convenience

In addition to the audience of the questionnaire survey, the questionnaire setting also needs to fully consider the statistical analysis of the recovery of the post-question questionnaire; minimize the post-work pressure, the variable should not set too much, should be used to efficiently obtain the label information as little variable, help post-study qualitative .

Online (according to the data acquisition port fine)

APP end (main) – Data buried points Get relevant data:

First of all, first and everyone is science, what is the dataset? And why is the APP end to pay special attention to the diversion point.

In fact, the so-called embedded point is to collect in the process of users using the app, to optimize the product and operations; most of the APP self-contained services and profitability (such as Taobao, get, etc.), then want Realization transformation, boot purchase requires “point” to embed the “point” onto a specific interactive component (for example, click the jump link, purchase button, etc.), and then quantify the PV, UV; stay time, a hop rate, and the purchase rate.

In the form of the point, it is mainly divided into the following:

Code Belt: When the control operation occurs, the data is transmitted by pre-written code. The current Baidu statistics, friend League provides this service.

Let’s take an example, for example, we want to count the number of clicks of this button in Taobao App, then when it is clicked, you can call the data transmission interface provided by the SDK in the onclick function corresponding to this button to send data.

Advantages: Control Sending Data Time, Event Custom Attribute Detailed Record; Disadvantages: Time, Human Cost, Timeliness of Data Transmission.

Visualization point: Using visual interaction means, the data is acquired by the visual interface configuration control operation and the event operation. Data is acquired by the background screen capture; for example, when the user generates multiple refresh, combined with the big data algorithm, calculate the user The preferences and switch the push content, the product, and then automatically switches to the corresponding personalized recommendation content page.

Advantages: low cost, fast speed, product, market, etc. can participate; Disadvantages: The behavior record information is small, supported by less analysis, and the development burden is reduced.

No buffing point: When the user shows the UI interface element, the platform will trigger events via controls. When the event is triggered, the system will have the corresponding interface to let the developer handle these behaviors; after the UI interface, the system can automatically identify the generated control. The only ID, ID is generated inside the program, and simply guarantee that these IDs are the same on different phones, the user-free point data can be achieved. Advantages: No need to bury, convenient; Disadvantages: The behavior record information is small, the transmission pressure is large.

Webpage – webpage reptiles (Python, C …):

For specific syntax, because of the use of tools, it is not possible to provide guidance on specific grammar (everyone searches on CSDNs according to the language they use), but its overall methodology is consistent.

Methodology: Artificial Determination of Dimensions of Climbing Information ¡ú Analysis Target Website URL Composition ¡ú Confirm Crawling Tool ¡ú Writing Program Language ¡ú Get Data ¡ú Save in Local ¡ú Subsequent Data Mining.

Three, conclusion

In this issue, the author passed a “buying vegetable” example, with everyone to understand several ways of data collection, I believe everyone has gains!

In the next period, the author is based on data collection, explaining how to use common tools for data cleaning and data cleaning!