Editor’s Guide: Enterprises produce numerous data daily, these data should be analyzed to generate value for business, operations. The big data platform is produced by meeting the various requirements for the company’s data. How to build a big data platform depends on the degree of data of the company and the data problem facing. This article will use online education as an example, how to build a big data platform from 0 to 1, share with you.
The first article, first introduced it in accordance with the practice. I have currently served as a major data marketing product in an online education company. Due to some opportunity, I will be responsible for the data product line and marketing CRM product line, so give me more opportunities to think and practice how to put data and marketing. The business is in-depth, which gives the potential of big data to marketing platforms, thereby achieving business refinement operations and data drivers.
Next, the big data marketing platform under the online education business scene is actually presented in a series of articles. Articles may involve: large data platform construction, user portrait service system, CRM cable dynamic score model and distribution algorithm, data product implementation promotion plan, customer data middle stage (CDP) and other directions.
This article mainly explains how to build a large data platform in the online education business site from 0 to 1.
I. Diagnosis of corporate data issues
Product is to meet the needs, do you need to build a big data platform? And what kind of big data platform is built? Depending on the degree of dataization of the company and the data problem facing. Therefore, before building large data platforms, it is necessary to make sufficient investigation, and the problem can be found to be subject to the drug. The evaluation method of the degree of dataization of the enterprise can refer to the data management capability maturity model (DMM) shown in the figure below.
Through the previous research and analysis, our company is at the L2 level, and the main data is as follows:
1) Data source dispersion
Advancing the association analysis between multi-proportionally is not conducive to further excavation data of data asset value without uniform data platform, data resources are not summarized, data cannot be efficiently supported
2) Data indicator is not unified
The accuracy of the different business units is divided into the authority, and the authority is questioned.
3) Data analysis efficiency is low
Each business unit occupies part of the energy data analysis work to data on data often requires a complete data analysis tool for data analysts from raw data to data analysts.
4) Data management problem
An unified one data dictionary lack of data flooring no yuan data management
Second, big data platform business architecture and road map
The previous part has been fully diagnosed and problematic in the internal data issues of the enterprise.
Data service system blueprint
From the perspective of business, the following data service system blueprint, the planning of the data service system needs to meet three points: the data service system needs to cover the complete corporate business, throughout the business, accompany the company’s development.
In this data service system, the core link is the overall modeling and data asset management of the data, which is the unified division of the warehouse we are familiar with. Combined with online education business characteristics, the digital construction needs to meet the construction of three core data system:
User Data System: User Analysis Application, User Label, User Behavior Data, User Basic Information Main Data, etc .; Marketing Data System: Marketing Analysis, Marketing Taste Label, Channel Characteristics Data, Revenue Transformation Main Data, etc. System: Learn analysis, learning preference labels, learning behavior data, learning material basic data, etc.
2. Data Warehouse Architecture
The level of data warehouse is divided into a hierarchical manner using the industry, including: ODS, DWD, DWS, ADS layer, as shown below:
1) ODS layer
Data Synchronization: Structured Data Incremental or Full Synchronization to Data Warehouse; Structured: Non-Structured (Log) Structured Processing and Store Data Warehouse; Cumulative History, Cleaning: Save Historical Data from Data Business Demand and Audit Requirements , Data cleaning;
2) CDM layer
Combination correlation and similar data: use detailed width tables, multiplexing association calculations, reduce data scanning. Public Indicators Unified Processing: Based on the Statistical Indicators of Nomencutive Specification, Call Arranging and Algorithm – Create Logic Summary Wide Table. Establish consistency dimension: Establish a consistent data analysis dimension, reduce the risk of data calculation calibration.
3) ADS layer
Personalized indicator processing: unafabilities, complexity (index, ratio, ranking, etc.)-based data assembly: large width table, horizontal stereo table, trend index string.
3. Data processing process architecture
The data processing flow mainly includes source data synchronous cleaning, data processing processing, model operation, and data application. Based on online online education company, the source data mainly includes: channel data, user data, transaction data, marketing process data, learning data, external third-party data, etc.
The model engine includes two types of offline computing engines and real-time computing engines, need to meet algorithm (or rules) deployment, model training, and online, and the ability to provide interface services to other business systems, such as providing multi-algorithmic clues for the CRM system, real-time distribution, User portrait layering and other services. In the full process of data, processing, production, application, data governance in full life cycle cannot be ignored, because the data is accurate, integrity, and consistency directly affects the credibility of the service on the data system. 4. Building a large data platform from 0 ~ 1
The author combined with its own experience in promoting the construction of big data platform, gives you the following route maps for your reference.
Third, data modeling and design specifications
Data model selection and example
Dimensional modeling common models with star models, snowflake models and constellation models, data warehouse design generally uses a star model.
The star model is a multi-dimensional data relationship, which consists of a fact table (Fact Table) and a set of dimension tables. Each dimension table has a main key, all of these main keys combines the primary key of the fact table. The non-main key attribute of the fact table is called the fact (FACT), which is generally numerical or other data that can be calculated.
Fact Table: Indicates a description of the types belonging to the analysis topic. For example, “Yesterday morning, Zhang San spent 1,000 yuan in the Global Network, purchased a zero-foundation class class.” Then analyze the topic of the purchase, can extract three dimensions from this information: time dimension (yesterday morning), local dimension (Global Network School), commodity dimension (a zero-based transducing class). Typically, the dimension table information is fixed, and the amount of data is small.
Dimension: indicates the measure of the analysis topic. For example, in that example above, 1000 yuan is the factual information. The fact table contains the outer code associated with each dimension table and is associated with the dimension table through the Join mode. The metrics of the fact table are usually numeric types, and the declaration is increasing, and the scale is rapidly increasing.
2. Digital design specification
1) Table naming specification
Digital table name namings are shown in the following figure.
2) Field-level specification
Named reference of the new indicator is named in the field name, avoiding the same field, and 10 people have 10 naming methods.
Field categories include: detail, dimension, indicators, time, code, flag, named specification:
The end of the ID indicates the number, and some dimension numbers correspond to the correlation of the corresponding dimension table acquisition meaning; the Name end represents the name, more to the ID, explain its meaning, the independent field ends, the code is indicated by the code field, corresponding meaning The part can be viewed directly in the document, partially needs to be associated with the digital code table acquisition; Time ends indicates the time field, the format is YYYY-MM-DD HH: MI: SS, the source system is acquired, does not work; the Money ends indicate the amount, The system corresponding to the payment amount; IS_ starts to indicate the flag field, this field only 0, 1, meaning: 1, 0 No; divide the above specification field, other fields according to the meaning of the Chinese meaning, more attribute fields, notice Big.
Fourth, big data platform technology architecture and module introduction
During the construction of big data platforms, the author and the company’s large data architect jointly studied that the technical architecture given after the discussion is shown below.
1) Security module
As a data platform, security data is always the first element. The establishment of the security system mainly includes the following aspects:
Data security specification, security rating Develop User System Basic Components Level Permission Management Service Layer Right Management User Certification Secret Key Management Process Approval Data Encryption Deactive Audit
2) Monitoring module
In addition to data security, the stability of the service is the second level indicator of the platform. A good monitoring system can help predict risk positioning issues. E.g:
Pre-proportional disk capacity positioning memory, CPU resource problem discovers abnormal task node downtime and other issues to view the service load, evaluate resources
3) Storage module
The storage module belongs to the underlying component module, mainly using the related components of the Hadoop ecosystem. Select a component for different application scenarios, for example:
Hive: Offline Nurse HBase: KV Storage, can be used for highly aggregated fixed indicators, respond to scene Druid with higher concurrent request: OLAP scene, can provide sub-second, higher requests and need to drill capable OLAP Function Impala: Provides more efficient query analysis capabilities on the basis of digital data, suitable for instant query scenarios, but does not handle higher requests.
4) Calculation module
YARN is uniform resource management, spark or flink can be used as a unified stream, batch framework. Or the phased allows both to coexist.
5) Management module
Data Governance: Main Platforms for Digital Management Data, including:
Metadata Management Data Quality Management Blood Relationship Management Data Security, Permission Management
Offline task management, scheduling:
Contains pipe tasks, SQL tasks, SHELL tasks, etc., the vast majority of SQL tasks in the digital scenes of the Digital Scene requires the dependence between SQL automatic generation tasks, and according to the dependencies and priority scheduling tasks according to the taskFlow task management:
Flow task release, monitoring, restart, etc.
V. Write in the last
To this end, the first article of online education big data marketing platform has ended, the next article will explain the initial stage of building a large data platform, how to combine the data warehouse and the Defense Analysis System (SA) to quickly meet the operatorThe demand for data analysis is opened to open the data operation strategy.