What is the data lake? Enterprises can use the data lake to maintain the revertent of business data as much as possible, solve problems that store all domain raw data; and the existence of data in the data can help companies enhance business handling efficiency. However, not all companies need to set up a data middle desk. In this article, the author explained a detailed explanation of the data lake and data in the data, let’s take a look.
Introduction: Back to the article, did not read the first part of the small partner, please click “10 minutes to understand the differences between the database, data warehouse, data lake, data” (1) “View, then we start the second part The content, if there is an inaccurate place, please also hope that everyone will correct.
First, data lake
The data warehouse is described and compared to the data warehouse and the data lake, respectively, and now we will learn more about the data lake.
1. The origin of the data lake
The data lake is mainly to address the original data of the domain, and the meaning of the “Lake” word in its name is full of performance. Like the production data of the company (non-structured data and structured data), business history data, temporary data, third-party data returned by IOT devices, mobile applications, and traditional devices can be formed by ETL tools. Store in the data lake.
For example, the author’s mobile phone signaling data, the positioning data returned by the GPS, etc., these data do not predefine the corresponding data structure, which means that the data can be stored and the data is required. Structured treatment, there is no need to make a clear analysis, and data practitioners explore and attempts in subsequent work.
Structured data and unstructured data mentioned above, what is structured / non-structured data? Below we explain the differences and links between the two.
2. What is structured / non-structured data
We have collected such a bunch of text information:
There is a student called Xiao Zhao, male, 97 years, civil engineering department, Beijing, there is a student called Xiao Li, 98 years, female, foreign language, Jiangsu Suzhou; ¡¤¡¤¡¤¡¤¡¤
The text information such as this is tens of thousands of lines, we have in Word, or the paper version file can be scanned into a picture format, which can be referred to as unstructured data. It is assumed that there is a statistical demand in accordance with gender, gender, professional and so on, and we mentioned the relational database in the first article, with related technologies and tools to process these text information, and the data after processing is Structured data.
So the definition of structured data: It is logically expressed and implemented by the two-dimensional table structure, and the data format and length specification are strictly followed, mainly by the relational database.
Non-structured data: non-structured data that is not suitable for representing the two-dimensional table of the database, including all format office documents, XML, HTML, various reports, pictures, and audio, video information, etc.
3. The role of data lake
Returning to the topic, why is companies to establish a data lake? First of all, there is an important part of the ODS (OPERATING DATA Store, Operation Data Store) in the data lake. Do you remember that the previous article has been speaking OLTP (on-line transaction processing). OLTP focuses on basic, daily transactions, and the ODS we now mention is the snapshot and history of OLTP data.
When we described in the above description of the database, the service database is different from the data warehouse. The business database is designed for OLTP. It is the real-time state of the system, and the data warehouse data is built for OLAP, It is for the depth of multi-dimensional analysis. So this will cause data analysis based on data warehouses to produce the following limitations:
Since the architecture design of the data warehouse is in advance, it is difficult to fully cover, so the analysis based on the data warehouse is to receive the pre-defined analysis target and the framework limit of the database. From the real-time state of OLTP to the converted of the OLAP, there are many information loss. For example, the balance of a user in an application in an application, only in the OLTP system is only in accordance with the occurrence of business. The balance in the wallet is updated in real time, but in the OLAP system, it is also a transaction that will only record the wallet operation. If you want to query and analyze the user’s historical balance, it will be more troublesome.
Since fundamentally, the most important role of data lake is to maintain the revertent of business data as much as possible. The positioning of the data lake is similar to the search engine. We can use the search data in the search engine to retrieve the requirements, ie, use, it allows this original unlimited full amount of data, you can access, process, and analyze .
4. Development of data lake
The data lake was the first in 2011, from Pentaho’s Chief Secret Officer James Dixon, which believes that due to its order, data warehouse is bound to bring data island effect, while data lake can be Open feature can solve the problem of data island.
But with the application of data lake in various enterprises, everyone feels: Well, this data is useful, I have to put it; that data is also useful, I have to put it in; I will put all the data without thinking into the data lake. In the related art or tools, there is no rule that is not square. When we think that all data is useful, then all the data is garbage, and the data lake has also become a data swamp that caused high enterprise costs. So this is why “Data Lake” is called “lake”, not the data river, the data pool or the data sea.
First, the data is to “save”, the data is “save”, and the data must “save” boundaries. Enterprise data is required for long-term accumulation, so it is “data lake”.
At the same time, lake water naturally layers, meets different ecosystem requirements, which is consistent with the unified data center of the company, and store management data. Thermal data is convenient to circulate applications, temperature data, and cold data in the upper layer, and achieves the balance of data storage capacity and cost in data storage.
Second, data middle
We finally ushered in data in the recent years. There are many articles on the Internet about the introduction of the data. What hive, spark, hadoop, kaalfa, etc. Have been very high and the clouds in the clouds will make our initial products.
So what is we from the middle stage, what is the data middle desk, and the data middle platform can tell the data middle platform.
What is Zhongtai
First, the data is thrown, and the concept of the MTD has fired in China in these two years. To put from the source, online article will mention this organization is 2015 Ma Yun visiting Supercell’s game company borrowed from, and later “Alibaba” CEO, the “Dazhong Taiwan, Small Front Front” organized and businesses system. So can we use a relatively light example to understand the word “middle”?
Of course, there is a chain and super-cheap Italian Western dining chain “Salia”, I believe that most of the classmates are tangible, 9 yuan, 24 pizza, the upper disaster is super fast, although it is not more traditional Western food, But compared to this price, it is very conscience, and the current Salia has opened nearly 400 branches in China (deadline 2019).
So what is the reason why Salia maintains a low price is efficient? The answer is very simple, that is, the central kitchen is rough, and then the chef’s chef only needs to cook on the table. Compared to traditional restaurant procurement (buy vegetables) ¡ú with the dish ¡ú cooking, the number of store chefs is reduced, and the cost of cranes is reduced.
Go back to our R & D process, the purchase (buy vegetable) ¡ú the boarding section is the background of our research and development, they help us solve “What is”; It is based on the customer’s “taste” to “do”.
And with the dish, vegetables, this link, that is, Salia’s “central kitchen” is equivalent to our MTA, just need the needs of stores, the central kitchen can quickly provide corresponding materials, improve business development efficiency, and reduce repetition Development costs.
2. What is the data middle
Introduced the concept of “China Taiwan”, and the data is believed to be able to give one more. That’s right, for the “dish” of the purchased, it is equivalent to data, and “dishes” made are equivalent to the data application required for the business department.
Then, the banquet is equivalent to the various data algorithms of the IT department. Each menu is slow and redundant, so that the “central kitchen” is standardized and systematically. It is the “data product” for each of the dishes required for the business unit.
This process of “central kitchen” is equivalent to the “data middle station” we said. So, is it necessary to build a data middle desk every company? What problems can data in the business can solve in business?
3. What can data in the data?
Does all companies need to build data? First we know that the company introduces a technology or product, it is not “fashionable”, is not “high-tech”, but whether it is suitable for the company’s current development, can improve the company’s profits and reduce the company’s cost.
First, the role of the data in the data is through the description of the middle stations and data, and summarizes the following 2 points:
Provide data products and data services, including, but not limited to, decision-making support tools (such as business reports, large-screen data visual display); data analysis classes (BI business intelligence, machine learning model, data mining); data search (log analysis), etc. Improve the data connectivity of the various departments of the enterprise and avoid the production of data islands.
According to the above mentioned two advantages of the data, whether a business is built in the Taiwan Taiwan, or it is also a company to build a data in the beginning from zero to one. The author has a few summary:
First, it is directed to different industries. Although traditional enterprise digital reform is on the road, there are already many industries that have been reform, but for most traditional enterprises, don’t say data middle desk, the company has not arrived in the age of data warehouse, “Rome is not One day, “Those the financial resources of the construction of the construction data,” time cost is high, that is, for the business circulation model of traditional enterprises, the accepted level of enterprise employees is a difficult gap, and the data is unhappy. For traditional enterprises or Internet companies in the data warehouse era, the transition data can be tried because various departments have unfinitually meet their business support points to take the requirements, business statistics, and counting requirements.
For beginner enterprises, the business line is still constantly changing, and when they are constantly trying to do, there is no ability to build the data in the data. In other words, it is “the most important thing to go home”.
Third, small knot
This article describes the differences and links between databases, data warehouses, data lakes, and data lakes in this article.
About the data says that data is new oil resources, and the state also uses data as a new type of production factor and is columns to traditional production factors.
I used to work in the pan-Internet and traditional enterprises. For various reasons, the data of traditional enterprises is not smooth compared to dataization in the pan-Internet industry. In August 2020, the State-owned Assets Supervision Commission of the State Council triggered the “Notice on Accelerating the Digital Transformation of State – owned Enterprises” showed that the digital transformation of all state companies will become inevitable, how to help traditional enterprises to digitally transform, using data to drive traditional industries new vitality For data manufacturers, especially for TOB data, the data manager will be challenges and opportunities.
The author will continue to work with everyone to share the articles related to other data products.