[Data] How is it pit?

With the escalation of big data era, data is particularly important in our life and work, in fact, long before humans use the data to be recorded, along with the development of society, the Internet age need to record more data and support. In this article the author from three detailed story about data “pits”, with a look at it.

Opening a direct look at the case: the two existing Creative A and B, going to take them more attractive to users who convert, the conversion rate is the index rating. (Conversion = log transformed / number of impressions)

In order to ensure a fair test, they are set up the same budget line, and both start running at 12:00, then 0:00 to close at the same time.

So run for a long time, the two groups are basically advertising spend over budget, and found that:

A total of 6500 impressions, conversions of 70 users, the conversion rate was 1.077%; B 6200 total exposure time, the user also be converted into 70, the conversion rate was 1.129%;

After seeing the results, Li advertising division on that: higher conversion rate of B material.

At this time, the marketing director of Pharaoh said: So far, I prefer to believe that A is higher.

Li forced to look ignorant, ask: Why ah?

Starbucks suck a mouthful of Pharaoh, he explained: even for the same ad, will have different conversion rates at different times. In general, the conversion rate is lower than the afternoon of the evening, during the day because most people have things, not easy to be converted, and at night, we have empty, so more likely to be converted.

Li asked: You’re right, but to serve the experiment anything to do with this?

Wang went on to say: I just looked at the background to the case –A material amount of material converted 50 pm, exposure 5,000 times, at night converted 20, 1500 exposure; B material in the afternoon transformed 20, 2200 exposure times, at night conversion 50, 4000 impressions.

In fact, whether it is in the afternoon or evening, the conversion of A is higher. And you think the reason why higher B, B at night mainly because it ran out of the amount – it pinch “soft touch” more, you mistakenly think it’s more powerful …

See here, you little muddled: Yes, ah, if you break it, indeed higher A, but the combined total look, then, is the higher B …

Well, this in the end how to look at it?

A pit 1: Simpson’s Paradox

The above phenomenon, is a typical “Simpson’s Paradox” (Simpson’s Paradox) – two sets of data under certain conditions, when will meet separately to discuss certain properties, but once the merger consideration, it may lead to the opposite conclusion.

They are also often encountered in other areas of work, and a phenomenon often deceptive.

Almost every company, the boss will let subordinates to various operational level aggregated data, and then report to yourself, and think we know the “overall” situation.

However, Google is a data Daniel once said: “aggregated data is often a lump of feces, did not make any sense.”

why would you say so?

Mathematical Analysis Professional aside, if the most popular language to explain, I think is this: that 20 pigs 20 pigs, 50 tree tree is 50, but if you deliberately add them up (20 + 50 = 70), this 70 is pointless – it represent? What can be representative.

Aggregated data, in addition to reporting symbolic, often no other significance. Why is “symbolic” of?

Because if reporting is to guide decision-making, it is likely to put people into the pit.

Like the case above, the creative, if you want to be lazy, director of marketing, just look at the final result, it is likely he misjudged the merits of the creative. More seriously, even lead to subsequent releases of all material in the direction of “bad stuff” is to optimize.

Fortunately, he has a certain basic data analysis, avoiding the pit.

So in a real case, if we must draw a conclusion on this, it is indeed A is higher. (Of course, a more scientific approach is to continue the experiment, and by increasing the budget / strict control period, to reduce the chance, and the gap between the proportion of the different nature of the data)

In addition to advertising, Simpson’s Paradox also often appear in a wide variety of statistical activities – need to calculate the ratio of basic statistics will appear, such as:

Retention rate conversion pass rate debt ratio ROI ……

So how do you avoid the pit aggregate data may bring it?

The key word to remember 8: different nature, open to count.

Second, the pit 2: mistaken as related cause and effect

“Beer and diapers,” we should have heard the story – through correlation analysis, sales of businesses found that sales of beer and diapers highly correlated. So they are on display, put it closer to beer and diapers, in order to increase sales.

Of course, this is completely impractical, made-up story. (Coined by Teradata is a manager at the company – it is estimated that the marketing manager, in order to convince businesses to buy his family’s data services, compiled soft paper) and would like to focus on here is: correlation analysis.

Today, whether it is the traditional industry or the Internet industry, enterprise data has become one of the most important assets.

One thing for each company’s data analysts, almost every day is to analyze the correlation between various factors and find the growth method. For example, the game company discovered that the longer the user played game time, the better the storage, so it will focus on improving the new user’s game, which greatly enhances the remaining.

For example, in the convenience store, it is found that the time in the store is more than the time, the higher the time, the higher the per capita consumption, so try to guide people counterclockwise. (Because of the right of the right-hand-worker, the general counterclockwise shop can let more items appear on the right side, so it is more convenient, and more is more)

It is undeniable that there are many effective growth methods that can be found through correlation analysis. However, too superstitious correlation, sometimes it will bring the opposite result.

For example, a social app wants to increase the remaining.

They found that the number of users sent messages is the highest.

Not only that, they also found that the number of messages exceeded 500, with a user group without more than 500, and there is a cliff type of the cliff. (“500” here, is often referred to as “magic numbers”)

So, in order to improve the remaining, the team will propose: If we try to improve the number of new users, try to make it more than 500, it can significantly improve the remaining.

So, they will trigger award reminder by setting “stage sexual award tasks” (sending news to a certain number of questions, and tells the next prize task), pull all the new users’ messages. And basically more than 500.

However, the final result is: Although the overall short-term retains, the overall long-term retraction has dropped.

Why is this this? Obviously the number of messages and the relevance of the remaining are the highest …

In fact, this is a typical mistake related to the cause, even causal inverted – not because the number of people is much, so more because of keeping it, so many testers.

The above scheme, although the short-term can improve the storage, but users who are really willing to use the product may be a disturbance.

On the other hand, the stimulation of interest will drive more non-target users (wool party) download and use the app, pull down the quality of the user, so long-term retaining.

And finally regarding the remaining optimization scheme, in fact, in terms of advertising: because the app is a key reference instagram, it is characterized by image-related functions.

But the previous advertisement is just blurred, “fun fun”, does not highlight the specific “function and use scene”, so that the user is expected to do not match the product, the remaining situation is not high.

Interesting is: In the previous data analysis results, the correlation between advertising and retention is not very high.

Third, Pit 3: Only believe that the data can be seen

If the above two pits, it is because the data and business are not understood, and the third pit may be more and more to understand the data and business, the easier falling. I also said in the past articles: The biggest problem of data is that it can only display information with data without displaying information without data.

Klein Christ Tutans called these two information as: positive data & negative data.

Positive data refers to data with structures and demoluble. For example: sales, sales, retail rate, conversion rate, reset rate, profit margin, pay rate, performance indicators, market size, etc. … (can standardized into Excel positive data)

Negative data, means that those that are not clear, it is also difficult to discover and quantify. For example, the user uses the motivation, emotion, concept, habits, and these factors that follow the times, and so on.

Starting from the day of your business, companies will master more and more active data:

Which products sell your most? Which products have the highest profit? What is the re-purchase rate? How does customer age are distributed? How much is the market share …

With the increase of positive data, its impact on the company will also be:

According to the sales and profits of different products, the sales department will influence the production planning brand unit according to the key hot words based on categories. According to the product selling point, according to the attributes of the old users, accurately putting the new user customer service department will also feedback according to user feedback. Product improvement advice

It seems that everything will grow forward and slowly precipitate as “experience”.

However, some “outside experience” is also brewing and happening. Taking e-commerce as an example, when Ali and Jingdong use their own growth experience, expand the category of higher customers, grab a higher net value, set up special discount promotion, and strategically give up the low-end market, spend more But suddenly took it out, and it became the first few years in the country.

Ali and Jingdong are actually not wrong, but what is it? Why don’t you use Taobao, why don’t you use Taobao?

Because it is cheaper.

Why do you have a lot of cheap? Because it has a lot of workshops, cottage goods.

Then why these workshops, the cottage goods are going to fight more?

On the one hand, other platforms are not allowed to sell, on the other hand, most of the group of houses can make them toned – the same as the online market.

Yes, for low-line users (including merchants), there are a lot of APPs that move on the scene of their online shopping. Whether it is a group shopping, bargain or buy and sell cottage cheap goods, original It is the daily day of their line. As for Taobao and Jingdong, it is more like the mall in the city – expensive, there are not many times. (And for them, but fortunate things, to see the real thing)

This is the relationship with “positive data” and “negative data”? Let “negative data” first.

Why do you see more opportunities for this market? Also invented “social e-commerce” new species?

In fact, for low-line users, shopping itself is social – everyone goes to the street to buy things; encounter familiar vendors to cut the bargain, you buy a pound of peanuts, he will send you two jujubes; you help next door Onions, help you with some salt next door. There are both commodity deals, and emotional transactions – this is the negative data of more insights, regarding users shopping.

So, what is “helping to send”, what “social e-commerce”, it comes from life, from insights to those negative data (motivation, emotion, concept, habits, etc.).

It’s just a lot to move them to your mobile phone and make it easier.

As for Ali and Jingdong, there is no doubt that their core teams must be both e-commerce business and are fine in data analysis. However, such a professional team, why didn’t you grasp the market opportunity?

The reason here is multifaceted.

Enterprises have to grow, the team wants to grow, naturally more willing to put the attention in the place of interest, a higher net worth, more high-priced products, higher frequency products, etc. (Including a lot of spelling now)

On the other hand, such as the positive data of the spring, it is also natural to focus on product and indicators:

How to improve logistics efficiency? How to improve advertising revenue? How to improve users active? ……

In this way, under the driving of interests and data, they will more and more understand their users and offer more and better products and services.

But at the same time, they will become more and more, who is not their own users – “those low-line people are not typical e-commerce people, I have no energy to pay attention to them.”

However, it is precisely because of this due to data, slowly cured prejudice, so that the market is differentiated, occupied, and even subversion.