hive视频热度统计案例
背景及需求描述
现有两个记录了关于youtube视频信息和用户信息的txt文件
其中存放了视频信息的文件每行的字段如下所示:
字段 | 备注 | 详细描述 |
---|---|---|
video id | 视频唯一 id | 11 位字符串 |
uploader | 视频上传者 | 上传视频的用户名 String |
age | 视频年龄 | 视频在平台上的整数天 |
category | 视频类别 | 上传视频指定的视频分类 |
length | 视频长度 | 整形数字标识的视频长度 |
views | 观看次数 | 视频被浏览的次数 |
rate | 流量 | 视频的流量,整型数字 |
conments | 评论数 | 一个视频的整数评论数 |
related ids | 相关视频 id | 相关视频的 id,最多 20 个 |
存放了用户信息的文件每行字段如下所示:
字段 | 备注 | 字段类型 |
---|---|---|
uploader | 上传者用户名 | string |
videos | 上传视频数 | int |
friends | 朋友数量 | int |
希望根据这两个文件所包含的信息,统计硅谷影音视频网站的常规指标,各种 TopN 指标:
-
统计视频观看数 Top10
-
统计视频类别热度 Top10
-
统计视频观看数 Top20 所属类别以及类别包含的 Top20 的视频个数
-
统计视频观看数 Top50 所关联视频的所属类别 Rank
-
统计每个类别视频观看数 Top10
-
统计上传视频最多的用户 Top10 以及他们上传的观看次数在前 20 视频
项目的完成
1. 数据清洗
通过观察原始数据形式,可以发现,视频可以有多个所属分类,每个所属分类用&符号分割,且分割的两边有空格字符,同时相关视频也是可以有多个元素,多个相关视频又用“\t”进行分割。为了分析数据时方便对存在多个子元素的数据进行操作,我们首先进行数据重组清洗操作。即:将所有的类别用“&”分割,同时去掉两边空格,多个**相关视频 id 也使用“&”**进行分割。
(1) maven依赖
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>com.atguigu</groupId><artifactId>guli-video</artifactId><version>1.0-SNAPSHOT</version><dependencies><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>RELEASE</version></dependency><dependency><groupId>org.apache.logging.log4j</groupId><artifactId>log4j-core</artifactId><version>2.8.2</version></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-common</artifactId><version>2.7.2</version></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-client</artifactId><version>2.7.2</version></dependency><dependency><groupId>org.apache.hadoop</groupId><artifactId>hadoop-hdfs</artifactId><version>2.7.2</version></dependency></dependencies>
</project>
(2)ETLUtils-处理具体的数据清洗逻辑
package com.atguigu.util;import sun.security.krb5.internal.ETypeInfo;public class ETLUtils {/*** 过滤每条输入数据* 1. 过滤小于9个字段的* 2. 去掉类别字段中的空格* 3. 修改相关视频ID字段的分隔符,由'\t'替换为'&'* @param oriStr 输入参数,原始数据* @return 过滤后的数据*/public static String etLString(String oriStr){StringBuffer sb = new StringBuffer();//1.切割拿到素有字符String[] fields = oriStr.split("\t");//2. 根据字段数量进行过滤if (fields.length<9){return null;}//3. 去掉类别中的空格fields[3] = fields[3].replaceAll(" ", "");//4. 修改相关视频ID字段分隔符,由'\t'替换为'&'for (int i = 0; i < fields.length; i++) {//对非相关ID字段进行处理if (i<9){if (i==fields.length-1){sb.append(fields[i]);}else {sb.append(fields[i]).append("\t");}}else {//对相关ID字段进行处理if (i==fields.length-1){sb.append(fields[i]);}else {sb.append(fields[i]).append("&");}}}//5. 返回新的结果return sb.toString();}/*测试public static void main(String[] args) {
// System.out.println(ETLUtils.etLString("LKh7zAJ4nwo\tTheReceptionist\t653\tEntertainment\t424\t13021\t4.34\t1305\t744\tDjdA-5oKYFQ\tNxTDlnOuybo\tc-8VuICzXtU\tDH56yrIO5nI\tW1Uo5DQTtzc\tE-3zXq_r4w0\t1TCeoRPg5dE\tyAr26YhuYNY\t2ZgXx72XmoE\t-7ClGo-YgZ0\tvmdPOOd6cxI\tKRHfMQqSHpk\tpIMpORZthYw\t1tUDzOp10pk\theqocRij5P0\t_XIuvoH6rUg\tLGVU5DsezE0\tuO2kj6_D8B4\txiDqywcDQRM\tuX81lMev6_o"));System.out.println(ETLUtils.etLString("SDNkMu8ZT68\tw00dy911\t630\tPeople & Blogs\t186\t10181\t3.49\t494"));}
*/}
(3)ETLMapper
package com.atguigu.ETL;import com.atguigu.util.ETLUtils;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;import java.io.IOException;//清洗数据
public class ETLMapper extends Mapper<LongWritable,Text,NullWritable,Text>{//定义全局的valueprivate Text v=new Text();@Overrideprotected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {//1. 获取数据String ori = value.toString();//2. 过滤数据String etlStr = ETLUtils.etLString(ori);//3. 写出数据if (etlStr==null){return;}v.set(etlStr);context.write(NullWritable.get(),v);}}
(4) ETLDriver
package com.atguigu.ETL;import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;public class ETLDriver implements Tool {private Configuration configuration;public int run(String[] args) throws Exception {//1.获取Job对象Job job = Job.getInstance(configuration);//2.设置jar包路径job.setJarByClass(ETLDriver.class);//3.设置Mapper/Reducer类路径,输出KV类型job.setMapperClass(ETLMapper.class);job.setMapOutputKeyClass(NullWritable.class);job.setMapOutputValueClass(Text.class);//4.设置最终输出KV类型job.setOutputKeyClass(NullWritable.class);job.setOutputValueClass(Text.class);//5. 设置输入输出路径FileInputFormat.setInputPaths(job,new Path(args[0]));FileOutputFormat.setOutputPath(job,new Path(args[1]));//6.提交任务boolean result = job.waitForCompletion(true);return result?0:1;}public void setConf(Configuration configuration) {this.configuration=configuration;}public Configuration getConf() {return this.configuration;}public static void main(String[] args) throws Exception {//构建配置信息Configuration conf = new Configuration();int result = ToolRunner.run(conf, new ETLDriver(), args);System.out.println(result);}
}
(5) 执行ETL清洗数据
/opt/module/jars
hadoop fs -put /opt/module/datas/gulivideo /video
为方便,把不需要清洗的数据user.txt 也上传至hdfs中
执行jar包中的driver,指定输入路径为原始数据,执行清洗后的数据存放的URL
hadoop jar /opt/module/jars/gulivideo-0.0.1-SNAPSHOT.jar\
com.atguigu.etl.ETLDriver\ # driver所在全路径
/gulivideo/video/2008/0222\ # 输入文件
/guliout # 输出文件位置
清洗完成。
2. 建表
按照上面给出的数据所包含字段建表如下
video表:
原始表
create table gulivideo_ori( videoId string, uploader string, age int, category array<string>, length int, views int, rate float, ratings int, comments int, relatedId array<string>)
row format delimited fields terminated by "\t"
collection items terminated by "&" stored as textfile;
orc表(列式存储):
create table gulivideo_orc(videoId string,uploader string,age int,category array<string>,length int,views int, rate float, ratings int, comments int, relatedId array<string>)
row format delimited fields terminated by "\t"
collection items terminated by "&"
stored as orc; --指定以orc格式存储
user表:
原始表:
create table gulivideo_user_ori( uploader string, videos int, friends int)
row format delimited
fields terminated by "\t" stored as textfile;
orc表:
create table gulivideo_user_orc( uploader string, videos int, friends int)
row format delimited
fields terminated by "\t"
stored as orc;
导入 ETL 后的数据到原始表中
load data inpath "/guliout" into table gulivideo_ori;
load data inpath "/user/2008/0903" into table gulivideo_user_ori;
通过原始表向ORC表插入数据
insert into table gulivideo_orc select * from gulivideo_ori;
insert into table gulivideo_user_orc select * from gulivideo_user_ori;
检查数据是否插入成功
select * from gulivideo_orc limit 2;
+------------------------+-------------------------+--------------------+-------------------------+-----------------------+----------------------+---------------------+------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| gulivideo_orc.videoid | gulivideo_orc.uploader | gulivideo_orc.age | gulivideo_orc.category | gulivideo_orc.length | gulivideo_orc.views | gulivideo_orc.rate | gulivideo_orc.ratings | gulivideo_orc.comments | gulivideo_orc.relatedid |
+------------------------+-------------------------+--------------------+-------------------------+-----------------------+----------------------+---------------------+------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| o4x-VW_rCSE | HollyW00d81 | 581 | ["Entertainment"] | 74 | 3534116 | 4.480000019073486 | 9538 | 7756 | ["o4x-VW_rCSE","d2FEj5BCmmM","8kOs3J0a2aI","7ump9ir4w-I","w4lMCVUbAyA","cNt29huGNoc","1JVqsS16Hw8","ax58nnnNu2o","CFHDCz3x58M","qq-AALY0DE8","2VHU9CBNTaA","KLzMKnrBVWE","sMXQ8KC4w-Y","igecQ61MPP4","B3scImOTl7U","X1Qg9gQKEzI","7krlgBd8-J8","naKnVbWBwLQ","rmWvPbmbs8U","LMDik7Nc7PE"] |
| P1OXAQHv09E | smosh | 434 | ["Comedy"] | 59 | 3068566 | 4.550000190734863 | 15530 | 7919 | ["uGiGFQDS7mQ","j1aBQPCZoNE","WsmC6GXMj3I","pjgxSfhgQVE","T8vAZsCNJn8","7taTSPQUUMc","pkCCDp7Uc8c","NfajJLln0Zk","tD-ytSD-A_c","eHt1hQYZa2Y","qP9zpln4JVk","zK7p3o_Mqz4","ji2qlWmhblw","Hyu9HcqTcjE","YJ2W-GnuS0U","NHf2igxB8oo","rNfoeY7F6ig","XXugNPRMg-M","rpIAHWcCJVY","3V2msHD0zAg"] |
+------------------------+-------------------------+--------------------+-------------------------+-----------------------+----------------------+---------------------+------------------------+-------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
3. 完成具体需求
谷粒影音需求:
题目:统计硅谷影音视频网站的常规指标,各种 TopN 指标:
需求1.统计视频观看数 Top10
selectvideoId,views
fromgulivideo_orc
order byviews desc
limit 10;
t1:
+--------------+-----------+--+
| videoid | views |
+--------------+-----------+--+
| dMH0bHeiRNg | 42513417 |
| 0XxI-hvPRRA | 20282464 |
| 1dmVU08zVpA | 16087899 |
| RB-wUgnyGv0 | 15712924 |
| QjA5faZF1A8 | 15256922 |
| -_CSo1gOd48 | 13199833 |
| 49IDp76kjPw | 11970018 |
| tYnn51C3X_w | 11823701 |
| pv5zWaTEVkI | 11672017 |
| D2kJZOfq7zk | 11184051 |
+--------------+-----------+--+
需求2.统计视频类别热度 Top10
某类视频的个数作为视频类别热度
1.使用UDTF函数将视频类别列转为行
select
videoid,
category_name
from
gulivideo_orc
lateral view explode(category) tmp_category as category_name;
t1:
+--------------+----------------+--+
| videoid | category_name |
+--------------+----------------+--+
| o4x-VW_rCSE | Entertainment |
| P1OXAQHv09E | Comedy |
| N0TR0Irx4Y0 | Comedy |
| seGhTWE98DU | Music |
| bNF_P281Uu4 | Travel |
| bNF_P281Uu4 | Places |
| CQO3K8BcyGM | Comedy |
| 3gg5LOd_Zus | Entertainment |
| sdUUx5FdySs | Film |
| sdUUx5FdySs | Animation |
+--------------+----------------+--+
2.按照category_name进行分组,统计每种类别视频的总数,同时按照该总数进行倒序排名,取前十
selectcategory_name,count(category_name)category_count
fromt1
group bycategory_name
order bycategory_count desc
limit 10;
即:
selectcategory_name,count(category_name)category_count
from(selectvideoid,category_name
fromgulivideo_orc
lateral view explode(category) tmp_category as category_name)t1
group bycategory_name
order bycategory_count desc
limit 10;
t2:
+----------------+-----------------+--+
| category_name | category_count |
+----------------+-----------------+--+
| Music | 179049 |
| Entertainment | 127674 |
| Comedy | 87818 |
| Animation | 73293 |
| Film | 73293 |
| Sports | 67329 |
| Gadgets | 59817 |
| Games | 59817 |
| Blogs | 48890 |
| People | 48890 |
+----------------+-----------------+--+
添加排名函数显示更清楚
selectcategory_name,category_count,row_number() over(order by category_count desc) category_rank
from(selectcategory_name,count(category_name)category_count
from(selectvideoid,category_name
fromgulivideo_orc
lateral view explode(category) tmp_category as category_name)t1
group bycategory_name
order bycategory_count desc
limit 10)t2;
结果:
+----------------+-----------------+----------------+--+
| category_name | category_count | category_rank |
+----------------+-----------------+----------------+--+
| Music | 179049 | 1 |
| Entertainment | 127674 | 2 |
| Comedy | 87818 | 3 |
| Film | 73293 | 4 |
| Animation | 73293 | 5 |
| Sports | 67329 | 6 |
| Games | 59817 | 7 |
| Gadgets | 59817 | 8 |
| People | 48890 | 9 |
| Blogs | 48890 | 10 |
+----------------+-----------------+----------------+--+
需求3.统计视频观看数 Top20 所属类别以及这些类别包含的 Top20 的视频个数
1.统计视频观看数前20的视频(需要包含类别方便下一步操作)
selectvideoid,views,category
fromgulivideo_orc
order byviews desc
limit 20;
t1:
+--------------+-----------+---------------------+--+
| videoid | views | category |
+--------------+-----------+---------------------+--+
| dMH0bHeiRNg | 42513417 | ["Comedy"] |
| 0XxI-hvPRRA | 20282464 | ["Comedy"] |
| 1dmVU08zVpA | 16087899 | ["Entertainment"] |
| RB-wUgnyGv0 | 15712924 | ["Entertainment"] |
| QjA5faZF1A8 | 15256922 | ["Music"] |
| -_CSo1gOd48 | 13199833 | ["People","Blogs"] |
| 49IDp76kjPw | 11970018 | ["Comedy"] |
| tYnn51C3X_w | 11823701 | ["Music"] |
| pv5zWaTEVkI | 11672017 | ["Music"] |
| D2kJZOfq7zk | 11184051 | ["People","Blogs"] |
| vr3x_RRJdd4 | 10786529 | ["Entertainment"] |
| lsO6D1rwrKc | 10334975 | ["Entertainment"] |
| 5P6UU6m3cqk | 10107491 | ["Comedy"] |
| 8bbTtPL1jRs | 9579911 | ["Music"] |
| _BuRwH59oAo | 9566609 | ["Comedy"] |
| aRNzWyD7C9o | 8825788 | ["UNA"] |
| UMf40daefsI | 7533070 | ["Music"] |
| ixsZy2425eY | 7456875 | ["Entertainment"] |
| MNxwAU_xAMk | 7066676 | ["Comedy"] |
| RUCZJVJ_M8o | 6952767 | ["Entertainment"] |
+--------------+-----------+---------------------+--+
2.对t1表中的category列进行炸裂(lateral view explode)
selectvideoId,category_name
fromt1
lateral view explode(category) tmp_category as category_name;
即:
selectvideoId,category_name
from(selectvideoid,views,category
fromgulivideo_orc
order byviews desc
limit 20)t1
lateral view explode(category) tmp_category as category_name;
补全t1得到查询结果如下:
t2:
+--------------+----------------+--+
| videoid | category_name |
+--------------+----------------+--+
| dMH0bHeiRNg | Comedy |
| 0XxI-hvPRRA | Comedy |
| 1dmVU08zVpA | Entertainment |
| RB-wUgnyGv0 | Entertainment |
| QjA5faZF1A8 | Music |
| -_CSo1gOd48 | People |
| -_CSo1gOd48 | Blogs |
| 49IDp76kjPw | Comedy |
| tYnn51C3X_w | Music |
| pv5zWaTEVkI | Music |
| D2kJZOfq7zk | People |
| D2kJZOfq7zk | Blogs |
| vr3x_RRJdd4 | Entertainment |
| lsO6D1rwrKc | Entertainment |
| 5P6UU6m3cqk | Comedy |
| 8bbTtPL1jRs | Music |
| _BuRwH59oAo | Comedy |
| aRNzWyD7C9o | UNA |
| UMf40daefsI | Music |
| ixsZy2425eY | Entertainment |
| MNxwAU_xAMk | Comedy |
| RUCZJVJ_M8o | Entertainment |
+--------------+----------------+--+
22 rows selected (3.228 seconds)
3.对t2表进行分组(category_name)求和(总数),然后倒排(根据总数)
selectcategory_name top20_category,count(*) top20_category_contain_videos
fromt2
group bycategory_name
order bytop20_category_contain_videos desc;
最终SQL:
selectcategory_name top20_category,count(*) top20_category_contain_videos
from(selectvideoId,category_name
from(selectvideoid,views,category
fromgulivideo_orc
order byviews desc
limit 20)t1
lateral view explode(category) tmp_category as category_name)t2
group bycategory_name
order bytop20_category_contain_videos desc;
最终查询结果:
+-----------------+--------------------------------+--+
| top20_category | top20_category_contain_videos |
+-----------------+--------------------------------+--+
| Entertainment | 6 |
| Comedy | 6 |
| Music | 5 |
| People | 2 |
| Blogs | 2 |
| UNA | 1 |
+-----------------+--------------------------------+--+
需求4.统计视频观看数 Top50 所关联视频的所属类别的排序(Rank)
1.过滤出视频观看数前50的视频
selectrelatedId,--需要用于后面的Joinviews
fromgulivideo_orc
order byviews desc
limit 50;
t1:
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+--+
| relatedid | views |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------+--+
| ["OxBtqwlTMJQ","1hX1LxXwdl8","NvVbuVGtGSE","Ft6fC6RI4Ms","plv1e3MvxFw","1VL-ShAEjmg","y8k5QbVz3SE","weRfgj_349Q","_MFpPziLP9o","0M-xqfP1ibo","n4Pr_iCxxGU","UrWnNAMec98","QoREX_TLtZo","I-cm3GF-jX0","doIQXfJvydY","6hD3gGg9jMk","Hfbzju1FluI","vVN_pLl5ngg","3PnoFu027hc","7nrpwEDvusY"] | 42513417 |
| ["ut5fFyTkKv4","cYmeG712dD0","aDiNeF5dqnA","lNFFR1uwPGo","5Iyw4y6QR14","N1NO0iLbEt0","YtmGrR0tR7E","GZltV9lWQL4","qUDLSsSrrRA","wpQ1llsQ7qo","u9w2z-xtmqY","txVJgU3n72g","M6KcfOAckmw","orkbRVgRys0","HSuSo9hG_RI","3H3kKJLQgPs","46EsU9PmPyk","nn4XzrI1LLk","VTpKh6jFS7M","xH4b9ydgaHk"] | 20282464 |
| ["x0dzQeq6o5Q","BVvxtb0euBY","Tc4iq0IaPgE","caYdEBT36z0","Wch5akcVofs","FgN4E9-U82s","a0ffAHbxsLY","BaR9j3-radY","jbNCtXtAwUo","XJBfdkDlubU","c6JRE4ZBcuA","nRGZJ8GMg3g","BfR7iz2UqZY","cVHrwiP2vro","CowiFyYfcH4","uYxKs7xXopc","dzYaq2yOCb8","9o_D-M91Hhc","0O04jXoZmgY","XphZDHtt3D0"] | 16087899 |
| ["RB-wUgnyGv0","Bdgffnf8Hfw","YSLFsov31dA","KDmGXlOJPbQ","Hr-48XYy9Ns","6E1s0LDL-uM","0j3iXi0V3hk","uEXlbUV45pw","KvMsc6OdKWc","9kGIbR7dqyQ","pEu1muGrREA","DolERIvMbzM","gPtR2eSeDIw","3EpF4fRoT4U","Dl2roCEKffM","QERUjf8fbII","9oviIyGYolo","dblCjXdP7bo","IMPGIaXCnaA","TdGgKd4ZyuY"] | 15712924 |
.............
2.对t1表中的relatedId列进行炸裂(行转列)
selectrelated_ids
fromt1
lateral view explode(relatedid)tmp_related as related_ids
group byrelated_ids;--针对相关视频的id做一个去重
即:
selectrelated_ids
from(selectrelatedId,--需要用于后面的Joinviews
fromgulivideo_orc
order byviews desc
limit 50)t1
lateral view explode(relatedid)tmp_related as related_ids
group byrelated_ids;
t2:
+--------------+--+
| related_ids |
+--------------+--+
| K6FJukNqMKc |
| JojoMIZTr44 |
| XR8L2aVVq2A |
| _zdT1IoScRE |
| -0NOL61faoQ |
...........
+--------------+--+
950 rows selected (7.534 seconds)
3.取出观看数前五十视频关联ID所在视频的类别
selectcategory
fromt2
join gulivideo_orc t_ori
on t2.related_ids=t_ori.videoid;
即:
selectcategory
from(selectrelated_ids
from(selectrelatedId,--需要用于后面的Joinviews
fromgulivideo_orc
order byviews desc
limit 50)t1
lateral view explode(relatedid)tmp_related as related_ids
group byrelated_ids)t2
join gulivideo_orc t_ori
on t2.related_ids=t_ori.videoid;
t3:
+-----------------------+--+
| category |
+-----------------------+--+
| ["Film","Animation"] |
| ["Music"] |
| ["Music"] |
| ["Music"] |
| ["Entertainment"] |
| ["Comedy"] |
| ["Entertainment"] |
| ["Comedy"] |
| ["Entertainment"] |
| ["Music"] |
| ["Comedy"] |
| ["Entertainment"] |
| ["Gadgets","Games"] |
..............
+-----------------------+--+
856 rows selected (14.045 seconds)
4.把t3表的category炸开,然后统计类别并倒序排序
selectexplode(category) category_name
fromt3
即:
selectexplode(category) category_name
from(selectcategory
from(selectrelated_ids
from(selectrelatedId,--需要用于后面的Joinviews
fromgulivideo_orc
order byviews desc
limit 50)t1
lateral view explode(relatedid)tmp_related as related_ids
group byrelated_ids)t2
join gulivideo_orc t_ori
on t2.related_ids=t_ori.videoid)t3
t4:
+----------------+--+
| category_name |
+----------------+--+
| Blogs |
| Music |
| Music |
| Music |
| Music |
| Music |
| Music |
| Music |
| Music |
| Comedy |
| Music |
| Music |
| Music |
| Music |
| Comedy |
| Entertainment |
| People |
| Blogs |
....
+----------------+--+
1,037 rows selected (13.591 seconds)
5.分组(类别)求和(总数),排序
selectcategory_name,count(*) category_countdense_rank() over(order by count(*)) category_rank
fromt4
group bycategory_name
order bycategory_count desc;
即:
selectcategory_name,count(*) category_count,dense_rank() over(order by count(*) desc) category_rank
from(selectexplode(category) category_name
from(selectcategory
from(selectrelated_ids
from(selectrelatedId,--需要用于后面的Joinviews
fromgulivideo_orc
order byviews desc
limit 50)t1
lateral view explode(relatedid)tmp_related as related_ids
group byrelated_ids)t2
join gulivideo_orc t_ori
on t2.related_ids=t_ori.videoid)t3)t4
group bycategory_name
order bycategory_count desc;
最终查询结果:
+----------------+-----------------+----------------+--+
| category_name | category_count | category_rank |
+----------------+-----------------+----------------+--+
| Comedy | 232 | 1 |
| Entertainment | 216 | 2 |
| Music | 195 | 3 |
| Blogs | 51 | 4 |
| People | 51 | 4 |
| Film | 47 | 5 |
| Animation | 47 | 5 |
| News | 22 | 6 |
| Politics | 22 | 6 |
| Gadgets | 20 | 7 |
| Games | 20 | 7 |
| Sports | 19 | 8 |
| Howto | 14 | 9 |
| DIY | 14 | 9 |
| UNA | 13 | 10 |
| Places | 12 | 11 |
| Travel | 12 | 11 |
| Animals | 11 | 12 |
| Pets | 11 | 12 |
| Vehicles | 4 | 13 |
| Autos | 4 | 13 |
+----------------+-----------------+----------------+--+
21 rows selected (15.382 seconds)
建立中间表
从之前的几个需求的实现中发现,一旦涉及求category或relatedId的排名,就需要对两个字段原来的数据进行炸裂(explode),为了提高查询效率,
我们可以提前建立一个中间表来存储将这两个字段炸裂后的数据比如:
创建类别表
create table gulivideo_category( videoId string, uploader string, age int, categoryId string,--用string类型存储炸裂后的类别数据length int, views int, rate float, ratings int, comments int, relatedId array<string>)
row format delimited fields terminated by "\t"
collection items terminated by "&" stored as orc;
向该类别表中插入数据:
insert into table gulivideo_category selectvideoId,uploader,age,categoryId,length,views,rate,ratings,comments,relatedIdfrom gulivideo_orclateral view explode(category) catetory as categoryId
该中间表的数据如下:
+-----------------------------+------------------------------+-------------------------+--------------------------------+----------------------------+---------------------------+--------------------------+-----------------------------+------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| gulivideo_category.videoid | gulivideo_category.uploader | gulivideo_category.age | gulivideo_category.categoryid | gulivideo_category.length | gulivideo_category.views | gulivideo_category.rate | gulivideo_category.ratings | gulivideo_category.comments | gulivideo_category.relatedid |
+-----------------------------+------------------------------+-------------------------+--------------------------------+----------------------------+---------------------------+--------------------------+-----------------------------+------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
| o4x-VW_rCSE | HollyW00d81 | 581 | Entertainment | 74 | 3534116 | 4.480000019073486 | 9538 | 7756 | ["o4x-VW_rCSE","d2FEj5BCmmM","8kOs3J0a2aI","7ump9ir4w-I","w4lMCVUbAyA","cNt29huGNoc","1JVqsS16Hw8","ax58nnnNu2o","CFHDCz3x58M","qq-AALY0DE8","2VHU9CBNTaA","KLzMKnrBVWE","sMXQ8KC4w-Y","igecQ61MPP4","B3scImOTl7U","X1Qg9gQKEzI","7krlgBd8-J8","naKnVbWBwLQ","rmWvPbmbs8U","LMDik7Nc7PE"] |
| P1OXAQHv09E | smosh | 434 | Comedy | 59 | 3068566 | 4.550000190734863 | 15530 | 7919 | ["uGiGFQDS7mQ","j1aBQPCZoNE","WsmC6GXMj3I","pjgxSfhgQVE","T8vAZsCNJn8","7taTSPQUUMc","pkCCDp7Uc8c","NfajJLln0Zk","tD-ytSD-A_c","eHt1hQYZa2Y","qP9zpln4JVk","zK7p3o_Mqz4","ji2qlWmhblw","Hyu9HcqTcjE","YJ2W-GnuS0U","NHf2igxB8oo","rNfoeY7F6ig","XXugNPRMg-M","rpIAHWcCJVY","3V2msHD0zAg"] |
| N0TR0Irx4Y0 | Brookers | 228 | Comedy | 140 | 3836122 | 3.1600000858306885 | 12342 | 8066 | ["N0TR0Irx4Y0","hX21wbRAkx4","OnN9sX_Plvs","ygakq6lAogg","J3Jebemn-jM","bDgGYUA6Fro","aYcIG0Kmjxs","kMYX4JWke34","Cy8hppgAMR0","fzDTn342L3Q","kOq6sFmoUr0","HToQAB2kE3s","uuCr-nXLCRo","KDrJSNIGNDQ","pb13ggOw9CU","nDGRoqfwaIo","F2XZg0ocMPo","AMRBGt2fQGU","sKq0q8JdriI","bHnAETS5ssE"] |
| seGhTWE98DU | vidsquare | 619 | Music | 198 | 3296342 | 4.570000171661377 | 13657 | 8639 | ["xd03Xz7U41A","e5eLJIVY2zs","5-uT93fl8aE","OIpslHiXv7U","b9AohUJAJpc","nmyou72bIoA","y7tb8uhHn4w","F7JRcN8x3Ew","k7_i-K5uKSo","tpxJ9rpg5A8","P75-LsO-rNA","Zk8UFNbcysA","RIuaGhrvUFg","3zWi2Ig91_Q","zby0NdFGSPY","DBElvQroZ6I","wx2o5bdRWvg","tAKCSJcBQr4","7KWlOOhShhs","w3z4fuBchpU"] |
| bNF_P281Uu4 | mattharding2718 | 490 | Travel | 222 | 5231539 | 4.809999942779541 | 29152 | 12529 | ["7WmMcqp670s","RFtTSisZtVY","JFeSH655mas","6FzTEVnwYes","WZYDizqz_Vs","-3TgQyVVAuw","Jd8I5ParUcw","F0L-tYSBw-Y","S-NBv179--s","TjhEebKUVI4","OfP14Nx7504","Esr6EhgWF94","5Dhbw2UWif0","oK00chrYG-I","cwCov7D5NRI","Ji0kcdwE8m4","sLcrB5psfsI","4WCUlSlCl4E","V8DWO-Ox5aU","IekXybapCTo"] |
| bNF_P281Uu4 | mattharding2718 | 490 | Places | 222 | 5231539 | 4.809999942779541 | 29152 | 12529 | ["7WmMcqp670s","RFtTSisZtVY","JFeSH655mas","6FzTEVnwYes","WZYDizqz_Vs","-3TgQyVVAuw","Jd8I5ParUcw","F0L-tYSBw-Y","S-NBv179--s","TjhEebKUVI4","OfP14Nx7504","Esr6EhgWF94","5Dhbw2UWif0","oK00chrYG-I","cwCov7D5NRI","Ji0kcdwE8m4","sLcrB5psfsI","4WCUlSlCl4E","V8DWO-Ox5aU","IekXybapCTo"] |
| CQO3K8BcyGM | TheHill88 | 548 | Comedy | 65 | 3083875 | 2.2200000286102295 | 36460 | 15444 | ["AX3aSeHwf7A","Kt-kvm_5a7Q","V5m2SzdMcPY","xnPX6FhjJBA","KVwOe8i8z4k","Auo9xiUVLVY","SGg6EHotg3E","uc4u1ZMqU98","rxLf50ZME38","lX403b5o2yw","hwL6QuO7djk","Wt3e27jKKBY","W6-l-Qg-LYw","S7iuckvGDRs","ZngFu4vFW2o","xyzXDKFnWtI","P3VAy9Tqvrk","6bhPfjo4lmI","33ue4RBFc6w","us7FDugkLeA"] |
| 3gg5LOd_Zus | NewNuma | 570 | Entertainment | 207 | 4200257 | 3.7300000190734863 | 35964 | 17657 | ["MC1hVHDz7tw","oMWYrMAmbj8","LsDOIlFxMKU","yoiNmHAJ8GA","Do5OmHsDBd4","9Tzetis50Pg","iW1rW3NhanM","y6Lw8BTqWDM","PHv5d735A3E","kwzgb1Cq2kI","iHw72X6AMDw","ywXKhvhQ2v4","e14SdZSm6aA","4-4Am-xdjqU","-4T7RR7N2XA","K6x6fjR-dZY","18srddtzW4Q","ahrbm2G0N7g","ZZfBnemBats","vLl-2QG3HUs"] |
| sdUUx5FdySs | Madyeti47 | 497 | Film | 189 | 5840839 | 4.78000020980835 | 42417 | 17797 | ["_JH-KGwUV9M","t5LDRjiBIGA","EDjnYuVnqq4","er6QIKDtn8Y","4fODSPaqCfs","WMN1pPC1mcg","sdUUx5FdySs","SUbryU_XfX0","ko-H5gX5C14","bev149PYFZE","6wwcp0sNd7s","eFsArrSXLZ8","moum7hC8mY8","edxt2BqZErA","hjS5Q0KdHZI","W7qA56yC64A","ZA-hmLAiGos","WPaqSMbTYfk","pODLB84Glvo","O7oHhyIdPmc"] |
| sdUUx5FdySs | Madyeti47 | 497 | Animation | 189 | 5840839 | 4.78000020980835 | 42417 | 17797 | ["_JH-KGwUV9M","t5LDRjiBIGA","EDjnYuVnqq4","er6QIKDtn8Y","4fODSPaqCfs","WMN1pPC1mcg","sdUUx5FdySs","SUbryU_XfX0","ko-H5gX5C14","bev149PYFZE","6wwcp0sNd7s","eFsArrSXLZ8","moum7hC8mY8","edxt2BqZErA","hjS5Q0KdHZI","W7qA56yC64A","ZA-hmLAiGos","WPaqSMbTYfk","pODLB84Glvo","O7oHhyIdPmc"] |
+-----------------------------+------------------------------+-------------------------+--------------------------------+----------------------------+---------------------------+--------------------------+-----------------------------+------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+--+
需求5.统计每个类别视频观看数 Top10
1.给每个类别根据视频的观看数添加rank值(倒序)
selectcategoryId,videoid,views,rank() over(partition by categoryId order by views desc)rk
fromgulivideo_category;
t1:
+-------------+--------------+----------+-----+--+
| categoryid | videoid | views | rk |
+-------------+--------------+----------+-----+--+
| Animals | 2GWPOPSXGYI | 3660009 | 1 |
| Animals | xmsV9R8FsDA | 3164582 | 2 |
| Animals | 12PsUW-8ge4 | 3133523 | 3 |
| Animals | OeNggIGSKH8 | 2457750 | 4 |
| Animals | WofFb_eOxxA | 2075728 | 5 |
..........
| Vehicles | kfpfmXTP-Zs | 19 | 14158 |
| Vehicles | ZxuO77lsl6o | 19 | 14158 |
| Vehicles | grkq-JH7U1E | 19 | 14158 |
.......
+-------------+--------------+----------+-----+--+
2.过滤Top10
selectcategoryId,videoid,views,rk
from (selectcategoryId,videoid,views,rank() over(partition by categoryId order by views desc)rk
fromgulivideo_category)t1
whererk<=10;
最终查询结果:
+----------------+--------------+-----------+-----+--+
| categoryid | videoid | views | rk |
+----------------+--------------+-----------+-----+--+
| Animals | 2GWPOPSXGYI | 3660009 | 1 |
| Animals | xmsV9R8FsDA | 3164582 | 2 |
| Animals | 12PsUW-8ge4 | 3133523 | 3 |
| Animals | OeNggIGSKH8 | 2457750 | 4 |
| Animals | WofFb_eOxxA | 2075728 | 5 |
| Animals | AgEmZ39EtFk | 1999469 | 6 |
| Animals | a-gW3RbJd8U | 1836870 | 7 |
| Animals | 8CL2hetqpfg | 1646808 | 8 |
| Animals | QmroaYVD_so | 1645984 | 9 |
| Animals | Sg9x5mUjbH8 | 1527238 | 10 |
| Animation | sdUUx5FdySs | 5840839 | 1 |
| Animation | 6B26asyGKDo | 5147533 | 2 |
| Animation | H20dhY01Xjk | 3772116 | 3 |
| Animation | 55YYaJIrmzo | 3356163 | 4 |
| Animation | JzqumbhfxRo | 3230774 | 5 |
| Animation | eAhfZUZiwSE | 3114215 | 6 |
| Animation | h7svw0m-wO0 | 2866490 | 7 |
| Animation | tAq3hWBlalU | 2830024 | 8 |
| Animation | AJzU3NjDikY | 2569611 | 9 |
| Animation | ElrldD02if0 | 2337238 | 10 |
...
+----------------+--------------+-----------+-----+--+
210 rows selected (5.476 seconds)
需求6.统计上传视频最多的用户 Top10 以及他们上传的视频中被观看次数在前20的视频
1.统计上传视频最多的10个用户(从user表中)
selectuploader,--上传视频的作者videos--上传数量
fromgulivideo_user_orc
order byvideos desc
limit 10;
t1
+---------------------+---------+--+
| uploader | videos |
+---------------------+---------+--+
| expertvillage | 86228 |
| TourFactory | 49078 |
| myHotelVideo | 33506 |
| AlexanderRodchenko | 24315 |
| VHTStudios | 20230 |
| ephemeral8 | 19498 |
| HSN | 15371 |
| rattanakorn | 12637 |
| Ruchaneewan | 10059 |
| futifu | 9668 |
+---------------------+---------+--+
2.取出这10个人上传的所有视频,按照观看次数进行排名,取前20
selectvideo.videoid,video.views
fromt1
joingulivideo_orc video
ont1.uploader=video.uploader
order byviews desc
limit 20;
最终SQL:
selectvideo.videoid,video.views,rank() over(order by views desc)
from(selectuploader,--上传视频的作者videos--上传数量
fromgulivideo_user_orc
order byvideos desc
limit 10)t1
joingulivideo_orc video
ont1.uploader=video.uploader
order byviews desc
limit 20;
最终查询结果:
+----------------+--------------+----------------+--+
| video.videoid | video.views | rank_window_0 |
+----------------+--------------+----------------+--+
| -IxHBW0YpZw | 39059 | 1 |
| BU-fT5XI_8I | 29975 | 2 |
| ADOcaBYbMl0 | 26270 | 3 |
| yAqsULIDJFE | 25511 | 4 |
| vcm-t0TJXNg | 25366 | 5 |
| 0KYGFawp14c | 24659 | 6 |
| j4DpuPvMLF4 | 22593 | 7 |
| Msu4lZb2oeQ | 18822 | 8 |
| ZHZVj44rpjE | 16304 | 9 |
| foATQY3wovI | 13576 | 10 |
| -UnQ8rcBOQs | 13450 | 11 |
| crtNd46CDks | 11639 | 12 |
| D1leA0JKHhE | 11553 | 13 |
| NJu2oG1Wm98 | 11452 | 14 |
| CapbXdyv4j4 | 10915 | 15 |
| epr5erraEp4 | 10817 | 16 |
| IyQoDgaLM7U | 10597 | 17 |
| tbZibBnusLQ | 10402 | 18 |
| _GnCHodc7mk | 9422 | 19 |
| hvEYlSlRitU | 7123 | 20 |
+----------------+--------------+----------------+--+
需求6:方案二
这个需求有歧义,如果是求每个上传者上传的视频中的TOP20,则进行如下的查询
1.先找到这十个上传者上传的所有视频和对应的播放数
selectvideo.videoid,video.views,t1.uploader
from(selectuploader,--上传视频的作者videos--上传数量
fromgulivideo_user_orc
order byvideos desc
limit 10)t1
join gulivideo_orc video
ont1.uploader=video.uploader;
t2
+----------------+--------------+----------------+--+
| video.videoid | video.views | t1.uploader |
+----------------+--------------+----------------+--+
| dIeNr9D3aeo | 3338 | expertvillage |
| c0bJkNHBAC0 | 5962 | expertvillage |
| _GnCHodc7mk | 9422 | expertvillage |
| -UnQ8rcBOQs | 13450 | expertvillage |
| hvEYlSlRitU | 7123 | expertvillage |
| Bo06hwZZPSs | 787 | expertvillage |
| 7gJnofOYBhI | 1532 | expertvillage |
| CKq7z8OKfnI | 412 | expertvillage |
| 7xME6Oxh_cc | 1607 | expertvillage |
| k88mOlfBego | 766 | expertvillage |
....................
| k8sdadasdao | 726 | expertkaly |
........
+----------------+--------------+----------------+--+
2.根据uploader分组,然后排序(views),作为rank值
selectuploader,videoid,views,rank() over(partition by uploader order by views desc) rk
from(selectvideo.videoid,video.views,t1.uploaderfrom(selectuploader,--上传视频的作者videos--上传数量fromgulivideo_user_orcorder byvideos desclimit 10)t1join gulivideo_orc videoont1.uploader=video.uploader)t2;
limit 40;
t3
+--------------+--------------+--------+-----+--+
| uploader | videoid | views | rk |
+--------------+--------------+--------+-----+--+
| Ruchaneewan | 5_T5Inddsuo | 3132 | 1 |
| Ruchaneewan | wje4lUtbYNU | 1086 | 2 ||
............
| Ruchaneewan | dOlfPsFSjw0 | 206 | 20 |
...........
| Ruchaneewan | nKxRzjKcxM0 | 192 | 22 |
| Ruchaneewan | GyyZLkd4ZDU | 188 | 23 |
| expertvillage | -IxHBW0YpZw | 39059 | 1 |
| expertvillage | BU-fT5XI_8I | 29975 | 2 |
| expertvillage | ADOcaBYbMl0 | 26270 | 3 |
| expertvillage | yAqsULIDJFE | 25511 | 4 |
| expertvillage | vcm-t0TJXNg | 25366 | 5 |
| expertvillage | 0KYGFawp14c | 24659 | 6 |
| expertvillage | j4DpuPvMLF4 | 22593 | 7 |
...........
+----------------+--------------+--------+-----+--+
3.根据t3表中的rk过滤出每个用户前20的数据
selectuploader,videoid,views,rk
from(selectuploader,videoid,views,rank() over(partition by uploader order by views desc) rkfrom(selectvideo.videoid,video.views,t1.uploaderfrom(selectuploader,--上传视频的作者videos--上传数量fromgulivideo_user_orcorder byvideos desclimit 10)t1join gulivideo_orc videoont1.uploader=video.uploader)t2)t3
whererk<=10;
最终查询结果:
+----------------+--------------+--------+-----+--+
| uploader | videoid | views | rk |
+----------------+--------------+--------+-----+--+
| Ruchaneewan | 5_T5Inddsuo | 3132 | 1 |
| Ruchaneewan | wje4lUtbYNU | 1086 | 2 |
| Ruchaneewan | i8rLbOUhAlM | 549 | 3 |
| Ruchaneewan | OwnEtde9_Co | 453 | 4 |
| Ruchaneewan | 5Zf0lbAdJP0 | 441 | 5 |
| Ruchaneewan | wenI5MrYT20 | 426 | 6 |
| Ruchaneewan | 3hzOiFP-5so | 420 | 7 |
| Ruchaneewan | Iq4e3SopjxQ | 420 | 7 |
| Ruchaneewan | JgyOlXjjuw0 | 418 | 9 |
| Ruchaneewan | fGBVShTsuyo | 395 | 10 |
| expertvillage | -IxHBW0YpZw | 39059 | 1 |
| expertvillage | BU-fT5XI_8I | 29975 | 2 |
| expertvillage | ADOcaBYbMl0 | 26270 | 3 |
| expertvillage | yAqsULIDJFE | 25511 | 4 |
| expertvillage | vcm-t0TJXNg | 25366 | 5 |
| expertvillage | 0KYGFawp14c | 24659 | 6 |
| expertvillage | j4DpuPvMLF4 | 22593 | 7 |
| expertvillage | Msu4lZb2oeQ | 18822 | 8 |
| expertvillage | ZHZVj44rpjE | 16304 | 9 |
| expertvillage | foATQY3wovI | 13576 | 10 |
..........
+----------------+--------------+--------+-----+--+byvideos desclimit 10)t1join gulivideo_orc videoont1.uploader=video.uploader)t2)t3
whererk<=10;
最终查询结果:
+----------------+--------------+--------+-----+--+
| uploader | videoid | views | rk |
+----------------+--------------+--------+-----+--+
| Ruchaneewan | 5_T5Inddsuo | 3132 | 1 |
| Ruchaneewan | wje4lUtbYNU | 1086 | 2 |
| Ruchaneewan | i8rLbOUhAlM | 549 | 3 |
| Ruchaneewan | OwnEtde9_Co | 453 | 4 |
| Ruchaneewan | 5Zf0lbAdJP0 | 441 | 5 |
| Ruchaneewan | wenI5MrYT20 | 426 | 6 |
| Ruchaneewan | 3hzOiFP-5so | 420 | 7 |
| Ruchaneewan | Iq4e3SopjxQ | 420 | 7 |
| Ruchaneewan | JgyOlXjjuw0 | 418 | 9 |
| Ruchaneewan | fGBVShTsuyo | 395 | 10 |
| expertvillage | -IxHBW0YpZw | 39059 | 1 |
| expertvillage | BU-fT5XI_8I | 29975 | 2 |
| expertvillage | ADOcaBYbMl0 | 26270 | 3 |
| expertvillage | yAqsULIDJFE | 25511 | 4 |
| expertvillage | vcm-t0TJXNg | 25366 | 5 |
| expertvillage | 0KYGFawp14c | 24659 | 6 |
| expertvillage | j4DpuPvMLF4 | 22593 | 7 |
| expertvillage | Msu4lZb2oeQ | 18822 | 8 |
| expertvillage | ZHZVj44rpjE | 16304 | 9 |
| expertvillage | foATQY3wovI | 13576 | 10 |
..........
+----------------+--------------+--------+-----+--+