6W+字记录实验全过程 | 探索Alluxio经济化数据存储策略

探索背景

随着大数据运用的不断开展,数据仓库、数据湖的大数据实践层出不穷;无论是电信、金融、政府,各个职业的大数据热潮蓬勃开展。在曩昔的4-5年中,咱们不断看到企业用户大数据膨胀问题日益加剧,大数据立异下数据存储成本出现线性增长,使得企业关于大数据的运用开始变得慎重、变向放缓了企业内部数据化转型的速度。

核心的应战:如何更加经济地构建数据湖存储体系。

大数据存储引擎从2006年发布以来,百花齐放,核算侧MapReduce、Spark、Hive、Impala、Presto、Storm、Flink的面世不断打破运用范畴,不过在大数据存储方面反而显得慎重与沉稳。在曩昔十多年,在Apache Hadoop生态被广泛提及的主要仍是HDFS与Ozone。

HDFS

Hadoop HDFS 是一种分布式文件体系,旨在在商用硬件上运行以进步其普适性。它与现有的分布式文件体系有许多相似之处。可是,HDFS的特点也是明显的:具有高度容错性、旨在布置在低成本硬件、答应水平扩缩容。HDFS供给对运用程序数据拜访的高吞吐量,适用于需求处理海量数据集的运用服务。

Ozone

Apache Ozone 是一种高度可扩展的分布式存储,适用于剖析、大数据和云原生运用程序。Ozone 支撑 S3 兼容目标 API 以及 Hadoop 兼容文件体系协议。它针对高效的目标存储和文件体系操作进行了优化。

经济化数据存储战略,主要表现在两个要害特性上,只要完结了,其他的增强都会如虎添翼:

  1. 运用最合适的存储体系存储对应的数据分块;
  2. 数据存储战略对上层运用的侵入性越低越好;

比如HDFS典型场景下运用3副本的战略,一方面是确保数据块的高可用性,一起多个副本也能够更好地保证数据局部性的要求,进步数据拜访的吞吐量;为了更好地供给数据服务,硬件环境也会选用相对更好的磁盘;关于早期的大数据实践而言,规范一致的软硬件挑选能够进步对新技能栈的推进,可是随着数据的不断堆集,许多数据的拜访频率出现指数级下降,尤其是针对合规检查的冷数据,不仅仅占有了生产集群的很多空间,可能一年到头都没有被拜访过一次。这是对资源的极大糟蹋。

大数据开展的现阶段,精细化数据存储被提上了议程。需求一种分层的存储体系,在维系现有核算性能的一起,将温、冷数据完结对上层运用通明的自主迁移,控制数据存储与保护的成本。

要害特性验证

经过这篇文章,咱们期望能够对经济化数据存储战略做一个开始探索,首要咱们将先前提到的两个要害特性具象化,然后经过几组试验对技能可行性进行一个讨论。

**要害特性一:**运用一套存储体系作为热数据体系;运用另一套存储体系作为冷数据体系;

**要害特性二:**一致命名空间一起兼容多套存储体系,经过一致命名空间对上层运用供给数据拜访服务;

技能挑选:

  1. 核算引擎: Hive (大部分企业用户运用SQL引擎作为数据开发东西)
  2. 存储引擎: HDFS/Ozone (业界常用的Apache生态存储)
  3. 数据编列引擎: Alluxio (第三方开源组件,兼容大部分Apache生态组件)

Hive

Apache Hive ™ 数据仓库软件有助于运用 SQL 读取、写入和办理驻留在分布式存储中的大型数据集。结构能够投影到已经存储的数据上。供给了一个命令行东西和 JDBC 驱动程序答运用户连接到 Hive。

关于Alluxio

“Alluxio数据编列体系”是全球首个分布式超大规模数据编列体系,孵化于加州大学伯克利分校AMP试验室。自项目开源以来,已有超越来自300多个组织机构的1200多位贡献者参加开发。Alluxio能够在跨集群、跨区域、跨国家的任何云中将数据更紧密地编列到接近数据剖析和AI/ML运用程序的集群中,从而向上层运用供给内存等级的数据拜访速度。

作为大数据生态体系中的存储与核算分离技能规范,在阿里云、腾讯云、华为云、金山云等国内尖端云厂商服务中得到生产查验,是建设企业私有云的柱石和核心技能。2021年成立后,先后荣登“中关村国际前沿科技立异大赛大数据与云核算范畴TOP10”、“2021投资界数字科技VENTURE50”、“科创我国”开源立异榜等多项榜单。

6W+字记录实验全过程 | 探索Alluxio经济化数据存储策略

技能可行性研究,咱们分两个阶段进行:

**阶段一:**运用同一类型的存储体系HDFS,完结不同HDFS体系之间的冷热分层【模仿场景:运用新的HDFS3.0 EC或许用磁盘密集型的机器专门建立冷数据HDFS】

**阶段二:**运用不同类型的存储体系,运用HDFS作为热数据存储体系;运用Ozone作为冷数据存储体系 【模仿场景:HDFS担任热数据/Ozone担任冷数据】

验证进程

布置架构

6W+字记录实验全过程 | 探索Alluxio经济化数据存储策略

软件版别:

  1. 核算引擎:Hive 2.3.9
  2. 存储引擎:Hadoop 2.10.1,Ozone 1.2.1,Alluxio 2.8
  3. 一切组件均为单机形式布置

集群规划:

主机

组件

ip-172-31-30-130.us-west-2.compute.internal

Hive、HDFS1

ip-172-31-19-127.us-west-2.compute.internal

HDFS2、Ozone

ip-172-31-17-3.us-west-2.compute.internal

Alluxio

试验一:根据Alluxio完结跨HDFS的通明数据冷热分层

## Step 1: 在Hive 中创立库、分区表,默认数据存储在 HDFS_1 上

create database test location "/user/hive/test.db";
create external table test.test_part(value string) partitioned by (dt string);

#创立库

hive> create database test location '/user/hive/test.db';
OK                                                                                                                                                                                                                        
Time taken: 1.697 seconds                                                                                                                                                                                                 
hive> 

#创立表

hive> create external table test.test_part(value string) partitioned by (dt string);
OK                                                                                                                                                                                                                        
Time taken: 0.607 seconds                                                                                                                                                                                                 
hive>                                                                                                

## Step 2: Alluxio Union URI 完结跨HDFS集群一致命名空间集成

alluxio fs mount \
--option alluxio-union.hdfs1.uri=hdfs://namenode_1:8020/user/hive/test.db/test_part \
--option alluxio-union.hdfs2.uri=hdfs://namenode_2:8020/user/hive/test.db/test_part \
--option alluxio-union.priority.read=hdfs1,hdfs2 \
--option alluxio-union.collection.create=hdfs1 \
/user/hive/test.db/test_part union://test_part/ 

#以Alluxio Union URI 方法挂载测验目录

[root@ip-172-31-17-3 ~]# alluxio fs mkdir /user/hive/test.db                                                                                                                                                           
Successfully created directory /user/hive/test.db                                                                                                                                                                         
[root@ip-172-31-17-3 conf]# alluxio fs mount \                                                                                                                                                                            
> --option alluxio-union.hdfs1.uri=hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/test.db/test_part \                                                                                                  
> --option alluxio-union.hdfs2.uri=hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part \                                                                                                  
> --option alluxio-union.priority.read=hdfs1,hdfs2 \                                                                                                                                                                      
> --option alluxio-union.collection.create=hdfs1 \                                                                                                                                                                        
> /user/hive/test.db/test_part union://test_part/                                                                                                                                                                         
Mounted union://test_part/ at /user/hive/test.db/test_part                                                                                                                                                                
[root@ip-172-31-17-3 ~]#

## Step 3: 修正 Hive 表途径为 Union URI 途径,屏蔽跨异构存储的技能细节

alter table test.test_part set location "alluxio://alluxio:19998/user/hive/test.db/test_part";

#修正Hive表格对应的途径

hive> alter table test.test_part set location "alluxio://ip-172-31-17-3.us-west-2.compute.internal:19998/user/hive/test.db/test_part";
OK                                                                                                                                                                                                                        
Time taken: 0.143 seconds                                                                                                                                                                                                 
hive> 

## Step 4: 模仿数据

mkdir dt\=2022-06-0{1..6}
echo 1abc > dt\=2022-06-01/000000_0
echo 2def > dt\=2022-06-02/000000_0
echo 3ghi > dt\=2022-06-03/000000_0
echo 4jkl > dt\=2022-06-04/000000_0
echo 5mno > dt\=2022-06-05/000000_0
echo 6pqr > dt\=2022-06-06/000000_0
hdfs dfs -put dt\=2022-06-0{1..3} hdfs://namenode_1:8020/user/hive/test.db/test_part
hdfs dfs -put dt\=2022-06-0{4..6} hdfs://namenode_2:8020/user/hive/test.db/test_part                                                                                                                                                    
[root@ip-172-31-17-3 ~]# mkdir dt\=2022-06-0{1..6}                                                                                                                                                                        
[root@ip-172-31-17-3 ~]# echo 1abc > dt\=2022-06-01/000000_0                                                                                                                                                              
[root@ip-172-31-17-3 ~]# echo 2def > dt\=2022-06-02/000000_0                                                                                                                                                              
[root@ip-172-31-17-3 ~]# echo 3ghi > dt\=2022-06-03/000000_0                                                                                                                                                              
[root@ip-172-31-17-3 ~]# echo 4jkl > dt\=2022-06-04/000000_0                                                                                                                                                              
[root@ip-172-31-17-3 ~]# echo 5mno > dt\=2022-06-05/000000_0                                                                                                                                                              
[root@ip-172-31-17-3 ~]# echo 6pqr > dt\=2022-06-06/000000_0  

#将模仿数据分别存入hdfs1、hdfs2

[root@ip-172-31-17-3 ~]# hdfs dfs -put dt\=2022-06-0{1..3} hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/test.db/test_part                                                                            
[root@ip-172-31-17-3 ~]# hdfs dfs -mkdir -p hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part                                                                                           
[root@ip-172-31-17-3 ~]# hdfs dfs -put dt\=2022-06-0{4..6} hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part 

#查询hdfs1和hdfs2,承认数据存入完结

[root@ip-172-31-17-3 ~]# hdfs dfs -ls hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/test.db/test_part                                                                                                 
Found 3 items                                                                                                                                                                                                             
drwxr-xr-x   - root hdfsadmingroup          0 2022-07-13 08:09 hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-01                                                          
drwxr-xr-x   - root hdfsadmingroup          0 2022-07-13 08:09 hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-02                                                          
drwxr-xr-x   - root hdfsadmingroup          0 2022-07-13 08:09 hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-03                                                          
[root@ip-172-31-17-3 ~]# hdfs dfs -ls hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part                                                                                                 
Found 3 items                                                                                                                                                                                                             
drwxr-xr-x   - root hdfsadmingroup          0 2022-07-13 08:10 hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-04                                                          
drwxr-xr-x   - root hdfsadmingroup          0 2022-07-13 08:10 hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-05                                                          
drwxr-xr-x   - root hdfsadmingroup          0 2022-07-13 08:10 hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-06 

#经过查询alluxio Union URI 再次承认数据存入hdfs1和hdfs2,以及Union URI跨存储关联收效

[root@ip-172-31-17-3 ~]# alluxio fs ls /user/hive/test.db/test_part                                                                                                                                                       
drwxr-xr-x  root           hdfsadmingroup               1       PERSISTED 07-13-2022 08:09:19:243  DIR /user/hive/test.db/test_part/dt=2022-06-02                                                                         
drwxr-xr-x  root           hdfsadmingroup               1       PERSISTED 07-13-2022 08:09:19:219  DIR /user/hive/test.db/test_part/dt=2022-06-01                                                                         
drwxr-xr-x  root           hdfsadmingroup               1       PERSISTED 07-13-2022 08:10:49:740  DIR /user/hive/test.db/test_part/dt=2022-06-06                                                                         
drwxr-xr-x  root           hdfsadmingroup               1       PERSISTED 07-13-2022 08:10:49:721  DIR /user/hive/test.db/test_part/dt=2022-06-05                                                                         
drwxr-xr-x  root           hdfsadmingroup               1       PERSISTED 07-13-2022 08:10:49:698  DIR /user/hive/test.db/test_part/dt=2022-06-04                                                                         
drwxr-xr-x  root           hdfsadmingroup               1       PERSISTED 07-13-2022 08:09:19:263  DIR /user/hive/test.db/test_part/dt=2022-06-03                                                                         
[root@ip-172-31-17-3 ~]#

## Step 5: 改写Hive表元数据

MSCK REPAIR TABLE test.test_part;

hive> MSCK REPAIR TABLE test.test_part;                                                                                                                                                                                   
OK                                                                                                                                                                                                                        
Partitions not in metastore:    test_part:dt=2022-06-01 test_part:dt=2022-06-02 test_part:dt=2022-06-03 test_part:dt=2022-06-04 test_part:dt=2022-06-05 test_part:dt=2022-06-06                                           
Repair: Added partition to metastore test.test_part:dt=2022-06-01                                                                                                                                                         
Repair: Added partition to metastore test.test_part:dt=2022-06-02                                                                                                                                                         
Repair: Added partition to metastore test.test_part:dt=2022-06-03                                                                                                                                                         
Repair: Added partition to metastore test.test_part:dt=2022-06-04                                                                                                                                                         
Repair: Added partition to metastore test.test_part:dt=2022-06-05                                                                                                                                                         
Repair: Added partition to metastore test.test_part:dt=2022-06-06                                                                                                                                                         
Time taken: 1.677 seconds, Fetched: 7 row(s)

#经过select方法观察到Hive元数据改写后,alluxio union URI关联收效表现到Hive表中

hive> select * from test.test_part;
OK                                                                                                                                                                                                                        
1abc    2022-06-01                                                                                                                                                                                                        
2def    2022-06-02                                                                                                                                                                                                        
3ghi    2022-06-03                                                                                                                                                                                                        
4jkl    2022-06-04                                                                                                                                                                                                        
5mno    2022-06-05                                                                                                                                                                                                        
6pqr    2022-06-06                                                                                                                                                                                                        
Time taken: 1.624 seconds, Fetched: 6 row(s)                                                                                                                                                                              
hive>

## Step 6: 装备冷热主动分层战略

alluxio fs policy add /user/hive/test.db/test_part "ufsMigrate(olderThan(2m), UFS[hdfs1]:REMOVE, UFS[hdfs2]:STORE)"

#设置战略:冷数据(本例中按生成超越2分钟的数据)主动从热存储(hdfs1)迁移到冷存储(hdfs2)

[root@ip-172-31-17-3 ~]# alluxio fs policy add /user/hive/test.db/test_part "ufsMigrate(olderThan(2m), UFS[hdfs1]:REMOVE, UFS[hdfs2]:STORE)"
Policy ufsMigrate-/user/hive/test.db/test_part is added to /user/hive/test.db/test_part.                                            

#经过Alluxio命令行检查战略设置成功与否

[root@ip-172-31-17-3 ~]# alluxio fs policy list
id: 1657700423909                                                                                                                                                                                                         
name: "ufsMigrate-/user/hive/test.db/test_part"                                                                                                                                                                           
path: "/user/hive/test.db/test_part"                                                                                                                                                                                      
created_at: 1657700423914                                                                                                                                                                                                 
scope: "RECURSIVE"                                                                                                                                                                                                        
condition: "olderThan(2m)"                                                                                                                                                                                                
action: "DATA(UFS[hdfs1]:REMOVE, UFS[hdfs2]:STORE)"                                                                                                                                                                                                                                                                                                                                                                                
[root@ip-172-31-17-3 ~]#

#战略收效后分别检查hdfs1和hdfs2,能够观察到hdfs1里面超越2分钟的数据都迁移到hdfs2中

[root@ip-172-31-17-3 logs]# hdfs dfs -ls hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/test.db/test_part                                                                                              
[root@ip-172-31-17-3 logs]# hdfs dfs -ls hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part                                                                                              
Found 6 items                                                                                                                                                                                                             
drwxr-xr-x   - root hdfsadmingroup          0 2022-07-13 08:26 hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-01                                                          
drwxr-xr-x   - root hdfsadmingroup          0 2022-07-13 08:26 hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-02                                                          
drwxr-xr-x   - root hdfsadmingroup          0 2022-07-13 08:26 hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-03                                                          
drwxr-xr-x   - root hdfsadmingroup          0 2022-07-13 08:10 hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-04                                                          
drwxr-xr-x   - root hdfsadmingroup          0 2022-07-13 08:10 hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-05                                                          
drwxr-xr-x   - root hdfsadmingroup          0 2022-07-13 08:10 hdfs://ip-172-31-19-127.us-west-2.compute.internal:8020/user/hive/test.db/test_part/dt=2022-06-06                                                          
[root@ip-172-31-17-3 logs]#

#战略收效,冷数据主动迁移进程中和完结后查Hive都得到如下预期查询成果:

hive> select * from test.test_part;
OK                                                                                                                                                                                                                        
1abc    2022-06-01                                                                                                                                                                                                        
2def    2022-06-02                                                                                                                                                                                                        
3ghi    2022-06-03                                                                                                                                                                                                        
4jkl    2022-06-04                                                                                                                                                                                                        
5mno    2022-06-05                                                                                                                                                                                                        
6pqr    2022-06-06                                                                                                                                                                                                        
Time taken: 0.172 seconds, Fetched: 6 row(s)                                                                                                                                                                              
hive>

最后,咱们将试验一的两个进程:(1)经过Alluxio的Union URI完结跨两个HDFS存储体系的Hive表的数据联邦,和(2)经过Alluxio完结跨两个HDFS存储体系的通明数据冷热分层,在图1和图2中分别以简化示意图的方法展现,便于更好的理解试验目标、进程和成果。

6W+字记录实验全过程 | 探索Alluxio经济化数据存储策略

图1:经过Alluxio的Union URI完结跨两个HDFS存储体系的Hive表的数据联邦的示意图

6W+字记录实验全过程 | 探索Alluxio经济化数据存储策略

图2:经过Alluxio完结跨两个HDFS存储体系的通明数据冷热分层示意图

下一组试验仅仅将上一组试验设定中的两个HDFS存储体系更改成了两个异构存储体系HDFS(热存储)和Ozone(冷存储),从通明冷热分层功能层面效果是相同的。

试验二:根据Alluxio完结跨异构存储(HDFS和Ozone)的通明数据冷热分层

## step 1 : hive 创立库、表

create database hdfsToOzone location '/user/hive/hdfsToOzone.db';
create external table hdfsToOzone.test(value string) partitioned by (dt string);

#创立库

hive> create database hdfsToOzone location '/user/hive/hdfsToOzone.db';
OK                                                                                                                                                                                                                        
Time taken: 0.055 seconds                                                                                                                                                                                                 
hive>

#创立表

hive> create external table hdfsToOzone.test(value string) partitioned by (dt string);
OK                                                                                                                                                                                                                        
Time taken: 0.1 seconds                                                                                                                                                                                                   
hive>

## step 2: Alluxio Union URI完结跨HDFS/Ozone集群一致命名空间集成

alluxio fs mount \
--option alluxio-union.hdfs.uri=hdfs://HDFS1:8020/user/hive/hdfsToOzone.db/test \
--option alluxio-union.ozone.uri=o3fs://bucket.volume/hdfsToOzone.db/test \
--option alluxio-union.priority.read=hdfs,ozone \
--option alluxio-union.collection.create=hdfs \
--option alluxio.underfs.hdfs.configuration=/mnt1/ozone-1.2.1/etc/hadoop/ozone-site.xml \
/user/hive/hdfsToOzone.db/test union://HDFS_TO_OZONE/ 

#在Ozone中运用命令行东西创立volume、bucket

[root@ip-172-31-19-127 ~]# ozone sh volume create /v-alluxio                                                                                                                                                      
[root@ip-172-31-19-127 ~]# ozone sh bucket create /v-alluxio/b-alluxio 
[root@ip-172-31-19-127 ~]# ozone fs -mkdir -p o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test                                                                                                                      
[root@ip-172-31-19-127 ~]# 

#先在Alluxio中创立试验目录,然后以 Union URI 方法挂载目录

[root@ip-172-31-17-3 ~]# alluxio fs mkdir /user/hive/hdfsToOzone.db
Successfully created directory /user/hive/hdfsToOzone.db
[root@ip-172-31-17-3 ~]# alluxio fs mount \                                                                                                                                                                               
> --option alluxio-union.hdfs.uri=hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/hdfsToOzone.db/test \                                                                                                 
> --option alluxio-union.ozone.uri=o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test \                                                                                                                                       
> --option alluxio-union.priority.read=hdfs,ozone \                                                                                                                                                                       
> --option alluxio-union.collection.create=hdfs \                                                                                                                                                                         
> --option alluxio.underfs.hdfs.configuration=/mnt1/ozone-1.2.1/etc/hadoop/ozone-site.xml \                                                                                                                               
> /user/hive/hdfsToOzone.db/test union://HDFS_TO_OZONE/                                                                                                                                                                   
Mounted union://HDFS_TO_OZONE/ at /user/hive/hdfsToOzone.db/test                                                                                                                                                          
[root@ip-172-31-17-3 ~]#

## step 3: 修正 Hive 表途径为 Union URI 途径,屏蔽跨异构存储的技能细节

alter table hdfsToOzone.test set location "alluxio://alluxio:19998/user/hive/hdfsToOzone.db/test";

#修正Hive表格对应的途径

hive> alter table hdfsToOzone.test set location "alluxio://ip-172-31-17-3.us-west-2.compute.internal:19998/user/hive/hdfsToOzone.db/test";
OK                                                                                                                                                                                                                        
Time taken: 1.651 seconds                                                                                                                                                                                                 
hive> 

## step 4: 模仿数据

ozone fs -put dt\=2022-06-0{1..3} o3fs://b-alluxio.v-alluxio.ozone:9862/hdfsToOzone.db/test
hdfs dfs -put dt\=2022-06-0{4..6} hdfs://HDFS1:8020/user/hive/hdfsToOzone.db/test

#将数据存入ozone

[root@ip-172-31-19-127 ~]# ozone fs -put dt\=2022-06-0{1..3} o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test                                                                                                                 
2022-07-13 10:00:38,920 [main] INFO impl.MetricsConfig: Loaded properties from hadoop-metrics2.properties                                                                                                                 
2022-07-13 10:00:38,981 [main] INFO impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s).                                                                                                             
2022-07-13 10:00:38,981 [main] INFO impl.MetricsSystemImpl: XceiverClientMetrics metrics system started                                                                                                                   
2022-07-13 10:00:39,198 [main] INFO metrics.MetricRegistries: Loaded MetricRegistries class org.apache.ratis.metrics.impl.MetricRegistriesImpl

#经过命令行查询ozone,承认数据存入完结

[root@ip-172-31-19-127 ~]# ozone fs -ls o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test                                                                                                                                      
Found 3 items                                                                                                                                                                                                             
drwxrwxrwx   - root root          0 2022-07-13 10:00 o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test/dt=2022-06-01                                                                                                         
drwxrwxrwx   - root root          0 2022-07-13 10:00 o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test/dt=2022-06-02                                                                                                         
drwxrwxrwx   - root root          0 2022-07-13 10:00 o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test/dt=2022-06-03                                                                                                         
[root@ip-172-31-19-127 ~]#

#将数据存入hdfs1,并经过命令行查询hdfs1,承认数据存入完结

[root@ip-172-31-17-3 ~]# hdfs dfs -put dt\=2022-06-0{4..6} hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/hdfsToOzone.db/test                                                                          
[root@ip-172-31-17-3 ~]# hdfs dfs -ls hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/hdfsToOzone.db/test                                                                                               
Found 3 items                                                                                                                                                                                                             
drwxr-xr-x   - root hdfsadmingroup          0 2022-07-13 10:06 hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/hdfsToOzone.db/test/dt=2022-06-04                                                        
drwxr-xr-x   - root hdfsadmingroup          0 2022-07-13 10:06 hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/hdfsToOzone.db/test/dt=2022-06-05                                                        
drwxr-xr-x   - root hdfsadmingroup          0 2022-07-13 10:06 hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/hdfsToOzone.db/test/dt=2022-06-06                                                        
[root@ip-172-31-17-3 ~]#

#经过Alluxio命令行查询,再次承认数据存入hdfs1和ozone,以及Union URI跨存储关联收效

[root@ip-172-31-17-3 ~]# alluxio fs ls /user/hive/hdfsToOzone.db/test                                                                                                                                                     
drwxrwxrwx  root           root                         0       PERSISTED 07-13-2022 10:00:40:670  DIR /user/hive/hdfsToOzone.db/test/dt=2022-06-02                                                                       
drwxrwxrwx  root           root                         0       PERSISTED 07-13-2022 10:00:38:691  DIR /user/hive/hdfsToOzone.db/test/dt=2022-06-01                                                                       
drwxr-xr-x  root           hdfsadmingroup               0       PERSISTED 07-13-2022 10:06:29:206  DIR /user/hive/hdfsToOzone.db/test/dt=2022-06-06                                                                       
drwxr-xr-x  root           hdfsadmingroup               0       PERSISTED 07-13-2022 10:06:29:186  DIR /user/hive/hdfsToOzone.db/test/dt=2022-06-05                                                                       
drwxr-xr-x  root           hdfsadmingroup               0       PERSISTED 07-13-2022 10:06:29:161  DIR /user/hive/hdfsToOzone.db/test/dt=2022-06-04                                                                       
drwxrwxrwx  root           root                         0       PERSISTED 07-13-2022 10:00:40:762  DIR /user/hive/hdfsToOzone.db/test/dt=2022-06-03                                                                       
[root@ip-172-31-17-3 ~]# 

## step 5: 改写Hive表元数据

MSCK REPAIR TABLE hdfsToOzone.test;

hive> MSCK REPAIR TABLE hdfsToOzone.test;                                                                                                                                                                                 
OK                                                                                                                                                                                                                        
Partitions not in metastore:    test:dt=2022-06-01      test:dt=2022-06-02      test:dt=2022-06-03      test:dt=2022-06-04      test:dt=2022-06-05      test:dt=2022-06-06                                                
Repair: Added partition to metastore hdfsToOzone.test:dt=2022-06-01                                                                                                                                                       
Repair: Added partition to metastore hdfsToOzone.test:dt=2022-06-02                                                                                                                                                       
Repair: Added partition to metastore hdfsToOzone.test:dt=2022-06-03                                                                                                                                                       
Repair: Added partition to metastore hdfsToOzone.test:dt=2022-06-04                                                                                                                                                       
Repair: Added partition to metastore hdfsToOzone.test:dt=2022-06-05                                                                                                                                                       
Repair: Added partition to metastore hdfsToOzone.test:dt=2022-06-06                                                                                                                                                       
Time taken: 0.641 seconds, Fetched: 7 row(s)                                                                                                                                                                              
hive>

#经过select方法观察到hive元数据改写后,alluxio union URI关联收效表现到hive表中

hive> select * from hdfsToOzone.test ;
OK                                                                                                                                                                                                                        
1abc    2022-06-01                                                                                                                                                                                                        
2def    2022-06-02                                                                                                                                                                                                        
3ghi    2022-06-03                                                                                                                                                                                                        
4jkl    2022-06-04                                                                                                                                                                                                        
5mno    2022-06-05                                                                                                                                                                                                        
6pqr    2022-06-06                                                                                                                                                                                                        
Time taken: 0.156 seconds, Fetched: 6 row(s)                                                                                                                                                                              
hive>

## step 6: 装备战略

alluxio fs policy add /user/hive/hdfsToOzone.db/test" ufsMigrate(olderThan(2m), UFS[hdfs]:REMOVE, UFS[ozone]:STORE)"

#设置战略:冷数据(本例中按生成超越2分钟的数据)主动从热存储(hdfs1)迁移到冷存储(ozone)

[root@ip-172-31-17-3 ~]# alluxio fs policy add /user/hive/hdfsToOzone.db/test/ "ufsMigrate(olderThan(2m), UFS[hdfs]:REMOVE, UFS[ozone]:STORE)"
Policy ufsMigrate-/user/hive/hdfsToOzone.db/test is added to /user/hive/hdfsToOzone.db/test.

#经过Alluxio命令行检查战略设置成功与否

[root@ip-172-31-17-3 ~]# alluxio fs policy list
id: 1657707130843                                                                                                                                                                                                         
name: "ufsMigrate-/user/hive/hdfsToOzone.db/test"                                                                                                                                                                         
path: "/user/hive/hdfsToOzone.db/test"                                                                                                                                                                                    
created_at: 1657707130843                                                                                                                                                                                                 
scope: "RECURSIVE"                                                                                                                                                                                                        
condition: "olderThan(2m)"                                                                                                                                                                                                
action: "DATA(UFS[hdfs]:REMOVE, UFS[ozone]:STORE)"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
[root@ip-172-31-17-3 ~]#   

#战略收效后分别检查hdfs1和ozone,能够观察到hdfs1里面超越2分钟的数据都迁移到ozone中

[root@ip-172-31-17-3 ~]# ozone fs -ls o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test                                                                                                                                      
Found 6 items                                                                                                                                                                                                             
drwxrwxrwx   - root root          0 2022-07-13 10:00 o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test/dt=2022-06-01                                                                                                         
drwxrwxrwx   - root root          0 2022-07-13 10:00 o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test/dt=2022-06-02                                                                                                         
drwxrwxrwx   - root root          0 2022-07-13 10:00 o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test/dt=2022-06-03                                                                                                         
drwxrwxrwx   - root root          0 2022-07-13 10:21 o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test/dt=2022-06-04                                                                                                         
drwxrwxrwx   - root root          0 2022-07-13 10:21 o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test/dt=2022-06-05                                                                                                         
drwxrwxrwx   - root root          0 2022-07-13 10:21 o3fs://b-alluxio.v-alluxio/hdfsToOzone.db/test/dt=2022-06-06                                                                                                         
[root@ip-172-31-17-3 ~]# hdfs dfs -ls hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/user/hive/hdfsToOzone.db/test                                                                                               
[root@ip-172-31-17-3 ~]# 

#战略收效,冷数据主动迁移进程中和完结后查hive都得到如下预期查询成果:

hive> select * from hdfsToOzone.test ;
OK                                                                                                                                                                                                                        
1abc    2022-06-01                                                                                                                                                                                                        
2def    2022-06-02                                                                                                                                                                                                        
3ghi    2022-06-03                                                                                                                                                                                                        
4jkl    2022-06-04                                                                                                                                                                                                        
5mno    2022-06-05                                                                                                                                                                                                        
6pqr    2022-06-06                                                                                                                                                                                                        
Time taken: 0.144 seconds, Fetched: 6 row(s)                                                                                                                                                                              
hive> 

4. 试验小结

能够看出,试验二的执行进程和效果展现和试验一几乎是如出一辙,除了冷数据存储体系从hdfs2切换成了一个异构存储体系Ozone。

经过试验,咱们充沛验证了Alluxio数据编列是如何成功将上层运用 (比如根据Hive的数仓建设) 与底层数据耐久化战略 (运用hdfs或许Ozone, 是否进行冷热分层等) 解耦合的。一起也表现了Alluxio关于异构存储体系的通用性和易用性。

最后期望这篇文章对各位如何运用Alluxio经济化数据存储战略有所启迪。

附录

Alluxio集成Hive及HDFS的方法

Alluxio 装备

echo ‘export ALLX_HOME=/mnt1/alluxio’ >> ~/.bashrc
echo ‘export PATH=PATH:PATH:ALLX_HOME/bin’ >> ~/.bashrc

alluxio.master.hostname=ip-172-31-17-3.us-west-2.compute.internal
alluxio.underfs.address=hdfs://ip-172-31-30-130.us-west-2.compute.internal:8020/alluxio 
alluxio.worker.tieredstore.level0.dirs.path=/alluxio/ramdisk
alluxio.worker.memory.size=4G
alluxio.worker.tieredstore.levels=1
alluxio.worker.tieredstore.level0.alias=MEM
alluxio.user.file.readtype.default=CACHE
alluxio.user.file.writetype.default=ASYNC_THROUGH
alluxio.security.login.impersonation.username=_HDFS_USER_
alluxio.master.security.impersonation.yarn.groups=*
alluxio.master.security.impersonation.hive.groups=*
alluxio.user.metrics.collection.enabled=true
alluxio.user.block.size.bytes.default=64MB
######## Explore ########
alluxio.user.block.write.location.policy.class=alluxio.client.block.policy.DeterministicHashPolicy
alluxio.user.ufs.block.read.location.policy=alluxio.client.block.policy.DeterministicHashPolicy
alluxio.user.ufs.block.read.location.policy.deterministic.hash.shards=1
alluxio.user.file.persist.on.rename=true
alluxio.master.persistence.blacklist=.staging,_temporary,.tmp
alluxio.user.file.passive.cache.enabled=false 

Hive 客户端core-site.xml

cp /hadoop_home/etc/hadoop/core-site.xml /hive_home/conf

## 拷贝 jar分别到hadoop和hive home下的lib子目录中

cp /<PATH_TO_ALLUXIO>/client/alluxio-enterprise-2.8.0-1.0-client.jar /hadoop_home/share/lib
cp /<PATH_TO_ALLUXIO>/client/alluxio-enterprise-2.8.0-1.0-client.jar /hive_home/lib

## 装备alluxio文件体系

vim /hive_home/conf/core-site.xml
<property>
   <name>fs.alluxio.impl</name>
   <value>alluxio.hadoop.FileSystem</value>
</property>
<property>
   <name>alluxio.master.rpc.addresses</name>
   <value>ip-172-31-17-3.us-west-2.compute.internal:19998</value>
</property> 

HDFS授权

## 检查hdfs 超级用户

vim /hadoop_home/etc/hadoop/hdfs-site.xml
<property>
<name>dfs.permissions.superusergroup</name>
<value>hdfsadmingroup</value>
</property>

## 将用户 Alluxio 增加到supergroup

groupadd hdfsadmingroup
usermod -a -G hdfsadmingroup root

## 同步体系的权限信息到 HDFS

su - hdfs -s /bin/bash -c "hdfs dfsadmin -refreshUserToGroupsMappings"

## 开启hdfs acl

vim /hadoop_home/etc/hadoop/hdfs-site.xml
<property>
<name>dfs.permissions.enabled</name>
<value>true</value>
</property>
<property>
<name>dfs.namenode.acls.enabled</name>
<value>true</value>
</property> 
su - hdfs -s /bin/bash -c "hdfs dfs -setfacl -R -m user:root:rwx /"

Ozone 布置

装备文件

wget https://dlcdn.apache.org/ozone/1.2.1/ozone-1.2.1.tar.gz
echo 'export OZONE_HOME=/mnt1/ozone-1.2.1' >> ~/.bashrc
echo 'export PATH=$PATH:$OZONE_HOME/bin:$OZONE_HOME/sbin' >> ~/.bashrc 

##在ozone-site.xml中加入必要装备信息

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<configuration>
<property>
<name>ozone.om.address</name>
<value>ip-172-31-19-127.us-west-2.compute.internal:9862</value>
</property>
<property>
<name>ozone.metadata.dirs</name>
<value>/mnt/ozone-1.2.1/metadata/ozone</value>
</property>
<property>
<name>ozone.scm.client.address</name>
<value>ip-172-31-19-127.us-west-2.compute.internal:9860</value>
</property>
<property>
<name>ozone.scm.names</name>
<value>ip-172-31-19-127.us-west-2.compute.internal</value>
</property>
<property>
<name>ozone.scm.datanode.id.dir</name>
<value>/mnt/ozone-1.2.1/metadata/ozone/node</value>
</property>
<property>
<name>ozone.om.db.dirs</name>
<value>/mnt/ozone-1.2.1/metadata/ozone/omdb</value>
</property>
<property>
<name>ozone.scm.db.dirs</name>
<value>/mnt/ozone-1.2.1/metadata/ozone/scmdb</value>
</property>
<property>
<name>hdds.datanode.dir</name>
<value>/mnt/ozone-1.2.1/datanode/data</value>
</property>
<property>
<name>ozone.om.ratis.enable</name>
<value>false</value>
</property>
<property>
<name>ozone.om.http-address</name>
<value>ip-172-31-19-127.us-west-2.compute.internal:9874</value>
</property>
<property>
<name>ozone.s3g.domain.name</name>
<value>s3g.internal</value>
</property>
<property>
<name>ozone.replication</name>
<value>1</value>
</property>
</configuration> 

初始化与启动(依照顺序)

ozone scm --init
ozone --daemon start scm
ozone om --init
ozone --daemon start om 
ozone --daemon start datanode
ozone --daemon start s3g 

ozone运用操作

#创立名称为v-alluxio的volume

[root@ip-172-31-19-127 ~]# ozone sh volume create /v-alluxio                                                                                                                                                      
[root@ip-172-31-19-127 ~]#

#在v-alluxio下创立名为b-alluxio的bucket

[root@ip-172-31-19-127 ~]# ozone sh bucket create /v-alluxio/b-alluxio 
[root@ip-172-31-19-127 ~]#

#检查bucket的相关信息

[root@ip-172-31-19-127 ~]# ozone sh bucket info /v-alluxio/b-alluxio                                                                                                                                                        
{                                                                                                                                                                                                                         
 "metadata" : { },                                                                                                                                                                                                       
 "volumeName" : "v-alluxio",                                                                                                                                                                                             
 "name" : "b-alluxio",                                                                                                                                                                                                   
 "storageType" : "DISK",                                                                                                                                                                                                 
 "versioning" : false,                                                                                                                                                                                                   
 "usedBytes" : 30,                                                                                                                                                                                                       
 "usedNamespace" : 6,                                                                                                                                                                                                    
 "creationTime" : "2022-07-13T09:11:37.403Z",                                                                                                                                                                            
 "modificationTime" : "2022-07-13T09:11:37.403Z",                                                                                                                                                                        
 "quotaInBytes" : -1,                                                                                                                                                                                                    
 "quotaInNamespace" : -1,                                                                                                                                                                                                
 "bucketLayout" : "LEGACY"                                                                                                                                                                                               
}                                                                                                                                                                                                                         
[root@ip-172-31-19-127 ~]#

#创立key,并放入相应的内容

[root@ip-172-31-19-127 ~]# touch Dockerfile                                                                                                                                                                                 
[root@ip-172-31-19-127 ~]# ozone sh key put /v-alluxio/b-alluxio/Dockerfile Dockerfile                                                                                                                                      
[root@ip-172-31-19-127 ~]#

#列出bucket下一切的key

[root@ip-172-31-19-127 ~]# ozone sh key list /v-alluxio/b-alluxio/
{                                                                                                                                                                                                                         
 "volumeName" : "v-alluxio",                                                                                                                                                                                             
 "bucketName" : "b-alluxio",                                                                                                                                                                                             
 "name" : "Dockerfile",                                                                                                                                                                                                  
 "dataSize" : 0,                                                                                                                                                                                                         
 "creationTime" : "2022-07-13T14:37:09.761Z",                                                                                                                                                                            
 "modificationTime" : "2022-07-13T14:37:09.801Z",                                                                                                                                                                        
 "replicationConfig" : {                                                                                                                                                                                                 
   "replicationFactor" : "ONE",                                                                                                                                                                                          
   "requiredNodes" : 1,                                                                                                                                                                                                  
   "replicationType" : "RATIS"                                                                                                                                                                                           
 },                                                                                                                                                                                                                      
 "replicationFactor" : 1,                                                                                                                                                                                                
 "replicationType" : "RATIS"                                                                                                                                                                                             
}
[root@ip-172-31-19-127 ~]#

#检查key的相关信息

[root@ip-172-31-19-127 ~]# ozone sh key info /v-alluxio/b-alluxio/Dockerfile                                                                                                                                                
{                                                                                                                                                                                                                         
 "volumeName" : "v-alluxio",                                                                                                                                                                                             
 "bucketName" : "b-alluxio",                                                                                                                                                                                             
 "name" : "Dockerfile",                                                                                                                                                                                                  
 "dataSize" : 0,                                                                                                                                                                                                         
 "creationTime" : "2022-07-13T14:37:09.761Z",                                                                                                                                                                            
 "modificationTime" : "2022-07-13T14:37:09.801Z",                                                                                                                                                                        
 "replicationConfig" : {                                                                                                                                                                                                 
   "replicationFactor" : "ONE",                                                                                                                                                                                          
   "requiredNodes" : 1,                                                                                                                                                                                                  
   "replicationType" : "RATIS"                                                                                                                                                                                           
 },                                                                                                                                                                                                                      
 "ozoneKeyLocations" : [ ],                                                                                                                                                                                              
 "metadata" : { },                                                                                                                                                                                                       
 "replicationFactor" : 1,                                                                                                                                                                                                
 "replicationType" : "RATIS"                                                                                                                                                                                             
}                                                                                                                                                                                                                         
[root@ip-172-31-19-127 ~]#

Alluxio 挂载 ozone

#方法一

[root@ip-172-31-17-3 ~]# alluxio fs mount /ozone o3fs://b-alluxio.v-alluxio.ip-172-31-19-127.us-west-2.compute.internal:9862/                                                                                                                          
Mounted o3fs://b-alluxio.v-alluxio.ip-172-31-19-127.us-west-2.compute.internal:9862/ at /ozone                                                                                                                                                         
[root@ip-172-31-17-3 ~]#

#方法二(带option的mount)

[root@ip-172-31-17-3 ~]# alluxio fs mount \                                                                                                                                                                               
> --option alluxio.underfs.hdfs.configuration=/mnt1/ozone-1.2.1/etc/hadoop/ozone-site.xml \                                                                                                                               
> /ozone1 o3fs://b-alluxio.v-alluxio/                                                                                                                                                                                     
Mounted o3fs://b-alluxio.v-alluxio/ at /ozone1                                                                                                                                                                            
[root@ip-172-31-17-3 ~]# 

#验证Ozone挂载是否成功

[root@ip-172-31-17-3 ~]# alluxio fs ls /                                                                                                                                                                                                                                                                          
drwxrwxrwx  root           root                         0       PERSISTED 01-01-1970 00:00:00:000  DIR /ozone1                                                                                                                                                                                                                         
drwxrwxrwx  root           root                         0       PERSISTED 01-01-1970 00:00:00:000  DIR /ozone                                                                                                                                                                                                           
[root@ip-172-31-17-3 ~]#