分布式数据库原理及技术（更新中）

博主： Jiale
发布时间：2025 年 11 月 21 日
60 次浏览
暂无评论
8392字数
分类：计算机基础

分布式数据库原理及技术.png

Lesson 1

数据中台

数据湖仓

数据脱敏

分布式数据库的雏形（ICBC Bank）

区块链

TCP三次握手四次挥手

联邦学习/深度学习

......

有个了解即可，作为科普知识

Lesson 2

Q1：传统的三级模式/两级映像在分布式数据库中是否改变？

A：保存，但增加了新内容（全局模式和分片模式）

Q2：分布式数据库在存储/检索效率上通过何种指标与传统RDBMS进行对比？

A：存储空间大小，检索效率

Q3：HDFS（GFS）在存储体系上如何实现共享服务？

Q4：HIVE的三种部署方式及使用环境

三级模式两级映像.png

结构化的数据：MySQL、Orcacle......

半结构化的数据：JSON、XML

非结构化：视频等（=>YOLO=>JSON）

Lesson 3

Q1：HIVE的三种工作模式（Derby、远程、本地）

Q2：HIVE的数据类型与数据库（table）之间的创建关联模式：存储位置、上传与下载数据库中的数据

关闭防火墙

切换到zookeeper安装目录

启动zookeeper

检查状态

切换到hadoop目录（master）sbin/start-all.sh

MySQL安装在slave2，HIVE启用本地模式

在slave2上

cd $HIVE_HOME

hive

对称的关闭操作

在slave的hive的环境下用exit

master上：sbin/stop-all.sh

在所有机器上执行： cd $ZOOKEEPER_HOME

bin/zkServer.sh stop

全部机器进行shutdown关机

统一数据库 bigdata2024

本地数据目录 /data

HIVE（metastore除derby）采取文件夹存储管理

知识点1：HIVE的metastore存储在本地还是存储在hdfs上 =>HDFS /warehousedir/home

知识点2：HIVE的配置（临时配置：hive里控制台输入；永久配置：hive-site.xml）

desc 表名 ——>展示字段名

desc formatted 表名 ——>展示具体参数

load data local inpath '/data/st1.txt' overwrite into table student；

Lesson 4

Review1：create table（Databases）默认存储位置？

A：HDFS：/warehousedir/home/xxxx.db/表名 ==Directory

表中的数据导入方式：

【1】insert 语句 ==>Directory/000000_0

【2】hdfs fs -put 本地文件 Directory

【3】load data local inpath '本地文件' into table xxx; Directory/xxx文件名

Q1：探索create table xxx location，这个位置对整个命名有什么要求？

A：location最好与默认文件管理体系一致，精确到表名级别，若含有约束，必须要把location放在最后一行

Q2：怎样装载的数据到表中，使得表能够正确识别？特别是三种复杂数据类型（Map，Array，Structure）

create table stu_test(sid int, sname string, sage int, addr string) location '/testdb';

发现并没有xxx.db/表名，因为这是人为规定的location，location最好与默认文件管理体系一致

create table stu_test(sid int, sname string, sage int, addr string)_ row delimited fields terminated by ',' lines terminated by '\n' store as textfile;

load data local inpath '/data/student100.txt' overwrite into table student stu_test;

create table stu_test(sid string, sname string, sage int, addr string) 首位字段字符串机制

复杂数据类型

Aarry格式
1. 
create table student1(sid int,sname string, grade array<float>)
row format delimited
fields terminated by ','
collection items terminated by'#'
lines terminated by '\n';

查询 sid=101 的学生的第二个成绩（索引为1）
select grade[1] as second_grade
from student1
where sid = 101;

2. 
load data local inpath '/opt/data/student1.txt' into table student1;
其中，student1.txt内容组织格式如下：
1,zhangsan,80#90.5#35
2,lisi,90# #88
3,wangwu,87#87
4,mary, #60#
5,tom,60# #
6,jemy,78#

sid	sname	grade（数组）	说明
1	zhangsan	[80.0, 90.5, 35.0]	3 个有效浮点数
2	lisi	[90.0, NULL, 88.0]	中间`#`之间为空，解析为 NULL
3	wangwu	[87.0, 87.0]	2 个有效浮点数
4	mary	[NULL, 60.0, NULL]	首尾`#`为空，解析为 NULL
5	tom	[60.0, NULL, NULL]	后两个`#`为空
6	jemy	[78.0]	单个元素，结尾`#`忽略

Map格式
1. create table student2(sid int, sname string, grade  map<string,float>)
       row format delimited
       fields terminated by ','
       collection items terminated by '#'
       map keys terminated by ':'
       lines terminated by '\n';
2. load data local inpath '/opt/data/student2.txt' into table student2;
其中，student2.txt内容组织格式如下：
1,zhangsan,语文:80#数学:90.5#英语:35
2,lisi,语文:90#数学:95#英语:88
3,wangwu,语文:87#数学:87#英语:56

Lesson 5

Review：三种复杂类型的字段组织（完成综合实践）

set hive.cli.print.current.db= true;

set hive.cli.print.header=true;

create table student3_1_copy as select * from student3_1;（拷贝表——数据也有（深拷贝））

create table student3_2_copy like student3_1;（拷贝表——只有结构（浅拷贝））

Array嵌套map格式

create table student3(sid int, sname string, grade  array<map<string,float>>)
row format delimited
fields terminated by ','
collection items terminated by '#'
map keys terminated by ':'
lines terminated by '\n';

无法实现


（1）create table student3(sid int, sname string, grade  array<map<string,float>>)
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe';
      
（2）load data local inpath '/opt/data/student3.txt' into table student3;
其中，student3.txt内容组织格式如下：
{"sid":1,"sname":"zhangsan", "grade":  {"语文":80"数学":90.5#"英语':35}}
{"sid":2,"sname":"lisi", "grade":  {"语文":80"数学":90.5#"英语':35}}
{"sid":3,"sname":"wangwu", "grade":  {"语文":80"数学":90.5#"英语':35}}

create table student3(sid int, sname string, grade map<string,float>)
row format delimited
fields terminated by ','
collection items terminated by '#'
map keys terminated by ':'
lines terminated by '\n';

select grade['数学'] as math_score
from student3
where sid = 101;
查询sid=101的数学成绩

101,张三,数学:90.5#语文:88.0#英语:92.3
102,李四,数学:85.0#语文:90.2#英语:89.7

若将collection items terminated by '#'删掉，则数据组成只能有一个了，因为集合分割不存在

create table student3(sid int, sname string, grade map<string,array<float>>)
row format delimited
fields terminated by ','
collection items terminated by '#'
map keys terminated by ':'
lines terminated by '\n';

无法实现

Struct
（1）
create table student4(sid int, info struct<name:string,age:int,sex:string>)
row format delimited
fields terminated by ','
collection items terminated by '#'
lines terminated by '\n';
（2）
load data local inpath '/opt/data/student4.txt' into table student4;
其中，student4.txt内容组织格式如下：
1,zhangsan#19#Female
2,lisi#20#male

总结：

Array，Map，Struct不能嵌套使用

Array，Struct属于collection items

Map若要实现多组数据并发，隐含包括了collection items

Lesson 6

查看表的位置

•desc formatted tableName ;

•show create table test_table;

create external table student_cp(sid int, sname string) row format delimited fields terminated by ',' location "/tmpdb/student_cp";

external(外部表) drop表之后位置location还会存在不会一并删除——原理：引用切断，存储位置仍然存在，当下次仍有同名表（结构相同）时，仍会继续读取里面的数据

table 与 external table：只是在删除表时，对相关存储位置的级联处理不一致

load data inpath '云端路径' into table 表名;

——加载后会删除云端路径下的文件

load data local inpath

——不会删除本地的文件

分区表的数据导入案例实践

1、表的创建

普通表

create table ah16(sname string, sex int, minzu string, sge int, college string,

major string, height int, weight int, breathlen float, breathweight int, physcore int,

bloodtype string)

row format delimited

fields terminated by ','

lines terminated by '\n'

stored as textfile;

分区表的结构管理

无分区的表格是否可以增加分区？增加的分区是已有字段或者未知字段两种情况：

alter table student00 add partition (gender string) ;

不行

create table student00(sid int , sname string) partitioned by(data string);

无法通过load data 无法插入数据

alter table student00 add partition (data = '20251011');

已分区的表增加或者删除分区？增加的分区字段是已有的还是新增的？

alter table part_table add partition (prov string) ;

alter table part_table add partition (gender='new') ;

alter table part_table drop partition (gender='new') ;

Lesson 7

Review：

table
external table
（区别===>数据的管理方式不同）

partition table （分区实现何种管理）

Q1：分区表（静态分区和动态分区）创建、数据导入如何完成？

create table part_table(sid int, sname string, sage int) partitioned by(addr string) row format delimited fields terminated by ',';

load data local inpath '/home/student100' into table part_table _partition(addr='guangxi'); 字符串，addr必须写清楚，严格做管控

这个guangxi字段是怎么写进去的，原有的数据是否发生改变？

==>原有的数据并没有发生改变

若重复装载，则存储位置addr=xxx不变，但是子文件会多student100_copy_1

alter table part_table drop/add partition(addr='guangxi');

该addr=guangxi的子文件被全部删除（或新增）

show create table part_table;

不能用于新表的创建上

两个分区该如何处理？

create table part_table2(sid int,sname string) partitioned by(sage int, addr string) row format delimited fields terminated by ',';

load data local inpath '/home/student100' into table part_table2 partition(sage=20,addr='guangxi');

alter table part_table add partition(sage=20,addr='jiangsu'); 需要指定一个明确分区的字段

alter在MySQL中对schema操作，但是后面跟了个add，本质上是最数据做操作

非分区表能不能变成分区表？

不能！

create table student3(sid int, sname string, grade map<string,float>)

partitioned by(addr string)

row format delimited

fields terminated by ','

insert into table student3

partition (addr="anhui")

select * from ........

最后修改：2025 年 11 月 21 日

如果觉得我的文章对你有用，请随意赞赏

发表评论取消回复
快留下你的评论吧ヾ(◍°∇°◍)ﾉﾞ

评论 *

私密评论

名称 *

🎲

邮箱 *

地址

分布式数据库原理及技术（更新中）

Jiale • 2025 年 11 月 21 日

<img src="https://jiale.domcer.com/usr/uploads/2025/11/1846933181.png" alt="分布式数据库原理及技术.png" title="分布式数据库原理及技术.png" style=""><h1>Lesson 1</h1>数据中台数据湖仓数据脱敏分布式数据库的雏形（ICBC Bank）区块链TCP三次握手四次挥手联邦学习/深度学习......有个了解即可，作为科普知识<h1>Lesson 2</h1>Q1：传统的三级模式/两级映像在分布式数据库中是否改变？A：保存，但增加了新内容（全局模式和分片模式）Q2：分布式数据库在存储/检索效率上通过何种指标与传统RDBMS进行对比？A：存储空间大小，检索效率Q3：HDFS（GFS）在存储体系上如何实现共享服务？Q4：HIVE的三种部署方式及使用环境<img src="https://jiale.domcer.com/usr/uploads/2025/11/4095036527.png" alt="三级模式两级映像.png" title="三级模式两级映像.png" style="">结构化的数据：MySQL、Orcacle......半结构化的数据：JSON、XML非结构化：视频等（=&gt;YOLO=&gt;JSON）<h1>Lesson 3</h1>Q1：HIVE的三种工作模式（Derby、远程、本地）Q2：HIVE的数据类型与数据库（table）之间的创建关联模式：存储位置、上传与下载数据库中的数据关闭防火墙切换到zookeeper安装目录启动zookeeper检查状态切换到hadoop目录（master）sbin/start-all.shMySQL安装在slave2，HIVE启用本地模式在slave2上cd $HIVE_HOMEhive对称的关闭操作在slave的hive的环境下用exitmaster上：sbin/stop-all.sh在所有机器上执行： cd $ZOOKEEPER_HOMEbin/zkServer.sh stop全部机器进行shutdown关机统一数据库 bigdata2024本地数据目录 /dataHIVE（metastore除derby）采取文件夹存储管理知识点1：HIVE的metastore存储在本地还是存储在hdfs上 =&gt;HDFS /warehousedir/home知识点2：HIVE的配置（临时配置：hive里控制台输入；永久配置：hive-site.xml）desc 表名 ——&gt;展示字段名desc formatted 表名 ——&gt;展示具体参数load data local inpath '/data/st1.txt' overwrite into table student；<h1>Lesson 4</h1>Review1：create table（Databases）默认存储位置？A：HDFS：/warehousedir/home/xxxx.db/表名 ==Directory表中的数据导入方式：【1】insert 语句 ==&gt;Directory/000000_0【2】hdfs fs -put 本地文件 Directory【3】load data local inpath '本地文件' into table xxx; Directory/xxx文件名Q1：探索create table xxx location，这个位置对整个命名有什么要求？A：location最好与默认文件管理体系一致，精确到表名级别，若含有约束，必须要把location放在最后一行Q2：怎样装载的数据到表中，使得表能够正确识别？特别是三种复杂数据类型（Map，Array，Structure）create table stu_test(sid int, sname string, sage int, addr string) location '/testdb';发现并没有xxx.db/表名，因为这是人为规定的location，location最好与默认文件管理体系一致create table stu_test(sid int, sname string, sage int, addr string)_ row delimited fields terminated by ',' lines terminated by '\n' store as textfile;load data local inpath '/data/student100.txt' overwrite into table student stu_test;create table stu_test(sid string, sname string, sage int, addr string) 首位字段字符串机制复杂数据类型<pre><code class="lang-c">Aarry格式
1. 
create table student1(sid int,sname string, grade array&lt;float&gt;)
row format delimited
fields terminated by ','
collection items terminated by'#'
lines terminated by '\n';

查询 sid=101 的学生的第二个成绩（索引为1）
select grade[1] as second_grade
from student1
where sid = 101;

2. 
load data local inpath '/opt/data/student1.txt' into table student1;
其中，student1.txt内容组织格式如下：
1,zhangsan,80#90.5#35
2,lisi,90# #88
3,wangwu,87#87
4,mary, #60#
5,tom,60# #
6,jemy,78#</code></pre><table><thead><tr><th align="left">sid</th><th align="left">sname</th><th align="left">grade（数组）</th><th align="left">说明</th></tr></thead><tbody><tr><td align="left">1</td><td align="left">zhangsan</td><td align="left">[80.0, 90.5, 35.0]</td><td align="left">3 个有效浮点数</td></tr><tr><td align="left">2</td><td align="left">lisi</td><td align="left">[90.0, NULL, 88.0]</td><td align="left">中间<code>#</code>之间为空，解析为 NULL</td></tr><tr><td align="left">3</td><td align="left">wangwu</td><td align="left">[87.0, 87.0]</td><td align="left">2 个有效浮点数</td></tr><tr><td align="left">4</td><td align="left">mary</td><td align="left">[NULL, 60.0, NULL]</td><td align="left">首尾<code>#</code>为空，解析为 NULL</td></tr><tr><td align="left">5</td><td align="left">tom</td><td align="left">[60.0, NULL, NULL]</td><td align="left">后两个<code>#</code>为空</td></tr><tr><td align="left">6</td><td align="left">jemy</td><td align="left">[78.0]</td><td align="left">单个元素，结尾<code>#</code>忽略</td></tr></tbody></table><pre><code class="lang-c">Map格式
1. create table student2(sid int, sname string, grade map&lt;string,float&gt;)
 row format delimited
 fields terminated by ','
 collection items terminated by '#'
 map keys terminated by ':'
 lines terminated by '\n';
2. load data local inpath '/opt/data/student2.txt' into table student2;
其中，student2.txt内容组织格式如下：
1,zhangsan,语文:80#数学:90.5#英语:35
2,lisi,语文:90#数学:95#英语:88
3,wangwu,语文:87#数学:87#英语:56</code></pre><h1>Lesson 5</h1>Review：三种复杂类型的字段组织（完成综合实践）set hive.cli.print.current.db= true;set hive.cli.print.header=true;create table student3_1_copy as select * from student3_1;（拷贝表——数据也有（深拷贝））create table student3_2_copy like student3_1;（拷贝表——只有结构（浅拷贝））<img src="https://jiale.domcer.com/usr/uploads/2025/11/2503219148.png" alt="DW.png" title="DW.png" style=""><pre><code class="lang-c">Array嵌套map格式

create table student3(sid int, sname string, grade  array&lt;map&lt;string,float&gt;&gt;)
row format delimited
fields terminated by ','
collection items terminated by '#'
map keys terminated by ':'
lines terminated by '\n';

无法实现

（1）create table student3(sid int, sname string, grade array&lt;map&lt;string,float&gt;&gt;)
row format serde 'org.apache.hive.hcatalog.data.JsonSerDe';
 
（2）load data local inpath '/opt/data/student3.txt' into table student3;
其中，student3.txt内容组织格式如下：
{&quot;sid&quot;:1,&quot;sname&quot;:&quot;zhangsan&quot;, &quot;grade&quot;: {&quot;语文&quot;:80&quot;数学&quot;:90.5#&quot;英语':35}}
{&quot;sid&quot;:2,&quot;sname&quot;:&quot;lisi&quot;, &quot;grade&quot;: {&quot;语文&quot;:80&quot;数学&quot;:90.5#&quot;英语':35}}
{&quot;sid&quot;:3,&quot;sname&quot;:&quot;wangwu&quot;, &quot;grade&quot;: {&quot;语文&quot;:80&quot;数学&quot;:90.5#&quot;英语':35}}</code></pre><pre><code class="lang-c">create table student3(sid int, sname string, grade map&lt;string,float&gt;)
row format delimited
fields terminated by ','
collection items terminated by '#'
map keys terminated by ':'
lines terminated by '\n';

select grade['数学'] as math_score
from student3
where sid = 101;
查询sid=101的数学成绩

101,张三,数学:90.5#语文:88.0#英语:92.3
102,李四,数学:85.0#语文:90.2#英语:89.7

若将collection items terminated by '#'删掉，则数据组成只能有一个了，因为集合分割不存在</code></pre><pre><code class="lang-c">create table student3(sid int, sname string, grade map&lt;string,array&lt;float&gt;&gt;)
row format delimited
fields terminated by ','
collection items terminated by '#'
map keys terminated by ':'
lines terminated by '\n';

无法实现</code></pre><pre><code class="lang-c">Struct
（1）
create table student4(sid int, info struct&lt;name:string,age:int,sex:string&gt;)
row format delimited
fields terminated by ','
collection items terminated by '#'
lines terminated by '\n';
（2）
load data local inpath '/opt/data/student4.txt' into table student4;
其中，student4.txt内容组织格式如下：
1,zhangsan#19#Female
2,lisi#20#male</code></pre>总结：Array，Map，Struct不能嵌套使用Array，Struct属于collection itemsMap若要实现多组数据并发，隐含包括了collection items<h1>Lesson 6</h1>查看表的位置•desc formatted tableName ;•show create table test_table; create external table student_cp(sid int, sname string) row format delimited fields terminated by ',' location "/tmpdb/student_cp";external(外部表) drop表之后 位置location还会存在 不会一并删除——原理：引用切断，存储位置仍然存在，当下次仍有同名表（结构相同）时，仍会继续读取里面的数据table 与 external table：只是在删除表时，对相关存储位置的级联处理不一致load data inpath '云端路径' into table 表名;——加载后会删除云端路径下的文件load data local inpath——不会删除本地的文件分区表的数据导入案例实践1、表的创建普通表create table ah16(sname string, sex int, minzu string, sge int, college string,major string, height int, weight int, breathlen float, breathweight int, physcore int,bloodtype string)row format delimitedfields terminated by ','lines terminated by '\n'stored as textfile;分区表的结构管理无分区的表格是否可以增加分区？增加的分区是已有字段或者未知字段两种情况：alter table student00 add partition (gender string) ;不行create table student00(sid int , sname string) partitioned by(data string);无法通过load data 无法插入数据alter table student00 add partition (data = '20251011');已分区的表增加或者删除分区？增加的分区字段是已有的还是新增的？alter table part_table add partition (prov string) ;alter table part_table add partition (gender='new') ;alter table part_table drop partition (gender='new') ;<h1>Lesson 7</h1>Review：table external table （区别===&gt;数据的管理方式不同）partition table （分区实现何种管理）Q1：分区表（静态分区和动态分区）创建、数据导入如何完成？create table part_table(sid int, sname string, sage int) partitioned by(addr string) row format delimited fields terminated by ',';load data local inpath '/home/student100' into table part_table _partition(addr='guangxi'); 字符串，addr必须写清楚，严格做管控这个guangxi字段是怎么写进去的，原有的数据是否发生改变？==&gt;原有的数据并没有发生改变若重复装载，则存储位置addr=xxx不变，但是子文件会多student100_copy_1alter table part_table drop/add partition(addr='guangxi');该addr=guangxi的子文件被全部删除（或新增）show create table part_table;不能用于新表的创建上两个分区该如何处理？create table part_table2(sid int,sname string) partitioned by(sage int, addr string) row format delimited fields terminated by ',';load data local inpath '/home/student100' into table part_table2 partition(sage=20,addr='guangxi');alter table part_table add partition(sage=20,addr='jiangsu'); 需要指定一个明确分区的字段alter在MySQL中对schema操作，但是后面跟了个add，本质上是最数据做操作非分区表能不能变成分区表？不能！create table student3(sid int, sname string, grade map&lt;string,float&gt;)partitioned by(addr string)row format delimitedfields terminated by ','insert into table student3partition (addr="anhui")select * from ........

分布式数据库原理及技术（更新中）

Lesson 1

Lesson 2

Lesson 3

Lesson 4

Lesson 5

Lesson 6

Lesson 7

发表评论取消回复
快留下你的评论吧ヾ(◍°∇°◍)ﾉﾞ

C++程序设计基础第十二章（更新中ing）

C++程序设计基础第四章

C++程序设计基础第一章

C++程序设计基础第八章

数据库系统概论

C++程序设计特别篇

分布式数据库原理及技术（更新中）

C++程序设计基础第十二章（更新中ing）

随笔：生活

数据结构第二章：算法分析

分布式数据库原理及技术（更新中）

Lesson 1

Lesson 2

Lesson 3

Lesson 4

Lesson 5

Lesson 6

Lesson 7

发表评论 取消回复 快留下你的评论吧ヾ(◍°∇°◍)ﾉﾞ

分布式数据库原理及技术（更新中）

发表评论取消回复
快留下你的评论吧ヾ(◍°∇°◍)ﾉﾞ