It's time to learn the real spark technology

Spark SQL can be said to be the essence of Spark. I think the overall complexity is more than 5 times that of Spark Streaming. Now Spark officially promotes structured streaming, and Spark Streaming does not actively maintain it. We build big data computing tasks based on Spark, and focus on DataSet. It was originally based on RDD The advantages of migrating the written code are very great, especially in terms of performance. The performance optimization embedded in Spark SQL is more reliable than people writing RDDs naked following various so-called best practices, especially for novices. For example, some best practices talk about filter operation before map operation, such as Spark SQL The predicate will be automatically pushed down in. For example, avoid using the shuffle operation as much as possible. If you enable the relevant configuration in Spark SQL, you will automatically use the broadcast join to broadcast small tables, convert the shuffle join into a map join, and so on. It really saves us a lot of attention. The code complexity of Spark SQL is caused by the intrinsic complexity of the problem, Most of the logic of the Catalyst framework in Spark SQL is based on a Tree type data structure. It is elegant to implement it based on scala. The partial functions of scala and powerful case regular matching make the whole code look clear. This article briefly describes some mechanisms and concepts in Spark SQL.   SparkSession is the entrance for us to write Spark application code. Starting a Spark shell will provide you with a starting point for creating SparkSession. This object is the starting point of the whole Spark application. Let's look at some important variables and methods of SparkSession:     The sessionState mentioned above is a key thing. It maintains all the state data used by the current session. There are the following things that need to be maintained:   In Spark SQL, dataFrame and Dataset are used internally to represent a data set. Then you can apply various statistical functions and operators to this data set. Some people may not know the difference between DataFrame and Dataset. In fact, DataFrame is a DataSet of Row type Spark sql is exposed from the API level. However, DataSet does not require the input type to be Row, but it can also be a strongly typed data, The data type processed at the bottom of DataSet is the internal InternalRow or UnsafeRow type of Catalyst, and there is an Encoder behind it for implicit conversion. The data you input is converted to the internal InternalRow, so that the DataFrame corresponds to the RowEncoder.   Performing transformations on the dataset will generate a tree structure with LogicalPlan type elements. Let's take an example. If I have a student table and a score table, the demand is to count the total scores of all students over 11 years old.     This queryExecution is the execution engine of the entire execution plan. There are various intermediate process variables in the execution process. The whole execution process is as follows:   Then the sql statement in our above example will become an abstract syntax tree after Parser parsing, and the corresponding parsed logical plan AST is   The image of "                        82   We can see that the filter condition is changed to a Filter node, which is of the UnaryNode type, that is, there is only one child. The data in the two tables is changed to an UnresolvedRelation node, which is of the LeafNode type. As the name implies, leaf nodes and JOIN operations are the table of the Join node, which is a BinaryNode node with two children.   The nodes mentioned above are of LogicalPlan type, which can be understood as operators for various operations. Spark SQL defines various operators for various operations.   The abstract syntax tree composed of these operators is the basis of the entire Catalyst optimization. The Catalyst optimizer will do various things on this tree, moving the nodes on the tree to optimize.   Now we have an abstract syntax tree through Parser, but we don't know what score and sum are, so we need analyzer to locate them. Analyzer will change all unresolved things on AST to resolved state, Sparksql has many resolution rules, which are easy to understand. For example, ResolverRelations is the basic type of the resolution table (column), and ResolveFunctions is the basic information of the resolution function, such as the sum function in the example, ResolveReferences may not be easy to understand. For example, the name in Select name corresponds to a variable that exists as a variable (attribute type) when the table is parsed. Then the same variable in the Project node corresponding to Select becomes a reference, and they have the same ID, So after being processed by ResolveReferences, it becomes the AttributeReference type, which ensures that they are given the same value when the data is actually loaded at the end, just like we define a variable when we write code. These Rules It repeatedly acts on the node, and the specified tree node tends to be stable. Of course, too many times of optimization will waste performance. So some rules act as Once, and some rules act as FixedPoint. This is a trade-off. Well, let's do a little experiment instead of talking nonsense.     We use ResolverRelations to parse our AST. After parsing, we can see that the original Unresolved Relation has changed to LocalRelation, which represents a table in local memory. This table is registered in the catalog when we use createOrReplaceTempView. This remove operation is nothing more than in the catalog Look up the table, find out the schema of this table, and parse out the corresponding fields. Convert each StructField defined by the outer user into an AttributeReference, which is marked with an ID.     Let's use ResolveReferences again. You will find that the same fields in the upper layer nodes have become references with the same ID, and their types are AttributeReference. After all the rules are finally applied, the entire AST will become                The following is the key point. We need to carry out logical optimization. Let's look at the logical optimization:        There are many kinds of logic optimization in Sparksql. Most of the logic of the Catalyst framework in Sparksql is based on a Tree type data structure. It is elegant to implement it based on scala. The partial function of scala and the powerful case regular matching make the whole code look clear. Let's talk less nonsense. Let's do a small experiment.     See? I changed my (100+10) to 110.     PushPredicateThroughJoin is used to push down a filter that only filters the stu table before the join. It will load a lot less data and optimize the performance. Let's take a look at the final appearance.     At least ColumnPruning，PushPredicateThroughJoin，ConstantFolding，RemoveRedundantAliases Logical optimization means, now my little tree has become:   After all, the completion of logic optimization is only an abstract logic layer. It needs to be converted into a physical execution plan first to turn the logically feasible execution plan into a plan that Spark can actually execute.     Spark sql converts logical nodes into corresponding physical nodes, such as the Join operator. Spark has formulated different algorithm strategies for this operator according to different scenarios, including BroadcastHashJoin, ShuffleHashJoin, and SortMergeJoin. Of course, there are many optimization points in this, Spark will make intelligent selection based on some statistical data during the conversion, which involves cost based optimization, which is also a big part. We can explain in a separate article that in our example, because the amount of data is less than 10M, it will automatically be converted to BroadcastHashJoin. The sharp eyed students can see that there seem to be some more nodes. Let's explain The BroadcastExchange node inherits the Exchange class and is used to exchange data between nodes. Here, BroadcastExchange will broadcast the data from LocalTableScan to each executor node and use it as a map side join. The final Aggregate operation is divided into two steps. The first step is to perform parallel aggregation, and then perform final aggregation on the aggregated results. This is similar to the combination and the final reduction in the domain name map reduce. An Exchange hashpartitioning is added in the middle to ensure that the same key shuffles to the same partition Shuffle is required when the distribution of Child's output data fails to meet the requirements. This is the exchange data node inserted in the final EnsureRequirement phase. In the database field, there is a saying that "join wins the world." We focus on some choices made by Spark SQL in the join operation.   The Join operation can basically divide the two join tables into large tables and small tables. The large table is used as a stream traversal table and the small table is used as a lookup table. Then, for each record in the large table, the record with the same key in the lookup table is retrieved according to the key. Spark supports all types of joins: The join operation in spark sql selects different join policies according to various conditions, including BroadcastHashJoin, SortMergeJoin, ShuffleHashJoin.  

BroadcastHashJoin: If Spark judges that the storage space of a table is less than the broadcast threshold (Spark uses the parameter spark. sql. autoBroadcastJoinThreshold to control the threshold for selecting BroadcastHashJoin, which is 10MB by default), it is to broadcast the small table to the Executor, and then place the small table in a hash table as a lookup table, which can be completed through a map operation The join operation avoids the shuffle operation with large performance code. However, it should be noted that BroadcastHashJoin does not support full outer join. For right outer join, broadcast left table, left outer join, left semi join, left anti join, broadcast right table, and inner join, the smaller table is broadcast.    
SortMergeJoin: If the data of both tables is large, it is more suitable to use SortMergeJoin. SortMergeJoin uses the shuffle operation to shuffle the records of the same key into a partition. Then the two tables are sorted. The cost of the sort merge operation is acceptable.    
ShuffleHashJoin: When will ShuffleHashJoin be performed when the lookup table is placed in the hash table instead of sorting during the shuffle process? The size of the lookup table cannot exceed the value of spark. sql. autoBroadcastJoinThreshold. Otherwise, BroadcastHashJoin will be used. The average size of each partition cannot exceed spark. sql. autoBroadcastJoinThreshold. This ensures that the lookup table can be placed in memory without OOM. There is another condition that is more than three times the size of the large table and the small table, so that this kind of join can be used Benefits.     As mentioned above, the nodes above the AST have been converted into physical nodes. These physical nodes finally call the execute method recursively from the head node, which will call the transform operation on the RDD generated by the child, and generate a series of RDD chains, just like the recursive call on the DStream in the spark string. The figure finally executed is as follows:

It can be seen that this final execution is divided into two stages. The small table broeadcastExecute is applied to the large table for BroadcastHashJoin without evolving the shuffle operation. In the last step of aggregation, the HashAggregate sum function is first performed in the map section, and then the Exclude operation shuffles the data of the same key according to the name Go to the same partition, and then do the final HashAggregate sum operation. Here is a WholeStageCode. It's strange. What is this? Because when we execute operators such as Filter and Project, these operators contain many expressions, such as SELECT sum (v) and name. The sum and v here are expressions, and the v in these operators belongs to Attribute variable expression, which is also a tree data structure, Sum (v) is a tree structure composed of the sum node and the sub node v of sum. These expressions can evaluate and generate code. The most basic function of the expression is to evaluate, calculate the input Row. Expression needs to implement the def eval (input: InternalRow=null): Any function to implement its function.   Expression is used to process Row, and the output can be of any type. However, the output type of Plan such as Project and Filter is def output: Seq [Attribute], which represents a group of variables. For example, in the plan of Filter (age>=11) in our example, age>11 is an expression, and this>expression depends on two sub nodes, one The literal constant expression evaluates to 11, and the other is the attribute variable expression age. This variable is converted to the AttributeReference type in the analyze phase, but it is Unevaluable. In order to obtain the corresponding value of the attribute in the input Row, you must bind the index of this variable in a row of data according to the schema association to generate BoundReference, Then the expression BoundReference can obtain the value in Row according to the index when eval. The final output type of the expression age>11 is boolean, but the Plan output type of Filter is Seq [Attribute].   It can be imagined that the data flows in one plan, and then the expressions in each plan will process the data, which is equivalent to the call processing of small functions. There is a large amount of function call overhead, so can we inline these small functions as a large function, That's what WholeStageCodegen did.  

You can see that each node of the final execution plan has an * in front of it, indicating that the whole code generation is enabled. In our example, Filter， Project，BroadcastHashJoin，Project，HashAggregate The whole code generation is enabled in this section, which is cascaded into two large functions. If you are interested, you can use a.queryExecution.debug.codegen to see what the generated code looks like. However, the Exchange operator does not generate the whole code, because it needs to send data over the network.       My sharing today ends here. In fact, there are many interesting things in Spark SQL, but because of the complexity of the nature of the problem, it requires a high degree of abstraction to straighten out all these things, which brings difficulties to code readers. But if you really look into it, you will gain a lot. If you have any opinion on this article, please leave a message at the end of the article to share your thoughts.   Niuren said     "Niuren" column is dedicated to the discovery of the thoughts of technologists, including technical practice, technical dry goods, technical insights, growth experience, and all the content worth being discovered. We hope to gather the best technicians to dig out unique, sharp and contemporary voices.  

About the author

Hot News

Huawei Computing Activity Zone

Hot software

OSCHINA Community

Online tools

Introduction

Public account

Video number

Problem feedback