It's time to learn the real spark technology

original
2018/11/22 11:20
Reading 417

Spark SQL can be said to be the essence of Spark. I think the overall complexity is more than 5 times that of Spark streaming. Now Spark officially promotes structured streaming, Spark streaming is not actively maintained. We build big data computing tasks based on Spark and focus on DataSet. The original code written based on RDD is migrated, which has great benefits, especially in terms of performance, The various embedded performance optimizations in Spark SQL are more reliable than the so-called best practices that people follow when writing RDDs naked. Especially for novices, for example, some best practices talk about filter operations before map operations. This Spark SQL will automatically push down predicates, such as avoiding the use of shuffle operations, If you enable the relevant configuration in Spark SQL, you will automatically use broadcast join to broadcast small tables, convert shuffle join to map join, etc., which can really save us a lot of attention. The code complexity of Spark SQL is caused by the complexity of the nature of the problem. Most of the logic of the Catalyst framework in Spark SQL is based on a Tree type data structure. It is elegant to implement it based on scala, Scala's partial function and powerful case regular matching make the whole code look clear. This article briefly describes some mechanisms and concepts in Spark SQL. 
 SparkSession is the entrance for us to write Spark application code. Starting a Spark shell will provide you with a starting point for creating SparkSession. This object is the starting point of the whole Spark application. Let's look at some important variables and methods of SparkSession: 
 
 The sessionState mentioned above is a key thing. It maintains all the state data used by the current session. There are the following things that need to be maintained: 
 In Spark SQL, dataFrame and Dataset are used internally to represent a data set, and then you can apply various statistical functions and operators to this data set. Some people may not know the difference between DataFrame and Dataset. In fact, DataFrame is a DataSet of Row type 
 The Row type mentioned here is at the API level exposed by Spark SQL. However, DataSet does not require the input type to be Row, but it can also be a strongly typed data, The data type processed at the bottom of DataSet is the internal InternalRow or UnsafeRow type of Catalyst, and there is an Encoder behind it for implicit conversion. The data you input is converted to the internal InternalRow, so that the DataFrame corresponds to the RowEncoder. 
 Performing transformations on the dataset will generate a tree structure with LogicalPlan type elements. Let's take an example. If I have a student table and a score table, the demand is to count the total scores of all students over 11 years old. 
  
 This queryExecution is the execution engine of the entire execution plan. There are various intermediate process variables in the execution process. The whole execution process is as follows: 
 Then the sql statement in our above example will become an abstract syntax tree after Parser parsing, and the corresponding parsed logical plan AST is 
 The image of "

























 
 We can see that the filter condition is changed to a Filter node, which is of the UnaryNode type, that is, there is only one child. The data in the two tables is changed to an UnresolvedRelation node, which is of the LeafNode type. As the name implies, leaf nodes and JOIN operations are the table of the Join node, which is a BinaryNode node with two children. 
 The nodes mentioned above are of LogicalPlan type, which can be understood as operators for various operations. Spark SQL defines various operators for various operations. The abstract syntax tree composed of these operators is the basis of the entire Catalyst optimization. The Catalyst optimizer will do various things on this tree, moving the nodes on the tree to optimize. 
 Now we have an abstract syntax tree through Parser, but we don't know what score and sum are, so we need analyzer to locate them. Analyzer will change all Unresolved things on AST into resolved state. Sparksql has many resolved rules, which are easy to understand. For example, ResolverRelations is the basic type of the resolution table (column), Resolve Functions is the basic information of the function parsed, such as the sum function in the example, ResolveReferences may not be easy to understand. For example, the name in Select name corresponds to a variable that exists as a variable (attribute type) when the table is parsed. Then the same variable in the Project node corresponding to Select becomes a reference, and they have the same ID, So after being processed by ResolveReferences, it becomes the AttributeReference type, which ensures that they are given the same value when the data is actually loaded at the end, just like we defined a variable when we wrote code. These rules are repeatedly applied to the node, and the specified tree node tends to be stable. Of course, more optimization times will waste performance, Therefore, some rules function as Once, and some rules function as FixedPoint. These are trade-offs. Well, let's do a little experiment instead of talking nonsense. 
 We use ResolverRelations to parse our AST. After parsing, we can see that the original Unresolved Relation has changed to LocalRelation, which represents a table in local memory. This table is registered in the catalog when we use createOrReplaceTempView. This remove operation is nothing more than looking up a table in the catalog, Find out the schema of this table, parse out the corresponding fields, convert each StructField defined by the outer user into an AttributeReference, and mark it with an ID. 
  
 Let's use ResolveReferences again. You will find that the same fields in the upper layer nodes have become references with the same ID, and their types are AttributeReference. After all the rules are finally applied, the entire AST will become 













 The following is the key point. We need to carry out logical optimization. Let's look at the logical optimization: 



 There are many kinds of logic optimization in Sparksql. Most of the logic of the Catalyst framework in Sparksql is based on a Tree type data structure. It is elegant to implement it based on scala. The partial function of scala and the powerful case regular matching make the whole code look clear. Let's talk less nonsense. Let's do a small experiment. 
  
 See? I changed my (100+10) to 110. 
  
 PushPredicateThroughJoin is used to push down a filter that only filters the stu table before the join. It will load a lot less data and optimize the performance. Let's take a look at the final appearance. 
  At least ColumnPruning, PushPredicteThroughJoin, ConstantFolding, RemoveRedundantAliases logic optimization methods have been used. Now my little tree has become: 
 After all, the completion of logic optimization is only an abstract logic layer. It needs to be converted into a physical execution plan first to turn the logically feasible execution plan into a plan that Spark can actually execute. 
  
 Spark sql converts logical nodes into corresponding physical nodes, such as the Join operator. Spark has formulated different algorithm strategies for this operator according to different scenarios, including BroadcastHashJoin, ShuffleHashJoin, and SortMergeJoin. Of course, there are many optimization points in this, Spark will make intelligent selection based on some statistical data during the conversion, which involves cost based optimization, which is also a big part. We can explain in a separate article that in our example, because the amount of data is less than 10M, it will automatically be converted to BroadcastHashJoin. The sharp eyed students can see that there seem to be some more nodes. Let's explain, The BroadcastExchange node inherits the Exchange class and is used to exchange data between nodes. Here, BroadcastExchange will broadcast the data from LocalTableScan to each executor node and use it as a map side join. The final Aggregate operation is divided into two steps. The first step is to perform parallel aggregation, and then perform final aggregation on the aggregated results. This is similar to the combination and the final reduction in the domain name map reduce. An Exchange hashpartitioning is added in the middle to ensure that the same key shuffles to the same partition, Shuffle is required when the distribution of child output data in the current physical plan fails to meet the requirements. This is the exchange data node inserted in the final EnsureRequirement phase. In the database field, there is a saying that "the person who gets the join wins the world". Let's focus on some choices made by Spark SQL during the join operation. 
 The Join operation can basically divide the two join tables into large tables and small tables. The large table is used as a stream traversal table and the small table is used as a lookup table. Then, for each record in the large table, the record with the same key in the lookup table is retrieved according to the key. Spark supports all types of joins: The join operation in spark sql selects different join policies according to various conditions, including BroadcastHashJoin, SortMergeJoin, ShuffleHashJoin

  • BroadcastHashJoin: If Spark judges that the storage space of a table is less than the broadcast threshold (Spark uses the parameter spark.sql.autoBroadcastJoinThreshold to control the threshold of selecting BroadcastHashJoin, which is 10MB by default), it is to broadcast the small table to the Executor, and then put the small table in a hash table as a lookup table, The join operation can be completed through a map operation, avoiding the shuffle operation with large performance code. However, it should be noted that BroadcastHashJoin does not support full outer join. For right outer join, broadcast left table, and left outer join, Left semi join, left anti join, broadcast Right table. For inner join, the table is smaller than broadcast. 
 

  • SortMergeJoin: If the data of both tables is large, it is more suitable to use SortMergeJoin. SortMergeJoin uses the shuffle operation to shuffle the records of the same key into a partition. Then the two tables are sorted. The cost of the sort merge operation is acceptable. 
 

  • ShuffleHashJoin: When will ShuffleHashJoin be performed when the lookup table is placed in the hash table instead of sorting during the shuffle process? The size of the lookup table cannot exceed the value of spark. sql. autoBroadcastJoinThreshold. Otherwise, BroadcastHashJoin will be used. The average size of each partition cannot exceed spark. sql. autoBroadcastJoinThreshold. This ensures that the lookup table can be placed in memory without OOM. There is another condition that is more than three times the size of the large table and the small table, Only in this way can the benefits of this join be brought into play. 
 
 As mentioned above, the nodes above the AST have been converted into physical nodes. These physical nodes finally call the execute method recursively from the head node, which will call the transform operation on the RDD generated by the child, and generate a series of RDD chains, just like the recursive call on the DStream in the spark stringing. The figure finally executed is as follows:

It can be seen that this final execution is divided into two stages. The small table broeadcastExecute is applied to the large table for BroadcastHashJoin without evolving the shuffle operation. In the last step of aggregation, the HashAggregate sum function is performed in the map section first, and then the Exclude operation shuffles the data of the same key to the same partition according to the name, Then do the final HashAggregate sum operation. Here is a WholeStageCode. It's strange. Why is this? Because when we execute operators such as Filter and Project, these operators contain many expressions, such as SELECT sum (v), name, Both sum and v here are expressions, where v belongs to the attribute variable expression, and the expression is also a tree data structure. Sum (v) is a tree structure composed of sum node and sub node v of sum. These expressions can evaluate and generate code. The most basic function of expression is to evaluate and calculate the input Row, Expression needs to implement the def eval (input: InternalRow=null): Any function to implement its functions. 
 Expression is used to process Row, and the output can be of any type. However, the output type of Plan such as Project and Filter is def output: Seq [Attribute], which represents a group of variables. For example, in the plan of Filter (age>=11) in our example, age>11 is an expression, and this>expression depends on two sub nodes, One Literal constant expression evaluates to 11, and the other is the attribute variable expression age. This variable is converted to the AttributeReference type in the analyze phase, but it is Unevaluable. In order to obtain the corresponding value of the attribute in the input Row, you must bind the index of this variable in a row of data according to the schema association, Generate BoundReference, and then such expressions as BoundReference can obtain the value in Row according to the index when eval. The final output type of the expression age>11 is boolean, but the Plan output type of Filter is Seq [Attribute]. 
 It can be imagined that the data flows in one plan, and then the expressions in each plan will process the data, which is equivalent to the call processing of small functions. There is a large amount of function call overhead, so can we inline these small functions as a large function, That's what WholeStageCodegen did. 
 

You can see that each node of the final execution plan has an * in front of it, indicating that the whole code generation is enabled. In our example, Filter, Project, BroadcastHashJoin, Project, HashAggregate all enable the generation of the whole code, which is cascaded into two large functions. If you are interested, you can use a.queryExecution.debug.codegen to see what the generated code looks like. However, the Exchange operator does not generate the whole code, because it needs to send data over the network. 




 My sharing today ends here. In fact, there are many interesting things in Spark SQL, but because of the complexity of the nature of the problem, it requires a high degree of abstraction to straighten out all these things, which brings difficulties to code readers. But if you really look into it, you will gain a lot. If you have any opinion on this article, please leave a message at the end of the article to share your thoughts. 
  Niuren said 


 "Niuren" column is dedicated to the discovery of the thoughts of technologists, including technical practice, technical dry goods, technical insights, growth experience, and all the content worth being discovered. We hope to gather the best technicians to dig out unique, sharp and contemporary voices. 
  

Expand to read the full text
Loading
Click to lead the topic 📣 Post and join the discussion 🔥
Reward
zero comment
zero Collection
zero fabulous
 Back to top
Top