RDD programming

Super hard core! Live chat about the "Chinese heart" of industrial software

1. Definitions

Resilient Distributed Dataset (RDD)

RDD in Spark is simply an immutable distributed collection of all objects. Each RDD is divided into multiple partitions, which can be calculated on different nodes of the cluster. RDD can contain any Python, Java, Scala object type, including user-defined types.

2. Foundation

In Spark, all work is expressed as creating new RDDs and transforming existing RDDs（ 1. Transformation ）, or call some operations on the RDD to calculate a result（ 2. Action )

Every Spark program or shell session works like this:
1. Create some RDDs as input from external data
2. Use a transformation like filter() to define a new RDD --Transformation
3. Ask Spark to persist () any intermediate RDD that needs to be reused Note: Every time you execute an action, Spark's RDD will be recalculated by default
4. Start actions like count() and first() to start parallel computing, and Spark will optimize and execute- -Action

Note: You can define a new RDD at any time, but Spark always calculates them in a lazy way, that is, when they are used for actions for the first time.

Note: The ability to always recalculate an RDD is actually the reason why RDD is called "elastic". When the machine with RDD data
When the server fails, Spark uses this ability to recalculate the lost partition, which is transparent to users.

3. Create RDD

1. Create a common array

 SparkConf conf = new SparkConf().setMaster("local").setAppName("JavaWordCount"); JavaSparkContext ctx = new JavaSparkContext(conf); JavaRDD<String> lines = ctx.parallelize(Arrays.asList("pandas","i like pandas")); System.out.println(lines.count());

2. Load an external dataset

 SparkConf conf = new SparkConf().setMaster("local").setAppName("JavaWordCount"); JavaSparkContext ctx = new JavaSparkContext(conf); JavaRDD<String> lines = ctx.textFile("D:/systemInfo.log"); System.out.println(lines.count());

4. RDD operation

RDD supports two types of operations: transformation and action

Transform: Operate an RDD to get a new RDD, such as map() and filter()
Action: The operations that return values to applications or export data to the storage system, such as count() and first()

 SparkConf conf = new SparkConf().setMaster("local").setAppName("JavaWordCount"); JavaSparkContext ctx = new JavaSparkContext(conf); JavaRDD<String> lines = ctx.textFile("D:/systemInfo.log"); JavaRDD<String> errsRDD = lines.filter(new Function<String, Boolean>() { private static final long serialVersionUID = 1L; public Boolean call(String x) { System.out.println ("RDD transformation calculation"); return x.contains("Exception"); } }); System.out.println(lines.count()); System.out.println(errsRDD.count());

Note: RDD transformation calculation will be delayed until you use it in an action
In the above example, we can first comment out two println, and then the "RDD transformation calculation" will not be output ", which will be output only when an action such as count() is executed" RDD transformation calculation "

5. Transfer function
In Java, a function is an object that implements the Spark function interface in the org.apache.spark.api.java package

Function name	method	usage
Function<T, R>	R call(T)	One input and one output, for operations such as map() and filter()
Function2<T1, T2, R>	R call(T1, T2)	Two inputs and one output, used for operations such as aggregate () and fold ()
FlatMapFunction<T, R>	Iterable<R> call(T)	One input has zero or more outputs for operations such as flagMap()

Note: In Java 8, you can also use lambda to implement function interfaces concisely

 SparkConf conf = new SparkConf().setMaster("local").setAppName("JavaWordCount"); JavaSparkContext ctx = new JavaSparkContext(conf); JavaRDD<String> lines = ctx.textFile("D:/systemInfo.log"); JavaRDD<String> errors = lines.filter(new Contains("Exception")); System.out.println(errors.count());

 class Contains implements Function<String, Boolean> { private static final long serialVersionUID = 1L; private String query; public Contains(String query) { this.query = query; } @Override public Boolean call(String v1) throws Exception { return v1.contains(query); } }

6. Common transformations and actions
Transformation
The two most common transformations are map() and filter(). The difference between the two is: whether the returned result is a new value or an original value
Map(): The returned result is the new RDD formed by each element after transformation
Filter(): returns the new RDD in this RDD that can only be formed by the elements of this function

You can also know by viewing the interface: map (Function<T, R>), filter (Function<T, Boolean>)
If you want each input element to produce multiple output elements, this operation is called flatMap (FlatMapFunction<T, U>)

Take an example of flatMap:

 SparkConf conf = new SparkConf().setMaster("local").setAppName("JavaWordCount"); JavaSparkContext ctx = new JavaSparkContext(conf); JavaRDD<String> lines = ctx.parallelize(Arrays.asList("hello world","hi")); JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() { private static final long serialVersionUID = 1L; public Iterable<String> call(String line) { return Arrays.asList(line.split(" ")); } }); System.out.println(StringUtils.join(words.collect(), ","));

Result: hello, world, hi-- hello world Implemented Generate multiple output elements

The instance of the corresponding map:

 JavaRDD<String[]> words = lines.map(new Function<String, String[]>() { @Override public String[] call(String v1) throws Exception { return v1.split(" "); } });

RDD supports many mathematical set operations
Transform two RDDs containing {1,2,3} and {3,4,5}

Function name	describe	result
union()	Generate an RDD containing all elements in two RDDs	{1, 2, 3, 3, 4, 5}
intersection()	Generate an RDD composed of elements in both RDDs	{3}
subtract()	Remove elements existing in one RDD from another	{1, 2}
cartesian()	Generating Cartesian product of two RDDs	{(1, 3), (1, 4), ..., (3,5)}

action

Some actions have been listed in the above examples to see what others are commonly used

collect()	Return all elements in RDD
count()	Return the number of elements in RDD
countByValue()	Number of occurrences of each element in RDD
take(num)	Return num elements in RDD
top(num)	Return the first num elements in RDD
takeOrdered(num)(ording)	Returns num elements in RDD based on the given order
takeSample(withReplacement,num,[seed])	Randomly return num elements in RDD
reduce(func)	Merge elements in RDD in parallel (such as summation)
fold(func)	Same as reduce (), but provides an initial value
aggregate(zeroValue)(seqOp, combOp)	Similar to reduce(), but used to return different types
foreach(func)	Apply func to each element in RDD

7. Persist()

Every time you perform an action on an RDD, Spark recalculates the RDD and all dependent RDDs. To avoid computing the same RDD multiple times, you can cache the data; When Spark caches the RDD, all nodes that calculate the RDD will save their partitions. If the node that has cached the data makes an error, Spark will recalculate the missing partition when necessary

Spark has multiple levels of persistent policies to choose from

level	Space occupancy	cpu	In memory	On hard disk	describe
MEMORY_ONLY	high	low	Yes	no
MEMORY_ONLY_SE	low	high	yes	no
MEMORY_AND_DISK	high	in	Sometimes	Sometimes	If too much data cannot be stored in memory, it will overflow to disk
MEMORY_AND_DISK_SER	low	high	Sometimes	Sometimes	If too much data cannot be stored in memory, it will overflow to disk. Data in memory is represented as serialization.
DISK_ONLY	low	high	no	yes

If you try to cache too much data, when the memory is exceeded, Spark will use the LRU cache strategy to discard the old partition. For the memory only storage level, Spark recalculates when it needs to access data; For the memory and disk level, data will be written to disk. Either way, you don't have to worry about whether caching attitude data will stop the task. However, unnecessary caching of data can lead to useful data
Too many calculations are discarded.

Take an example:

 SparkConf conf = new SparkConf().setMaster("local").setAppName("JavaWordCount"); JavaSparkContext ctx = new JavaSparkContext(conf); JavaRDD<Integer> rdd = ctx.parallelize(Arrays.asList(1, 2)); JavaRDD<Integer> result = rdd.map(new Function<Integer, Integer>() { private static final long serialVersionUID = 1L; public Integer call(Integer x) { System. err. println ("Recalculate:"+x); return x * x; } }); result.persist(StorageLevel.MEMORY_ONLY()); System.out.println(result.count()); System.out.println(result.count());

When we comment out result. persist

 Recalculate: 1 Recalculate: 2 two Recalculate: 1 Recalculate: 2 two

Two counts are found, and RDD transformation is calculated twice

Enable persist:

 Recalculate: 1 Recalculate: 2 two two

--These are Leaning Spark's notes

About the author

Hot News

Huawei Computing Activity Zone

Hot software

OSCHINA Community

Online tools

Introduction

Public account

Video number

Problem feedback