Can I add arguments to python code when I submit spark job?
Yes: Put this in a file called args.py
#import sys
print sys.argv
If you run
spark-submit args.py a b c d e
You will see:
['/spark/args.py', 'a', 'b', 'c', 'd', 'e']
You can pass the arguments from the spark-submit command and then access them in your code in the following way,
sys.argv[1] will get you the first argument, sys.argv[2] the second argument and so on. Refer to the below example,
You can create code as below to take the arguments which you will be passing in the spark-submit command,
import os
import sys
n = int(sys.argv[1])
a = 2
tables = []
for _ in range(n):
tables.append(sys.argv[a])
a += 1
print(tables)
Save the above file as PysparkArg.py and execute the below spark-submit command,
spark-submit PysparkArg.py 3 table1 table2 table3
Output:
['table1', 'table2', 'table3']
This piece of code can be used in PySpark jobs where it is required to fetch multiple tables from the database and, the number of tables to be fetched & the table names will be given by the user while executing the spark-submit command.
Even though sys.argv
is a good solution, I still prefer this more proper way of handling line command args in my PySpark jobs:
import argparse
parser = argparse.ArgumentParser()
parser.add_argument("--ngrams", help="some useful description.")
args = parser.parse_args()
if args.ngrams:
ngrams = args.ngrams
This way, you can launch your job as follows:
spark-submit job.py --ngrams 3
More information about argparse
module can be found in Argparse Tutorial
Ah, it's possible. http://caen.github.io/hadoop/user-spark.html
spark-submit \
--master yarn-client \ # Run this as a Hadoop job
--queue <your_queue> \ # Run on your_queue
--num-executors 10 \ # Run with a certain number of executors, for example 10
--executor-memory 12g \ # Specify each executor's memory, for example 12GB
--executor-cores 2 \ # Specify each executor's amount of CPUs, for example 2
job.py ngrams/input ngrams/output