python

Removing duplicate rows based on specific column in PySpark DataFrame

Let’s create the dataframe.

Python3

# importing module
import pyspark

# importing sparksession from pyspark.sql
# module
from pyspark.sql import SparkSession

# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()

# list of students data
data = [["1", "sravan", "vignan"], ["2", "ojaswi", "vvit"],
		["3", "rohith", "vvit"], ["4", "sridevi", "vignan"],
		["1", "sravan", "vignan"], ["5", "gnanesh", "iit"]]

# specify column names
columns = ['student ID', 'student NAME', 'college']

# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)

print('Actual data in dataframe')
dataframe.show()

Dropping based on one column

Python3

# remove duplicate rows based on college
# column
dataframe.dropDuplicates(['college']).show()

Dropping based on multiple columns

Python3

# remove duplicate rows based on college
# and ID column
dataframe.dropDuplicates(['college', 'student ID']).show()

Similar Posts:

    None Found

Leave a Reply

Your email address will not be published. Required fields are marked *