Ask Question
10 March, 05:49

You are building a predictive solution based on web server log data. The data is collected in a comma-separated values (CSV) format that always includes the following fields: date: string time: string client_ip: string server_ip: string url_stem: string url_query: string client_bytes: integer server_bytes: integer You want to load the data into a DataFrame for analysis. You must load the data in the correct format while minimizing the processing overhead on the Spark cluster. What should you do? Load the data as lines of text into an RDD, then split the text based on a comma-delimiter and load the RDD into a DataFrame. Define a schema for the data, then read the data from the CSV file into a DataFrame using the schema. Read the data from the CSV file into a DataFrame, infering the schema. Convert the data to tab-delimited format, then read the data from the text file into a DataFrame, infering the schema.

+5
Answers (1)
  1. 10 March, 05:59
    0
    see explaination

    Explanation:

    The data is collected in a comma-separated values (CSV) format that always includes the following fields:

    ? date: string

    ? time: string

    ? client_ip: string

    ? server_ip: string

    ? url_stem: string

    ? url_query: string

    ? client_bytes: integer

    ? server_bytes: integer

    What should you do?

    a. Load the data as lines of text into an RDD, then split the text based on a comma-delimiter and load the RDD into DataFrame.

    # import the module csv

    import csv

    import pandas as pd

    # open the csv file

    with open (r"C:/Users/uname/Downloads/abc. csv") as csv_file:

    # read the csv file

    csv_reader = csv. reader (csv_file, delimiter=',')

    # now we can use this csv files into the pandas

    df = pd. DataFrame ([csv_reader], index=None)

    df. head ()

    b. Define a schema for the data, then read the data from the CSV file into a DataFrame using the schema.

    from pyspark. sql. types import *

    from pyspark. sql import SparkSession

    newschema = StructType ([

    StructField ("date", DateType (), true),

    StructField ("time", DateType (), true),

    StructField ("client_ip", StringType (), true),

    StructField ("server_ip", StringType (), true),

    StructField ("url_stem", StringType (), true),

    StructField ("url_query", StringType (), true),

    StructField ("client_bytes", IntegerType (), true),

    StructField ("server_bytes", IntegerType (), true])

    c. Read the data from the CSV file into a DataFrame, infering the schema.

    abc_DF = spark. read. load ('C:/Users/uname/Downloads/new_abc. csv', format="csv", header="true", sep=' ', schema=newSchema)

    d. Convert the data to tab-delimited format, then read the data from the text file into a DataFrame, infering the schema.

    Import pandas as pd

    Df2 = pd. read_csv ('new_abc. csv', delimiter="/t")

    print ('Contents of Dataframe : ')

    print (Df2)
Know the Answer?
Not Sure About the Answer?
Find an answer to your question ✅ “You are building a predictive solution based on web server log data. The data is collected in a comma-separated values (CSV) format that ...” in 📘 Computers and Technology if you're in doubt about the correctness of the answers or there's no answer, then try to use the smart search and find answers to the similar questions.
Search for Other Answers