Skip to main content

Malcolm McRoberts

Malcolm McRoberts's Public Library

  • insert into table s_file_pa_job_data_t partition(customer_id)     values (array(named_struct('pa_axis',0)) ); 

    that is, using the array() and named_struct() udfs which will from some scalar values construct respectively an array, and a struct according to your specs. (see UDF documentation here:


    but unfortunately if you do that you'll get

    FAILED: SemanticException [Error 10293]: Unable to create temp file  for insert values Expression of type TOK_FUNCTION not supported in insert/values 

    because unfortunately hive does not support the use of UDF functions in the VALUES clause yet. As the other posts suggest, you could do it using a dummy table, which is kind of ugly, but works.

    • In addition to supporting the standard scalar data types, Hive also supports three complex data types:

      • Arrays are collections of related items that are all the same scalar data type
      • Structs are object-like collections wherein each item is made up of multiple pieces of data each with its own data type
      • Maps are key/value pair collections
Jul 27, 16

"Deploying Client Configuration Files"

  • You can also distribute these client configuration files manually to the users of a service.

  • You can use collect_list Hive UDAF:

    from pyspark.sql.functions import expr from pyspark import HiveContext  sqlContext = HiveContext(sc) df = sqlContext.createDataFrame(rdd)  df.groupBy("x").agg(expr("collect_list(y) AS y"))

  • You may pass them in parameters like that:

    "container": {     "type": "DOCKER",     "docker": {         "network": "HOST",         "image": "your/image",         "parameters": [             { "key": "add-host", "value": "host:ip" },             { "key": "dns-search", "value": "url" }         ]     } }

  • ipython notebook --ip='*' 

    Or a specific IP visible to other machines:

    ipython notebook --ip=
  • ipython notebook --ip='*' 

    Or a specific IP visible to other machines:

    ipython notebook --ip=

  • getItem(key)

    An expression that gets an item at position ordinal out of a list, or gets an item by key out of a dict.

    • pyspark.sql.functions.lag(col, count=1, default=None)[source]

      Window function: returns the value that is offset rows before the current row, and defaultValue if there is less than offset rows before the current row. For example, an offset of one will return the previous row at any given point in the window partition.


      This is equivalent to the LAG function in SQL.

      • col – name of column or expression
      • count – number of row to extend
      • default – default value

      New in version 1.4.

  • pyspark.sql.functions.monotonically_increasing_id()[source]

    A column that generates monotonically increasing 64-bit integers.


    The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive.

  • from pyspark.sql import functions as F  df.withColumn("row_number", F.row_number.over(Window.partitionBy("a","b","c","d").orderBy("time"))).show()

  • #cloud-config write_files:  - path: /test.txt  content: |  Here is a line.  Another line is here.

  • I believe that the following should work:

    #cloud-config runcmd:   - newhostn="<%newhostn%>"   - hostn=$(cat /etc/hostname)   - echo "Exisitng hostname is $hostn"   - echo "New hostname will be $newhostn"   - sed -i "s/$hostn/$newhostn/g" /etc/hosts   - sed -i "s/$hostn/$newhostn/g" /etc/hostname power_state:   mode: reboot

  • You first have to build the package.

    # navigate into your python-package (where the is located) python sdist

    This will create a dist/ dicretory and creates a .tar.gz file

1 - 20 of 2393 Next › Last »
20 items/page

Diigo is about better ways to research, share and collaborate on information. Learn more »

Join Diigo