**Paper ID** 286

**Paper Title** K-Pg: Shared State in Differential Dataflows

**Track Name** Research Paper Second-Round

## Reviewer #1

1. **Overall Evaluation** Weak Reject
2. **Reviewer's confidence** Some Familiarity
3. **Originality** Medium
4. **Importance** Medium
5. **Summary of the contribution (in a few sentences)**

    The paper described the design and implementation of K-pg a novel data analytics engine built on top of a timely data flow execution engine. Like prior work Naiad, K-pg is based on the differential data flows paradigm, but differs by introducing shared state in the form of indexes across all workers, temporally as well as across iterations. K-pg models updates to data as streams and computations as operations on updates or batches of updates. The key innovation in the system seems to be the concept of arrangements that takes in a set of updates and outputs shared indexes.

6. **List 3 or more strong points, labelled S1, S2, S3, etc.**

    S1: A general framework and implementation for large scale data analytics that subsumes relational, stream and iterative processing pipelines.

    S2: The design based on differential data flows includes novel abstractions like the arrange operator.

    S3: The system outperforms the best of breed systems for relational, streaming and iterative tasks.

7. **List 3 or more weak points, labelled W1, W2, W3, etc.**

    W1: The paper needs to be better written.

    W2: It is not clear how the implementation of the system conforms to the principles laid out in section 3.3

    W3: It is not clear what are all the ways arrangements are used, and which operators use arrangements.

8. **Detailed evaluation. Number the paragraphs (D1, D2, D3, ...)**

    D1: I think the paper would definitely benefit with a running example that is taken throughout the paper to explain the design of the system.

    D2: While the data structures (or indexes) generated by arrangement operators are shared, some language suggests that when another data flow gets a trace handle, the arrangements are “copied over” during an import. Is that correct?

    D3: What is the batch size in figure 4c? And what is w in 4b?

    D4: The term relative throughput is not defined. Throughput is relative to what?

9. **Candidate for a Revision? (Answer yes only if an acceptable revision is possible within one month.)**

    Yes

10. **Required changes for a revision, if applicable. Labelled R1, R2, R3, etc.**

    The paper needs a significant rewrite. See weak points.

## Reviewer #2

1. **Overall Evaluation** Reject
2. **Reviewer's confidence** No Familiarity
3. **Originality** Medium
4. **Importance** Medium
5. **Summary of the contribution (in a few sentences)**

    The paper presents K-Pg, a system that enhances data parallel operators with the ability to process shared indexes.

6. **List 3 or more strong points, labelled S1, S2, S3, etc.**

    S1. The implementation is modular

    S2. The authors compare against specialised systems

    S3. K-Pg can be used in diverse workflows

7. **List 3 or more weak points, labelled W1, W2, W3, etc.**

    W1. The paper is difficult to follow.

    W2. The plots are not easy to read.

    W3. Latency plots are often omitted.

8. **Detailed evaluation. Number the paragraphs (D1, D2, D3, ...)**

    The paper introduces shared indexes in order to efficiently implement data parallel workflows in various domains. The experimental evaluation shows that their approach achieves good results.

    Timely dataflows are the central concept of the paper but they are not explained well in Section 3.1
    In section 3.3 it is not clear why the principles are that important
    Section 4 is difficult to follow
    The plots are difficult to read

    In general the paper is not well written and it seems that K-Pg is a small extension of Naiad

9. **Candidate for a Revision? (Answer yes only if an acceptable revision is possible within one month.)**

    No

## Reviewer #3

1. **Overall Evaluation** Reject
2. **Reviewer's confidence** Expert
3. **Originality** Low
4. **Importance** Low
5. **Summary of the contribution (in a few sentences)**

    The authors propose a new system for doing analytics with recursive queries (online and offline). As part of this, they propose a novel operator for indexing and sharing stream data.

6. **List 3 or more strong points, labelled S1, S2, S3, etc.**

    1) A reasonable problem to work on
    2) Indexing and sharing stream history has some novelty
    3) Real implementation and experiments with TPCH

7. **List 3 or more weak points, labelled W1, W2, W3, etc.**

    1) Writeup need major improving
    2) Missing key references
    3) Comparisons in the paper with similar systems needs major improving, and probably rethinking.

8. **Detailed evaluation. Number the paragraphs (D1, D2, D3, ...)**

    First, much of the paper is incomprehensible, filled with vague unsupported claims. For instance, there is a whole paragraph after the first table in the intro that I found incomprehensible. I have no idea what is meant. Why the word 'holistic'?

    Section 3.3 also made no sense to me. Is exchange supposed to be a shuffle? Nobody implements this as part of a count operator, including Naiad. If I understood what arrange is suppose to be, it keeps a stream history. Count operators don't typically do this. Why is it part of count? Maybe a clear example would help clear some of this up.

    In terms of unsupported claims there are many cases of this throughout the paper. For instance, sticking with section 3.3, there is the statement (relative to Naiad) "executes robustly despite the absence of frustrating systems knobs". What knobs? Robust w.r.t. what?

    While some aspects of the experiments are good (full TPCH, real implementation etc...) The authors didn't actually run the competition. Rather, in some cases (they don't say which), they just use numbers published in other papers. How do we know the queries were run the same way. Were they run warm? cold?

    Also significant discussion is spent talking about performance improvements using batching and careful memory layout. The first streaming system do this, resulting in significantly higher performance, was Trill, which can also be used for offline analytics, and which I think the authors are unaware of. Much of the performance related effort is less novel than the authors imply.

    Finally, there is a significant body of work on recursive streaming queries, of which Naiad is one (e.g. "On-the-fly Progress Detection in Iterative Stream Queries"). It is really unclear to me what the value is of the proposed system over these approaches. Perhaps a clear example would help? This paper needs a complete rewrite, and possibly significantly more research in the experimental section, which is why I didn't give it a revise.

9. **Candidate for a Revision? (Answer yes only if an acceptable revision is possible within one month.)**

    No

## META-REVIEWER #1

    Not submitted.