ISO/IEC JTC1/SC22/WG5 N2039

             Response to the WG5 straw ballot on N2033

                     Bill Long and John Reid

This paper contains responses to the comments in the WG5 straw ballot 
on N2033 (see N2038) and a set of edits to N2033. 

Reinhold Bader (A) wrote
"Now that the TS has the concept of stalled images, I think that image
control statements without a STAT= specification that involve a
failed image could now relatively easily be made to result in the 
executing image becoming stalled, instead of terminating the program. 
This would make development of fail-safe packages much easier, 
because the fail-safety can be designed in a top-down manner i.e. 
library code that synchronizes, allocates or deallocates must not 
necessarily be modified."

Response
We see the inclusion of a STAT= specification in an image control 
statement as desirable and believe that this will not have a big
effect on performance.  This is in contrast to protecting against 
remote references on failed images, for which there is no syntax 
and which would have performance overheads. Furthermore, we now 
have a fairly simple model of image control statements without a 
STAT= specifier - if successful they form segment boundaries and 
can be used for segment ordering and hence data consistency, 
or they fail and the program aborts, making data consistency 
issues irrelevant. Therefore, we do not favor the extension of 
stalling to image control statements without a STAT= specification.

.......

Reinhold Bader (B) wrote
(B) Section 7.
It is not specified what happens if no STAT argument is specified.

Response
We suggest this edit (copied from [18:24-25])
[17:32+] Add new para
"If a condition occurs that would assign a nonzero value to a STAT 
argument but the STAT argument is not present, error termination 
is initiated."

.......

Reinhold Bader (D) wrote
(D) Collective intrinsics CO_BROADCAST (7.4.10) and
    CO_REDUCE (7.4.13)
There still seems to be some missing text with respect to invoking
these intrinsics with objects of derived type that have POINTER
components.

Response
He is concerned about a pointer appearing to be associated with 
a target on a remote image. This can already happen in Fortran 
2008 and such a pointer is regarded as undefined, see 16.5.2.5
of the Standard:
"The association status of a pointer becomes undefined when
 ...
(2) the pointer is pointer-assigned to a target on a different image,"
No edits to N2033 are needed. 

.......

Reinhold Bader (C) and (E) are relevant only if the change proposed in
(A) is accepted. Thus, there are no separate responses for these
comments.

.......

Tobias Burnus wrote

The DTS does not address finalization of CO_BROADCAST and CO_REDUCE for 
derived types which have finalizers.

For CO_BROADCAST, simply adding a statement like the following should 
be sufficient and implementation wise, it should be simple as one can 
simply finalize it before the actual data transfer: In the description 
of "A" append: "On all images of the current team but on the image 
specified by SOURCE_IMAGE, A is finalized before it becomes defined.

For CO_REDUCE, the implementation will be more difficult; still, I 
believe it makes sense to require finalization.

Response
For CO_BROADCAST, we already have the words [22:24] "A becomes defined, 
as if by intrinsic assignment, ..." and the standard states in 4.5.6.3, 
para 9: "When an intrinsic assignment statement is executed, 
the variable is finalized after evaluation of expr and before the 
definition of the variable." Therefore, no change is needed. 

For CO_REDUCE, these edits are needed:
[24:22] After "to A" add "as if by intrinsic assignment".
[24:23] After "to A" add "as if by intrinsic assignment".

.......

Malcolm Cohen wrote

I agree with Robert Corbett's vote.

I am somewhat taken aback that we've suddenly added this new concept 
(stalled images) with far-reaching effects (and more proposed in other 
comments) at the last minute.

It needs to be clear that it is possible to implement the "reliability" 
(failed/stalled/whatever image) features efficiently on a variety of 
architectures.  It should not require incompatible changes to an existing 
coarray implementation (which the current draft certainly seems to do).  I 
have no problem with some "bells and whistles" potentially requiring extra 
work, but a reasonably effective subset needs to be workable without heroic 
efforts, and without affecting programs that do not use the feature.

Response
See our responses below to Robert Corbett's vote.

.......

Malcolm Cohen wrote

Re finalization, I agree with Tobias Burnus' comments that it would be good 
for this to be spelled out in detail for CO_BROADCAST and CO_REDUCE.  For 
the latter it should say that the result of applying the function is 
finalized, including the final function application, (the latter is as if 
the output variable were assigned an expression that is the last function 
reference).  It should, I think, also be stated that the finalizations of 
the intermediate function results are done on the image that actually 
invoked the function, so that any deallocations are handled by the image 
that did the allocations.

Response
See our responses below to Tobias Burnus' comments.

.......

Robert Corbett wrote
 
My primary objection is to the requirements given in the third
paragraph of Clause 5.9 [14:19-23].  I do not see how the specified
semantics can be implemented without compromising the performance of
codes that do not make use of the feature.  I am not certain that
the semantics can be implemented at all in some common environments.

Response 
The intention was that implementations be permitted not to support
transfer of control to the END TEAM statement, but the present wording
does not say this. We think these edits are needed to address this 
comment and your other comments:

[14:9] Change "detect that an image has stalled" to "manage a stalled
image".

[14:19-25] Replace these two paragraphs by the following two 
paragraphs:
"If an image, in a statement other than an image control statement or an
invocation of a collective or atomic subroutine, attempts to reference
or define data using an <image-selector> that identifies an image that
has failed, the executing image becomes a stalled image. If the
<image-selector> identifies the initial team or the processor does not
have the ability to manage a stalled image, the executing image remains
a stalled image for the rest of the execution of the program.
Otherwise, the executing image resumes execution at the
END TEAM statement of the construct after execution of all
finalizations and deallocations that would have occurred during the
normal completion of active procedures that were invoked within the 
CHANGE TEAM construct.

While an image is stalled, other images can still access data
on that image.  If an image is stalled in the initial team, it 
participates in normal termination as if it had initiated normal 
termination."

.......

Robert Corbett wrote
Clause 3.7 [5:41]
What does it mean for an image to have "encountered" an
<image-selector>?  I know we use the usual meaning of a word when
we do not specify its meaning, but that rule is inadequate for this
case.  For example, if an image executes a statement that contains an
<image-selector>, but that <image-selector> is part of an operand that
is not evaluated, has the <image-selector> been "encountered?"  My
guess is that it has not, but I cannot tell that from the draft TS.

Response
Your guess is correct. Together with the rewrite of [14:19-25] in
our response to your first comment, these edits are needed:

[5:41] Change "has encountered" to ", in a statement other than an 
image control statement or an invocation of a collective or atomic 
subroutine, attempts to reference or define data using".
[32:14] Change "has encountered" to ", in a statement other than an 
image control statement or an invocation of a collective or atomic 
subroutine, attempts to reference or define data using".

.......

Robert Corbett wrote
Clause 5.9, paragraph 3 [14:20-23]
When does a stalled image transfer control to the END TEAM statement?
Can it happen immediately or must it wait until all other images that
are part of the same team have completed, failed, or stalled?

Response
In order to preserve symmetric memory, it would be necessary for the
stalled image to participate in coarray deallocations. Also, it is 
intended that data on the stalled image remain available to executing
images. The edits needed are included in the rewrite of [14:19-25]
in our response to your first comment.

.......

Robert Corbett wrote
Clause 5.9, paragraph 3 [14:20-23]
Are the deallocations and finalizations subject to any requirements
w.r.t. the order in which they are performed?  For example, during
normal execution, allocatable objects that are part of an instance of
an internal procedure will be deallocated before the allocatable
objects that are part of the related instance of the host procedure.
Is there any requirement that that ordering be respected by the
stalled image?

Response
Yes, the order should be respected. The edits needed are included in 
the rewrite of [14:19-25] in our response to your first comment.

.......

Bill Long wrote
I recommend the following change.
N2033: [14:23] Delete ',without synchronization of coarray deallocations'.

Response
This edit has been included in the response to Robert Corbett.

.......

Nick Maclaren wrote
I agree with Robert Corbett and Malcolm Cohen about stalled images, but
believe that they have understated the issue.  The requirement is to
handle the 'knock-on' effect of image A failing, image B getting stuck
as a consequence, and image C then needing to interact with image C.  I
agree with the authors that the concept is essential if support is to be
provided for failed images, and that is one of the reasons that I have
consistently voted against the whole feature or abstained.

Response
I think you mean "... image C then needing to interact with image B."
As part of our response to Robert Corbett, we have added this sentence
"While an image is stalled, other images can still access data
on that image."  
An implementation can treat a stalled image as being very like an image 
that as initiated normal termination. 

.......

Nick Maclaren wrote
I have implemented error recovery in run-time systems, have used and
worked on it in several contexts, and know that I am not smart enough to
specify it for a language like Fortran.  Of the thousand or so language
and environment specifications I have seen, I have never seen one
specify this successfully, even for a single environment.  It might be
possible in Haskell, but Fortran is not Haskell.  From the lack of
convergence of these documents and the comments on the mailing list,
this TS seems to be failing in the ways that so many others have failed
before it.  It is doubtful that adding this facility takes "full account
of the state of the art" (see the ISO Directives).

Response
Perhaps you were aiming for too perfect a system. It is not intended 
that all possible failure scenarios be covered. For example, NOTE 5.9
explains that it might be impossible to recover from failure of image 1. 

.......

Nick Maclaren wrote
I believe that there is no chance whatsoever that this issue can be
resolved, and WG5 still keep to the schedule agreed in Las Vegas (see
N2020 and N2024).  Indeed, I doubt that it could be done with even a
year's delay.  Solving this problem is not within the state of the art,
despite considerable efforts in a great many contexts over the past
half-century.

I believe that the whole feature of support for failing and stalled
images should be removed, possibly specified in another TS, and not
integrated until there is significant implementation and user experience
in a fairly wide variety of environments.

Response
We would like to remind you that support of failed images is not 
required of the processor. It is our belief that agreeing to remove the
feature would lead to several "no" votes and that we have to "agree to 
disagree". 

.......

Nick Maclaren wrote

Many or most of the comments in N2013 on events have still not been
addressed, nor have some of ones on atomics and collectives.  In
particular, there are assumptions of cross-facility coherence and
progress but no normative text requiring them - indeed, quite the
opposite.  It is doubtful that the current TS is "consistent, clear and
accurate" (see the ISO Directives).  This is extremely serious, as
adopting an inconsistent set of assumptions will make it almost
impossible to deliver the target specified in 1. Scope, paragraph 2,
even ignoring the problem of the schedule.

I do not believe that this issue is as intractable as the previous one,
because specifying data and control flow and progress are within "the
state of the art".  However, I am doubtful that the facilities in TS can
be implemented efficiently without special hardware or operating system
support, while still delivering the consistency and progress that seem
to be assumed.  However, even if there are no consistency problems to be
resolved, I do not believe that accepting these aspects of this TS is
compatible with keeping to the agreed schedule.

I believe that this area needs further clarity, even if not a polished
specification, before the TS should be accepted.  I am not repeating the
relevant comments in N2013, because there is little point - there has
been little relevant change to the drafts.

Response
It would be very helpful to have explicit suggestions for edits. These
could then be considered for inclusion. We urge Nick Maclaren to provide
suggested edits through his national body. 

.......

Dan Nagle wrote 
I recommend the following change.

[27:14-15] change "a nonzero value" to "a positive value"

Error values are positive.

Response
We agree with this edit. 

.......

John Reid wrote
I recommend the following changes.

[10:19] Delete "or be the value of a team variable for the initial
team".
Reason. Execution of FORM TEAM is always required.

Response
The error lies in the first part of this sentence: "The <team-variable> 
shall have been defined by execution of a FORM TEAM statement in the 
team that executes the CHANGE TEAM statement". It was intended that
other means of defining the team variable, including the use of 
GET_TEAM, be permitted. We therefore suggest this edit:
[10:18] Change "The <team-variable> shall have been defined" to 
"The value of <team-variable> shall be the value of a team variable 
defined".
Further edits are needed to allow for the case where the team is the
initial team:
[10:20] Change "those defined" to "those of team variables defined".
[10:21] Change "team." to "team or be the values of a team variable for 
the initial team."
[11:9] After "designator" add "and the current team is not the initial 
team".
[11:12] After "TEAM_ID." add "If TEAM_ID= appears in a coarray designator 
and the current team is the initial team, the value of <scalar-int-expr>
is ignored."

.......

John Reid wrote
I recommend the following changes.

[10:38], [12:21], [29:34], [34:8], [34:13]. At the end of the sentence
add "since execution last began in this team" (wavy underlined on page
34).
Reason. We need to allow for teams changing during the execution of the
program. At the October meeting, these words were added at [13:5],
[35:26], [35:37], and [36:1].

Response
We agree with these edits, except that the edit for [10:38] should be
[10:38] At the end of the sentence add "since execution last began in 
the team that was current before execution of the CHANGE TEAM statement"
and this edit is needed at [10:41]: 
[10:41] At the end of the sentence add "since execution last began in 
the team that was current before execution of the corresponding 
CHANGE TEAM statement".
Malcolm Cohen later drew our attention to the fact that the concept of 
construct completion, used at [10:32] and in the new text for 
[14:19-25] in our first response to Robert Corbett, has not been 
defined. We therefore propose this further edit:
[10:23] At the end of the paragraph, add the sentence "A CHANGE TEAM 
construct completes execution by executing its END TEAM statement."

..............

John Reid wrote
I recommend the following changes.

[13:1] Change "the team" to "team".
Reason. Definite article is wrong here.

[13:5] Remove space before period.

Response
We agree with these edits. 

.......

John Reid wrote
I recommend the following changes.

[14:9] Change "detect that an image has stalled" to "manage a stalled
image".
[14:20] After "becomes a stalled image" add ". If the processor does
not have the ability to manage a stalled image, the executing image
becomes a stalled image for the rest of the execution of the program.
If the processor has the ability to manage a stalled image, the
executing image becomes a stalled image"
Reason. I think the intention is to allow implementations not to
support stalled images transferring control to the END TEAM statement.
Stalling will still happen and will need to be permanent.

Response
These edits have been superseded by the response to Robert Corbett.

.......

Van Snyder wrote

No, for similar reasons to Robert Corbett, Malcolm Cohen, and Nick
Maclaren.  I am concerned that we have added incompletely
thought-through things at the last minute.

Response
See the responses to Robert Corbett, Malcolm Cohen, and Nick Maclaren. 


Stan Whitlock

Yes, but I recommend the changes in Bill Long and John Reid's ballots.
 
Response
See the responses to Bill Long and John Reid.