Network Working Group                                        N. Williams
Internet-Draft                                              Cryptonector
Intended status: Informational                        September 20, 2013
Expires: March 24, 2014


      Boundary Analysis for Internationalization and Localization
                draft-williams-i18n-boundary-analysis-00

Abstract

   Internationalization of protocols and programs often requires
   determining where to use one or another character repertoire,
   codeset, encoding, where to perform localization, and so on.  This
   document aims to serve as a guide to Internet protocol designers in
   determining what they may or should recommend or require of protocol
   implementors.  Of particular interest in this document are filesystem
   protocols.

Status of this Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on March 24, 2014.

Copyright Notice

   Copyright (c) 2013 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of


Williams                 Expires March 24, 2014                 [Page 1]

Internet-Draft           I18N Boundary Analysis           September 2013


   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.


Table of Contents

   1.      Introduction and Motivation  . . . . . . . . . . . . . . .  3
   1.1.    Conventions used in this document  . . . . . . . . . . . .  3
   2.      Internationalization . . . . . . . . . . . . . . . . . . .  4
   2.1.    Terminology  . . . . . . . . . . . . . . . . . . . . . . .  4
   3.      Filesystems and Remote/Distributed Filesystem Protocols  .  6
   3.1.    On Filesystem Client and Server Implementation
           Architectures and their Relevance  . . . . . . . . . . . .  6
   3.2.    Obvious Boundaries . . . . . . . . . . . . . . . . . . . .  6
   3.3.    Legacy . . . . . . . . . . . . . . . . . . . . . . . . . .  7
   3.3.1.  Legacy Problem #1: Loss of Metadata at the System Call
           Boundary . . . . . . . . . . . . . . . . . . . . . . . . .  7
   3.3.2.  Legacy Problem #2: Unknown Character String Metadata
           in Existing Filesystem Content . . . . . . . . . . . . . .  7
   3.3.3.  Legacy Problem #3: Poor Handling of Unicode
           Equivalence (Normalization)  . . . . . . . . . . . . . . .  8
   3.3.4.  Legacy Problem #4: Ignored Requirements  . . . . . . . . .  8
   3.3.5.  Legacy Problem #5: Constraints Imposed by Non-Internet
           Standards  . . . . . . . . . . . . . . . . . . . . . . . .  8
   3.4.    A World Without Legacy . . . . . . . . . . . . . . . . . .  8
   3.5.    Coping with / Accepting Legacy . . . . . . . . . . . . . .  9
   3.5.1.  Implications . . . . . . . . . . . . . . . . . . . . . . .  9
   3.6.    Recommendations for Filesystem Protocols, Filesystems,
           and Operating Systems  . . . . . . . . . . . . . . . . . . 10
   3.7.    Interoperability Considerations for Filesystem
           Protocols  . . . . . . . . . . . . . . . . . . . . . . . . 11
   4.      Security Considerations  . . . . . . . . . . . . . . . . . 12
   5.      IANA Considerations  . . . . . . . . . . . . . . . . . . . 13
   6.      References . . . . . . . . . . . . . . . . . . . . . . . . 14
   6.1.    Normative References . . . . . . . . . . . . . . . . . . . 14
   6.2.    Informative References . . . . . . . . . . . . . . . . . . 14
           Author's Address . . . . . . . . . . . . . . . . . . . . . 15


Williams                 Expires March 24, 2014                 [Page 2]

Internet-Draft           I18N Boundary Analysis           September 2013


1.  Introduction and Motivation

   As the IETF has attempted to internationalize Internet protocols we
   have learned some valuable lessons.  It is time to write these down.
   This document focuses on internationalization problems in the
   filesystem and remote / distributed filesystem protocols space.

   This document is INFORMATIVE.

1.1.  Conventions used in this document

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in [RFC2119].

   Where RFC2119 key words are used herein for stating requirements or
   recommendations, they are used to as part of suggested normative
   language to be used by normative Internet protocol specifications
   that accept the internationalization advice given in this document.


Williams                 Expires March 24, 2014                 [Page 3]

Internet-Draft           I18N Boundary Analysis           September 2013


2.  Internationalization

   Internationalizing a protocol roughly requires the following tasks:

   1.  decide where to use Unicode [XXX add reference] and what encoding
       of Unicode

   2.  decide where any conversions to other codesets should be done, if
       any

   3.  decide what Unicode characters (and non-characters) to permit or
       forbid

   4.  decide what Unicode character mappings are appropriate

   5.  decide how to handle string equality, including case-sensitive
       and case-insensitive behavior, and whether and how to handle
       Unicode equivalence (normalization)

   In practice, because historically most protocols and data formats do
   not tag strings with any language nor codeset information, and
   because codesets and their encodings often overlap, and other legacy
   problems, there's no simple way to decide where to perform any
   conversions, mappings, or checks.

   We describe here our experience with NFSv4 in particular and
   filesystems in general.

2.1.  Terminology

   [...]

   Some terms used in this document:

   just-use-8  Where a program or protocol component accepts character
      strings, treating them as arbitrary octet strings, often assuming
      that byte values less than 0x80 are US-ASCII, or that specific
      byte values are specific US-ASCII characters (e.g., filesystem
      path component separators).

   just-send-8  Where a program or protocol component sends character
      strings without regard as to whether the string's codeset/encoding
      are the expected on-the-wire codeset/encoding.

   just-use-UTF-8  Where a program or protocol component accepts
      character strings that are valid UTF-8 strings withour regard to
      normalization.


Williams                 Expires March 24, 2014                 [Page 4]

Internet-Draft           I18N Boundary Analysis           September 2013


   just-send-UTF-8  Where a program or protocol component sends UTF-8
      character strings without attempting to normalize or perform any
      similar steps (e.g., applying character mappings and/or
      prohibitions).

   ...  Define lots more and reference other RFCs...


Williams                 Expires March 24, 2014                 [Page 5]

Internet-Draft           I18N Boundary Analysis           September 2013


3.  Filesystems and Remote/Distributed Filesystem Protocols

   Filesystems and filesystem protocols may be the most difficult
   application to internationalize that we in the IETF have seen to
   date.  Initially, for NFSv4 [RFC3530] we believed that we could
   simply mandate the use of UTF-8 [RFC3629], forbid some characters,
   require a choice of normalization forms, and we'd be done.  In
   practice it was not so simple.

3.1.  On Filesystem Client and Server Implementation Architectures and
      their Relevance

   To understand the difficulties faced in internationalizing NFSv4 we
   need to understand the typical architecture of NFSv4 clients and
   servers.  We say "typical", but it really is typical: the vast
   majority, if not all of the major general-purpose operating systems
   in use at this time and over the entire history of NFSv4 share the
   architecture that we describe here, differing only in minor details.

   Normally the architecture and design of clients and servers would be
   of no interest to the IETF: we want neiter to dictate nor be unduly
   constrained by such things.  In this case architecture and legacy
   combine to create unusual problems for filesystem protocols.  In this
   case we must take implementation architecture into account!

   Both, clients and servers, typically have a "kernel" that executes
   privileged mode object code and which has a pluggable "virtual
   filesystem switch" (VFS) -- an interface that abstracts filesystems
   so as to permit support for many different types of filesystems.
   Clients, and usually servers, also run user-mode object code -less
   privileged than kernel-mode object code- that interfaces with
   filesystems by invoking privileged kernel-mode code through well-
   defined interfaces ("system calls") that allow the kernel to maintain
   privilege separation and isolation.  These system calls too present a
   common, standard, abstract interface to all filesystems that can be
   plugged into the kernel's VFS.  Some servers run no user-mode object
   code to speak of, running all fileserver protocol implementations in
   kernel mode, nonetheless, the architecture is roughly the same for
   servers as for clients.

3.2.  Obvious Boundaries

   Some boundaries are immediately evident:

   o  the system call layer, between user-mode and kernel mode

   o  the VFS boundary, between generic kernel object code and specific
      filesystem implementations


Williams                 Expires March 24, 2014                 [Page 6]

Internet-Draft           I18N Boundary Analysis           September 2013


   o  the network, between the client implementation and the server
      implementation

   o  the VFS again, between the server and the filesystems beneath it

   o  persistent storage network, between specific filesystem
      implementations and persistent storage

   Their relevance to I18N will be discussed further below.

3.3.  Legacy

   Many, perhaps all commonly used general-purpose operating systems,
   predate modern internationalization efforts.  (Some operating
   systems, such as those for mobile devices, are new enough that they
   might well pose no legacy I18N issues for filesystems.)

   Most such operating systems simply treated character strings as
   mostly opaque at many if not all of the boundaries described in
   Section 3.2, at most interpreting path component separator
   characters, in the process assuming US-ASCII [XXX add reference] as
   the lowest common denominator for the purpose of finding path
   component separators.

   Because these operating systems, filesystem on-disk formats, and
   actual on-disk filesystems, predate modern internationalization
   efforts, there exist many filesystems with object name strings of
   unknown or mixed codesets.  Strings, such as object names, in
   filesystems are never tagged with codeset information because the
   codeset information was and still is usually lost at the system call
   boundary.  The actual codesets (and encodings) used typically varies
   along with user (and system administrator) locale preferences.

3.3.1.  Legacy Problem #1: Loss of Metadata at the System Call Boundary

   The first and foremost problem, then, is the loss of locale metadata
   at the system call boundary.  Without fixing this we cannot move to
   an all-Unicode world in filesystems protocols.all

3.3.2.  Legacy Problem #2: Unknown Character String Metadata in Existing
        Filesystem Content

   The second most important problem in filesystem internationalization
   is the lack of locale (codeset, encoding) metadata for existing
   (legacy) filesystem content, specifically file and directory names.


Williams                 Expires March 24, 2014                 [Page 7]

Internet-Draft           I18N Boundary Analysis           September 2013


3.3.3.  Legacy Problem #3: Poor Handling of Unicode Equivalence
        (Normalization)

   Historically Unicode input methods tend to produce pre-composed
   codepoints -- something close to Normalization Form Composed (NFC).
   But this is not always so.

   Historically most filesystems treat file (and directory) names as
   opaque, but at least one filesystem (Apple's HFS+ [XXX add
   reference]) assumes UTF-8 and normalizes to Normalization Form
   Decomposed (NFD) at object-create and object-lookup time.

   This can result in subtle interoperability problems, as two objects
   with equivalent names may exist in namespaces (directories) where
   names are expected to be unique, or users may fail to input names
   that match those that exist in a filesystem.

3.3.4.  Legacy Problem #4: Ignored Requirements

   The original NFSv4 specification [RFC3530] requires some character
   mappings and prohibitions.  Most implementations have ignored this
   requirement.

3.3.5.  Legacy Problem #5: Constraints Imposed by Non-Internet Standards

   POSIX [XXX add reference] is one common standard for system call
   interfaces to filesystems.  Arguably it requires that:

   1.  applications observe the same file/directory names -when listing
       a directory- as they created;

   2.  no aliases may exist for files/directories that are not
       "symlinks" or "hardlinks".

   This makes it very difficult to deploy Unicode normalization anywhere
   other than the application.  But it is not possible to fix every
   POSIX application to normalize on create or lookup either!

3.4.  A World Without Legacy

   If we didn't have the legacy problems described above we could simply
   mandate the use of Unicode in one specific encoding (e.g., UTF-8) "in
   the middle", with the middle being: from the system call boundary, to
   the VFS boundary, as well "on the wire".  Any codeset conversions and
   Unicode normalization would be performed at the system call boundary
   (i.e., on the client) and at the VFS boundary (if, for example, a
   filesystem on-disk format requires different codeset/encoding than
   the protocol does on the wire).


Williams                 Expires March 24, 2014                 [Page 8]

Internet-Draft           I18N Boundary Analysis           September 2013


   Or perhaps in an ideal world all user applications may run only in
   Unicode locales, and must perform explicit codeset conversions when
   handling legacy (non-Unicode) data.  This ideal is one we will likely
   obtain in time, as legacy non-Unicode locales are abandoned, legacy
   filesystems cleaned up, and new operating systems (or new versions of
   them) take over older ones.

   In an ideal world there would be no Unicode normalization problems
   because either there would be just one normal form for Unicode or
   because all implementations of filesystem clients, servers,
   filesystems, and filesystem-using applications, would use a single,
   common normal form.  In practice this is almost certainly an
   impossible ideal.

3.5.  Coping with / Accepting Legacy

   Legacy abounds.  We must cope with it.

   First, the IETF can't cause the system call boundary metadata loss
   problem to be fixed.  The architectures of the relevant operating
   systems is such that the simplest fix for that problem is to convert
   between the user-mode locale's codeset/encoding and the codeset/
   encoding expected by the kernel.  But getting such a fix to be
   implemented and deployed is difficult for a number of reasons, not
   the least of which is its impact on performance (for users using
   locales that require conversions), but also complexity: the user-mode
   side of system calls can sometimes be in a bootstrapping state during
   which I18N object code may not have been loaded yet.  The simplest
   fix for this problem is to recommend that users use only locales that
   use Unicode as the charater repertoire and codeset, preferably with
   the encoding expected on the kernel-side of the system call boundary.

   The second legacy problem -legacy filesystem content- can be
   addressed by requiring manual inspection and repair of legacy
   content, but there exist such vast amounts of legacy contents that
   this is not a realistic option.  There is no fix for the legacy
   filesystem content problem.

3.5.1.  Implications

   Some implications of accepting legacy:

   o  we may want Unicode in the middle, but sometimes we'll have non-
      Unicode content

   o  we can stop the creation of new non-Unicode content on disk, but
      we can't really preclude access to it


Williams                 Expires March 24, 2014                 [Page 9]

Internet-Draft           I18N Boundary Analysis           September 2013


   o  normalization-on-create is problematic

   o  normalization-on-lookup is problematic

   o  normalization-insensitive lookups are problematic

   o  ignoring normalization is problematic

   With respect to normalization there's no one solution appropriate to
   all use cases.

3.6.  Recommendations for Filesystem Protocols, Filesystems, and
      Operating Systems

   o  Filesystems SHOULD be configurable to reject object names which
      are not valid in the filesystem's chosen Unicode encoding.

      This allows filesystems (and their servers) to put a stop to the
      rot, except, of course, for non-Unicode strings that happen to
      appear as valid Unicode strings due to codeset/encoding aliasing.

   o  Remote / distributed filesystem protocols _should_ specify the use
      of Unicode on the wire, but they should also allow the use of non-
      Unicode names, leaving it to the filesystem to decide whether to
      accept or reject such names.

      *  For example, this means that NFSv4 _servers_ SHOULD accept
         object names -from clients- which are not valid UTF-8, contrary
         to the original NFSv4 specification [RFC3530].

   o  Remote / distributed filesystem protocols should permit servers to
      return non-Unicode object names to clients.  This allows servers
      to serve legacy non-Unicode content.

      *  For example, this means that NFSv4 clients SHOULD be prepared
         to accept non-UTF-8 names from NFSv4 servers, contrary to the
         original NFSv4 specification [RFC3530].

   o  Filesystem servers should accept object names -from filesystems-
      which are not valid in the host operating system's chosen codeset
      and encoding for use above the VFS.

   o  Filesystems SHOULD be configurable as to Unicode normalization,
      allowing at least the following two options:

      *  Normalization-insensitive lookups.


Williams                 Expires March 24, 2014                [Page 10]

Internet-Draft           I18N Boundary Analysis           September 2013


      *  No normalization at all.

   o  Filesystems MAY be configurable as to Unicode normalization,
      allowing these additional options:

      *  Normalize on create and lookup

   o  Operating systems SHOULD be configurable as to codeset/encoding
      conversions at the system call boundary, allowing these options:

      *  convert to/from non-Unicode locales' codesets

      *  no conversion

   o  Operating systems that do not support codeset/encoding conversions
      at the system call boundary SHOULD at least encourage users to use
      or switch to using Unicode locales.

3.7.  Interoperability Considerations for Filesystem Protocols

   [[anchor1: Intent: describe interoperability problems that arise
   given current NFSv4 deployments and legacy filesystem contents.]]


Williams                 Expires March 24, 2014                [Page 11]

Internet-Draft           I18N Boundary Analysis           September 2013


4.  Security Considerations

   [[anchor2: Lots to talk about here.  For example, aliasing issues
   w.r.t. multiple equivalent Unicode forms, and the resulting potential
   for confusion.]]


Williams                 Expires March 24, 2014                [Page 12]

Internet-Draft           I18N Boundary Analysis           September 2013


5.  IANA Considerations

   There are no IANA considerations in this document.


Williams                 Expires March 24, 2014                [Page 13]

Internet-Draft           I18N Boundary Analysis           September 2013


6.  References

6.1.  Normative References

   [RFC2119]  Bradner, S., "Key words for use in RFCs to Indicate
              Requirement Levels", BCP 14, RFC 2119, March 1997.

6.2.  Informative References

   [RFC3530]  Shepler, S., Callaghan, B., Robinson, D., Thurlow, R.,
              Beame, C., Eisler, M., and D. Noveck, "Network File System
              (NFS) version 4 Protocol", RFC 3530, April 2003.

   [RFC3629]  Yergeau, F., "UTF-8, a transformation format of ISO
              10646", STD 63, RFC 3629, November 2003.


Williams                 Expires March 24, 2014                [Page 14]

Internet-Draft           I18N Boundary Analysis           September 2013


Author's Address

   Nicolas Williams
   Cryptonector, LLC

   Email: nico@cryptonector.com


Williams                 Expires March 24, 2014                [Page 15]