Network Working Group N. Williams Internet-Draft Cryptonector Intended status: Informational September 20, 2013 Expires: March 24, 2014 Boundary Analysis for Internationalization and Localization draft-williams-i18n-boundary-analysis-00 Abstract Internationalization of protocols and programs often requires determining where to use one or another character repertoire, codeset, encoding, where to perform localization, and so on. This document aims to serve as a guide to Internet protocol designers in determining what they may or should recommend or require of protocol implementors. Of particular interest in this document are filesystem protocols. Status of this Memo This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet- Drafts is at http://datatracker.ietf.org/drafts/current/. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." This Internet-Draft will expire on March 24, 2014. Copyright Notice Copyright (c) 2013 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of Williams Expires March 24, 2014 [Page 1] Internet-Draft I18N Boundary Analysis September 2013 the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License. Table of Contents 1. Introduction and Motivation . . . . . . . . . . . . . . . 3 1.1. Conventions used in this document . . . . . . . . . . . . 3 2. Internationalization . . . . . . . . . . . . . . . . . . . 4 2.1. Terminology . . . . . . . . . . . . . . . . . . . . . . . 4 3. Filesystems and Remote/Distributed Filesystem Protocols . 6 3.1. On Filesystem Client and Server Implementation Architectures and their Relevance . . . . . . . . . . . . 6 3.2. Obvious Boundaries . . . . . . . . . . . . . . . . . . . . 6 3.3. Legacy . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3.1. Legacy Problem #1: Loss of Metadata at the System Call Boundary . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.3.2. Legacy Problem #2: Unknown Character String Metadata in Existing Filesystem Content . . . . . . . . . . . . . . 7 3.3.3. Legacy Problem #3: Poor Handling of Unicode Equivalence (Normalization) . . . . . . . . . . . . . . . 8 3.3.4. Legacy Problem #4: Ignored Requirements . . . . . . . . . 8 3.3.5. Legacy Problem #5: Constraints Imposed by Non-Internet Standards . . . . . . . . . . . . . . . . . . . . . . . . 8 3.4. A World Without Legacy . . . . . . . . . . . . . . . . . . 8 3.5. Coping with / Accepting Legacy . . . . . . . . . . . . . . 9 3.5.1. Implications . . . . . . . . . . . . . . . . . . . . . . . 9 3.6. Recommendations for Filesystem Protocols, Filesystems, and Operating Systems . . . . . . . . . . . . . . . . . . 10 3.7. Interoperability Considerations for Filesystem Protocols . . . . . . . . . . . . . . . . . . . . . . . . 11 4. Security Considerations . . . . . . . . . . . . . . . . . 12 5. IANA Considerations . . . . . . . . . . . . . . . . . . . 13 6. References . . . . . . . . . . . . . . . . . . . . . . . . 14 6.1. Normative References . . . . . . . . . . . . . . . . . . . 14 6.2. Informative References . . . . . . . . . . . . . . . . . . 14 Author's Address . . . . . . . . . . . . . . . . . . . . . 15 Williams Expires March 24, 2014 [Page 2] Internet-Draft I18N Boundary Analysis September 2013 1. Introduction and Motivation As the IETF has attempted to internationalize Internet protocols we have learned some valuable lessons. It is time to write these down. This document focuses on internationalization problems in the filesystem and remote / distributed filesystem protocols space. This document is INFORMATIVE. 1.1. Conventions used in this document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. Where RFC2119 key words are used herein for stating requirements or recommendations, they are used to as part of suggested normative language to be used by normative Internet protocol specifications that accept the internationalization advice given in this document. Williams Expires March 24, 2014 [Page 3] Internet-Draft I18N Boundary Analysis September 2013 2. Internationalization Internationalizing a protocol roughly requires the following tasks: 1. decide where to use Unicode [XXX add reference] and what encoding of Unicode 2. decide where any conversions to other codesets should be done, if any 3. decide what Unicode characters (and non-characters) to permit or forbid 4. decide what Unicode character mappings are appropriate 5. decide how to handle string equality, including case-sensitive and case-insensitive behavior, and whether and how to handle Unicode equivalence (normalization) In practice, because historically most protocols and data formats do not tag strings with any language nor codeset information, and because codesets and their encodings often overlap, and other legacy problems, there's no simple way to decide where to perform any conversions, mappings, or checks. We describe here our experience with NFSv4 in particular and filesystems in general. 2.1. Terminology [...] Some terms used in this document: just-use-8 Where a program or protocol component accepts character strings, treating them as arbitrary octet strings, often assuming that byte values less than 0x80 are US-ASCII, or that specific byte values are specific US-ASCII characters (e.g., filesystem path component separators). just-send-8 Where a program or protocol component sends character strings without regard as to whether the string's codeset/encoding are the expected on-the-wire codeset/encoding. just-use-UTF-8 Where a program or protocol component accepts character strings that are valid UTF-8 strings withour regard to normalization. Williams Expires March 24, 2014 [Page 4] Internet-Draft I18N Boundary Analysis September 2013 just-send-UTF-8 Where a program or protocol component sends UTF-8 character strings without attempting to normalize or perform any similar steps (e.g., applying character mappings and/or prohibitions). ... Define lots more and reference other RFCs... Williams Expires March 24, 2014 [Page 5] Internet-Draft I18N Boundary Analysis September 2013 3. Filesystems and Remote/Distributed Filesystem Protocols Filesystems and filesystem protocols may be the most difficult application to internationalize that we in the IETF have seen to date. Initially, for NFSv4 [RFC3530] we believed that we could simply mandate the use of UTF-8 [RFC3629], forbid some characters, require a choice of normalization forms, and we'd be done. In practice it was not so simple. 3.1. On Filesystem Client and Server Implementation Architectures and their Relevance To understand the difficulties faced in internationalizing NFSv4 we need to understand the typical architecture of NFSv4 clients and servers. We say "typical", but it really is typical: the vast majority, if not all of the major general-purpose operating systems in use at this time and over the entire history of NFSv4 share the architecture that we describe here, differing only in minor details. Normally the architecture and design of clients and servers would be of no interest to the IETF: we want neiter to dictate nor be unduly constrained by such things. In this case architecture and legacy combine to create unusual problems for filesystem protocols. In this case we must take implementation architecture into account! Both, clients and servers, typically have a "kernel" that executes privileged mode object code and which has a pluggable "virtual filesystem switch" (VFS) -- an interface that abstracts filesystems so as to permit support for many different types of filesystems. Clients, and usually servers, also run user-mode object code -less privileged than kernel-mode object code- that interfaces with filesystems by invoking privileged kernel-mode code through well- defined interfaces ("system calls") that allow the kernel to maintain privilege separation and isolation. These system calls too present a common, standard, abstract interface to all filesystems that can be plugged into the kernel's VFS. Some servers run no user-mode object code to speak of, running all fileserver protocol implementations in kernel mode, nonetheless, the architecture is roughly the same for servers as for clients. 3.2. Obvious Boundaries Some boundaries are immediately evident: o the system call layer, between user-mode and kernel mode o the VFS boundary, between generic kernel object code and specific filesystem implementations Williams Expires March 24, 2014 [Page 6] Internet-Draft I18N Boundary Analysis September 2013 o the network, between the client implementation and the server implementation o the VFS again, between the server and the filesystems beneath it o persistent storage network, between specific filesystem implementations and persistent storage Their relevance to I18N will be discussed further below. 3.3. Legacy Many, perhaps all commonly used general-purpose operating systems, predate modern internationalization efforts. (Some operating systems, such as those for mobile devices, are new enough that they might well pose no legacy I18N issues for filesystems.) Most such operating systems simply treated character strings as mostly opaque at many if not all of the boundaries described in Section 3.2, at most interpreting path component separator characters, in the process assuming US-ASCII [XXX add reference] as the lowest common denominator for the purpose of finding path component separators. Because these operating systems, filesystem on-disk formats, and actual on-disk filesystems, predate modern internationalization efforts, there exist many filesystems with object name strings of unknown or mixed codesets. Strings, such as object names, in filesystems are never tagged with codeset information because the codeset information was and still is usually lost at the system call boundary. The actual codesets (and encodings) used typically varies along with user (and system administrator) locale preferences. 3.3.1. Legacy Problem #1: Loss of Metadata at the System Call Boundary The first and foremost problem, then, is the loss of locale metadata at the system call boundary. Without fixing this we cannot move to an all-Unicode world in filesystems protocols.all 3.3.2. Legacy Problem #2: Unknown Character String Metadata in Existing Filesystem Content The second most important problem in filesystem internationalization is the lack of locale (codeset, encoding) metadata for existing (legacy) filesystem content, specifically file and directory names. Williams Expires March 24, 2014 [Page 7] Internet-Draft I18N Boundary Analysis September 2013 3.3.3. Legacy Problem #3: Poor Handling of Unicode Equivalence (Normalization) Historically Unicode input methods tend to produce pre-composed codepoints -- something close to Normalization Form Composed (NFC). But this is not always so. Historically most filesystems treat file (and directory) names as opaque, but at least one filesystem (Apple's HFS+ [XXX add reference]) assumes UTF-8 and normalizes to Normalization Form Decomposed (NFD) at object-create and object-lookup time. This can result in subtle interoperability problems, as two objects with equivalent names may exist in namespaces (directories) where names are expected to be unique, or users may fail to input names that match those that exist in a filesystem. 3.3.4. Legacy Problem #4: Ignored Requirements The original NFSv4 specification [RFC3530] requires some character mappings and prohibitions. Most implementations have ignored this requirement. 3.3.5. Legacy Problem #5: Constraints Imposed by Non-Internet Standards POSIX [XXX add reference] is one common standard for system call interfaces to filesystems. Arguably it requires that: 1. applications observe the same file/directory names -when listing a directory- as they created; 2. no aliases may exist for files/directories that are not "symlinks" or "hardlinks". This makes it very difficult to deploy Unicode normalization anywhere other than the application. But it is not possible to fix every POSIX application to normalize on create or lookup either! 3.4. A World Without Legacy If we didn't have the legacy problems described above we could simply mandate the use of Unicode in one specific encoding (e.g., UTF-8) "in the middle", with the middle being: from the system call boundary, to the VFS boundary, as well "on the wire". Any codeset conversions and Unicode normalization would be performed at the system call boundary (i.e., on the client) and at the VFS boundary (if, for example, a filesystem on-disk format requires different codeset/encoding than the protocol does on the wire). Williams Expires March 24, 2014 [Page 8] Internet-Draft I18N Boundary Analysis September 2013 Or perhaps in an ideal world all user applications may run only in Unicode locales, and must perform explicit codeset conversions when handling legacy (non-Unicode) data. This ideal is one we will likely obtain in time, as legacy non-Unicode locales are abandoned, legacy filesystems cleaned up, and new operating systems (or new versions of them) take over older ones. In an ideal world there would be no Unicode normalization problems because either there would be just one normal form for Unicode or because all implementations of filesystem clients, servers, filesystems, and filesystem-using applications, would use a single, common normal form. In practice this is almost certainly an impossible ideal. 3.5. Coping with / Accepting Legacy Legacy abounds. We must cope with it. First, the IETF can't cause the system call boundary metadata loss problem to be fixed. The architectures of the relevant operating systems is such that the simplest fix for that problem is to convert between the user-mode locale's codeset/encoding and the codeset/ encoding expected by the kernel. But getting such a fix to be implemented and deployed is difficult for a number of reasons, not the least of which is its impact on performance (for users using locales that require conversions), but also complexity: the user-mode side of system calls can sometimes be in a bootstrapping state during which I18N object code may not have been loaded yet. The simplest fix for this problem is to recommend that users use only locales that use Unicode as the charater repertoire and codeset, preferably with the encoding expected on the kernel-side of the system call boundary. The second legacy problem -legacy filesystem content- can be addressed by requiring manual inspection and repair of legacy content, but there exist such vast amounts of legacy contents that this is not a realistic option. There is no fix for the legacy filesystem content problem. 3.5.1. Implications Some implications of accepting legacy: o we may want Unicode in the middle, but sometimes we'll have non- Unicode content o we can stop the creation of new non-Unicode content on disk, but we can't really preclude access to it Williams Expires March 24, 2014 [Page 9] Internet-Draft I18N Boundary Analysis September 2013 o normalization-on-create is problematic o normalization-on-lookup is problematic o normalization-insensitive lookups are problematic o ignoring normalization is problematic With respect to normalization there's no one solution appropriate to all use cases. 3.6. Recommendations for Filesystem Protocols, Filesystems, and Operating Systems o Filesystems SHOULD be configurable to reject object names which are not valid in the filesystem's chosen Unicode encoding. This allows filesystems (and their servers) to put a stop to the rot, except, of course, for non-Unicode strings that happen to appear as valid Unicode strings due to codeset/encoding aliasing. o Remote / distributed filesystem protocols _should_ specify the use of Unicode on the wire, but they should also allow the use of non- Unicode names, leaving it to the filesystem to decide whether to accept or reject such names. * For example, this means that NFSv4 _servers_ SHOULD accept object names -from clients- which are not valid UTF-8, contrary to the original NFSv4 specification [RFC3530]. o Remote / distributed filesystem protocols should permit servers to return non-Unicode object names to clients. This allows servers to serve legacy non-Unicode content. * For example, this means that NFSv4 clients SHOULD be prepared to accept non-UTF-8 names from NFSv4 servers, contrary to the original NFSv4 specification [RFC3530]. o Filesystem servers should accept object names -from filesystems- which are not valid in the host operating system's chosen codeset and encoding for use above the VFS. o Filesystems SHOULD be configurable as to Unicode normalization, allowing at least the following two options: * Normalization-insensitive lookups. Williams Expires March 24, 2014 [Page 10] Internet-Draft I18N Boundary Analysis September 2013 * No normalization at all. o Filesystems MAY be configurable as to Unicode normalization, allowing these additional options: * Normalize on create and lookup o Operating systems SHOULD be configurable as to codeset/encoding conversions at the system call boundary, allowing these options: * convert to/from non-Unicode locales' codesets * no conversion o Operating systems that do not support codeset/encoding conversions at the system call boundary SHOULD at least encourage users to use or switch to using Unicode locales. 3.7. Interoperability Considerations for Filesystem Protocols [[anchor1: Intent: describe interoperability problems that arise given current NFSv4 deployments and legacy filesystem contents.]] Williams Expires March 24, 2014 [Page 11] Internet-Draft I18N Boundary Analysis September 2013 4. Security Considerations [[anchor2: Lots to talk about here. For example, aliasing issues w.r.t. multiple equivalent Unicode forms, and the resulting potential for confusion.]] Williams Expires March 24, 2014 [Page 12] Internet-Draft I18N Boundary Analysis September 2013 5. IANA Considerations There are no IANA considerations in this document. Williams Expires March 24, 2014 [Page 13] Internet-Draft I18N Boundary Analysis September 2013 6. References 6.1. Normative References [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. 6.2. Informative References [RFC3530] Shepler, S., Callaghan, B., Robinson, D., Thurlow, R., Beame, C., Eisler, M., and D. Noveck, "Network File System (NFS) version 4 Protocol", RFC 3530, April 2003. [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD 63, RFC 3629, November 2003. Williams Expires March 24, 2014 [Page 14] Internet-Draft I18N Boundary Analysis September 2013 Author's Address Nicolas Williams Cryptonector, LLC Email: nico@cryptonector.com Williams Expires March 24, 2014 [Page 15]