Zend Weekly Summaries Issue #361

December 13, 2007

Uncategorized

TLK: T_IMPORT vs T_USE
TLK: Compiled variables and backpatching #2
TLK: Square brackets
TLK: Solaris and getcwd()
TLK: Class posing
RFC: VS 2005 support
BUG: import NAME conflict
TLK: Exceptions in autoload
TLK: Taint support: first results
CVS: Getopt in, ereg out of the core
PAT: Missed one (or two)

30th September – 6th October 2007


TLK: T_IMPORT vs T_USE

Sebastian Bergmann can remember the heady days of PHP 5.0-dev, when the
original namespace implementation in PHP was attempted and thrown out. When
that implementation was withdrawn, claimed Sebastian, the
T_NAMESPACE and T_USE tokens were retained for
reasons of forward compatibility. (Read: namespace and
use have been reserved words ever since.) Sebastian now wrote to
the internals list to recommend that, rather than introducing the new token
T_IMPORT, the current namespace implementation should be altered
to use T_USE.

Johannes Schlüter corrected him – T_NAMESPACE wasn’t
retained, although T_USE was. He’d investigated usage of
import the easy way, using Google’s href="http://www.google.com/codesearch">codesearch facility; he search
term "->import(" returned roughly 300 results. The public
codebases using import include several major PHP applications -
Horde, href="http://info.tikiwiki.org/tiki-index.php">Tikiwiki, href="http://typo3.com/">TYPO3 and href="http://wordpress.org/">WordPress among them. Johannes therefore
backed Sebastian. If the change could be agreed, he offered to write the
patch.

Andi Gutmans wanted to stay with import, if possible. He
believed that people generally would expect use to include
files. That said, he recognized the problem, and pointed out the same issue
probably arises with namespace. What Andi really wanted to do at
this point was investigate whether it is possible to support reserved words as
identifiers, and whether doing so would make sense. For now, he asked that
T_IMPORT be left as it was; ‘We’re still far enough off from
release that we don’t need to finalize today.

Greg Beaver and Stas Malyshev immediately started thinking about how it could
be done – Greg actually produced href="http://news.php.net/php.internals/32579">a mostly-working solution
later in the week – but Sebastian Nohn pointed out that PHP users generally
would care less about the appropriateness of the term than about it breaking
their software. He pointedly added href="http://www.s9y.org/">serendipity and href="http://framework.zend.com/">the Zend Framework to Johannes’ list of
public code currently using import. Andi responded; he had no
intention of breaking any software, much less the ZF; he just wanted to see
if this other approach was feasible. If not, it would have to be
use.

Mike Ford completely disagreed with Andi over the connotations of the word
import. He felt that use implies ‘something
that’s already lying around ready to be used
‘, whereas
import would go and fetch something. However, he conceded,
in the global polyglot marketplace, this argument may not have much
force
‘.

Short version: The naming of ‘import’ isn’t important enough to
justify the pain.


TLK: Compiled variables and backpatching #2

Paul Biggar returned. He thanked Stas for his explanation of compiled
variables, but still had no idea what “backpatching” might be. Could someone
please oblige?

Stas did so, but since the term isn’t generally used by the PHP development
team this is all a little esoteric. He believed that “backpatching” describes
the act of updating opcodes for parsed code to give them awareness of
something that couldn’t be known at the time of parsing. For example, in an
if-else statement, when the if condition is false
you need the else part, but that hasn’t been parsed yet. To get
around this, the opcode created when if was parsed is updated on
parsing else. Stas added, rather enigmatically, that backpatching
is also used to compose variable expressions, e.g.


$foo[$bar]->x->$y->z;

Short version: So now you know. (If you followed that, anyway.)


TLK: Square brackets

Alexey Zakhlestin had discovered href="http://www.php.net/~derick/meeting-notes.html">the PDM
notes for PHP 6 from way back when. The section that particularly
caught his eye was the one about square brackets:

"For both strings and arrays, the [] operator will support
substr()/array_slice() functionality"

Was this behaviour likely to appear in PHP 5.3?

Ilia Alshanetsky thought not; it hadn’t been on the table for discussion.
Besides, he personally felt it was ‘a bit too much magic‘. Tony Dovgal
agreed; ‘too Perl-ish for me‘. Stas simply saw no point in it;
since we do have substr()/array_slice() there’s no need to overload the
[] operator
‘.

Andrei Zmievski wrote that it was on his TODO list.

Martin Alterisio immediately asked how it would impact SPL’s
ArrayAccess and related interfaces and objects. Would there be
an interface to the new functionality, and if so, how would ranges be passed
to it? Would it be consistent with substr() and
array_slice(), if used alongside ArrayAccess? Stas
wrote, rather smugly, that this was exactly the problem with such syntax.
Alexey thought it could be made to work with ArrayAccess, but
would be slow; the requested elements would have to be queried one by one and
then combined into an array. He also thought that adding an interface would
solve the problem, and that ranges could be passed in the same way as with
the [] operator. Martin replied that, in that case,
rangeSet() shouldn’t have a type-hinted third parameter, since
that argument might well implement ArrayAccess.

Mike Ford slapped Tony down for throwing out a potential feature simply
because it reminded him of another language. It took a while for Tony to
convince him that ‘too Perl-ish‘ is simply shorthand for ‘too
cryptic and makes no sense because it duplicates already implemented
functionality (more than one way to do it, yeah)
‘. That apart, Mike’s
main complaint was that Tony was arguing against ‘a firm decision of an
eminent group of PHP core developers
‘ who were ‘committed to
implementing this much-needed feature for PHP 6
‘. Tony pointed out that
most of the eminent developers had in fact changed their minds since that
meeting, as indeed they had about several of those firm decisions, which were
never firm decisions in the first place. Tony wasn’t just prepared to argue;
he would do all he could to block ‘such a useless feature‘ – it was
just another syntax alias for substr(). Stas wondered aloud what
Mike “needed” it for, given that the functionality itself already exists in
PHP.

Andrei saw this whole exchange as an exercise in double standards,
particularly since Stas happened to be involved in a deep discussion about
a very esoteric feature‘ – class posing – at the time. He didn’t see
how the mention of square brackets justified a knee-jerk reaction. Stas
retorted that it was OK to discuss anything related to PHP on the internals
list; ‘and if you like me to be equal-opportunity knee-jerk-reactor, here
it goes: I don’t think both of these two are really needed :)
‘.

Larry Garfield wanted to know if an ArrayAccess object works
with array_slice() currently? He reasoned that if it does,
[x, y] would be purely syntactic sugar. If it doesn’t, [x,
y]
would be a powerful new feature. Alexey confirmed that it doesn’t,
but Derick Rethans argued that it wouldn’t actually make any difference
whether it works with array_slice() or not. Marcus Börger
clarified this; ArrayAccess was designed not to work in
array functions. It should support array syntax, though, so it will support
[x, y] if the feature goes into PHP. However, Marcus added, he
found that kind of slicing ‘too Perl-ish‘ too.

Alexey, having seen ‘Perl-ish’ explained multiple times by now, wrote that
he didn’t find the syntax cryptic at all – in fact, it appeared to him
to make some array algorithms more readable. Besides, it would make userland
implementations of array_slice() possible… Tony did his best
to see this as a genuine need, but failed miserably.

Short version: Too Perl-ish.


TLK: Solaris and getcwd()

One Rob Thompson turned up on the internals list, hoping to resolve href="http://bugs.php.net/bug.php?id=41822">a PHP bug he was seeing under
Solaris. In particular, he wanted to know why the PHP function
getcwd() sometimes fails. This occurs when a path component has
no read permissions, and appears to be a security feature in Solaris itself.
Rob had found that to get through this barrier directly on the system you’d
need to either execute a suid-root getcwd() or (wait for it)
“tell Solaris you already know where you are”. The way to achieve this was by
changing directories and using a fully-qualified path; if the attempt failed,
the library getcwd() would return NULL.

Having gained everyone’s sympathetic attention with that last piece of
information, Rob went on to ask questions. Firstly, he wanted to know, is
there any way for a non-root instance of PHP to “know where it is” in the
directory tree, which would make the directory-changing remedy possible?
Secondly, could anyone confirm that PHP’s include() requires the
POSIX library’s getcwd() in order to manage relative paths?
Thirdly, assuming that the former is possible and the latter correct, could
the Solaris workaround be used to make PHP’s getcwd() work?

Tony braved the deep waters. For PHP to “know where it is”, it would need to
call getcwd(), which plainly doesn’t work on Solaris. However,
Rob was correct in assuming that the POSIX getcwd() – or an
equivalent appropriate to the system – is needed in order for relative
include paths to work. The problem with the Solaris recommendation was that
it required you to know where you were prior to the chdir()
call. To know this, you would need to call getcwd()… Rob went
away and read some more. Much later, he confirmed that this was ‘a
“chicken or the egg” issue
‘. Worse, ‘with Solaris, there isn’t even
any chicken
‘.

New question. Would a sensible approach be to check for a NULL
getcwd() return value, replacing it with . (AKA ‘you are
where you are, wherever that is
‘) for the PHP getcwd() call?
Rob believed this would resolve the problem of relative include paths, but
wasn’t sure of the security issues arising from the solution. He later
offered a patch to achieve this, but Tony gently advised him to look at the
TSRM module, where the real work takes place, rather than trying to fix it in
the stream wrapper.

Short version: Solaris… gotta love it in order to live with it.


TLK: Class posing

Sebastian Bergmann had been reading books again, this time about something
called ‘posing’ in
Objective C
. Basically, a class can completely replace another class
within an application; the replacement class poses as the target class, and
will receive all messages intended for the target class. There’s a single
restriction pertinent to PHP; ‘a class may only pose as one of its direct
or indirect superclasses
‘.

Earlier in the year, Johannes apparently implemented class posing for PHP as
a proof-of-concept exercise. Having seen that href="http://news.php.net/php.internals/32538">it can work, Johannes,
Marcus, Sara Golemon and Sebastian had discussed where to put the
functionality. There was a choice of deserving PECL extensions -
operator or runkit – but to make it viable for projects like
Sebastian’s PHPUnit, it would need to be
in the PHP core. Sebastian could definitely use it, and would love to see it
in PHP 5.3. What did others think?

Guilherme Blanco was sure he could find a use for it too. However he
thought it might be better implemented using a magic method – something like
__new() – rather than by using a function to handle overloads.
Guilherme reasoned that this would free the class of inheritance issues,
making it better suited to super and extended classes. Strangely enough – oh
perhaps it isn’t strange at all – __new() had been Sebastian’s
initial proposal to Johannes, but there had apparently been some performance
implications with that approach.

David Zülke wanted to know if it would be possible to overload final
classes? Sebastian replied that it would; the compile-time class declaration
isn’t affected. Class posing works by intercepting run-time object creation.

Richard Quadling and Stas both commented that Sebastian’s code ‘looks like
a factory pattern
‘, and asked why he couldn’t implement it in standard
PHP? Stas disliked the introduction of ‘very “magic” things‘,
objecting that it would be impossible to know which class was being
instantiated by a call to new Foo(). In his opinion, class
posing belonged firmly in PECL. Sebastian explained that factory and
singleton patterns were simply examples of class posing at work; the reason
he wanted to use them in PHPUnit was to improve the mock objects system
there. For example, given:


public function foo() {
    
$bar = new Bar;
    
$return = $bar->doSomething();
    
// do something with $return
    
return 'some value';
}


he would like to be able to ‘stub out‘ the Bar class and
have Bar::doSomething() return a pre-configured value. At
present, he can’t pass a stubbed version of the Bar class into
the method, so there’s no way to do that. Using class posing, he could
override the call to new Bar() and have it implement the stub
class.

Stas wondered how Java’s unit tests manage without allowing class
replacement. He wasn’t convinced that allowing it in PHP would be a good
solution, and suggested that Sebastian look to see how unit testing is
achieved without this feature in other OO languages. Sebastian pointed out
that stubs and mock objects aren’t the same thing as unit tests; they are
simply tools that allow better unit tests to be written. Stas asked if any of
the known unit test systems use them, and if so, how they are implemented.
Jared Williams believed they usually rely on the dependency injection
pattern, and produced links to both a
Java implementation with a PHP port
and href="http://sourceforge.net/projects/phemto/">a more lightweight PHP
implementation. However, the cost of setting them up is expensive, in
terms of implementation registration, reflection performance etc. Stas
thanked Jared and promised to look into them; ‘I am extremely
uncomfortable with an Engine change that would allow “new Foo()” to produce
an object that is not Foo.

Timm Friebe suggested refactoring the source code. Not being able to do so
was rare, in his experience. That said, if Sebastian really needed to
intercept construction, he could write a file:// stream wrapper
or filter to intercept calls to include and
require, and replace calls to new Foo() with
newinstance('Foo') in the source.

Sebastian argued that you can’t refactor third-party code to comply with a
given pattern; class posing becomes an essential item at that point. Besides,
new Foo() could only produce objects that are in an
is_a relationship with Foo due to the restriction
he’d mentioned earlier, allowing typehints to continue working. Reflection
and __autoload() would continue to work without special
considerations.

Arne Blankerts pointed out that that restriction actually conflicts with
Sebastian’s earlier statement that it’s possible to override final classes.
He suspected it would also ‘cause serious trouble with anything marked
private
‘ in the original class. Sebastian corrected himself; a class may
pose for a target class that contains final methods. The posing class itself
would be unable to override final methods of its parent. Marcus agreed with
Arne that final should always be respected, and wrote that
we need to investigate further‘ concerning the behaviour of
private.

Johannes caught up with the thread, and explained that the implementation
Sebastian had mentioned had been ‘simply a conference hack‘ to see
whether class posing was possible and allow Sebastian to test it; ‘there
wasn’t much thinking involved
‘. (Heh.) In an extension – which this
implementation is – the best approach Johannes had found was to use
registration. A core implementation would allow a call to
__new() to be cached during the declaration, which would be
better in terms of performance. However, given that performance isn’t much of
an issue in a test environment, the question was whether the feature is
actually needed in the core or would be best implemented as an extension?
Johannes’ own feeling was that installing an extension should present no
problem for anyone able to perform unit tests using mock objects. He also
wondered about the feasibility of taking ‘the code coverage stuff
available in Xdebug and the
ZEND_NEW overloading implemented here, along with some other
bits and pieces, to create a phpunit extension. However, this was
hardly a topic for discussion on the internals list…

Short version: Watch out for PHPUnit arriving soon in a PECL near you (maybe).


RFC: VS 2005 support

Pierre-Alain Joye raised the subject of dropping support for the Microsoft
Visual Studio 6.0 compiler in PHP 5.3. Although it would have ‘a couple of
side effects
‘, this would be a one-time job that would make our lives
easier when dealing with Windows ever after.

Richard Quadling immediately asked if it wouldn’t be better to target the
MSVS 2005 Express Edition, which is a free-as-in-beer compiler. Daniel Brown
wondered which side effects Pierre anticipated? He also backed Richard’s
point about targeting the Express Edition, but didn’t know what impact this
might have in the long term.

Rob Richards recommended stringent testing. He’d recently run into an issue
when running an application built with VS 2005 and using DLLs built with
older compilers – the runtime linking is different – and PHP has an awful lot
of third party DLLs to consider. The particular issue Rob had found was when
the VS 2005-built application created a pointer to a file using
fopen(), which was then passed to a DLL built with an older MSVS
version, which then called fwrite(). The ensuing crash was caused
by the clash of two incompatible runtimes.

Stas noted that non-CL builds aren’t supported at all yet, but
the PHP build should be able to use cl.exe with any of the targets
mentioned so far. Did anything actually need changing for VS 2005? He added
that there may be some runtime issues; VS 2005 links with shared libraries
that might well be missing from older systems. This would need verification.
Marcus suggested either targeting VS 2003 (huh?) or supplying the
dependencies alongside PHP.

Andi thought it was worth another go (‘another’ because PHP’s official
Windows distributions dude, Edin Kadribasic, tried this href="http://devzone.zend.com/article/1599#Heading8">some time ago.)
In Andi’s experience, VS 2005 binaries are ‘significantly faster‘ than
VC6 binaries. He added that the Zend team have already tackled some of the
issues arising from an in-house upgrade, and would be able to help out with those.

Marcus just wanted to drop ‘all the VC6 build files‘ – presumably
referring to the .dsp files used by Visual Studio, rather than the
generic CL build system that can be used to build PHP under Windows
regardless of compiler version. He would like to have VS 2002 and VS 2003
work as well, ‘and VS 2007 is at the door already‘. Pierre explained
about the build system, and that VS 2003 already works fine without any
changes. He would be happy to kill the .dsp files; ‘only Stas (or
Dmitry?) has given them some love lately
‘. However, the thing about VS
2005 was that it would require ‘a couple of important changes‘ to the
build system, e.g. manifest support. Pierre had never tried the VS 2007 beta,
but believed that targeting VS 2005 would be adequate preparation for it.

Nuno Lopes intervened to report that he’d been using VS 2005 to build PHP for
quite some time, with no changes and no problems. He’d even built PHP against
some of Edin’s VC6-compiled third party libraries, again with no issues. That
said, he wasn’t using his homemade binaries in production – just for debugging
purposes. Andi reiterated Rob’s earlier point: unless all the third party
libraries are also compiled with VS 2005, there is a real chance of
problems arising when the data structures in the different runtimes clash.
Apache man William A. Rowe noted that it isn’t just the data structures; one
C runtime (CRT) can have localized resources that aren’t visible to others.
He cited ‘the faux-posix I/O‘ as a prime example of this.

Turning to the subject of Apache, William explained that the
httpd binaries shipped by the ASF are built using VC6, and will
remain so for the lifespan of Apache 2.0/2.2. There was a fair chance of
their moving to VS 2005 for Apache 2.4, but – given the number of C library
issues occurring ‘in each iteration‘ of the compiler – the Apache
builds are unlikely to upgrade beyond that any time soon. He added that Perl
is still shipped on the VC6 runtime, and Python on the VS 2003 runtime. It
would be a game of cat and mouse unless or until everyone moved to VS 2005.
The important thing was to clean up the .pdb files so that they would
import cleanly; without them it’s no longer possible to export .mak
build files for use outside the MSVS environment.

Andi mentioned that it’s possible to de-couple Apache from PHP by simply
using FastCGI. Marcus wrote that MS also recommend that approach. He went on
to echo William’s point that struct sizes aren’t generally an issue in the
Windows API, so much as ‘the POSIX stuff‘ and new functions that don’t
exist in the older runtimes. Memory allocation is a particular problem; you
have to bind statically, and blocks malloc‘d in one module can’t
be freed elsewhere. William replied that it’s possible to work around the
allocation issues, so long as the modules are well partitioned and have full
responsibility for freeing their own memory. However, one thing that would
trip up any project was binding tightly to third party libraries. OpenSSL, in
particular, can create ‘a mess all its own‘ if compiled to use a
different CRT.

Short version: Spot the missing developer.


BUG: import NAME conflict

Benjamin Schulz reported that:


import Foo::Bar as DomDocument;
import Foo::Exception;
import MyStuff::Dom::XsltProcessor;


resulted in a fatal error, "Import name '...' conflicts with defined
class"
. He was somewhat bewildered by this; naturally he’d like to be
able to refer to Foo::Exception as simply Exception
within his own application. He wouldn’t be using a namespace there otherwise.
Was there some good reason for this behaviour?

Greg replied that this works fine so long as you import the global class too:


import ::Exception as Notused;
import Foo::Exception;


He agreed that this wasn’t very intuitive, but at least it was simple. Stas
mentioned that he wouldn’t recommend such code, but didn’t really have an
alternative to offer. Markus Fischer pointed out that you wouldn’t
necessarily know the names of global classes in third-party libraries in
advance anyway.

Benjamin wrote bluntly that Greg’s simple solution made the entire concept
seem broken. Greg retorted that import is best used within a
namespace, and if Benjamin just did that there’d be no need to import global
classes. Marcus wondered how Benjamin was defining his Exception
class? Something like:


class Exception extends ::exception { }


would be pretty ugly; besides, that class replaces a core functionality. Why
didn’t he just use the built-in Exception class, if his own
version was so general that it needed that name? Extended
Exception classes should be specialized anyway, in Marcus’ view.
Benjamin pointed out that the problem was nothing to do with exceptions per
se, it was more about global class names that he didn’t want to know about in
his own namespaced code. Did he really need to rename his namespaced
Exception class GenericException? How about
GenericXsltProcessor? What about future SPL classes? What
happens when a PECL extension declares a class in global space that hasn’t
even been thought of yet? This implementation guaranteed that future PHP
releases or extensions would break even applications that use namespaces.

It took an embarrassingly long time for Benjamin to get his point across,
even with Markus and Moritz Bechler’s assistance, mainly because he’d made
the mistake of using an inherited Exception class to illustrate
the problem. Poor lad.

Greg realized a few days later that ‘Benjamin has in fact unearthed a bug
in the implementation of import
‘, and explained it all over again…
still using Exception:


namespace Foo;
import Blah::Exception;
$a = new Exception;


should be equivalent to:


namespace Foo;
import Blah::Exception as Foo::Exception;
$a = new Foo::Exception;


but wasn’t. Greg also posted hints for a fix, as he didn’t have time to put
the patch together to fix it himself.

Stas argued about what that piece of code should be equivalent to, but you
could tell his heart wasn’t in it by the way he ended up agreeing with Greg.
Thankfully, Stas was able to translate the problem into pure internals-speak:
unqualified lookups inside namespace should also take imports into
account
‘.

Dmitry still had issues with the example, and asked for another. Greg
obliged, and once more explained the fix. Later, he decided it would be
quicker to put both the example and the patch href="http://bugs.php.net/42859">in a bug report than keep explaining
both. Dmitry double-checked, and confirmed that both the bug report and the
patch appeared correct. He promised to give them closer attention, and
thanked Greg for his efforts.

Short version: The PEBCAK that wasn’t.


TLK: Exceptions in autoload

Moving swiftly on from there, Greg asked if anyone had a link to the archive
where it’s explained why the executor is unstable after an exception is
thrown in __autoload(). Marcus didn’t recall any particular
thread, but did recall Andi explaining why some things are unstable when
exceptions are pending. Much had changed since then, and Marcus thought now
might be a good time to re-investigate the issue. However, if it meant ending
up with a Java-like exception stack, he’d rather have PHP’s current behaviour.
Greg agreed; you can get around the problem of a fatal error arising from an
uncaught exception in the current setup, to some extent, by using
die(new Exception(...)). He was just hoping to understand
__autoload() internals a little better.

In fact, Greg was hoping to avoid using that rather dodgy die()
trick. He was offering (in another thread) a patch that introduced a new
function, in_class_exists(), which would return
TRUE if __autoload() were called by
class_exists(). It would allow his autoload handler to return if
the existence of a class were queried, and die with the appropriate exception
where an E_ERROR would normally result.

Marcus was bemused; class_exists() should simply return
FALSE on failure. Why make it more complicated? Was Greg hoping
to avoid the time taken to load and compile the class? Wouldn’t he need to do
that at some point anyway, if the class existed? Besides, it’s possible to
avoid a call to __autoload() altogether by calling:


class_exists($classname, false);


Greg explained the scenario he was trying to deal with:

new PEAR2 user Joe User downloads PEAR2 package Blah, which depends
on package Foo, but does not download Foo/something happens and Foo
is erased accidentally/whatever.

Joe, not knowing anything about PEAR2 package Blah, is just trying
it out to see how it works, and so does the drill of:

<?php

include '/path/to/PEAR2/Autoload.php';

$a = new Blah;
$a->doSomething();
$a->doSomethingElse();

?>

This script results in: Fatal Error: class PEAR2::Foo not found in
/path/to/PEAR2/Blah/SomeinternalClass.php on Line XX

Not very useful. PEAR2 knows what the problem is, and also where
Foo should be, but can’t pass along that information to Joe
since there’s no way to safely pass error information out of
__autoload(). Basically, __autoload() just isn’t
very helpful when it comes to debugging, and adding the ability to check
whether a call to class_exists() is the source of the
autoloading would go a long way to resolving that problem. Simply not
allowing class_exists() to call __autoload(), on
the other hand, would not.

Johannes suggested that


function __autoload($a) {
    
$bt = debug_backtrace();
    if (
$bt[1]["function"] == "class_exists") {
        echo
"in get_class";
    }
}


should work quite well, without the need for obscure engine hacks. Greg
explained that he does that already. He saw it as ‘an ugly, unnecessary
hack with lots of potential pitfalls caused by the inability to customize an
error message when a class doesn’t exist
‘; it also has a nasty
performance hit. Johannes doubted that performance is important in an error
handler; besides, he reckoned the performance issue there was a result of the
search through the file system anyway. Marcus backed him; after all, the
message is only shown when the target class is not present.

Short version: Best investigate the stability issue then.


TLK: Taint support: first results

Wietse Venema, who proposed taint support for PHP href="http://devzone.zend.com/article/1525#Heading5">almost a year
ago, wrote to internals@ to href="http://news.php.net/php.internals/32576">update the development
team on his progress. Although he’d had to adapt his original plan
somewhat, he now had an initial implementation that adds taint support to the
core, a selection of built-in functions and a couple of extensions. The good
news was that performance was much better than anticipated; Wietse reported
an overhead for make test in the 1% – 2% range. He planned to
release his work for review in the near future, since feedback was now
needed. Although rough at present, the code already had the potential to
manage such tasks as labeling sensitive data.

The taint implementation is controlled by a single INI directive,
taint_error_level, which is an INI_ALL setting.
Setting it to, for example, E_WARNING would make this script:


<?php

$username = $_GET['username'];
echo
"Welcome back, $username\n";

?>


output:


Welcome back, xxx
Warning: echo(): Argument contains data that is not converted with href="/manual/function.htmlspecialchars.html">htmlspecialchars() or htmlentities() in /path/to/script on line 3

The directive would of course be switched off by default. Taint mode aims to
be context sensitive, offering up advice about escapeshellcmd()
and mysqli_real_escape_string() where appropriate. This is
achieved by adding binary properties (bits) to the unused areas of the
zval struct. The bits currently are named TC_HTML,
TC_SHELL, TC_MYSQL, TC_MYSQLI,
TC_SELF, TC_USER1 and TC_USER2; there
is room for at least another 16 bits, assuming a 32-bit compiler. These bits
are set internally, using taint_marks_* or
taint_checks_* parameters as appropriate. (This part wasn’t very
clear to me either – it looks as if new internal macros are used.) There is no
interface for the TC_USER* bits at present, but the plan is to
make them available at application level. Wietse went on to explain the
propagation rules (conversions from integer to string and pure arithmetic or
string operations retain all taint bits; conversions from string to integer
remove most of them; comparison operators ignore them) before covering the
problem areas he’d already identified. These included functions like
parse_str(), and the problem of empty strings. Also, support for
tainted objects is not yet complete, and object-to-other type conversions, in
particular, may lose taint bits. Finally, Wietse warned that those areas of
PHP (and extensions) that don’t use the correct macros to initialize
zval structures are likely to be problematic when taint checking
is turned on, since they will leave taint bits at uninitialized values.

David Wang was a little twitchy about those spare zval bits; he
uses three of them in his garbage collection patch. He agreed that there is
room for a lot of free bits, but thought Wietse should be aware that
increasing the size of the zval struct leads to L1 cache misses.
Wietse copy-pasted the paragraph he’d written about the 16 extra bits, and
explained that he’d found micro benchmarks overly processor dependent – hence
his choice of macro benchmarking.

Marcus wrote that he liked the INI approach, and of course the benchmark
results. He was less certain about the database specificity; if it couldn’t
be avoided, having ‘something for PDO‘ would be a good move. He didn’t
like the name TC_SELF for something that checks the source of
calls to internal control operations such as eval(), and
suggested that TC_PHP might be more appropriate. Regarding the
evidence of poor macro usage, Marcus mentioned that David had encountered the
same problem. Marcus had come to the conclusion that macro usage should be
enforced, and direct access to zval members disallowed, by evil
means (new zval member prefixes to break non-compliant code).
The nice way to do it is too slow…

PHP user Laurent Jouanneau pointed out that a PHP application can generate
several kinds of output other than HTML: JSON, CSV and PDF, to name but a
few. How could Wietse’s code guess the output type? and if it couldn’t, how
could the warning be disabled where it was inappropriate? Wietse wrote
something flippant about not using echo. M. Sokolewicz called
him on it, and gained the explanation that the code to create PDF output
doesn’t have taint-labeled data; the labels actually need to be put there at
the point of data creation. Rasmus Lerdorf commented that this didn’t make
much sense to him, and gave Wietse a much-abbreviated chunk of common-enough
PHP code to consider:


$user_data = $_REQUEST['data'];
switch(
$output_format) {
    case
'html':
        echo
"<html>$user_data</html>";
        break;
    case
'xml':
        
header('Content-type: text/xml');
        echo
"<xml>$user_data</xml>";
        break;
    case
'json':
        
header('Content-type: application/json');
        echo
json_encode(array($user_data));
        break;
}


$user_data is tainted, but the untainting rules are very different for
those three cases, and… an error that talks about HTML escaping only makes
sense in the html case.

Wietse wondered where ‘the output format feature‘ was documented.
Rasmus educated him. It didn’t take too long before Wietse asked about the
Content-Type header. Rasmus agreed that this would work in most
cases, but added that output buffering would break it, since
echo doesn’t output anything before the output buffer is flushed
- and it might never be flushed. Wietse acknowledged that the practice
of setting Content-Type immediately prior to flushing the buffer
would be incompatible with taint checks. It would also be ‘prohibitively
expensive
‘ to apply taint policy to the contents of the output buffer;
you’d need to record which function, argument, file and line each byte of
data came from, as well as its taint labels.

Stut suggested that taint should simply assume HTML but provide a way to
specify otherwise, either with a specific function call or via
ini_set(). Wietse felt it best not to overburden the interface
with switches and functions; he preferred to rely on header information to
pick up the MIME type. ‘I just need to hook into the header() function and
do a little parsing
‘, he wrote, thanking Rasmus for his explanation. Stut
meanwhile was trying to think of a situation where the tainting might get in
the way of determining the requested output format, but couldn’t come up with
any.

Greg wondered if TC_SELF would be applied to stream data. Wietse
explained that it wouldn’t by default, but could be configured that way. For
the time being, only data from the Web is treated as hostile; all other
external data simply needs to be escaped when used in HTML, shell or SQL. In
case anybody wasn’t clear on this point, he added that the current taint bits
and marking policies are simply a first step; they are liable to change as
Wietse becomes more aware of common practices in PHP.

Short version: If you’re planning to test this, be aware that it’s
not pretty just yet.


CVS: Getopt in, ereg out of the core

Changes in CVS that you should probably be aware of include:

  • Zend Engine bugs #42798
    (__autoload() not triggered for classes used in method
    signature), #42802 (Namespace not
    supported in typehints), #42819
    (namespaces in indexes of constant arrays) and href="http://bugs.php.net/42820">#42820 (defined() on
    constant with namespace prefixes tries to load class) were fixed in 5_3 and
    HEAD [Dmitry]
  • Core bugs #42789
    (join() warning messages are not proper and different return
    value in PHP 5/6) and #42142
    (substr_replace() returns FALSE) were fixed
    [Jani]
  • getopt() is now available on Windows, following its move
    from SAPI level to the PHP core in 5_3 and HEAD. It also now supports long
    options. [written by David Soria Parra, committed by Jani]
  • PHPAPI function php_prefix_varname() is now available in
    the PHP_5_3 branch (affects internals only) [Jani]
  • In ext/json, bug
    #42785
    (json_encode() formats doubles according to locale
    rather then following standard syntax) was fixed in 5_2, 5_3 and CVS HEAD
    [Ilia]
  • In ext/xsl a new method for profiling stylesheets,
    xsl->setProfiling(), is available in the 5_3 branch and CVS HEAD
    [Christian 'Chregu' Stocker]
  • Core bug #42752 was fixed in
    5_2, 5_3 and HEAD following improvements to the recursion detection in
    array_walk() (memleaks remain.) [Tony]
  • In CVS HEAD, strcspn() now behaves the same way in both
    Unicode and native mode, fixing bug
    #42731
    [Tony]
  • Zend Engine bugs #42772
    (Storing $this in a static var fails while handling a cast to
    string) and #42818 ($foo =
    clone(array());
    leaks memory) were fixed in PHP_5_2, PHP_5_3 and CVS
    HEAD [Dmitry]
  • The internal function php_fgetcsv() gained an
    escape parameter in PHP_5_3 and CVS HEAD, closing bug href="http://bugs.php.net/40501">#40501. This change impacts core
    function fgetcsv(), and the SplFileObject methods
    fgetcsv() and setCsvControl(). The default setting
    in all cases is ' \\'. [David Soria Parra]
  • \u, \U and \C are no longer
    supported in single quotes in CVS HEAD, closing bug href="http://bugs.php.net/42746">#42746 [Tony]
  • In ext/pgsql, bug
    #42783
    (pg_insert() does not accept an empty list for
    insertion) was fixed in 5_2, 5_3 and HEAD [Ilia]
  • Zend Engine bug #42817
    (clone() on a non-object does not result in a fatal
    error) was fixed in 5_2, 5_3 and HEAD [Ilia]
  • lcov 1.6 is now officially supported in 5_2, 5_3 and
    HEAD [Nuno]
  • The core regex functions dating back to PHP 3
    (ereg[i](), ereg[i]_replace(),
    split(), join() and sql_regcase())
    were moved to their own extension, ext/ereg, from 5_3 [Jani]
  • In ext/ldap, ldap_set_option() gained two new
    possibilities, LDAP_OPT_NETWORK_TIMEOUT or (for the Netscape
    LDAP SDK) LDAP_X_OPT_CONNECT_TIMEOUT in 5_3 and HEAD, fulfilling
    feature request #42837 [Jani]

In other CVS news, Pierre formally crowned our new RM; he gave Johannes write
access to all relevant cvs.php.net modules.

Short version: A stupidly busy week.


PAT: Missed one (or two)

David Soria Parra had a bit of a week of it, with two successful patches
going into CVS as reported above and a third, unsuccessful patch nonetheless
leading to a Zend Engine fix.

One Bill Moran notified the list with the information that he’d added a
one-line fix to the report for bug
#42637
(SoapFault: Only http and https are allowed), and
hoped to see it checked in before the next PHP_5_2 branch release. As you’ll
gather from the length of this summary, this was a pretty hectic week on
internals@; Bill’s patch escaped list attention.

Rui Hirokawa continued the world’s most drawn-out conversation – did you ever
see that Red Dwarf
episode
where the mail pod arrives three million years late and one of
the parcels contains a video of a chess move? the first move? -
actually, from this post it looked like Rui had reached a possible solution
to the _HALT_COMPILER() problem in ext/mbstring. He
offered to disable detect_unicode by default, assuming there
were no objections. We might have to wait a while for François’
reaction, though.

Short version: Bill’s patch – and one from Greg
earlier
– didn’t get any feedback.

2 Responses to “Zend Weekly Summaries Issue #361”

  1. sniper Says:

    Actually join() wasn’t moved anywhere. :)
    Just split() / spliti() which require the old regex library.

    –Jani

  2. bweirdan Says:

    >> The core regex functions dating back to PHP 3 (ereg[i](), ereg[i]_replace(), split(), join() and sql_regcase()) were moved to their own extension, ext/ereg, from 5_3 [Jani]

    join is out of the core? are you serious? What does it have to do with ereg*? join always was an alias to implode