User talk:SteveBaker: Difference between revisions

NOTE: I know some people carry on conversations across two User talk pages. I find this ludicrous and unintuitive, and would much prefer to follow Wikipedia's recommendations (see How to keep a two-way conversation readable). Conversations started here will be continued here, while those I start on other users' pages will be continued there. If a user replies to a post of mine on this page, I will either cut/paste the text to their page, or (more likely) copy/paste from their page to this one and continue it here.

Archives

/archive1 - Prior to Dec 10th 2006
/archive2 - Prior to Feb 7th 2007
/archive3 - Prior to May 1st 2007
/archive4 - Prior to May 24th 2007
/archive5 - Prior to Aug 9th 2007
/archive6 - Prior to Nov 2nd 2007
/archive7 - Prior to Nov 27th 2007
/archive8 - Prior to Dec 27th 2007
/archive9 - Prior to Feb 27th 2008
/archive10- Prior to Jun 21st 2008
/archive11- Prior to Aug 28th 2008
/archive12- Prior to Jan 5th 2009

new WP:RDREG userbox

This user is a Reference desk regular.

The box to the right is the newly created userbox for all RefDesk regulars. Since you are an RD regular, you are receiving this notice to remind you to put this box on your userpage! (but when you do, don't include the |no. Just say {{WP:RD regulars/box}} ) This adds you to Category:RD regulars, which is a must. So please, add it. Don't worry, no more spam after this - just check WP:RDREG for updates, news, etc. flaming lawye r^c 22:07, 5 January 2009 (UTC)[reply]

Hmmm - we also have:

This WP:RD denizen is:

[x] Scaly
[ ] Omniscient
[ ] Benevolent

...with a bunch of silly options for checking different boxes.

SteveBaker (talk) 23:35, 5 January 2009 (UTC)[reply]

learn how to spell

it's -> its

American's -> Americans

-lysdexia 17:31, 6 January 2009 (UTC) —Preceding unsigned comment added by 69.108.164.45 (talk)

Informal communication - you knew what I meant. Meh. SteveBaker (talk) 17:59, 6 January 2009 (UTC)[reply]

Are we cool?

Hi Steve. Someone on the Ref Desk talk page raised a suggestion that I had tried to 'stifle debate' on a previous policy proposal of yours. I'm assuming that they meant this one. While I disagreed with your proposal at that time, I also aimed to treat you with courtesy and respect; I also had no intention of trying to close off the discussion. Please let me know if you think I was being a jackass; if you found my remarks offensive or my tone imperious, I apologize most sincerely.

I genuinely appreciate the effort that you put into the Ref Desk (both its content and its policies), and wouldn't want to lose your hard work or your insightful approach to policy.

And now I've gone and done it again. I think that your proposal is pretty reasonable, but I fear that it's vulnerable to gaming and ruleslawyering. As is always the case, your mileage may vary, and I may be worried about nothing. I think, though, that your proposal is aiming to codify something that can already be comfortably done – in the relatively rare circumstances where it might be necessary – using equal parts common sense and WP:IAR.

Postscript: I seem to keep addressing you as 'Steve' — if you're finding it presumptuous, rude, or plain annoying, let me know. Cheers! TenOfAllTrades(talk) 23:50, 7 January 2009 (UTC)[reply]

huh! I didn't think you stifled debate at all. So - no, I'm not upset about anything. Weird. I have a tendancy (typical of Asperger's people) of feeling happier with things clear cut and concise - so I like extra rules. This is a fault - and I recognise it. So often I'll propose a rule (because I like that kind of thing) with the expectation that nobody else will be keen on it.

And please - call me "Steve" - that's my name (well, no - actually, it's "Stephen" - but only my mother calls me that).

So - I don't know why all of this seemed necessary - I thought we were getting on just fine (but again, that's another Asperger thing - I have no clue about all this interpersonal stuff!)...now I'm left wondering what *I* might have said to give the impression that I was upset?! Urgh! Give me computers any day!

SteveBaker (talk) 00:50, 8 January 2009 (UTC)[reply]

Don't worry; I didn't get the idea that you might be upset from anything that you said or did. DuncanHill asserted recently that I was trying to stifle the discussion at WT:RD on your last proposal, and I wanted to make sure that things were clear between us. You've always come across as a calm, cool guy, and I didn't want to be standing on your toes while you were being too polite to raise a fuss.

I've occasionally said things on Wikipedia talk pages which come across as much more unpleasant than they ought to be. I've been upset, or I've misread someone else's remarks, or I've just phrased something badly. When I looked back at what I wrote earlier, I didn't think that I had said anything out of line, but I figured I should check my calibration.

By the way — don't stop trying to find ways to improve our policies. I can tell that you're looking for ways to help people navigate Wikipedia as smoothly as possible. As an admin, I (perhaps unfortunately) tend to examine policy with an eye for ways in which it is likely to be abused; I don't want to discourage your efforts. Happy editing! TenOfAllTrades(talk) 03:31, 8 January 2009 (UTC)[reply]

Yeah - it's VERY easy to misread intent in any online communication, so to err on the side of WP:AGF starts to become a habit! I think it's essential that Policy (with a big 'P') be as safe from abuse as it can be. But the WP:RD stuff is Guidelines - and for that stuff, ignore all rules definitely applies. What we need is a way to say to people "what you did is considered wrong by our community - please don't do it again" - and when someone ignores a guideline for perfectly logical, sensible reasons, we can say "pheh - it's only a guideline". It's not policy precisely because that kind of flexibility works when we all AGF and act together to tell people when they are doing wrong. There are very few - if any - unreasonable people working on the RD. The questioners however...yeah...well...that's a different matter.

I don't know how Duncan came over the impression that I was upset or that discussion was stifled...how the heck DO you stifle discussion on an open forum like the RD talk page anyway?! Certainly I didn't get that impression. My suggestion was discussed - the consensus to implement it clearly wasn't there - game over. Move on, try to find another way to make things run more smoothly.

IMHO, the proposal was worthwhile. I've been outspoken in the past about people changing, redacting or otherwise messing with people's answers to questions. That's dangerous because (especially on places like the Science and Math desks) the tendancy to want to correct a "wrong" answer is very strong - and the consequences of people editing and/or deleting each other's answers is potentially dangerous to our ability to provide good answers. There have been many cases when I've come back to the RD after a couple of days away and seen a long series of replies to a question - all of which were flat out wrong - and I needed to step in and explain why. If any of those people wants to stick by their guns and is allowed to delete (or worse still, edit) my reply then that would be a disasterous thing for the lone truth-teller. On the other hand, there have been several occasions (one of which is personally very embarrassing when I got Newton's law of cooling wrong!) when what I have said has been loud and empassioned AND flat out wrong...and it's just as important that I didn't delete the original (correct) answer in the process. So deletion and amendment of answers is something I'm passionate about wishing to disallow - that should be a line that we don't ever cross.

The deletion (but NOT amendment) of questions and (perhaps) their entire thread of answers is a different matter. We have rules about medical/legal/homework questions that are actually very important to our mission at the RD - and whilst our respondants are almost all smart, reasonable people - our questioners are all over the map: Vandals, trolls, annoying little kids who want to talk about sex, people with so little English skills that they are all-but incomprehensible...all of these things are par for the course. I'm not even opposed to removing all of the answers (all or none though!) if the question itself is deleted. In this case we are "un-asking" the question - and replies to an un-asked question are not necessary - so they can go too. But I'm still passionate about not CHANGING what the questioner wrote - those words belong to them - rightly or wrongly. So it should be an all-or-nothing thing. Either the question stays or it goes. In general, it should stay - people can always choose not to answer it - and that works reasonably well in practice. The time pressures on the RD are what makes this a pain to deal with. We need mechanisms to allow prompt, approximate, decision making - with Wikipedia consensus-making not getting in the way of the time-pressures...yet still using consensus to decide (possibly after the fact) whether or not this was a good call or not. In that way we slowly build up "case law" about what we accept and what we delete - which can be used to help the fast/approximate decision-making do it's thing. This is not all that different from the way that AfD works (for example) with the idea of there being rapid deletion of articles that are "obviously" unacceptable based on a kind of "case law" of what we accepted in the past - with an appeals process and a means for more difficult cases to get pushed through full-scale consensus building.

SteveBaker (talk) 13:29, 8 January 2009 (UTC)[reply]

Nuclear battleships

Hi Steve, with respect to this, yes indeed all steel is contaminated to a certain degree since the onset of atmospheric nuclear explosions. This is due to use of atmospheric air and atmospheric-derived oxygen in iron- and steel-making. The atmosphere was and is contaminated with fallout radionuclides.

I know this because it was one of the very first questions I dared to ask at SciRef. I read it in SciAm probably in the '80's or early 90's (long before I dropped my subscription after SciAm turned into a fluff-rag, they used to have such excellent diagrams plus Gardner and Hofstadter and hard science articles). The tickler for me was their piece describing how steel used in satellites came exclusively from old battleships, so as not to disrupt the sensitive on-board equipment. The difference being that old steel can be remelted without introducing new oxygen, but you can toss other alloying agents in as a solid.

You could find my thread by searching SciRef within the last 12-18 months, title I believe was "Battleships in Space" (or cap/locase variants). I also have a ref discussing how to pin down the age of corpse-teeth by radionuclide content from Nature journal, which I thought I'd added to an article but can't find now. Anyway the effect is real, sorry for being a little late bringing this up! Franamax (talk) 06:36, 8 January 2009 (UTC)[reply]

Note that the radionuclide content in teeth comes from the atmosphere via food, not from the journal itself. :) Franamax (talk) 06:43, 8 January 2009 (UTC)[reply]

Cool! Well, I was merely skeptical about this business - I pointed out (repeatedly) that I didn't know for sure and that it basically just seemed a little 'fishy'. Knowing that excluding atmospheric oxygen from the smelting process is the key goes some way to explaining what's going on - although it still sounds very wrong to me. Thanks for the correction! SteveBaker (talk) 13:32, 8 January 2009 (UTC)[reply]

Hi Franamax - if you're still looking for the age-by-anthropogenic-radiocarbon paper, it's here: [1]. TenOfAllTrades(talk) 14:43, 8 January 2009 (UTC)[reply]

Thanks TOAT, I have a Nature subscription, so I can search their archives (I do that early and often :) It needs to be incorporated somewhere into one of our various articles on "Effects of nuclear testing". I know I made some kind of change to something related at the time, and I discussed with someone creating an updated graph of atmospheric radionuclide levels, but bugger if I can find it now. There's a downside to getting involved in too many things, the loose ends get overwhelming... :(

Steve, there's no correction involved, since you made no assertions. Think about it though - air is needed to smelt iron and oxygen is needed to make new steel, so unless you have a really major fractionation process to purify the air/oxygen, you have to live with the contaminants. That's normally not a problem, except when you're building an enclosure for a really sensitive neutron/alpha-particle/beta-particle/gamma-ray detector. You can either characterize the emissions from your "bad" steel and subtract from the data, or find better steel.

As it happens, there is a source for "good" steel, you just have to cut it off a sunken battleship. Then you re-melt (or re-heat) it under controlled conditions and shape it to what you need. I imagine you'd still need to deal with the surface oxide layer (hot-form then pickle&oil maybe) but you avoid the many steps needed to fully purify the process air. It's one of those cool things that catch my eye and clutter up my brain. I'd love to find that reference, but I have no special access to SciAm, I dropped my subscription around the time teh interwebs was being inventitated. :) Franamax (talk) 02:42, 9 January 2009 (UTC)[reply]

A way to long replay

This is a little late but I know you always like to talk about science. This is an extension of the dialogue on sexual reproduction would never work sorry I didn't respond there but I got side tracked. I still think your last post was way to reductionistic by treating a base pair as a bit of information. I can see that I wasn't giving practical examples, and was just being way to abstract. What I want to say was not all DNA sequences are equal. Some bonding patterns are more robust and are less likely to face replication errors. Base pairs are more susceptible to methylation such as cytosine and its even more susceptible as a CpG site. There are weak sections of DNA between the genes that have been proposed to be there to simple accept damage, from light and potentials, in order to protect the sections that require high fidelity. The structure of a base pair and where it is in a codon appears to be directly related to the what amino acid it codes for, such as if there is a uracil in the second position the codon will always code for a nonpolar amino acid. You could argue that this has to do with how it interacts with a transcription agent but you wouldn't have the support of the biochemists (check the "Theories on the origin of the genetic code" on the genetic code page). Synthetic base pairs without meaning in the four bit system can be introduced. Some of the proofing proteins appear to be designed to remove natural base pairs in the wrong place and and ignore unnatural ones (thats based on a PhD thesis I went to a few years back). These are just a few of the known examples where DNA behaves as more than just 00, 01, 10, or 11; I'm pretty sure in a pure "4 bit" system these other functions would be independent. I wish I could cite everything for you but I don't have times and I suspect you already know much of this. I'm sure there are similar examples for things that happen at PN junctions but I bet most of them just result in failures in the systems function. In contrast these asymmetries in the nature of DNA "bits" having been incorporated into the systems function. DNA can be considered a digital system which is built on a substantially more complex system; a complex system which significance influence the function of digital system. I don't know how you would describe this system in "information theory" but I expect it gets a distinction from data that is stored without such operational conditions. I'm interested to see if you have anything more to offer on the subject. Personally, I am always bothered by reductionism in the treatment of things like DNA and neurons. People act as if they understand how "action potentials" combine, like they known how dendrite's "math" works. I'm referring the passive cable theory which in many text books is treated like gospel. In reality the neuron is doing a ton more then described by this model, good thing or we probably wouldn't be having this conversation. I guess the normal audience at the reference desk usually just need the intro text book explanation. Finally, your arguments carry a lot more weight when you don't make personal comments. I'm well aware of what I do and don't know in science. I don't know any more than the average laymen about information theory, which I would guess is next to nothing; but I know a fair amount about DNA and PN junctions.--OMCV (talk) 02:09, 9 January 2009 (UTC)[reply]

The whole "DNA is binary data" and "mish-mash of software" thing missed a few points. The "software combination" involved in meiosis, gamete formation and zygote formation is actually controlled by a pre-existing "software" mechanism, defined by the software itself, and recursive levels of pre-existing oversight. This includes selective imprinting of the maternal and paternal genomes, homologous recombination during meiosis, nucleosome patterning, histone acetylation, RNA interference, blah blah... Suffice to say that if we'd developed computer software within the context of such recursive machinery, such that the only way the software could exist was within its evolved framework, combining two programs would be a snap.

Another missing point was the estimate of only a 1 in 100 chance that an embryo could develop. That may indeed be correct in any case. I spent awhile searching for statistics on that, but the search gets clouded by "in vitro" results, so I bailed out. It's definite though that not all gamete combinations result in a competent embryo, so software conflicts indeed seem to occur. Franamax (talk) 02:59, 9 January 2009 (UTC)[reply]

OK there are lots and lots of long explanations and complicated jargon and all sorts of interesting stuff in those posts...but the sad part is that NONE of that matters - not at all! The fact is that at each point in the DNA strand there are only four possible things that can happen next - you have A,G,C,T...there isn't a Q or a Z. That means that like it or not - I don't give a damn how those bits are INTERPRETED by the biological machinery - I only care that there are only four options - that's two bits of data no matter how you slice it. This isn't a question about biology - it's a question about choice - there are four choices - that's two bits - bits are a measure of choice. If someone finds another four new base-pairs in the DNA strand of some weird creature (P,Q,R,S say) then you now have 8 choices and that's three bits per base pair.

I can try to explain - but you clearly don't understand fundamental information theory - and that's a problem.

Let me try one more time - how about this. If I type:

 F:P()*%$)(Q$P N@

...that's 16 random bytes of data - 128 bits in total. This string of bits is meaningless gibberish - but it's still 128 bits. You can count them. On the other hand, I could write:

 17th July, 1955.

...that's also 16 bytes of data - it's also 128 bits in total. It's obviously a date - and now it has some useful information content. But it's still not very important information. Now, if I embed those exact same 128 bits into a larger context:

 Steve's Birthday is 17th July, 1955.

...then your mental processor (or your computer) can do more about it. Suddenly - those exact same 128 bits are more useful to you. But there is still only 128 bits - the raw information content doesn't change whether it's gibberish or useful data. If I take (again) that exact same data and put a couple of square brackets around it: 17th July, 1955. - then that same exact 128 bits will act as a link to a document - or in this case, a command to MediaWiki to open an edit window if you click on it.

The is true of that DNA data - it doesn't matter a damn what you're telling me about how that data is processed or what fancy processing happens or that this letter combination here means something different in this context than in that context. It's STILL precisely 4 bits per base pair and cannot ever be anything other than that.

I don't doubt that the same bit pattern (or base-pair sequence) does different things in different contexts - that's not additional data in the sequence - it's other things external to that sequence that changed the MEANING of those bits. Just like my birthday - which means something quite different when it's presented without context, with my name in front of it, or inside double-square brackets.

This is the same thing with software and computer data of all kinds. If I take the string of 8 bit 'base-pairs' and say:

  i = i + 2

...and present it to a mathematician - he's going to say "Hey - that's not true!" - but present it to a C-language compiler and you'll produce a machine-code instruction to add two to the variable called 'i'. The MEANING of those bits depends on the context - but that has IN NO WAY CHANGED THE NUMBER OF BITS.

Furthermore - although "s49087()*&_#$&rY" is seemingly gibberish, it's 128 bits - but it's not very useful. It has fewer than 128 bits of USEFUL information in it...it's always possible that the useful content of your DNA basepairs is less than 2 bits each - and that's almost certainly true because of the 64 codons they form, many code for the same "STOP" command...that makes those STOP codons interchangeable - which effectively reduces your useful choice of codon to less than 64. But one thing information theory teaches us is that you can't ever GAIN bits. If something takes 128 bits to store - then AT MOST there can only be 2¹²⁸ different states - so you can't ever store more than 128 bits in that space. The same is true of your base-pairs. They can NEVER be worth more than 2 bits each - no matter how much fancy science you tell me about them.

So, I'm sorry - but all that you wrote above is utterly irrelevent - I don't even need to read it. There are only four kinds of base-pair in a DNA strand - so they are 2 bits per base-pair - and nothing that biologists can say will change that...it's pure information theory...mathematics. If some biologist tells you otherwise - then he doesn't know what he's talking about.

SteveBaker (talk) 04:52, 9 January 2009 (UTC)[reply]

What he wrote isn't entirely irrelevant, Steve. Its worth reading. If you did you'd note he mentioned DNA methylation, particularly at cytosines. We call nucleotides A, T, C and G, and think of them as sole basis of information. However we often forget that this is a shorthand code for a molecular structure: C stands for cytosine. But, in many species, a significant number of what we just refer to as "C" are actually a different structure, 5-methylcytosine (we'll call it C'). So we actually have a common, 5th possibility (there are other modifications, that are rarer. In some species, adenine also undergoes methylation.) Then consider that this switch from C → C' is transient and bidirectional. So any given cytosine should probably be considered as having twice the potential for informational storage. I agree with your wider point about how increasing complexity of biological systems doesn't really impact on the information capacity of DNA, but its worth noting that the complexity actually stretches to the very structure (and hence information coding potential) of DNA itself. Rockpocket 06:16, 9 January 2009 (UTC)[reply]

(e/c) A - C - T - G - methylC - methylCpG. All represent distinct items of information, can you enumerate them in two bits? Methylated-cytosine not= cytosine. Information theory ultimately depends on the information it conveys. By the same token, analyzing the information content of triplets has to account for the redundancy of codons, so you can't say that a triplet conveys more than (roughly) 24 pieces of information in the context of protein coding. This is the reverse case, methylation status represents an extra bit of information. Put it another way: if you methylate all the C's in your genome, the total quantity of information is the same, but life will be radically different for you. You propose to say that no information has changed, because they're all still C's. Franamax (talk) 06:21, 9 January 2009 (UTC)[reply]

(Of course I read it! I only meant that I didn't NEED to read it in order to explain the issue of context). OK - well if there are "other", functionally different base-pairs kicking around (let's call this methylated cytosine C') - then the original assertion that there are only four base-pairs was incorrect and I have been misinformed! So we do indeed have to increase the bit count from 2 bits to a little more than 2. Technically - the number of bits is the log-to-base-2 of the number of possible states. So with 5 base pairs the number of bits is log₂(5) - don't worry that there are a fractional number of bits - that's a common thing (the digits of a decimal number, for example contain log₂(10) bits of information each - somewhere between 3 and 4 bits. But the CONTEXT in which these things appear is still irrelevent (in information-theoretic terms) to assessing how much data there is. Suppose there were some places in the genome where it didn't matter whether a particular codon contained a C or a C' - that wouldn't change the number of bits that were being encoded - only the efficiency with which the underlying machinery uses them. So we have to revise our idea about the total information content of a DNA strand to be (log₂(5) x NumberOfBasePairs) instead of (2 x NumberOfBasePairs) - but that's the only thing that changes. SteveBaker (talk) 18:53, 9 January 2009 (UTC)[reply]

I still have a hard time removing the integral function which results from information beyond the A, C, T, & G code and exists in all naturally occurring contexts from its pure data storage aspect. Even if its not important for storing the data set it seems important for describing what is stored in the system. But lets see if we can put one part of what I said in Steve's terms. Take the data string "F:P()*%$)(Q$P N@" that would store fine in a silicone but lets say there is a alphanumerical data system like DNA that would store the "P()" with 99.99% fidelity but "Q$P" has only a 90.09% fidelity. As it turn out every time the "$" is used the fidelity at that position and the adjacent bits drops considerably. In fact any efforts in sending you "$" would make the surrounding information suspect. That seems significant to me in reducing the system to bits. I have comment on a few more sections but I'm trying to keep my reply tight. Have a good one.--OMCV (talk) 13:08, 9 January 2009 (UTC)[reply]

The problem you're having is that the CONTEXT in which data occurs matters (both in DNA strands and in software and human speech and in all other systems that I can imagine) - and the PROCESSING RULES which are applied when you 'express' a gene or 'execute' a computer program matter in a given context with a given processor. Hence, neither of those things alters the total number of outcomes that there could possibly be from expressing/executing that data. That's precisely WHY we insist on counting bits and not looking at end results.

If we stick to that rule then the total "information content" of software and DNA can be directly and meaningfully compared without having to descend into all the gory details of how cellular biology works - and we don't have to wonder whether the answer for the computer data would be different on a Mac or a Linux machine. It's JUST information. That's what "bits" are - a simple measure of the number of possible combinations a particular kind of object can represent.

You keep telling me that the same sequence of base-pairs produces different results depending on the surrounding context (and - according to you - this means it's "worth more" than digital data). I'm telling you that this is true of ANY binary data - synthetic or natural - and it doesn't affect the number of bits that the data can represent. Just because the binary pattern for a computer program can be pushed through a loudspeaker to make an annoying squawk - or put into a digital picture frame to produce a random-looking splatter of pixels - doesn't mean that I can claim that the binary data has any more bit hidden inside it - it's the exact same data - it's just been expressed as audio or light using a different processor. Similarly, a 2000-base-pair DNA sequence that happens to produce (I dunno) Insulin when you play it forwards and (say) L-Triptophan when you play it backwards - or in the presence of an acidic environment or whatever the heck it is that cells do...that's irrelevent to the DNA sequence itself - it's still only 2000 choices between 4 (or maybe now it's 5) base pairs - so it's STILL only 4000 bits of information. The thing that "decided" to express it as one thing or the other added extra information of it's own to make that happen. That extra information was either originally stored at some other location in the DNA - or is an environmental factor or something - but it wasn't in that 2000 base-pair sequence because it doesn't have any 'space' to store anything more than the 4000 bits we can 'see' by counting the base-pairs and multiplying by the number of kinds of base-pair.

The business of sending me a '$' using DNA and it only coming out as a '$' 80% of the time is again, to do with the processing. You evidently have a faulty replay mechanism ("faulty" is the wrong word in a biological context...but not in a computing context). My car ignition key has three positions - and you can take the key out altogether. It's a 2 bit system (four states). The fact that my car fails to start 20% of the time when I put the key in and turn it all the way to the right is neither here nor there. It's still a 2 bit switch. Same deal with the '$' stored in your DNA.

SteveBaker (talk) 18:53, 9 January 2009 (UTC)[reply]

I think I understand what how our idea relate at this point. I'm surprised that "Information Theory" does not have language to describe when a "faulty replay mechanism" or pa rocess mechanism is intrinsically tied to a type of bit within the constraints of a specific system.

So for the sake of review there is the "unique data set" and the "context". The "unique data" of DNA can be reduced to 4 bits; fine, I agreed with that a long time ago. The "context" is more complex and contains data that is not unique; clearly I have a hard time considering this data separately form the "unique data". After all I'm a chemist and thus like the empirical real world examples, first principles do little for me. After all you were the one who said nothing can exist without a context so it bothers me to do thought experiments without a context. Let the physicists have their theory and abstractions. None the less I see the value, there is only one of four possible bits of unique data per nucleotide, DNA can be reduced or compressed to that abstraction.

In silicone the "context" is usually divided between the hardware, that allows for binary states, and an "adjacent unique data set" that acts on the "unique data" we care about to achieve a "function". In DNA the context is again part external hardware, which is the environment of the test tube or cell the DNA resides in with buffer, small molecules, proteins, different forms of RNA; suffice it to say there the significant states far exceed binary. Even if the DNA is a four bit system the context is already way more complex than a PN junction. The second portion of the context which is intrinsic to DNA's function is redundantly stored with each bit/nucleotide, actually each of the four bits carries a slight different portion of the overall context. The amount of data embedded in the nucleotide "function control system" is enormous and complimented (if that's the right word) by the ability to act on that data. This differentiation between adjacent and embedded context I find interesting. Thanks for you time, have a good one.--OMCV (talk) 04:17, 10 January 2009 (UTC)[reply]

I think we're all coming at this from different viewpoints. Steve is taking the quite correct approach that if the choices are CTGA, it's a two-bit system, no more, no less. Others of us are saying that 1) there are more than just those four choices; 2) it depends on the context; 3) it's a "lossy" system. Steve is right that in terms of pure information capacity, 1) is the only determining factor - how many possible combinations can be made? C-C' and T-U substitutions are two salient examples of why there are more than just four possibilities. However, there are others, so the true information content of a DNA strand would need to consider quite a few possible chemical modifications. Then there's the base-pairing bit - two strands which are not paired in perfect fidelity. This delves into the context part, because depending on which strand is being read, the information content may or may not be different. Nevertheless, mispaired bases represent another contribution to the total number of combinations in a DNA strand of an arbitrary length.

Not all these combinations are relevant, so I'll dare to suggest a synthesis that also addresses the lossy nature of DNA: we should instead be thinking of DNA as a communication medium (which Steve has been doing all along I think). Then we have to look at channel capacity, and it's a "noisy channel", where B is the "bandwidth" representing the total set of permutations, S/N is the signal-to-noise ratio defined by the context, C is the maximum rate through the channel, and R is the effective rate defined by the error-correction capacity of the system (the "context"). Steve would know more about that than me, but it's certainly more complex than just CTGA. Franamax (talk) 07:01, 10 January 2009 (UTC)[reply]

Yes, exactly. This is why it's important to stick to a rigorous counting system at each level of abstraction. Coming back to what I happen to understand the best - computers - we have a system where a RAM memory cell is strictly 1 bit. That's strictly an implementation decision - it wouldn't make ANY difference if we had chosen to store voltages of 1,0 and -1 rather than just 1 and 0 as we choose to do. We'd have had a 3-state logic system with 1.5 bits per cell. But nobody other than the guy who built the computer would have to care about that. From everyone else's perspective, it's just bits. That's a liberating way to limit the complexity of reasoning about the system. Similarly, in a binary computer, we typically group our bits into bytes - collections of 8 bits - and arbitarily choose to store numbers in the range 0..255 in those bytes. We now have a base-256 logic system built on top of our base-2 logic system. However, that's not the only way. In the 1970's when microprocessors were new and we were all figuring out the best way to use them - many people decided to group their bits into 'nibbles' of 4 bits each and to store a decimal number with one digit in each nibble (this is called 'Binary Coded Decimal' - BCD for short). Since a nibble could store 4 bits - it has the ability to store things in the range 0..15 - but there are only 10 decimal digits - so about a third of that range was wasted. Now - when you ask "How much storage space is there in a BCD computer?" you have to answer on more than one level:

At the hardware level, we have bits and nibbles - so you just add up the number of bits.
At the BCD level though - the number 999 has three digits. So it consumes 3 nibbles. You can store anything from 0 to 999 in three BCD nibbles - and that's almost 10 bits...but at the hardware level, three nibbles is 12 bits - so a BCD digit has just about 3.3 bits of storage even though we stored it in a 4 bit chunk of memory.
It might be that you use your BCD computer to compute company payroll. Suppose you have to store your worker's salaries in the machine. If the highest paid person gets $98,000 a year - it would be kinda stupid to use 5 nibbles to store that because next year you might give that person a 5% salary increase and now you need 6 nibbles...which would break all of your software. So you might choose to be super-safe and use 6 or more nibbles per salary "number" - but in truth, your salary information is only 5 digits - so you're wasting a nibble for future expansion. At this level of representation, we have even fewer bits per nibble than at the BCD level.

What happened was that in choosing how we store information on our underlying hardware, we decided to waste about 0.6 bits per nibble for the sake of convenience. That doesn't change the storage capacity of the underlying hardware - only the way we choose to use it. However (and this is important) you can never GAIN information content - at each level of representation, you can only lose it or break even. The laws of information theory are a lot like the laws of thermodynamics - and actually, there is a solid science connection there. You can't get 'free energy' because of the laws of thermodynamics - and you can't get 'free information' for the exact same reason. (And I mean "exact"!)

Forgetting the 'extra' base pairs for the moment, we know that the cell takes 3 base pairs (2 bits each - so 6 bits in total) and treats those as codons. Which are really like the instruction set of a computer - the parallels are kinda creepy! 6 bits means that there can only be 64 codons. However, (according to our genetic code article at least) this is kinda like BCD representation because some of those codons share the same meaning. The 64 codons only represent 'commands' for 20 amino acids plus 'START' and 'STOP' commands - 22 possible states - just over 4 bits of information. So there is waste and if we consider the information content at the codon level, 3 base-pairs is storing only just over 4 bits rather than the 6 bits they are theoretically capable of storing. That's just like BCD. Furthermore, not all codon sequences make sense - STOP,START.STOP,START, for example is useless (presumably) - so at the level of "what represents a protein", you have even more states that don't do anything useful - and the amount of information content is reduced still further.

So the intermediate conclusion is that at each level of representation, you count a different number of bits. However, each 'level' of representation tends to lose information compared to the level of representation below it.

The relevance of that to processing is important. But what it does mean is that counting the number of base-pair possibilities puts an UPPER limit on the number of bits you can store in a DNA molecule. The true number of proteins or whatever that it actually codes for is going to be a lot less than that...but absolutely, certainly, without any doubt whatever...it can't code for any more. No matter what context or what processing you provide - that's a hard upper limit. Claiming that some fancy biological 'thing' can make more information come out of a DNA strand is PRECISELY the same thing as claiming that you've invented a perpetual motion machine. "PRECISELY" because information theory and thermodynamics are actually the same thing 'under the hood'.

So yeah - there may be more bits per base-pair because there are these other base-pairs and mismatches and whatever - but you can (and should!) decide how many bits there are per base-pair and use that at the UPPER limit for the storage capacity for your DNA strand...and be aware that the actual, practical limit (which determines the number of possible unique individuals it could code for) is by absolute fundamental thermodynamic necessity a lot less than that. I've heard all sorts of weird ways in which DNA operates - where (for example) some sequences of codons are read backwards as well as forwards....or that a 'skip' in processing can result in 2 base pairs from one codon and one from the next being accidentally read as a different codon. That's all very interesting and amusing but it doesn't alter the answer to the "number of unique possibilities" question because all that is is a change of context or a change of processing. For any given strand, what you get with the 'correctly lined up' reading of the codons has a precise 1:1 correspondance with what you get for a particular misregistration reading of that same sequence. So that doesn't increase the number of unique individuals you can code for - although it does provide for some cunning ways of getting more proteins from a single strand of DNA.

As for lossiness and bandwidth: Bandwidth and storage space are the same thing in information-theoretic terms. A communications path is just a time-ordered sequence of bits rather than a space-ordered sequence such as you get in a DNA strand or a RAM chip. The same exact rules apply. A lossy system (where some bits that you put into the system get changed, or deleted or extra bits get stuck in there) is still able to be useful - but you need some form of error correction (or merely error detection in most situations - if you can detect an error you can say "try again" and keep doing that until you get a good one - so error correction and error detection are essentially the same thing). In order to detect errors (and therefore to be able to correct them) you need some redundancy in the system. It's possible that the reason the cell uses 3 base pairs to store only 22 unique instructions is precisely that. Certainly, the fact that you have two copies is a classic redundancy. But what all fault-tolerant systems share in common is that you MUST waste some more bits in order to achieve that tolerance - and (I won't bother you with the math) you can predict from the number of bits you waste what the maximum possible degree of fault-tolerance is. But again - fault tolerance is not getting you "free energy" - you can only lose storage capacity.

The other big issue is data compression. We all know that you can take a digital photograph and if it's a 1000x1000 pixel image, with Red,Green and Blue color data at a byte per color per pixel - then that requires 3,000,000 bytes to store accurately. However, you can compress it by storing it as a JPEG file - and now it takes maybe only a tenth of that. It sounds like we've broken the 1st law of thermodynamics - we got 'free information' storage...no different from 'free energy'. However, we don't. These compression tricks fall into two classes - "Lossy" and "Lossless". Lossy compression schemes result in the reconstituted data being different from the data you started with. A JPEG photo and it's original pristine photo look more or less the same - but if you look carefully, you'll see that the colors aren't quite the same and it's a bit more blurry and there are color fringes around some edges. That's because of the "no free lunch" part of information theory. If you use less bits - you lose something. There is also "lossless" compression (the "PNG" file format does that - as does Zipping something into a '.Z' file on your PC). But the thing about lossless compression is that it only works on things like our BCD numbers that are already wasting bits. So you could compress a bunch of BCD numbers back down to the point where no memory was wasted anymore. If you try to compress something losslessly and it doesn't have any wasted bits inside - it'll actually get larger - not smaller. This isn't a surprise - because the laws of thermodynamics are hard-and-fast rules, with no exceptions.

So I guess my conclusion here is that counting the bits in the base-pairs of your DNA strand imposes a HARD limit on the maximum amount of variation that strand can encode...which is why all of the other layers of encoding, context, interpretation, clever trickery, redundancy and compression - don't matter a damn. The number of base-pairs multiplied by the log-to-base-2 of the number of kinds of base pair is the upper limit...all of the other things you are telling me can only possibly reduce that number. When you try to tell me otherwise - it's PRECISELY like some idiot claiming to have invented a perpetual motion machine. You get to the point where you can tell such a person that their invention doesn't work BEFORE they even start to explain it. It's the exact same deal with the information content of DNA. It's thermodynamics.

SteveBaker (talk) 13:39, 10 January 2009 (UTC)[reply]

As another tangential (and perhaps not relevant to the wider point, but useful for future reference) correction on detail: The codon set that encodes the 20 standard amino acids actually includes start codons, but not stop codons. The amino acid methionine doubles up as a "START" in eukaryotes. Whether the methionine codon's "command" is "START" or not depends only on wider context, not on the information encoded in the codon itself. Incidentally, in prokaryotes there is a further quirk. The same start codon, ATG, codes for both methionine and N-formylmethionine. Which, again, depends on context. I think this impacts your statement, "The 64 codons only represent 'commands' for 20 amino acids plus 'START' and 'STOP' commands - 22 possible states - just over 4 bits of information. So there is waste and if we consider the information content at the codon level, 3 base-pairs is storing only just over 4 bits rather than the 6 bits they are theoretically capable of storing." How does this take into account the exact same codon providing very different 'commands', Doesn't that increase the number of bits of information stored in that codon? Rockpocket 20:46, 10 January 2009 (UTC)[reply]

Oh - that's odd. Our genetic code article definitely says that UAA, UAG and UGA are "STOP" instructions and AUG is "START". But whatever - the principle remains. The idea that the same "instruction" codes for different things depending on the context is also not a novelty. Pretty much all modern computers do this too. I'm not going to research a particular example - so let me make up something. On the XYZ-2000 computer, the bit pattern 11001100 means "ADD" - unless it's preceded by the 01010101 code - in which case it means "SUBTRACT". This is a common enough thing and it does complicate the business of counting the number of bits represented by the concept of an "INSTRUCTION". On real Pentium chips, the shortest instructions (for the most common operations) are stored in a single byte (8 bits) - but there are instructions that are (IIRC) up to 7 bytes long depending on the context. The total number of "instructions" is easy enough to count though. So yes, the word "codon" has evidently gotten a bit messed up - you want to think of it as an "INSTRUCTION" but you also want it to represent 3 base-pairs when in fact, some codons depend on what came before so they are REALLY 6-base-pair codons. That confusion is not there in computers - we have "bytes" that are the convenient group of 8 bits - and bytes and instructions are not necessarily correlated. So, again, fuzzy biological language is confusing the thought patterns! If we talk instead about 'base-pair-triplets' and 'actions' then we'd say that one action was typically stored in a single base-pair-triplet but some actions (such as the 'methionine' and 'N-formylmethionine' actions) require multiple base-pair-triplets to encode their full meaning. That's exactly like how the Pentium works...and I'd bet that the REASON is exactly the same. When they designed the Pentium, most existing PCs were using 80486's - and they wanted to make sure that all programs written for the '486 would still run on the Pentium. So instead of completely changing the mapping of bit patterns to instructions (so there would be a separate base-pair-triplet for 'methionine' and 'N-formylmethionine') they decided to change the rules for replaying instructions such that some instructions would depend on the previous context. This 'reverse compatibility' thing would be needed in lifeforms because when the RNA molecule learned (evolved) how to make proteins with N-formylmethionene AS WELL AS methionine - it still had to replay all of the rest of the methionene-only-DNA correctly. So evolution had to pick a genetic code combination that didn't come up very often to use for this special new thing. Same deal with the design of the Pentium.

It's weird - the more you guys try to pursuade me that DNA is different to software - the more things I see that are chillingly similar. It's incredible that we worked out almost all of these computer techniques BEFORE we figured out how DNA works...yet amazingly the two mechanisms are virtually identical in every important respect! SteveBaker (talk) 21:08, 10 January 2009 (UTC)[reply]

That makes a lot of sense, thanks for explaining it. I actually like the comparison between DNA and software, it makes a lot of intuitive sense to me. I simply doing understand software well enough to know whether the analogy holds up, you are doing a good job of explaining that. I'm beginning to think that the major difference between the mechanisms is not in how information is stored, but is in the amount of "feedback" the code exerts on itself. The information the DNA holds is used to build hardware, which is directed back on itself to modify the code in incredibly complex ways. This, obviously, is how it manages to be self replicating and thus evolves. My (limited) understanding of software is that this happens in a more limited way, but we are not quite at same level. We can use software to build hardware, but have not quite perfected self-replicating machines yet. Rockpocket 21:45, 10 January 2009 (UTC)[reply]

Thanks Steve. I completely agree with you treatment of the unique date and I had never heard of a nibble which is is pretty cool. But rounding back to my original point "functional software" can be accurately and completely described as software while "functional DNA" must be described as software and hardware. "Software" and "DNA" are descriptions of things at different levels of abstraction.--OMCV (talk) 19:37, 10 January 2009 (UTC)[reply]

(A nibble is 4 bits. Some people call 2 bits a 'nibblet'. Some people spell it nybble on the grounds that we don't talk about bites!)

The distinction you are still making about DNA is 100% one of biologist's own making. The term "DNA" refers to a chemical - a molecule. Except that there are a bazillion variations on DNA (one for every creature on the planet give or take). It's unfortunate that there seems (to an outsider) to be no clear names for:

The general class of molecules that are a double-helix of base-pairs.
A specific molecule of that class.
The data encoded on that one of those molecules.
The language it's encoded in (I guess "genetic code" comes close).
The 'computer' - or more technically the "interpreter" that processes it (Although "RNA" is close).

But the main problem is that we don't have different names for the molecule and the data that's encoded on it. In computers, we hardly ever confuse the RAM chip (2) and the software (3) that's stored inside the RAM chip. In biology - there isn't a clear separation. If you sequence a DNA strand and store all of the A's, G's, T's and C's on a CD-ROM - the data on that CD-ROM is exactly the same thing as the data that's encoded on the molecule(3). The NAME for that thing would be your analog of software. So let's do that: DNAdata and DNAatoms are those two concepts. DNAdata is capable of being stored on a CD-ROM or printed on paper or memorized by some savant - OR you can store it on DNAatoms so that a biological creature can execute it.

Analogy: If I take a copy of Microsoft WORD and store it on a flash drive or on a spinning magnetic disk or place it in RAM and execute it - it's still Microsoft WORD...it's the same software because the information content is identical no matter where we happen to have stored it. We can even Zip up the "WORD.EXE" file so that none of the bits are the same and we STILL call it "Microsoft WORD" because the fundamental information content hasn't changed. I have to put it into a RAM chip in order to execute it - but it's still the same program even when it's on disk someplace where I can't execute it.

So when you copy the C/G/A/T stuff onto a CD-ROM, it's still the same information - and you could (in principle if not in practice) reconstitute a functioning DNA molecule from the data on the CD-ROM. The hardware is kinda disposable.

The thing that IS a bit different from a RAM chip is that the entire physical structure of your storage medium is made up of the base-pairs. In a RAM chip, you can erase the data and you've still got a RAM chip...not so with DNA! But that's just a hardware detail from an information-theoretic perspective. I could take a very large number of Lego bricks in Red,Green,Blue and Yellow and use those four colors to define a 2 bit code. Then I could build a huge tower of bricks that would 'store' Microsoft WORD in Lego at 2 bits per brick. I could later on, build a robot with a camera that would examine that tower brick by brick - looking at the colors - and copy that data into computer memory and run it. So we can make artificial Lego-DNAatoms and store software on it. At THIS point - is there truly any difference at all between the DATA that's stored on the DNA and software stored in Lego?

You're probably going to say that it's because we 'execute' DNAdata directly from the DNAatoms strand without copying it into another storage medium - but that can be true of software too. I could build a little robotic RNA analog that would move up and down the tower of Lego bricks and execute the program (god-awfully slowly!) directly from the Lego. This was exactly the kind of thing that Alan Turing was thinking of when he came up with his "Turing machine" thought experiment...and the 'Church-Turing thesis' says that all Turing machines (be they Lego and LegoRNA - or a Pentium IV chip) are equivalent.

So, IMHO, the only reason there is this distinction in your mind between DNA being hardware and software together - and my view of the universe is that you don't have two separate words for the DNAatoms and the DNAdata. If you did then you'd probably be agreeing with me.

Once you mentally and linguistically separate the two concepts - it's possible to talk rationally about whether we should consider that 'data' to really be 'executable commands' or 'data' in the classic sense of numbers and words. As a computer geek I have to tell you that the line between data and executable is more than just blurry - it doesn't exist.

SteveBaker (talk) 20:47, 10 January 2009 (UTC)[reply]

I'm happy with that but if you ever see a hardwares/software combination other than DNA capable of self-replication, even if its not sexual, let me know. If you havn't seen it already I thought you might get a kick out of this. xkcd.--OMCV (talk) 22:01, 10 January 2009 (UTC)[reply]

Flowchart

Hello. I include here a flowchart that I occasionally find useful, in the hopes that you might find it useful too. The difficulty is "Does the message contain really ridiculous misunderstandings and claims?", since what is 'ridiculous' can be hard to tell. That is where Poe's Law unfortunately raises its head. Anyway, I tend to find that if I ask whether someone is being ironic and they aren't, this either leads to them checking their text and seeing a mistake (good) or making it clear that they really do think some very strange things, giving you a good opening for ripping into these views if you so choose (fun!). Or, of course, explaining so that I see I was mistaken.

If you find you are able to tell who I am, please do not write that name on here. Sadly, I'm keeping a low easily-checkable profile with that username on this project. Thanks.

79.66.109.89 (talk) 20:03, 10 January 2009 (UTC)[reply]

I've checked the flow chart and decided that you sent it to me ironically. Hence I should ignore it and therefore I must treat your messa...oh oh...No you don't catch me out that easily! :-) SteveBaker (talk) 20:12, 10 January 2009 (UTC)[reply]

@@ Line 199: / Line 199: @@
 :::::::: [[User:SteveBaker|SteveBaker]] ([[User talk:SteveBaker#top|talk]]) 20:47, 10 January 2009 (UTC)
+:::::::::I'm happy with that but if you ever see a hardwares/software combination other than DNA capable of self-replication, even if its not sexual, let me know.  If you havn't seen it already I thought you might get a kick out of this. [http://xkcd.com/505/ xkcd].--[[User:OMCV|OMCV]] ([[User talk:OMCV|talk]]) 22:01, 10 January 2009 (UTC)
 == Flowchart ==

Langbahn Team – Weltmeisterschaft